In a recent blog post, my colleague Adrian talked in great detail about the technical reasons why audio data downloaded from websites like YouTube are unsuitable for sound recognition.

It should go without saying that if you use poor quality data for training then you get poor models which, in turn, create poor consumer experiences. But if poor end-user experiences weren’t enough of a reason, we must also consider the other risks associated with using data you can find online, such as legal action and the resulting negative brand association.

Every company and institution working in data and machine learning should understand the rules around personal data and how they differ around the world. You can read more about those differing approaches in our whitepaper, ‘Audio for Machine Learning: The Law and your Reputation’, which we published a few months ago.

One area that we touch on within the whitepaper is that organisations need to know they have the correct licences from the copyright owners. This means that it is necessary to have a detailed understanding of where your data comes from, as well as the ability to trace it back to a legal document granting permission to use it for commercial machine learning products.

For example, take AudioSet from Google Research. It is “an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labelled 10-second sound clips drawn from YouTube videos.” Although it is mostly speech and music, Google Research have made AudioSet publicly available to be used by academia for sound recognition evaluation and training. Importantly, Google can’t grant permission for the underlying YouTube content to be used. A quick review of the YouTube videos included in this dataset highlights significant and obvious potential copyright infringement. The labellers who built this dataset have included protected IP like TV shows, music, and video games. Clearly their aim was to build a non-commercial dataset for academic use, but it has the potential to be misused.

Copyright vests automatically in the creator of an original work of authorship that has at least some minimal degree of creativity—which could include audio recordings. In some jurisdiction such as South Korea, it is ‘neighboring rights’ not ‘copyright’ that is granted to the creator of a recording. However, under the neighboring rights, the framework of copyright system or the need for securing a licence would remain the same with copyright. In general, the owner of the copyright has the exclusive right to copy, prepare derivatives works of, distribute, publicly perform, and publicly display the work (and to allow others to exercise those rights).

For the purposes of machine learning, it is important that we understand the impact of copyright and licensing. For example, YouTube typically restricts each user’s right to view or listen to content on its platform to that user’s personal, non-commercial use. Copyright holders, who upload content to YouTube, typically grant other users a licence to access the content that is hosted on the platform only as enabled by YouTube. There are some exceptions to the standard licensing terms, such as Creative Commons, but these do not necessarily grant the required rights for machine learning.

For example, IBM took images uploaded to Flickr under Creative Commons licence conditions and used them to train their facial recognition system. This approach resulted in an avalanche of negative media coverage:

A year later, IBM’s CEO, Arvind Krishna, wrote to Congress to confirm that it had cancelled its facial recognition system. However, the issue has not yet disappeared. At the time of writing, a class-action lawsuit (Vance v. International Business Machines) is currently working its way through the court system in Illinois.

Although they differ around the world, copyright laws allow for ‘fair use’ to balance the rights of copyright owners with a range of other rights, interests, and freedoms. However, the exact wording of the legislation differs, as does the judicial interpretation and application. Crucially, ‘fair use’ does not permit commercial exploitation. Therefore, within an academic setting, the use of copyrighted audio data may be acceptable under non-commercial fair use terms, but if the resultant model were made commercially available, then the data used may no longer be ‘fair’ to use. This is a potential area where organisations will be caught out if their focus is on the model and its capabilities, rather than on the underlying data used to train it.

Within the EU, there are other relevant intellectual property rights to consider. An organization may have database copyright or a standalone database right. These rights come into existence automatically without any registration requirements. For example, if an organisation has collated sounds in a systematic or methodical way, where the resulting collection constitutes the author’s intellectual creation, or has made a substantial investment in obtaining, verifying or presents the contents of a database. As a result, it then becomes possible for infringement of database rights to occur even if other forms of copyright are not infringed. This is a particular risk in the context of AI and ML, where organisations want as much data as possible. Stumbling across a database of sounds and using those sounds could easily constitute extraction or reutilisation of a substantial part of a database, which if protected, will mean an infringing act has been committed.

There are many aspects to a licence to use audio data.For example, its defined purpose and whether it is royalty-bearing or royalty-free, exclusive or non-exclusive, limited or unlimited use, and commercial or non-commercial. If your machine learning system requires audio recordings that have been created or recorded by other people, then you need to license this data and make sure they have the right to license it to you. If you license audio data from a third party, it carries significant technical risk—which we discuss in a previous whitepaper titled Why Real Sounds Matter for Machine Learning. Additionally, existing licences are not designed for the purpose of machine learning and commercial deployment. Explaining what machine learning entails is not easy to do with people who are unfamiliar with the concept. As a result, significant ambiguity remains over usage rights, and this leads to significant potential liabilities.

To further complicate matters, it is the responsibility of the licensee to have fully understood the licence chain. If you are licensing recordings directly from the copyright holder, then there is one licence to agree. However, if you are licensing audio recordings via a central intermediary, then there are two (or more) licence agreements that are needed to allow you to train an ML model for commercial deployment. The misplaced assumption that liability ends with the intermediary can leave companies exposed to claims for unauthorised use.

If you can agree to licence terms with each rightsholder, then the next matter at hand is to determine whether you have a perpetual right or one that carries a limited term-of-use. Any machine learning system built upon a limited-term licence would need to be retrained without that data at some future point in time, which may compromise the reliability of the system. This places a burden on the data platform to provide traceability so that expiring licensing terms can be easily assessed and mitigated. However, it is better to avoid expiring licensing terms in the first place, which is why emphasis should be placed on primary data collection and augmentation.

A final note of caution related to criminal liability: in certain jurisdictions (for example, the UK’s current copyright law, the Copyright, Design and Patents Act 1988), laws make express reference to how, in certain circumstances, acts of infringement can constitute a criminal offence.

As ML researchers and engineers, we have a moral and legal obligation to treat data as sacrosanct. It is entirely possible to collect the data that you need for any ML task without infringing somebody’s copyright or invading their privacy. For too long people have treated data as though it is some freely available resource but the tide is turning and organisations will soon need to answer a very simple question, “do you have permission to use this data to train this model?”