April 20, 2021
“Be careful about how you get that data set” – AI industry warned by US Federal Trade Commission
“The bigger the data set, the better the algorithm, and the better the product for consumers, end of story… right? Not so fast. Be careful about how you get that data set […] how you get the data may matter a great deal.”
Machine learning requires data. And lots of it. As well as the right quantities, it also must be diverse, representative of the task, and labelled correctly. If you follow poor practices, use questionable public datasets or inappropriate sources, then your AI’s performance is likely to be sub-standard in real-world applications.
Audio data quality is essential for sound recognition. There is, however, another reason why taking data seriously is so important: the significant financial and reputational risks that poor practice creates. This is the subject of a detailed whitepaper titled ‘Audio for Machine Learning: The Law and your Reputation‘ that we’ve published today.
So, what are the risks?
- In January 2021, the FTC announced that it had reached a settlement with California-based Everalbum, which required the company to delete all its data and the models that they were based on.
- In February 2021, the Canadian privacy regulators requested that ClearView remove all Canadian subjects from the images that they had scraped from the internet for their “unlawful” facial recognition system.
- In 2019, IBM announced that it had used images shared on Flickr under Creative Commons to train a facial recognition system. They faced an instant backlash from Flickr users, shut the service down a year later, and there is still a legal case going through the courts in the US state of Illinois.
In the whitepaper, we highlight the risks that organisations may be taking with the data they source for ML. This includes datasets built for research purposes, like AudioSet, as well as other sound libraries such as those built for games, movies or generic media. In both examples, it is important for there to be a chain of consent that stretches back to the original IP owner. Even if this data is made available, the organisation building the model has responsibility to ensure that all the correct permissions have been obtained for the ML training task. For example, if the sound has been recorded on the London Underground, you need the permission for that purpose from Transport for London (the network operator). Considering the amount of data required, this creates a significant overhead.
There are three problems presented by using datasets where traceability back to original content owners is unknown – such as with YouTube or downloading it from the internet.
One revolves around making sure that you have the content owner’s consent to use their data to train commercial AI systems. The ‘fair use’ argument doesn’t typically apply if you are using that content for commercial purposes unless you have their consent.
The second issue is that the terms of service presented by sites such as YouTube and others do not grant permission to use data for commercial machine learning purposes.
The third is to ensure that the data you are using is not the intellectual property of another organisation or individual – impossible to tell if you don’t have complete data traceability back to the original content owner and control over your data sources. For example, somebody could upload a video to YouTube featuring another person’s content, such as a cartoon character crying on TV. If that data is then used in your ML model, you are using data that is not representative of the real world, but you may also be using the cartoon creator’s protected IP. This is a real example found within the ‘baby cry’ class of sounds on AudioSet.
We believe that regulatory bodies and the courts around the world will get much more explicit and restrictive around data and its application in machine learning, as we have already started to see. They will require companies to seek expressed consent from data providers and have the data management practices that prove they can trace the consent right back to the original content owner.
Consumers are also expecting tech companies – big and small – to do more. In our recent independent survey of 2,000 US and UK consumers aged between 18 and 44, 75% of respondents said that there should be more strict regulations and limitations around how companies use their data without their knowledge to train AI systems.
The whitepaper looks at South Korea, the European Union, and California, USA in more detail. It also sets out the pillars that are essential for a sensible and robust approach to data management that keeps organisations on the right side of the law now and well into the future.
The principles that we set out in the whitepaper form the foundation of our Alexandria™ dataset, which now has 30 million labelled recordings across 1,000 sound classes and with 400 million metadata points. Our approach to data shows that if you take it seriously, you can build high-performance AI that meets a global audience’s legal and ethical expectations.
Like this? You can subscribe to our blog and receive an alert every time we publish an announcement, a comment on the industry or something more technical.