December 7, 2021
Painting with sounds: The important role of augmentation in ML
We recently talked about our soft-temporal modelling design philosophy and why it was so important to our technology. In the blog we highlighted one of the significant challenges with understanding sound, which is that sounds vary, even if they are considered the same class. For example, emergency vehicle sirens differ around the world and there are about 350 different breeds of dog.
What’s more, recognising sounds – whether events or acoustic scenes – means that ML models must detect more than just a fairly bounded and structured space like in the case of speech recognition. Speech consists of vibrations generated by the human vocal folds, sustained, amplified and shaped by the various vocal tract components. These produce a limited range of sounds which are interpreted as symbols constrained within a human-made system of communication.
Sound – as a whole – involves a considerably wider range of vibrations, hence a broader set of physical phenomena, comprising of numerous types of mechanical triggers as sources, and numerous propagation mediums and environments. We see speech as a small and very specific sub-space of sound.
Furthermore, not only does each sound come with its own ‘anatomy’ within its class, but also every single sound instance in nature is unique and unrepeatable. For example, can that dog bark twice in the exact same way? This means that the possible acoustic scenarios of a given target sound we aim to detect are virtually infinite. This has been demonstrated by Fourier, who showed us that in order to perfectly re-create any vibration you would need an infinite number of sinusoids. So unless it is by some Divine Intervention, I doubt any of us have all those sinusoids at hand.
Therefore, this virtually infinite source of data presents anybody wanting to recognise sound with a major problem: How do you build training and evaluation datasets that are representative of the world in which the end system will have to operate?
The solution is Acoustically Smart Temporal Augmentation or ASTA for short. It’s an approach that we had to design to cope with the realities of modelling an unbounded and infinite world.
Before I get into describing what augmentation and ASTA are, it is worth briefly highlighting that they are very different from synthesised audio data.
When we talk about augmentation, we are talking about the creation – or auralisation – of high fidelity polyphonic soundscapes that sound 100% realistic both to the human ear, as well as to a machine. This is because each new recording is created using different audio components in realistic but varied ways. Synthesised audio data on the other hand is a bit like Foley Art in Hollywood movies, it might sound like the right thing to a human but to a machine it is very different.
In order to create these accurate augmented recordings you need to capture each component in a way that gives you the complete freedom to construct each soundscape. You then need the tools that can take the various components and build a suitable dataset that contains the relevant edge cases without inventing scenarios that are extremely unlikely or impossible in the real-world.
Here’s a quick overview of how we do it through ASTA:
- Primary data collection in our sound labs – these act like a green screen for audio. We can isolate the sounds of interest and potential sounds of interference. Importantly, while we do capture with high-grade expensive microphones, we also capture using representative consumer devices which feature a range of microphones, housing and audio paths. This is because the resultant model needs to be able to cope with a wide range of hardware and DSP artifacts.
- Primary data collection in the field – we capture the target and non-target sounds in their natural environment.
- Environmental impulse responses (the signatures for different acoustic environments) and measurements – to recreate any given environment and any possible positioning of the sound source and capturing device, we have to be able to measure the acoustic space.
- Metadata, metadata, metadata – Metadata is crucial. To give you a sense of how much importance we place on it, our Alexandria™ dataset contains 40m audio recordings and over 500m metadata points. We capture everything from the sensitivity of the recording device to the construction material, detailed information about the recorded subjects and the distance between it and the microphones. This is yet another reason why you can’t just use any audio dataset found on the internet. Without the metadata it is simply impossible to simulate in a scientific manner an accurate acoustic scenario with specific hardware and software effects. You can read more about the technical limitations of 3rd party data sources in our ‘Why Real Sounds Matter’ whitepaper.
Our ASTA tools are then used to combine all of the above information to create a detailed soundscape, where audio is emulated as if the subject is in a particular environment along with some other interfering sound sources while preserving their natural temporal characteristics. The resulting acoustic soundscape includes all the environmental interactions and is generated as if it is recorded by any device of interest. Therefore it can match the variation and diversity that are found in the real-world.
It’s equivalent to painting with sound.
Below is a simplified illustration that highlights the role of augmentation.
The outcome of ASTA is the power to create a series of recordings that would be expensive or very difficult to control. For example, we can accurately auralise a sound recognition model running on a pair of earbuds picking up a dog barking 10m away, the sound of a truck reversing beep and engine 5m away, all in a busy market, above the ambient sound of rain.
In addition to building training sets, augmentation is key in building evaluation datasets too. It is time consuming and expensive to take a model out on the road and to objectively evaluate its performance. Take window glass break. You could visit thousands of different homes, smash their windows and confirm it works, or you could build a dataset that is equally realistic – but you can do it in a fraction of the time. In addition, our approach to augmentation means that the process is repeatable, so that you know you are getting an apples-to-apples comparison.
This challenge with evaluating models is not specific to sound. Automotive companies like Waymo for example, can’t build and then crash thousands of different types of cars in many different location types with all the variables involved. Trent Victor, their Director of Safety Research blogged about the company’s efforts to reconstruct actual crashes through computer simulations to test their models (read it here).
There are three fundamental success criteria for sound recognition. The models have to be accurate, robust and compact. Augmentation plays a critical key role in all three. Accurately augmented data means that models are guided towards learning the right acoustic and temporal features, which leads to optimal TP and FP performance. Evaluating the model against a varied and diverse dataset ensures that models are robust enough to cope with the real world and aren’t affected by edge-cases. And by teaching the model what it needs to know, rather than learning spurious and undesired behaviours, the model is optimised for the task and isn’t inflated unnecessarily.
About Audio Analytic
Audio Analytic is on a mission to create exceptional human experiences through a greater sense of hearing. This empowers machines with the ability to react to the world around us, helping satisfy our entertainment, safety, security, wellbeing, convenience, and communication needs across a huge range of consumer products.
Audio Analytic’s ai3™ and ai3-nano™ compact software platforms are suitable for embedding into a wide range of products from smart speakers and video doorbells, to smartphones and wearables. They have been licensed to some of the world’s most prominent consumer technology companies and bring accurate and robust sound recognition to many products available today.