May 1, 2020
Putting sound into context
Imagine there’s a tap running in your house. If you’re doing the washing up, that’s not a problem. But what if you’ve just left the house for work and that tap’s still running? You could return to a sky-high water bill and a flooded home.
A smart speaker with a broader sense of hearing could prevent this. It would understand the significance of the running tap, conclude that you are leaving the house from the other sounds you’ve made, and use this context to alert you to the free-flowing tap before you leave.
That’s not all. A contextually aware product could remind you to do any number of things before you leave the house. It could also identify a worsening cough, automatically implement ‘do not disturb’ functionality if it sounds like you’re having a bath, provide reassurance if you’re woken in the night by an unexpected noise, and so on, unleashing an unprecedented number of use cases both in your home and as you navigate the world outside, via your smartphone or earbuds.
For our machine learning researchers, building tech that can understand context is a really exciting area of development and for us is a key part of what we call Second Generation Sound Recognition.
We have already laid a strong foundation in this area, providing smart home devices with a sense of hearing to recognise and intelligently react to a range of important audio events like glass breaking, smoke alarms, dogs barking, babies crying, car alarms, for example. We’re also working with companies in the smartphone and hearables markets to empower products with the ability to react to dynamic acoustic scenes.
Now, we are developing a broader sense of hearing, providing contextual understanding that connects audio events, acoustic scenes and what they represent. It’s a game-changing step for the sound recognition market – and one that will provide consumers with truly intelligent and valuable experiences.
In this post, we will explain how our Second Generation Sound Recognition technology can understand context from an audio environment and how we are developing special forms of recurrent neural networks to help us achieve contextual awareness at the edge, in a compact and computationally efficient manner.
Building context in sound
We’ve already begun addressing the challenges of scaling up to provide context recognition, where you need to understand audio events and acoustic scenes with unprecedented depth. This is where machine learning models can help, specifically those models that exploit long-term correlations.
Sound events are not just each acoustic part of the sound, they are temporal sequences that change over time. For example, the sound of a crying baby isn’t just one “waah” but a repeat of this soundbite over time, potentially with other sounds included in the cry and from the surrounding environment as well. We now need our future systems to recognise these acoustic and temporal features of the target sound and its environment across a broader range of sounds and over longer periods of time.
This is where models which exploit long-term correlations come in. They can identify and then store significant audio information over time. In effect, they provide the system with a memory to identify and understand different audio events, bringing some much-needed context to your audio landscape.
Long Short Term Memory (LSTM) networks are one known neural network component, which can provide modelling of longer term correlations in the data. We have already used an adapted version of these networks in our own research to model medium-term sound events, and there is a strong argument to explore them for longer-term context. More on this later.
LSTMs are typically and widely used for language understanding in speech and text but things get complicated when we move from identifying spoken words to identifying sound events. Think of a simple sentence, such as “Alexa, set a timer for 10…”. The next word in this unfinished sentence is likely to be “minutes”. This is relatively simple for the system to infer. It has a wake word to tell the system to listen and it only has to focus on one sentence, where the information needed to make this prediction sits directly in front of the missing word.
Now, go back to our running tap example. The system has to gather a broad range of data over time, recognising and understanding significant audio events to make an accurate decision. Unlike a structured sentence, this audio information could also be in any sequence, with important data detected long before the moment of decision.
Understanding the challenges
When bringing context into an audio environment, we must make sense of complex interleaving sounds across time.
We need to model sounds acoustically and temporally across the short-, mid- and long-term. The system needs contextual awareness to understand what’s happening in the house (for example), allowing it to differentiate between an insignificant and potentially significant activity. We also need to understand how different sounds interact and overlap, identifying and using those sounds that are meaningful for the environment they exist in. This is a huge undertaking, addressing a range of fundamental technology challenges.
These sounds are diluted across both time and space – where an important noise may be masked by a louder (but insignificant) sound or may sound completely different depending on its surrounding environment. Because these sounds take place over time, they could also appear in any possible sequence, making it difficult to infer the correct context for any given situation. As time passes, the system needs to link different sounds together to build a cohesive picture of what’s going on – and what to look out for.
The key challenge is to identify which sounds are important. While humans automatically disregard insignificant noises, a machine may assume every sound has meaning and holds onto all of this unnecessary data.
This is where machine learning comes into play, helping the system build the right context into its decision-making algorithms. Compared to the identification of wake words, this is a complex undertaking where the system is constantly searching for and storing many different acoustic elements. Even if the system has a vast amount of data to work with, the diversity of the combination of sounds to build the necessary context is huge.
Identifying the solutions
LSTMs could be a way to address this challenge. We’ve already seen success in this area, where our modelling of mid-term audio events with customised LSTMs suggests that this type of DNN architecture could be used to model long-term context, albeit not ‘out of the box’.
LSTMs are particularly useful when tackling the vanishing gradient problem. Here, a non-LSTM neural network can struggle to store long-term information, where data effectively fades away the further back in time it occurs. An LSTM chooses and stores the essential information from the past data to store a summary of the current situation, or ‘state’. This state is retained for as long as it is required to build context within the system.
However, LSTMs are computationally expensive to run and optimise, especially high performance variants such as bi-directional LSTMs.
This is a pertinent problem for Audio Analytic because our technology is implemented at the edge, where we require a small computational footprint. Therefore, we cannot simply increase the complexity of the network by adding more GPUs, extra neural layers or sophisticated processors to overcome this issue.
We do things smarter, using an unprecedented amount of primary-sourced and augmented audio data and some other proprietary techniques to effectively shrink our models to boost their efficiency.
Looking ahead to give sound context
We are tackling two main contextual awareness issues in our research. We have designed powerful machine learning models with low computational complexity, which are capable of understanding acoustic and temporal audio activities. In addition to other areas of research, our customised LSTM network and Temporal Resolution Scaling are both stepping stones to tackling this problem.
And with our Acoustically Smart Temporal Augmentation (ASTA) technique we are also utilizing our large audio datasets to teach our models the underlying significance of these activities, helping them understand sound events and scenes and the broader context that they appear in.
Let’s explore all these solutions now.
Firstly, we have developed a new variant of LSTMs, which we call Look-Ahead LSTMs (LA-LSTMs). LA-LSTMs are a LSTM architecture and training method which aim at preserving the capability that bi-directional LSTMs have to exploit future information into their predictions, but at a much lower computational cost.
Second, we have also developed a method called Temporal Resolution Scaling (TRS), which we use to construct deep LSTM networks without increasing the computational complexity. In a similar fashion to image processing techniques using Wavelets, various levels can operate at various appropriate temporal resolutions, providing us with better modelling of long-term sequential data.
Third, and finally, LA-LSTMs also use a special form of data augmentation, allowing us to fully capitalise on our wealth of existing acoustic data for the purposes of understanding long-term audio activities. Indeed, ASTA helps us create acoustically and temporally correct occurrences of data. This approach goes beyond many traditional augmentation methods, which artificially distort the sound data for different environments. In addition to acoustic augmentation, we augment the context occurrences, or in other terms the order and time spacing in which things happen, in a way that is field-realistic, before training our LA-LSTMs on the acoustically and temporally augmented data.
How will sound-based context recognition impact our lives?
Machine learning models and DNN architectures that can model over longer periods of time are exciting and will play a key part in building this longer-term contextual understanding into future consumer products as wide-ranging as smart speakers, video doorbells, smartphones, earbuds and cars.
Novel techniques such as LA-LSTMs, TRS and ASTA are important steps in the right direction as we build technologies that can not only hear the world around us, but fully understand it.
These exciting developments are also just one piece of a very complex puzzle. Our research takes place on a multitude of fronts and delivers compact tinyML models to enable cloudless AI on all manner of devices and create a multi-class model that recognises more than 50 sounds and scenes for an even broader sense of hearing.
We are building the foundational technology for short-, medium- and long-term contextual awareness across a range of audio events and scenes. In doing so, products that utilise our technology will understand sound over relatively long periods of time, fuelling unprecedented innovation in both voice assistants and wider consumer technologies.
By building context into our Second Generation Sound Recognition, we can create the next wave of intelligent products, unlocking a vast range of innovative solutions to flood the market (not your kitchen).