September 30, 2021

Why our ‘soft temporal modelling’ design philosophy is key to understanding sounds

ambient computing

In order to guide our research and development, we set a core technical design philosophy that allows for the flexibility that is inherently needed to take very diverse sound objects and sequences and translate these into very stable and unique detections. As you learn and read more about our technology it is useful to understand how our approach runs through our entire ML pipeline from the data we collect, label and augment, to the models we train and evaluate.

First, let’s briefly explore why defining the structure of sounds is so challenging for machines. For this example, I’m going to use a familiar sound to all of us; the dog bark. Detecting the sound of your dog barking supports a range of consumer use cases from smartphones and earbuds that alert you to sounds happening around you if you can’t hear them; to smart speakers and smart home devices that protect your property or enable you to track your pet’s anxiety levels while you are out.

Unless you suffer from some form of significant hearing loss, you will immediately be able to recall the sound of a dog barking whether that is a Chihuahua or a German Shepherd. But how do you impart this understanding on to a machine, especially when there is a vast array of acoustic features that compose the sound? For example, there are big dogs, small dogs, old dogs and young dogs. Different breeds have a variety of snout shapes, which condition the variety of sounds that they make. You have different types of barks, like when a dog is stressed or just looking for friendly attention. And then you have the environment in which the animal barks which can be inside houses, garages, streets, gardens, parks, etc… That is a vast number of different sound combinations for the dog part alone. On top of that, each environment features additional sources of sounds.

As experts in the understanding of sounds, our number one goal is to develop algorithms and methods able to map all of these sounds to one label – in our example, mapping every possible dog bark in every possible location to these two words, “dog barking”. How do we do that?

One perspective on that problem is to find what is common to all dog barks at the acoustic level. Indeed, dog sounds exist within certain audio frequency boundaries, and are composed of “dog tones” and “dog noises” within certain ranges of variation. This gives part of the answer. But there is another part to it: finding what is different between dog bark audio sequences, and making the algorithm tolerant to these differences. That is the goal of our core technical design philosophy that we call “soft temporal modelling”, which looks at both the acoustic and the temporal variability of sounds. In the example of dogs, you could think of this as the musical score of a dog barking episode: how many individual barks, how long they were, how each individual dog bark is structured over time and where in time they happen relative to each other. “Woof”, “yap-yap”, “woof woof — woof”, “woof – arf woof – woof”, “woof woof woof arf woof” – five sound sequences in this example, still covered by two words or just one concept: dog barking.


Three very different dog bark examples
Three very different dog bark examples

The necessity for softness along the temporal modelling dimension comes from the fact that even for humans, it can be difficult to agree very precisely on where the start and end of each sound or each tone are located, or how long a silence between sounds needs to be to qualify as an interruption. Although intuitively we seem to know what a dog bark is, when it comes to labelling the boundaries of it on a given audio recording, we are not so sure anymore – for example, data labellers often disagree on whether a short silence amidst a sequence of sounds is a boundary or is short enough to be considered part of the whole sound event. This is why our data is labelled the way it is. We know that various levels of labelling are key to various levels of modelling, so we label simultaneously at the fine and episodic levels in addition to the basic level of weak labels. As a result, we have been able to develop temporal modelling methods which are tolerant to variations in sequence. Two words, one concept, against an infinity of relevant sound combinations and labelling opinions.

Our design philosophy flows down to all aspects of our technical innovation, both in terms of what we’ve already developed and what drives our continuous evolution. For example:

  • The tolerance to sequence interruptions is baked in the way we train the network via our patented loss function, which tells the network what the sound is and what it is not. For example, “the recognition of dog barks should be continuous and tolerant to short interruptions” – thus helping the network to find the boundaries between the sound objects in softer ways, instead of imposing hard and precise boundaries.
  • Our PSDS evaluation method, which has been adopted by the organisers of the DCASE Challenge Task 4 as the default metric, considers the quantity of overlap between sounds and their labels, rather than strict and instantaneous sound boundaries, to evaluate whether a detection was correct or not.
  • Similarly, our patented Temporal Decision Engine helps the system to make classification decisions based on soft temporal models of appropriate sound sequences for each type of sound event, in combination with decisions about the tones themselves.

Combining softness and precision, just the right amount of each, is how we get to truly define sounds. In my dog barking example, this is how we summarise thousands of sound instances into just one concept.

Our technology exists to create valuable and exciting experiences for consumers. This means that the models have to be accurate, robust and compact to be commercially successful. A feat that would be impossible without our design philosophy.


Like this? You can subscribe to our blog and receive an alert every time we publish an announcement, a comment on the industry or something more technical. 


 About Audio Analytic 

Audio Analytic is the pioneer of AI sound recognition technology. The company is on a mission to give machines a compact sense of hearing. This empowers them with the ability to react to the world around us, helping satisfy our entertainment, safety, security, wellbeing, convenience, and communication needs across a huge range of consumer products.

Audio Analytic’s ai3™ and ai3-nano™ sound recognition software enables device manufacturers to equip products at the edge with the ability to recognize and automatically respond to our growing list of sounds and acoustic scenes.

We are using our own and third party cookies which track your use of our website to help give you the best experience. By continuing, we’ll assume that you are happy to receive all cookies on our website.

You can check what cookies we use and how to manage them here and you read our privacy policy here.

Accept and close