With the rise of the Internet of Things and the rebirth of Home Automation, the world is seeing an ever-increasing number of connected devices that provide value to people by analysing ambient data automatically. Such systems and devices are usually referred to as “smart” or “intelligent” because they include some degree of high-level interpretation of the captured data. The data itself varies in nature, ranging from images captured by connected cameras, through to body signals sensed by wearables, or house parameters measured by smart meters and thermostats. However few devices are as yet able to capture and analyse the high value information held in ambient sound.

What is intelligent sound detection?

Humans have dreamt for a long time of creating machines that would be able to emulate human behaviour: sci-fi is full of talking robots such as Robby from Forbidden Planet or HAL9000 from 2001 Space Odyssey, and early attempts at building machines able to talk, play games or otherwise help humans date back to the 18th century. In recent times, computer systems able to emulate spoken communication and music recognition have been popularised and have met commercial success with services such as Siri, Shazam and SoundHound. The common denominator of such services is the recognition of the particularly valuable sounds that are speech and music.

But what about the rest of the sounds around us? Wouldn’t it be useful if your surveillance camera or some other connected device could tell you if something were to go wrong in your home by the sound of it, for example if a smoke alarm went off or if someone were trying to break one of your windows? Enter Audio Analytic: our software performs intelligent sound detection.

How is intelligent sound detection different from speech and music recognition?

The challenges posed by recognising sound beyond speech and music are multiple. To begin with, the structure of a “soundscape” (a landscape of sounds) can be very different from a speech utterance or a musical piece: while sentences are structured by a grammar and musical pieces by a score, in a soundscape any sound can follow any other sound, with very loose sequential constraints, if any. Sounds can also overlap, with a sound of interest occurring on top of a noisy background. Further, while the sounds of speech are restricted to those that the human vocal tract can produce, and the sounds of music are most often structured as tuned notes and produced by resonance processes, environmental sounds can truly sound like anything and can be produced by a multiplicity of physical phenomena: crashing objects, explosions, beeping circuits, animal sounds, machines humming, which all result from different physical processes.

The quality of audio captured by devices is a further challenge. Speech is most often spoken close to the microphone of a mobile phone, or at the sweet spot of some in-car audio capture system, situations known as close talk or near field audio capture. Environmental sounds, on the other hand, can happen anywhere at a distance from the microphone, a situation known as far field audio capture. In the case of music, the audio can be delivered to the system as high quality recordings if classified from the user’s music collection, or with some degree of care for audio quality if it is captured from, say, a mobile phone’s microphone. But environmental sound recognition has to work on any embedded device, including devices whose audio circuitry was designed with cost reduction in mind rather than quality of audio transmission. For example, many consumer devices natively use mono audio capture, thus preventing the use of beamforming, a technique which may help far field audio capture but requires an array of at least two microphones operating in stereo. Dealing with such practical constraints and suboptimal audio is thus part of the art of automatic environmental sounds recognition.

In addition, while mainstream speech and music recognition services are backed up by huge computational power hosted in enormous data centres, environmental sound recognition software is most often expected to run “on the edge”, i.e., it has to work with the limited computational power available on its embedded host, for example, directly on the chip of a surveillance camera.

In order to solve these major challenges of complex soundscape, variable audio quality and low computational power, a sound recognition system must be very good at modelling a wide range of acoustical phenomena, while also being able to cope with a multiplicity of noise conditions and microphone types. Just like with speech recognition, which must be tolerant to variations in people’s voices when they have a cold, a sound recognition system must also be able to generalise between different instances of a sound, e.g., glass broken at different thicknesses and sizes, different models of smoke alarms etc. And, all this intelligence must be able to run at a very small computational cost.

How is Audio Analytic approaching this challenge?

We are expert in the development and application of sound recognition research.

The inherent complexity of intelligent sound detection means that there are no credible canned solutions. So Audio Analytic has developed intelligent solutions that involve a unique blend of state of the art knowledge about acoustic modelling and machine learning, and hands on know-how about audio data, audio capture devices, highly efficient embedded software, and a multitude of other practical aspects of sound recognition. We hold three technology patent families in the field.

Our in-house research team is highly experienced in the field. Most of the engineers hold a PhD, and their backgrounds include industrial R&D with prominent companies such as Toshiba and Nuance, and academic research with the best worldwide electrical engineering university programmes.

Sound recognition is a more relatively recent academic research topic than speech and music recognition, and there is a growing number of research centres around the world with whom Audio Analytic maintains contact. In particular, we run a constant programme of mutually valuable research projects with the top academic labs in the field, such as University of Surrey’s Centre for Vision, Speech and Signal Processing (Prof. Mark Plumbley) and Queen Mary University’s Centre for Digital Music (Prof. Simon Dixon).

How does it work?

So how can sound recognition be done, and how does Audio Analytic do it? Audio Analytic’s underlying technology is known as “machine learning”: just as humans learn to recognise sound from example, our algorithms are able to learn from large masses of audio recordings. To push the human analogy further, humans perceive sounds through the combined action of the ears, which capture sounds and extract some acoustic descriptors (e.g., tone, volume etc.), and the brain, which recognises the acoustic patterns and is able to generalise this recognition across a variety of instances of a particular sound.

Similarly, the Audio Analytic system is divided into a feature extraction module which extracts acoustic characteristics as the ears do, and a pattern matching engine which learns from data as the brain does. In building this technology, Audio Analytic has brought some deep skills to bear:

  • Collecting and managing large amounts of audio recordings of environmental sounds, for training and testing purposes. Many of these were recorded as actual field data rather than lab simulations.
  • Developing a feature extraction module able to capture the rich range of acoustic characteristics exhibited by a multitude of environmental sounds, which are much richer than the characteristics of speech and music.
  • Choosing and tuning the best machine learning algorithm for sound classification.
  • Making all this technology run on small embedded devices at a manageable and practical computational cost.

The alternative approaches we see in the market are typically too simplistic for the task. The sound recognition algorithms used are mostly based solely on volume detection or auditory models, which is akin to ears without a brain: somehow able to characterise sounds, but unable to recognise them in a way that would generalise sufficiently across variations of the sounds of interest.

Conclusions

Sound recognition is inherently a hard computational problem, yet when properly addressed it yields high value by informing people if sounds of interest happen when they’re not present to hear them. As the Smart Home and Internet of Things market grows and develops, sound recognition is becoming recognised as an increasingly important source of valuable and actionable information. This means that any sound recognition technology deployed across connected devices in the home, workplace and general environment needs to be as accurate, reliable and efficient as possible.
We believe that Audio Analytic’s expertise and experience makes us the best in the world at sound recognition, and we continually develop and evolve our technology for the benefit of our customers and end-users.

*****

Like this? You can subscribe to our blog and receive an alert every time we publish an announcement, a comment on the industry or something more technical.

About Audio Analytic

Audio Analytic is the pioneer of AI sound recognition software. The company is on a mission to map the world of sounds, offering our sense of hearing to consumer technology. By transferring our sense of hearing to consumer products and digital personal assistants we give them the ability to react to the world around us, helping satisfy our entertainment, safety, security, wellbeing and communication needs.

Audio Analytic’s ai3™ sound recognition software enables device manufacturers and chip companies to equip products with Artificial Audio Intelligence, recognizing and automatically responding to our growing list of sound profiles.

>