TL; DR – All sounds are not created equal. When it comes to training and evaluating sound recognition systems which deliver consistently good performance in consumer applications, you cannot use recordings downloaded from the internet (YouTube, Freesound, etc.) due to a range of technical and legal limitations.

“On two occasions I have been asked, “Pray, Mr Babbage, if you put into the machine wrong figures, will the right answers come out?” … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.”

Charles Babbage, Passages from the Life of a Philosopher

Although it can be accessed at a click of the mouse, audio and video content shared online is made available for specific purposes only – typically for social networking, entertainment and media production. It wasn’t uploaded so that researchers and engineers could use it for building sound recognition systems.

In many cases, its usage for commercial purposes is prohibited, and the copyright ownership is held by the individual person or company who created, filmed or recorded it. There are also various technical limitations of using internet-downloaded recordings, and, in this blog, we look at why a detailed and considered approach is required when training and evaluating sound recognition technology that is fit for the real world.

In this blog, I look at six technical limitations, explaining the impact that each can have on sound recognition systems.

***

Depending on the evaluation circumstances, an internet-downloaded sound event may be correctly classified, misclassified or missed altogether by a sound recognition system. This is highly dependent on the distance between playback speaker and microphone, the type of sound, its frequency content and the effects applied to it. This is because internet-downloaded audio files are unknown quantities – where information about the recording environment and processes are unclear. These inconsistencies make the files unsuitable for training and undesirable to evaluate a sound recognition system fit for consumer applications.

The inconsistencies and problems introduced by internet-downloaded audio recordings span many factors, as highlighted in figure 1 below, which compares the issues facing real sounds in real environments compared to files found on the internet. The combination – or mixture – of these factors makes it difficult to pinpoint the root cause (or causes) of any resulting issues in the source file, leading to mistakes in the evaluation and training of sound recognition systems.

 

Figure 1: Using downloaded and unknown audio recordings for training or evaluation introduces a range of distortions. This ‘pipeline of distortions’ is clearly illustrated when comparing evaluations (i) in real-world conditions and (ii) using an audio file from the internet
Figure 1: Using downloaded and unknown audio recordings for training or evaluation introduces a range of distortions. This ‘pipeline of distortions’ is clearly illustrated when comparing evaluations (i) in real-world conditions and (ii) using an audio file from the internet

1. Subject matter diversity and variability of sounds

A significant training limitation when using internet-downloaded audio files is that they may not represent the diverse range of real sounds. Sometimes this may be down to the fact that the person training the system doesn’t have the available subject information from the recording or that the file was just tagged according to the uploader’s judgement rather than according to a consistent taxonomy. As we mentioned at the start, these files are often unknown quantities, and this lack of metadata may mean that systems are trained using what is available, including a large proportion of noise, rather than what is representative of the real world.

Not only do ML and data engineers need to understand the quality and variability of their training data, they also need to fully understand the target application of the end products in order to carefully plan the data collection stage.

To highlight the impact of using narrow or un-known data, consider this example: if a system built to detect a glass window being broken is only trained using recordings featuring one type of glass – because audio recordings featuring only one type of glass is all that was available online – then that system may not cope with the multitude of other types of glass windows which are present in consumers’ homes. It becomes even worse if the category tagged as ‘glass’ includes sounds of kitchen glasses being broken or being merely ‘chinked’ for a toast.

Fortunately, this does not happen with Audio Analytic’s data because we know precisely what has been recorded, and all the sounds are labelled according to a coherent taxonomy. Also, we make sure that the recorded sound variations are consistent with what happens in real life: ‘glass’ really means windows in your home. All of this information is stored in our Alexandria™ dataset, which contains 40 million labelled recordings across 1,200 sound classes and includes over half-a-billion metadata points of different audio events such as car horns, bicycle bells, smoke alarms, window glass breaks, people shouting, laughter etc. 

 

2. Random recording devices

Most internet-downloaded audio files are recorded on the simple microphones found in smartphones, laptops, tablets and other personal devices.

Sound recognition systems must tolerate the channel distortions from consumer microphones – the frequency responses of those microphones have significant variations, with a lack of low and high frequencies and additional audio processing applied to them. When inappropriately using internet-downloaded audio files to evaluate or train a sound recognition system, such channel effects are overlaid, and the reproduced audio is not true to life. Further problems arise during playback, when system distortions, codec distortions and playback environments are combined with these.

 

3. Codec compression

The poor application of codecs for the compression of downloaded audio is a major factor affecting the quality of the audio files from the internet. Many such files are encoded using lossy codecs, such as MP3, AAC or Vorbis. Once encoded, a portion of audio information is lost, and the quality is lower compared to lossless formats such as WAV or FLAC. This may be fine if decoded audio is fed directly from a device’s audio channel to a sound recognition module, for example, when sound recognition is required to deal with a security camera’s AAC audio stream. However, encoding/decoding effects can cumulate or amplify in undesired ways if propagating decoded files further down the chain for system testing or for data augmentation.

These codecs are specifically designed for encoding music and speech, often removing information that humans cannot perceive. They may, for example, utilise frequency masking, which can make it difficult for a machine to detect specific sound events, especially those with a specific time-frequency structure.

Figure 2: Spectrograms of high-quality baby cry recording (left) and MP3 encoded version (right)
Figure 2: Spectrograms of high-quality baby cry recording (left) and MP3 encoded version (right)
Figure 3: Spectrograms of high-quality dog bark recording (left) and AAC encoded version (right)
Figure 3: Spectrograms of high-quality dog bark recording (left) and AAC encoded version (right)

Figures 2 and 3 clearly demonstrate this point, where spectrograms of original high-quality recordings and encoded versions are shown. The black zones represent frequency content holes and band-limiting effects caused by the encoder, which are redundant for human listeners but are likely to affect the performance of the sound recognition system if used without care.

In addition to the encoding and normalisation of audio files, other audio effects may be applied, which are difficult to tell by just listening to the recording. These effects could include:

  • Dynamic Range Compression: where the volume of the loud parts of the audio is reduced, and quiet parts are amplified, compressing an audio signal’s dynamic range
  • Equalisation: further alters the frequency response of the recording
  • Synthesisers: adds artificial sounds on top of the original recording or synthesises the sound
  • Noise reduction: which can corrupt target sounds.

 

4. Lack of Sound Pressure Level (SPL) information

Most internet-based recordings lack the Sound Pressure Level (SPL) information for an acoustic event. SPL is an indication of the natural loudness at which sounds physically occur in the real world. As a result, the underlying question is whether, for example, the recording of a baby crying will be as loud as a real baby in a real room. If you are not training a model based on a correct representation of the target sound, then the performance of the model will be affected, often missing occurrences of that sound. If evaluating the performance of a high-quality system using this same recording, you may be playing it at a volume that doesn’t represent its natural loudness and, as a result, the system may not confidently recognise it because a key acoustic feature is not present. It may help at this point to briefly explain what SPL information is. When a sound is recorded, the analogue signal is converted into a digital waveform. This digital waveform is directly related to the SPL of an acoustic event.

To playback a recording with the exact same SPL, we must know the sensitivity of the microphone as well as any other gains applied at each stage of the audio path. Generally, this requires a calibration signal with a known SPL to make that mapping.

Furthermore, the amplitude of the microphone signal is often boosted before the recordings are uploaded to increase the volume on consumer devices. This makes it infeasible to playback the recording at the correct natural sound level.

Due to the fact that all of our audio data is primary-sourced, we embed the SPL information in the recording alongside other important metadata. This is very useful for system evaluation because we can reproduce the evaluation sounds at their original and natural loudness, just like they would appear in the natural environment around a particular consumer device.

 

5. Limitations of playback devices

When evaluating a sound recognition system in a simulated environment, accurate playback is required to ensure the accuracy of that evaluation. To achieve this, you must use loudspeakers with a flat frequency response to minimise distortions. This is usually not the case for most consumer-grade speakers found in our laptops, smartphones and many other consumer devices.

Figure 4: Examples of measured speaker frequency responses of two laptops and a high-quality speaker
Figure 4: Examples of measured speaker frequency responses of two laptops and a high-quality speaker

In figure 4, you can see the frequency responses of a high-quality loudspeaker and two consumer-grade laptop speakers, where the latter struggle to reach the high sound pressure levels (SPLs) of sounds in the real world. The low-frequency content (below 500 Hz) is significantly reduced.

In addition, Laptop 1 struggles to reproduce high frequencies above 4,000 Hz. In comparison, the high-quality loudspeaker can reach realistically high SPLs with flat frequency response, meaning that it can playback the sounds at their natural volume and without distortions across the whole frequency range.

Such frequency response deviations are tolerated by listeners in consumer-grade speakers because these are optimised for speech and music playback, whereas more general sound playback requires different measurements and calibration.

For example, the speaker must reproduce sufficiently high SPL values across the full frequency range, which is off-limits for laptop speakers, where the maximum measured SPL levels are 84 dBA for Laptop 1 and 86 dBA for Laptop 2.

In figure 5, the distribution of SPLs for four different types of audio events is shown to illustrate its importance. For accurate sound reproduction, a speaker must be capable of reaching these levels. However, these are generally outside of laptop speakers’ range. Reaching levels of 100 dB SPL is possible in principle if the volume is cranked all the way up, but this results in significant sound distortions.

Figure 5: Distributions of SPL for various types of sound events with estimated maximum SPL of consumer-grade speakers
Figure 5: Distributions of SPL for various types of sound events with estimated maximum SPL of consumer-grade speakers

In addition to poor quality speakers, some devices, especially laptops using the Windows operating system, use additional audio processing. Advanced users can disable this type of processing. However, the only way to guarantee that no unwanted processing is applied through omitted or invisible processing modules is to conduct a test measurement over the whole audio chain, as highlighted in figure 1(ii).

There again, the point is to make sure that the sounds haven’t deviated too much from the application domain when testing or training; what you want is recognition of real sounds from real devices, not a playback recogniser.

 

6. Playback room effects

Think about shouting in a cave, compared to a wide-open space. Your voice is going to sound different, depending on where you are. The same applies to any recorded sound. The room response will affect the recording.

Many audio files are recorded in a reverberant environment, as opposed to an anechoic chamber where the sound is not impacted by the room. When training a sound recognition system these room responses need to be fully understood so that the model is robust enough to be able to adapt to different environments when used by consumers.

When playing back such files in a regular room to evaluate a sound system, you inadvertently add another room response on top of the existing one within the recording. If you are using an internet-downloaded file where the room response is not understood in the first place, this causes further problems when testing a sound recognition system. You now have to contend with the room effects of the room where you are playing back the sound and the room where it was originally recorded.

As a result, it is difficult to determine whether the performance deterioration comes from the room effects in the original space in which the audio was recorded or in the room where the sound is being played back.

To understand the detrimental effect of these multiple layers of room responses, this video from Alvin Lucier on YouTube enables you to hear and appreciate the impact:

So, one layer of room effects is fine, and indeed a high-performance sound recognition system should be robust across the vast range of environmental responses observed in consumer applications, both indoors and outdoors. On the contrary, overlaying many room effects might end up sounding as crazy and as unreal as Alvin Lucier’s experiment.

***

Sound recognition is a field that is radically different from other types of machine learning, such as image, facial, speech or music recognition. But it faces the same issue that all machine learning systems encounter: if you put garbage in, you get garbage out. The particular challenge for sound recognition is that our own sense of hearing limits our ability as humans to understand what is garbage.

Hopefully, you fully understand and appreciate why diverse, high-quality data – as well as the ability to capture, specify and manage it – is so important to machine learning for real-world sound recognition systems running on consumer devices.

This blog is an extract from my whitepaper “Why Real Sounds Matter for Machine Learning.” You can download the full whitepaper here.

>