October 22, 2019
SANE 2019: Addressing the limitations of sound event detection metrics
Evaluation metrics currently in use to measure performance in Sound Event Detection (SED) tasks were inherited from the broader domain of information retrieval and may not sufficiently consider the specifics of sound recognition.
At SANE 2019, Audio Analytic’s research team will present its findings around the key issues that need to be addressed for a more robust, more versatile, more meaningful and more accurate evaluation metric to benefit academia and industry.
Why do we need an evaluation metric?
Evaluation metrics are a key element of the evolution of all technologies and products which are based on machine learning.
Performance progress – whether it’s throughout the various generations of a technology or across competing solutions – can only be measured and advertised on the basis of commonly understood and widely agreed upon evaluation metrics.
For example, the field of Automatic Speech Recognition (ASR) uses the Word Error Rate (WER) as a performance metric, which is widely accepted as a standard throughout both the research community and industry.
However, the criteria used to evaluate model performance of Sound Event Detection (SED) still varies across academic publications and competitive data challenges.
In addition to this, sound recognition is becoming a much more prevalent feature in smart consumer devices and has a rapidly expanding range of values and benefits, ranging from smoke alarm and glass break detection to exciting use cases within smartphones and hearables.
Therefore a dedicated and practical metric to measure sound recognition performance is required for this area to progress further.
Audio Analytic is a world expert in the detection of a wide variety of sounds, and its software can benefit many applications ranging from security to context awareness in a range of devices.
At SANE 2019, a research workshop taking place at Columbia University on October 24, Audio Analytic’s research team will highlight issues that need to be addressed to create an evaluation framework and metric that is relevant to SED tasks.
The team will then present their research and reasoning behind a proposed metric, which will be known as the Polyphonic Sound Detection Score (PSDS).
Dr Sacha Krstulović, Director of Labs at Audio Analytic, said: “In machine learning, the performance of classification systems is often evaluated by Precision, Recall, F-score and Accuracy.
“These metrics have been used as de-facto standards to evaluate SED tasks in previous DCASE challenges. However, they can present significant drawbacks.
“In our poster, we outline three key areas that need to be addressed when evaluating the performance of SED models.
“These are: the definition of sound events from a technical perspective, the distinction between cross-triggers and false positives, and the independence from the operating point.
“This work will also outline our research and experiments related to developing the PSDS.”
Read more about these three key areas that need to be addressed…”
View Audio Analytic’s abstract for SANE 2019…
Dr Krstulović added: “Audio Analytic is dedicated to delivering the world’s best sound recognition software, therefore our research in this area is very strong, fueled by years of practical experience in the field, and supported by very large and very relevant proprietary data sets.
“At SANE 2019, we will present our findings on the specific limitations of the existing metrics, as well as our proposed solution to the wider research community. It will be interesting to hear the views of peers and other experts.
“We also hope our discussions there and the publications we are preparing for the near future will lead to an evaluation methodology for SED that is more comprehensive and more relevant for both academia and industry.”
Three current issues with sound event detection criteria…
SED is a new field for audio recognition in comparison to, for example, ASR. It was only natural that SED researchers initially applied generic metrics, which are inherited from pattern recognition and information retrieval in a broad sense, to the evaluation of SED systems.
However, SED tasks have unique key characteristics. For example, the variability of different sound event types is much more complex than, say, the variability of voice identities and accents in speech recognition.
While speech always involves vocal tracts and structured language models, polyphonic sound events are born from a huge variety of sound production processes that can happen at random times.
1. The definition of sound events needs to be revisited
Audio Analytic’s abstract for SANE 2019 explains that there are inconsistencies in how true positives and false positives are currently defined across SED tasks.
To begin with, it is not always clear when sound events start and end.
For example, the sound of a dog barking can sometimes include several barks without any pauses in between. But this can be interpreted as a single bark depending on the person who is listening and labelling this sound.
Evaluation metrics should therefore be robust enough to allow for various interpretations around the structure of sound events.
In this area, the poster proposes looking at SED much more broadly by measuring the performance of sound event recognition for events that may span many audio frames, may have a variable duration and may include interruptions.
Çağdaş Bilen, Senior Research Engineer at Audio Analytic, explains: “In many cases, the definition of true positives can change the way in which sound events are labelled.
“In other words, there can be more than one way to label a sound event: it may be a true positive in one labelling standard, and a false positive in another.
“Our approach involves being more flexible in the definitions of true positives and false positives thus opening the door to measuring the performance in more accurate and more meaningful ways.”
2. Cross-triggers should be considered a special case of false positives
A false positive is essentially an incorrect result. In some cases, a false positive is due to a sound that is acoustically similar to another sound of interest for the SED task.
Baby cry, for example, can sometimes be mistaken for a person speaking, in a context where the speech class is another class of interest for detection.
Such incidents, known as cross-triggers, should be assessed further when evaluating models for both synthetic benchmarking tasks and real-world field performance.
Dr Bilen added: “If a sound being detected is a false positive, but very similar to another class of interest for detection, then this needs to be explored further.
“We then need to ask: ‘Why is one type of sound being detected, but being identified as a different class? Is this happening often? And most importantly, which modifications of the sound recognition system will correct this problem?’
“The distinction of cross-triggers from more generic false positives allows for finer analysis of the biases arising from the data. Interestingly, our findings reveal that this plays a significant role in the improvement of real-world performance.”
3. Algorithm performance needs to be assessed independently from the operating point
The same system with different settings – for example different decision thresholds which may define various operating points of the system including more conservative, more sensitive etc. – may lead to different measurements of performance when using the same metric.
If an operating point is chosen badly, randomly or otherwise on the basis of unclear criteria, then the possibility exists of missing out on a better acoustic model.
In other words, the proposed SED metric should concentrate on comparing the data modeling power of the various systems under investigation, rather than their respective choices of operating points.
Therefore, evaluation which spans a range of operating points offer much more robustness and are therefore more meaningful when assessing the performance of an algorithm.
“With our proposed metric, it becomes possible to evaluate the performance of an algorithm across a wide range of tradeoffs,” said Dr Bilen.
“For example, let’s take the point where the number of false positives may be doubled because a more permissive operating point was enforced by a particular application.
“One system may gain a lot more true positives whereas another system may not gain as many – a standardized metric can therefore help making better informed comparisons.”
In line with these topics, the research team’s ongoing work is shaping the foundation of an improved and more meaningful evaluation metric for SED.
This will significantly empower academic research as well as the development and commercialisation of high-performance sound recognition technology – a market where Audio Analytic is a worldwide leader.