October 31, 2019

Introducing the Polyphonic Sound Detection Score, a robust evaluation framework and metric for sound recognition

Audio Analytic’s research team has defined a robust metric to evaluate the performance of polyphonic sound event detection systems in a technical paper that has been accepted by the International Conference on Acoustics, Speech and Signal Processing 2020 (ICASSP 2020).

Audio Analytic’s submission to ICASSP 2020: A Framework for the Robust Evaluation of Sound Event Detection.

The PSDS source code can be downloaded from Audio Analytic’s GitHub repository here.

Members of the Audio Analytic Labs team presented a preview of this research to academics and other stakeholders during the poster session at SANE 2019.

Sacha Krstulović, Director of Audio Analytic Labs, said: “The initial response from the community at SANE 2019 and DCASE has been very positive.

“The participants, particularly those who are specialists in the field of sound recognition, acknowledged that current metrics have their limitations and that a dedicated metric is needed to evaluate polyphonic sound event detection tasks.

“Our hope is that  the results of our research work on developing a more robust, more versatile, more meaningful and more accurate evaluation framework will help steering worldwide research in promising directions for the future.”

The Polyphonic Sound Detection Score (PSDS)

Audio Analytic has identified three key limitations that need to be addressed for an evaluation metric to be meaningful and robust when detecting sound events from multiple classes (for example glass break, dog bark etc.), which can occur simultaneously.

  • Redefining sound event detection. Valid sound event detections should be defined by intersection with the ground truth labels, rather than collars around start and end times.
  • Consideration for cross-triggers. Cross-triggers, when treated as a special case of false positives, yield extra insight into data properties, and also help error analysis.
  • Dependence on operating point. The metric should be independent of the system’s operating point, while remaining relevant to user experience.

In the paper submitted to ICASSP, all three areas above are addressed by design into the new evaluation framework.

Given the new definition of true positives (TPs) and false positives (FPs) based on intersection, the effective TP rate and effective FP rate are calculated across a set of system sensitivity settings and plotted on a receiver operating characteristic (ROC) curve.

The normalized area under this curve is the global metric which allows researchers to evaluate the system across all possible operating points for a particular SED task, and is called the Polyphonic Sound Detection Score (PSDS).

The PSDS values are between zero and one, where better performing systems achieve higher scores.

Results from initial experiments

To assess the evaluation framework, Audio Analytic’s research team used three systems which are publicly available from the DCASE challenge 2019.

One was the baseline system (System 1), and the other two were entered by challenge participants who ranked first and fourth in the challenge (System 2 and System 3 respectively).

Dr Krstulović explains: “In this first set of results, we wanted to assess the impact of defining sound event detection by using intersections against the ground truth labels.

“The team computed the standard DCASE F-Score by assessing true positives and false positives with collars, which is how the challenge has been assessed previously.

“They then compared this to the intersections approach, which is more robust against discontinuities in detections or ground truth labels.”

He added: “The impact on F-score was significant insofar as lots of false positives were reinstated as legitimate true positives, which supports our point that sound event detection should be robust against interruptions in both the detections and the labels.

“Indeed, when looking at the performance with TPs and FPs based on collars, the three systems appear to be performing very differently. Yet under a definition of TPs and FPs based on intersection and tolerant of interruptions, they appear to be performing quite similarly.

“Thus, the conclusion on which system performs best is being modified by a different and more robust perspective on the definition of TPs and FPs.

Diagram 1. F-Score using a collar-based definition (left) versus intersections (right).
Diagram 1. F-Score using a collar-based definition (left) versus intersections (right).

As these experiments were conducted with a given operating point, researchers delved further into system performance across a range of operating points (Diagram 2).

Dr Krstulović added: “In this scenario, we suddenly see that although System 2 still appears to be performing best overall, System 1 and System 2 might have different advantages at different operating points.

“When we look at the first plot of Diagram 2, System 1 may yield better results when low FP rates are desired. System 2 seems to be better for applications which are ready to trade FP performance against higher TP requirements. This evaluation framework finds the best system in conjunction with the type of operation point required by the application.

“Accounting for the cross-triggering behaviour of the various systems also brings another dimension. For example, System 2 could be preferred to System 1 at high TP operating points if it was particularly important for the application to avoid confusion between the classes recognised by the system.”

Diagram 2. PSD-ROC curves for the three systems under evaluation. Tables represent PSDS i.e. area under the ROC curve. This diagram illustrates the effect of accounting for cross-triggers into the final metric.
Diagram 2. PSD-ROC curves for the three systems under evaluation. Tables represent PSDS i.e. area under the ROC curve. This diagram illustrates the effect of accounting for cross-triggers into the final metric.

Why is this significant?

This evaluation framework allow researchers and product engineers to find the best system for a given application. In other terms, the metric allows researchers to factor in the application requirements when looking for the best system.

For example:-

  • Which is the best system at a particular TP/FP trade-off?
  • How important cross-triggers are for the system.
  • How stable the classification performance remains across multiple sound classes.

The influence of these factors on system selection can be controlled in the evaluation process to analyse how the performance may change across different application requirements.

Dr Krstulović added: “With this framework, more detailed insights can be obtained from the three systems under evaluation.

“In some scenarios, System 2 is performing better at some operating points whereas the other two systems are performing better in other situations.

“So there is no clear winner across all situations, as the best system may depend on the application requirements which are driven by the end-user experience that you want to deliver.

“This is helpful because if you know the application requirements the system should perform, then the evaluation can find the best system based on these requirements. In fact, different requirements may correspond to selecting different systems.

“This is also why knowing product performance requirements is important, or in other terms what are the evaluation conditions. The framework gives you the opportunity to adapt your evaluation to your product requirements.”

You can find the full technical paper on the Polyphonic Sound Detection Score, along with links to GitHub and the Jupyter Notebook, here.