February 1, 2021
Latest scientific paper supports universal adoption of PSDS as definitive SED metric
For the second year running we will be presenting a paper at ICASSP (IEEE International Conference on Acoustics, Speech and Signal Processing), the world’s largest conference on signal processing and its applications. The event will take place virtually between June 6th and June 12th.
At last year’s ICASSP, we introduced our Polyphonic Sound Detection Score (PSDS) as a more sophisticated metric for analysing sound event detection (SED) systems. It was subsequently adopted by the annual DCASE Challenge organisers alongside the existing F1-score metric for Task 4. This year, our paper titled ‘Improving Sound Event Detection Metrics: Insights from DCASE 2020’ and written in conjunction with Nicolas Turpault and Romain Serizel at Université de Lorraine, presents an in-depth analysis of how PSDS can uniquely inform users about SED systems performance in comparison with conventional metrics.
The insights presented in the paper further support the universal adoption of PSDS within academic and industrial research teams. We also hope to see PSDS adopted by the DCASE Challenge organisers as the primary metric for the 2021 Task 4 challenge so that participants can submit their best overall system instead of several slightly-different-tuned versions.
What makes PSDS an improvement over the conventional metrics?
For more details about PSDS formulation, please see our previous blog post.
Overall performance metrics, such as the F1-score, are typically computed from the output of two well-known event validation criteria borrowed by the SED community from other fields of research. These criteria are collar-based or segment-based, and they identify the number of correct (i.e., true positives) and incorrect detections (i.e., false positives). The collar-based criterion treats events of different duration with different levels of strictness; thus, over penalising long sounds with respect to short ones. The segment-based criterion, on the other hand, lacks precision and may be application-dependent because the chosen segment-size may prefer sounds where the duration is comparable to the reference segment.
In our study, we compare these approaches against the validation criterion introduced by PSDS: intersections, i.e., the amount of overlap between system outputs and reference. We show its superior flexibility since it can easily be adjusted to enforce similar levels of constraints to either segment or collar based criteria (see Figure 1 from the paper below). Contrarily to the conventional criteria, the intersections-based logic adapts by design to any sound event duration, and it treats short and long audio events fairly.
Furthermore, the intersections criterion makes the SED evaluation more robust to labelling subjectivity. Strong annotation (i.e., the task of annotating the onset and offset of a sound event) is subjective to the annotators’ perception of where an event starts and ends. The intersection criterion deals with such issues by better accounting for merged detections (i.e., single detections across multiple annotations) and interrupted ones (i.e., multiple detections across single annotations) in the detection scores.
Also, conventional SED metrics, such as F1-score, depend on the choice of an operating point. This is not ideal for assessing the acoustic modelling power of the system. Indeed, a machine learning model can be described to have two main axes of variability: the level of modelling capacity of the data it processes (such as the acoustics) and the level of its reactiveness or sensitivity. The quality of the system is more about the first, whereas the second can be tuned to the needs of the application. An evaluation that decouples the two aspects prevents to erroneously conclude that a system is better than another at modelling sounds while, in practice, the difference relies only on a different choice of system sensitivity tuning.
Looking at the PSD-ROC (Receiver Operating Curve) of the systems submitted by each team (Figure 2), we have found that they closely overlap. Hence the PSDS values change marginally for most teams. In one case (Team 2) the PSD-ROC and the PSDS value are significantly different, and this is because the team has submitted two different model architectures which show different acoustic modelling capacities. For the other teams, the PSD-ROC and PSDS value confirms that the underlying acoustic modelling capacity of the proposed systems is not varying significantly.
To recap, the PSDS is a metric that summarises the SED system performance across several operating points. This allows the evaluation or the ranking of systems regardless of user-related specifications of error costs and trade-offs. As a result, model comparison is driven by the best overall sound event modelling capacity. Comparatively, the F1-score depends on a specific system tuning: that produces a different ranking for different application scenarios (see Figure 3 from the paper below), which makes it challenging to identify the best modelling architecture overall.
We look forward to sharing our insights with peers at this year’s ICASSP conference and discussing how PSDS can be used to support the development of better sound recognition models.
Like this? You can subscribe to our blog and receive an alert every time we publish an announcement, a comment on the industry or something more technical.
About Audio Analytic
Audio Analytic is the pioneer of AI sound recognition technology. The company is on a mission to give machines a compact sense of hearing. This empowers them with the ability to react to the world around us, helping satisfy our entertainment, safety, security, wellbeing, convenience, and communication needs across a huge range of consumer products.
Audio Analytic’s ai3™ and ai3-nano™ sound recognition software enables device manufacturers to equip products at the edge with the ability to recognize and automatically respond to our growing list of sounds and acoustic scenes.