For the second year running we will be presenting a paper at ICASSP (IEEE International Conference on Acoustics, Speech and Signal Processing), the world’s largest conference on signal processing and its applications. The event will take place virtually between June 6th and June 12th.

At last year’s ICASSP, we introduced our Polyphonic Sound Detection Score (PSDS) as a more sophisticated metric for analysing sound event detection (SED) systems. It was subsequently adopted by the annual DCASE Challenge organisers alongside the existing F1-score metric for Task 4. This year, our paper titled Improving Sound Event Detection Metrics: Insights from DCASE 2020 and written in conjunction with Nicolas Turpault and Romain Serizel at Université de Lorraine, presents an in-depth analysis of how PSDS can uniquely inform users about SED systems performance in comparison with conventional metrics.

The insights presented in the paper further support the universal adoption of PSDS within academic and industrial research teams. We also hope to see PSDS adopted by the DCASE Challenge organisers as the primary metric for the 2021 Task 4 challenge so that participants can submit their best overall system instead of several slightly-different-tuned versions.

You can read this year’s paper here.

What makes PSDS an improvement over the conventional metrics?

For more details about PSDS formulation, please see our previous blog post.

Overall performance metrics, such as the F1-score, are typically computed from the output of two well-known event validation criteria borrowed by the SED community from other fields of research. These criteria are collar-based or segment-based, and they identify the number of correct (i.e., true positives) and incorrect detections (i.e., false positives). The collar-based criterion treats events of different duration with different levels of strictness; thus, over penalising long sounds with respect to short ones. The segment-based criterion, on the other hand, lacks precision and may be application-dependent because the chosen segment-size may prefer sounds where the duration is comparable to the reference segment.

In our study, we compare these approaches against the validation criterion introduced by PSDS: intersections, i.e., the amount of overlap between system outputs and reference. We show its superior flexibility since it can easily be adjusted to enforce similar levels of constraints to either segment or collar based criteria (see Figure 1 from the paper below). Contrarily to the conventional criteria, the intersections-based logic adapts by design to any sound event duration, and it treats short and long audio events fairly.

(Figure 1) Assessment of the F1-score for eight different systems across different evaluation criteria: segment-based (SB), intersection-based (IB-DTC=GTC*), collar-based (CB) and collar onset only (CB-O). * The intersection-based evaluation uses DTC (Detection Tolerance Criterion) = GTC (Ground Truth intersection Criterion) values ranging from 0.1 to 0.9. Please consult the paper for more information.
(Figure 1) Assessment of the F1-score for eight different systems across different evaluation criteria: segment-based (SB), intersection-based (IB-DTC=GTC*), collar-based (CB) and collar onset only (CB-O). * The intersection-based evaluation uses DTC (Detection Tolerance Criterion) = GTC (Ground Truth intersection Criterion) values ranging from 0.1 to 0.9. Please consult the paper for more information.

Furthermore, the intersections criterion makes the SED evaluation more robust to labelling subjectivity. Strong annotation (i.e., the task of annotating the onset and offset of a sound event) is subjective to the annotators’ perception of where an event starts and ends. The intersection criterion deals with such issues by better accounting for merged detections (i.e., single detections across multiple annotations) and interrupted ones (i.e., multiple detections across single annotations) in the detection scores.

Also, conventional SED metrics, such as F1-score, depend on the choice of an operating point. This is not ideal for assessing the acoustic modelling power of the system. Indeed, a machine learning model can be described to have two main axes of variability: the level of modelling capacity of the data it processes (such as the acoustics) and the level of its reactiveness or sensitivity. The quality of the system is more about the first, whereas the second can be tuned to the needs of the application. An evaluation that decouples the two aspects prevents to erroneously conclude that a system is better than another at modelling sounds while, in practice, the difference relies only on a different choice of system sensitivity tuning.

Looking at the PSD-ROC (Receiver Operating Curve) of the systems submitted by each team (Figure 2), we have found that they closely overlap. Hence the PSDS values change marginally for most teams. In one case (Team 2) the PSD-ROC and the PSDS value are significantly different, and this is because the team has submitted two different model architectures which show different acoustic modelling capacities. For the other teams, the PSD-ROC and PSDS value confirms that the underlying acoustic modelling capacity of the proposed systems is not varying significantly.

(Figure 2) PSD-ROC curves of all the systems that each team has submitted. Note that two teams have submitted only a single system, so they are not considered here.
(Figure 2) PSD-ROC curves of all the systems that each team has submitted. Note that two teams have submitted only a single system, so they are not considered here.

To recap, the PSDS is a metric that summarises the SED system performance across several operating points. This allows the evaluation or the ranking of systems regardless of user-related specifications of error costs and trade-offs. As a result, model comparison is driven by the best overall sound event modelling capacity. Comparatively, the F1-score depends on a specific system tuning: that produces a different ranking for different application scenarios (see Figure 3 from the paper below), which makes it challenging to identify the best modelling architecture overall.

(Figure 3) Here are the PSD-ROC curves for each eight teams’ best system. The team ranking varies at three different operating points across the effective false positive rate (eFPR) per hour (along the x-axis), which represents different tolerances to the amount of false positives that a user could potentially receive as a trade-off against the corresponding effective true positive rates (eTPR). Comparing systems across the whole PSD-ROC curves factors out this dependency and gives a complete picture of model performance.
(Figure 3) Here are the PSD-ROC curves for each eight teams’ best system. The team ranking varies at three different operating points across the effective false positive rate (eFPR) per hour (along the x-axis), which represents different tolerances to the amount of false positives that a user could potentially receive as a trade-off against the corresponding effective true positive rates (eTPR). Comparing systems across the whole PSD-ROC curves factors out this dependency and gives a complete picture of model performance.

We look forward to sharing our insights with peers at this year’s ICASSP conference and discussing how PSDS can be used to support the development of better sound recognition models.

***** 

Like this? You can subscribe to our blog and receive an alert every time we publish an announcement, a comment on the industry or something more technical. 

 

 About Audio Analytic 

Audio Analytic is the pioneer of AI sound recognition technology. The company is on a mission to give machines a compact sense of hearing. This empowers them with the ability to react to the world around us, helping satisfy our entertainment, safety, security, wellbeing, convenience, and communication needs across a huge range of consumer products.

Audio Analytic’s ai3™ and ai3-nano™ sound recognition software enables device manufacturers to equip products at the edge with the ability to recognize and automatically respond to our growing list of sounds and acoustic scenes.

We are using our own and third party cookies which track your use of our website to help give you the best experience. By continuing, we’ll assume that you are happy to receive all cookies on our website.

You can check what cookies we use and how to manage them here and you read our privacy policy here.

Accept and close
>