Forensic Automatic Speaker Recognition based on Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network

Francesco Sigona; Giuseppe Vitolo; Mirko Grimaldi

doi:10.17469/O2111AISV000026

Autori

Francesco Sigona Laboratory CRIL (Centro di Ricerca Interdisciplinare sul Linguaggio) & DReaM, Department of Humanities, University of Salento, Lecce, Italy https://orcid.org/0000-0003-2939-0009
Giuseppe Vitolo Dipartimento di ingegneria dell’innovazione, Università del Salento –Italy
Mirko Grimaldi Centro di Ricerca Interdisciplinare sul Linguaggio (CRIL) – Dipartimento di Studi Umanistici, Università del Salento, Italia https://orcid.org/0000-0002-0940-3645

DOI:

https://doi.org/10.17469/O2111AISV000026

Parole chiave:

forensic linguistics, voice comparison, speaker verification, speaker recognition

Abstract

In the field of automatic forensic voice comparison (FVC), the use of neural networks models in the processing chain is increasingly frequent. Recently, Emphasized Channel Attention, Propagation and Aggregation in Time Delay Neural Network (ECAPA-TDNN) demonstrated high discrimination and accuracy in the non-forensic speaker verification task. In this contribution, after illustrating the fundamental differences between the forensic and non-forensic tasks of speaker verification – from a linguistic, logical and methodological perspective – the performances of a software implementation of automatic FVC based on ECAPA-TDNN, at different simulated operating conditions (noise level, net speech duration), are verified. Preliminary results confirm excellent performance in non-critical operating conditions. The hypotheses on the performance trend are also preliminarily confirmed: as the duration of the speech samples increases, and the noise level decreases, the evaluation metrics improve at a rate that depends on the combination of these two factors.