Modeling Hesitations. Speech Synthesis Application and Evaluation

Loredana Schettino; Antonio Origlia; Giacomo Matrone

doi:10.17469/O2109AISV000010

Authors

Loredana Schettino Centro Interdipartimentale di Ricerca Urban/Eco, Università degli Studi di Napoli Federico II, Italia. Dipartimento di Studi Umanistici, Università degli Studi di Salerno, Italia https://orcid.org/0000-0002-3788-3754
Antonio Origlia Dipartimento di Ingegneria Elettrica e delle Tecnologie dell’Informazione, Università degli Studi di Napoli Federico II, Italia https://orcid.org/0000-0002-8635-1623
Giacomo Matrone Dipartimento di Ingegneria Elettrica e delle Tecnologie dell’Informazione, Università degli Studi di Napoli Federico II, Italia https://orcid.org/0009-0001-5318-7249

DOI:

https://doi.org/10.17469/O2109AISV000010

Keywords:

disfluency, pauses, speech synthesis, Deep Neural Network, perception

Abstract

Studies have shown that elements like silent pauses, segmental lengthenings, and fillers are naturally involved in the economy of speech and, in specific patterns, may contribute to communication in both human-human and human-machine interactions. Therefore, research on speech synthesis aimed at developing more natural-sounding systems by inserting hesitation phenomena. However, audio issues were found to arise when synthesising filled pauses. Only recently, speech synthesisers based on Deep Neural Networks achieved better performances. In this study, we provide a first perceptual evaluation of a model of occurrence of hesitations (lengthenings, silent pauses as well as fillers) in Italian utterances using a stateof-the-art neural TTS system. A set of experimental stimuli were synthesized and subjected to listeners’ evaluations in a discrimination test. Results show that synthetic utterances that include hesitations, according to the linguistic model, are judged as more natural sounding than utterances that do not include any.