Modeling Hesitations. Speech Synthesis Application and Evaluation
DOI:
https://doi.org/10.17469/O2109AISV000010Keywords:
disfluency, pauses, speech synthesis, Deep Neural Network, perceptionAbstract
Studies have shown that elements like silent pauses, segmental lengthenings, and fillers are naturally involved in the economy of speech and, in specific patterns, may contribute to communication in both human-human and human-machine interactions. Therefore, research on speech synthesis aimed at developing more natural-sounding systems by inserting hesitation phenomena. However, audio issues were found to arise when synthesising filled pauses. Only recently, speech synthesisers based on Deep Neural Networks achieved better performances. In this study, we provide a first perceptual evaluation of a model of occurrence of hesitations (lengthenings, silent pauses as well as fillers) in Italian utterances using a stateof-the-art neural TTS system. A set of experimental stimuli were synthesized and subjected to listeners’ evaluations in a discrimination test. Results show that synthetic utterances that include hesitations, according to the linguistic model, are judged as more natural sounding than utterances that do not include any.
Downloads
Published
Issue
Section
License
Copyright (c) 2022 AISV - Associazione Italiana di Scienze della Voce [Italian Association for Speech Sciences]
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.