On the Role of Dialogue Context in Predicting Speaking Style
DOI:
https://doi.org/10.17469/O2111AISV000020Keywords:
Natural Language Processing, Deep learning, Expressive speech synthesis, Global Style TokenAbstract
Text-to-Speech (TTS) synthesis is a problem almost as old as Natural Language Processing (NLP). The focus of this problem is on creating tools capable of generating a voice uttering a given text. Deep learning-powered solutions try to go beyond mere speech generation from the text: newer models try to factorise the probability predicted by these generative models to condition it on various aspects: speaker's voice, speaking style, or prosody. In this work, we focused on predicting the speaking style from the given text inside a conversation, for the application to conversational agents and chatbots. To this end, we developed and trained a neural network module working as a connector between the textual component of a chatbot (i.e., the neural language model for dialogue understanding and generation) and the speech synthesis component of a chatbot (i.e., the neural TTS synthesis model with speaking style conditioning).Downloads
Published
29-12-2023
Issue
Section
Articles
License
Copyright (c) 2023 AISV - Associazione Italiana di Scienze della Voce [Italian Association for Speech Sciences]

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.