On the Role of Dialogue Context in Predicting Speaking Style

Authors

DOI:

https://doi.org/10.17469/O2111AISV000020

Keywords:

Natural Language Processing, Deep learning, Expressive speech synthesis, Global Style Token

Abstract

Text-to-Speech (TTS) synthesis is a problem almost as old as Natural Language Processing (NLP). The focus of this problem is on creating tools capable of generating a voice uttering a given text. Deep learning-powered solutions try to go beyond mere speech generation from the text: newer models try to factorise the probability predicted by these generative models to condition it on various aspects: speaker's voice, speaking style, or prosody. In this work, we focused on predicting the speaking style from the given text inside a conversation, for the application to conversational agents and chatbots. To this end, we developed and trained a neural network module working as a connector between the textual component of a chatbot (i.e., the neural language model for dialogue understanding and generation) and the speech synthesis component of a chatbot (i.e., the neural TTS synthesis model with speaking style conditioning).

Downloads

Published

29-12-2023

Similar Articles

<< < 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 > >> 

You may also start an advanced similarity search for this article.