On the Role of Dialogue Context in Predicting Speaking Style

Vincenzo Scotti; Roberto Tedesco

doi:10.17469/O2111AISV000020

Authors

Vincenzo Scotti Dipartimento di Elettronica Informazione e Bioingegneria, Politecnico di Milano, Italy https://orcid.org/0000-0002-8765-604X
Roberto Tedesco Dipartimento di Elettronica Informazione e Bioingegneria, Politecnico di Milano, Italy https://orcid.org/0000-0002-2830-4247

DOI:

https://doi.org/10.17469/O2111AISV000020

Keywords:

Natural Language Processing, Deep learning, Expressive speech synthesis, Global Style Token

Abstract

Text-to-Speech (TTS) synthesis is a problem almost as old as Natural Language Processing (NLP). The focus of this problem is on creating tools capable of generating a voice uttering a given text. Deep learning-powered solutions try to go beyond mere speech generation from the text: newer models try to factorise the probability predicted by these generative models to condition it on various aspects: speaker's voice, speaking style, or prosody. In this work, we focused on predicting the speaking style from the given text inside a conversation, for the application to conversational agents and chatbots. To this end, we developed and trained a neural network module working as a connector between the textual component of a chatbot (i.e., the neural language model for dialogue understanding and generation) and the speech synthesis component of a chatbot (i.e., the neural TTS synthesis model with speaking style conditioning).