Dimensions of quality for state of the art synthetic speech

Seebauer FM, Wagner P (2022)
In: Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum . Bruggeman A, Ludusan B, Universität Bielefeld (Eds); Bielefeld: Universität Bielefeld.

Konferenzbeitrag | Veröffentlicht | Englisch
 
Download
OA 245.33 KB
Herausgeber*in
Bruggeman, Anna; Ludusan, Bogdan
herausgebende Körperschaft
Universität Bielefeld
Abstract / Bemerkung
Synthetic speech has a long standing tradition of being employed for experiments in phonetics and laboratory phonology. The choice of synthesis method and system is commonly made by the researcher(s) to fit the specific quality criteria and study design. The overall quality of a given system, however, remains as a confound that is difficult to control for [1]. In speech technology newly proposed systems are usually compared across specific dimensions e.g., ‘Intelligibility’ and ‘Naturalness’. These dimensions have already been extensively studied and evaluated within the context of old diphone and formant synthesis networks [2]. We contend, however, that these tradi- tional dimensions need to be re-examined in the context of state of the Art Text-to-Speech (TTS) systems, as those newer models exhibit different quality deteriorations. Our work aims to bridge the conflicting demands for quality criteria that are easily computed and applied during TTS de- velopment, while at the same time remaining descriptive and meaningful for phonetic research. As a first step in this endeavor, we carried out an experiment to find suitable dimensions of TTS quality with a bottom-up approach based on descriptions provided by 11 participants (phonetic experts). The participants were instructed to label speech samples generated by 8 different state of the art Text-to-speech systems (varieties of English). Each system produced a stimulus consisting of two sentences of the phonetically balanced ‘caterpillar story’ [3]. In order to ensure that all systems were evaluated across different phonetic contexts in a balanced way, the sentences were rotated between participants so that each participant heard the complete story but with different parts read by different systems. The experimental setup is loosely based on the work in [4]. The participants were instructed to write down nouns, adjectives or sentences describing the quality of a given stimulus. Using embeddings generated by a pretrained BERT model [5] for semantic distances, we determined which of the participants terms were semantically similar. A subsequent affinity propagation clustering revealed there to be 39 meaningfully different clusters, each rep- resenting a dimension of quality for synthetic voices. Keeping in mind that these dimensions are later to be used for ratings in actual evaluation experiments, it was decided to reduce the num- ber of clusters to a more practical number of 10 and re-calculate the spectral clustering with a precomputed cosine affinity matrix. The resulting clusters and their respective quality descrip- tions are depicted in fig. 1. A manual analysis of the resulting dimensions led to the following descriptive labels: ‘artificiality/voice quality’, ‘intonation/noise/prosody’, ‘voice/audio quality’, ‘audio cuts’, ‘style/recording quality’, ‘emotion/voice quality/attitude’, ‘engagedness’, ‘human likeness’, ‘hyperarticulation’. From the assigned cluster descriptions it is evident, that the se- mantic embeddings sometimes conflated several seemingly unrelated quality features into single dimensions (e.g. prosody and background noise), while occasionally splitting almost synonymous terms into multiple clusters (e.g. ‘artificiality’, ‘roboticness’ and ‘metallicness’). To evaluate these shortcomings of the semantic model, two independent manual clusterings were carried out. They were both limited to 10 clusters and reported a modified jaccard agreement index of 63,44, while agreeing with the automatic computed clusters with 54.48 and 57.93, respectively. The low in- terrater agreement between the manual clusters suggests that a panel decision process might be needed to determine the final quality dimensions. Subsequent research will evaluate clusters cre- ated by na ̈ıve listeners and quality dimensions of different sub-tasks in synthetic speech, such as voice conversion.
Stichworte
speech synthesis; speech technology; voice dimensions; biphonetics
Erscheinungsjahr
2022
Titel des Konferenzbandes
Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum
Konferenz
18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum
Konferenzort
Bielefeld
Konferenzdatum
2022-10-06 – 2022-10-07
Page URI
https://pub.uni-bielefeld.de/record/2967012

Zitieren

Seebauer FM, Wagner P. Dimensions of quality for state of the art synthetic speech. In: Bruggeman A, Ludusan B, Universität Bielefeld, eds. Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum . Bielefeld: Universität Bielefeld; 2022.
Seebauer, F. M., & Wagner, P. (2022). Dimensions of quality for state of the art synthetic speech. In A. Bruggeman, B. Ludusan, & Universität Bielefeld (Eds.), Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum Bielefeld: Universität Bielefeld. https://doi.org/10.11576/pundp2022-1016
Seebauer, Fritz Michael, and Wagner, Petra. 2022. “Dimensions of quality for state of the art synthetic speech”. In Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum , ed. Anna Bruggeman, Bogdan Ludusan, and Universität Bielefeld. Bielefeld: Universität Bielefeld.
Seebauer, F. M., and Wagner, P. (2022). “Dimensions of quality for state of the art synthetic speech” in Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum , Bruggeman, A., Ludusan, B., and Universität Bielefeld eds. (Bielefeld: Universität Bielefeld).
Seebauer, F.M., & Wagner, P., 2022. Dimensions of quality for state of the art synthetic speech. In A. Bruggeman, B. Ludusan, & Universität Bielefeld, eds. Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum . Bielefeld: Universität Bielefeld.
F.M. Seebauer and P. Wagner, “Dimensions of quality for state of the art synthetic speech”, Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum , A. Bruggeman, B. Ludusan, and Universität Bielefeld, eds., Bielefeld: Universität Bielefeld, 2022.
Seebauer, F.M., Wagner, P.: Dimensions of quality for state of the art synthetic speech. In: Bruggeman, A., Ludusan, B., and Universität Bielefeld (eds.) Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum . Universität Bielefeld, Bielefeld (2022).
Seebauer, Fritz Michael, and Wagner, Petra. “Dimensions of quality for state of the art synthetic speech”. Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum . Ed. Anna Bruggeman, Bogdan Ludusan, and Universität Bielefeld. Bielefeld: Universität Bielefeld, 2022.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Creative Commons Namensnennung 4.0 International Public License (CC-BY 4.0):
Volltext(e)
Access Level
OA Open Access
Zuletzt Hochgeladen
2022-11-15T09:53:13Z
MD5 Prüfsumme
066a899533d3f20b3f4dce53c8073f33


Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Suchen in

Google Scholar