Dimensions of quality for state of the art synthetic speech
Seebauer FM, Wagner P (2022)
In: Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum . Bruggeman A, Ludusan B, Universität Bielefeld (Eds); Bielefeld: Universität Bielefeld.
Konferenzbeitrag
| Veröffentlicht | Englisch
Download
PP_2022_paper_0145-4.pdf
245.33 KB
Herausgeber*in
Bruggeman, Anna;
Ludusan, Bogdan
herausgebende Körperschaft
Universität Bielefeld
Einrichtung
Abstract / Bemerkung
Synthetic speech has a long standing tradition of being employed for experiments in phonetics
and laboratory phonology. The choice of synthesis method and system is commonly made by the
researcher(s) to fit the specific quality criteria and study design. The overall quality of a given
system, however, remains as a confound that is difficult to control for [1]. In speech technology
newly proposed systems are usually compared across specific dimensions e.g., ‘Intelligibility’ and
‘Naturalness’. These dimensions have already been extensively studied and evaluated within the
context of old diphone and formant synthesis networks [2]. We contend, however, that these tradi-
tional dimensions need to be re-examined in the context of state of the Art Text-to-Speech (TTS)
systems, as those newer models exhibit different quality deteriorations. Our work aims to bridge
the conflicting demands for quality criteria that are easily computed and applied during TTS de-
velopment, while at the same time remaining descriptive and meaningful for phonetic research.
As a first step in this endeavor, we carried out an experiment to find suitable dimensions of TTS
quality with a bottom-up approach based on descriptions provided by 11 participants (phonetic
experts). The participants were instructed to label speech samples generated by 8 different state of
the art Text-to-speech systems (varieties of English). Each system produced a stimulus consisting
of two sentences of the phonetically balanced ‘caterpillar story’ [3]. In order to ensure that all
systems were evaluated across different phonetic contexts in a balanced way, the sentences were
rotated between participants so that each participant heard the complete story but with different
parts read by different systems. The experimental setup is loosely based on the work in [4]. The
participants were instructed to write down nouns, adjectives or sentences describing the quality
of a given stimulus. Using embeddings generated by a pretrained BERT model [5] for semantic
distances, we determined which of the participants terms were semantically similar. A subsequent
affinity propagation clustering revealed there to be 39 meaningfully different clusters, each rep-
resenting a dimension of quality for synthetic voices. Keeping in mind that these dimensions are
later to be used for ratings in actual evaluation experiments, it was decided to reduce the num-
ber of clusters to a more practical number of 10 and re-calculate the spectral clustering with a
precomputed cosine affinity matrix. The resulting clusters and their respective quality descrip-
tions are depicted in fig. 1. A manual analysis of the resulting dimensions led to the following
descriptive labels: ‘artificiality/voice quality’, ‘intonation/noise/prosody’, ‘voice/audio quality’,
‘audio cuts’, ‘style/recording quality’, ‘emotion/voice quality/attitude’, ‘engagedness’, ‘human
likeness’, ‘hyperarticulation’. From the assigned cluster descriptions it is evident, that the se-
mantic embeddings sometimes conflated several seemingly unrelated quality features into single
dimensions (e.g. prosody and background noise), while occasionally splitting almost synonymous
terms into multiple clusters (e.g. ‘artificiality’, ‘roboticness’ and ‘metallicness’). To evaluate these
shortcomings of the semantic model, two independent manual clusterings were carried out. They
were both limited to 10 clusters and reported a modified jaccard agreement index of 63,44, while
agreeing with the automatic computed clusters with 54.48 and 57.93, respectively. The low in-
terrater agreement between the manual clusters suggests that a panel decision process might be
needed to determine the final quality dimensions. Subsequent research will evaluate clusters cre-
ated by na ̈ıve listeners and quality dimensions of different sub-tasks in synthetic speech, such as
voice conversion.
Stichworte
speech synthesis;
speech technology;
voice dimensions;
biphonetics
Erscheinungsjahr
2022
Titel des Konferenzbandes
Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum
Urheberrecht / Lizenzen
Konferenz
18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum
Konferenzort
Bielefeld
Konferenzdatum
2022-10-06 – 2022-10-07
Page URI
https://pub.uni-bielefeld.de/record/2967012
Zitieren
Seebauer FM, Wagner P. Dimensions of quality for state of the art synthetic speech. In: Bruggeman A, Ludusan B, Universität Bielefeld, eds. Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum . Bielefeld: Universität Bielefeld; 2022.
Seebauer, F. M., & Wagner, P. (2022). Dimensions of quality for state of the art synthetic speech. In A. Bruggeman, B. Ludusan, & Universität Bielefeld (Eds.), Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum Bielefeld: Universität Bielefeld. https://doi.org/10.11576/pundp2022-1016
Seebauer, Fritz Michael, and Wagner, Petra. 2022. “Dimensions of quality for state of the art synthetic speech”. In Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum , ed. Anna Bruggeman, Bogdan Ludusan, and Universität Bielefeld. Bielefeld: Universität Bielefeld.
Seebauer, F. M., and Wagner, P. (2022). “Dimensions of quality for state of the art synthetic speech” in Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum , Bruggeman, A., Ludusan, B., and Universität Bielefeld eds. (Bielefeld: Universität Bielefeld).
Seebauer, F.M., & Wagner, P., 2022. Dimensions of quality for state of the art synthetic speech. In A. Bruggeman, B. Ludusan, & Universität Bielefeld, eds. Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum . Bielefeld: Universität Bielefeld.
F.M. Seebauer and P. Wagner, “Dimensions of quality for state of the art synthetic speech”, Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum , A. Bruggeman, B. Ludusan, and Universität Bielefeld, eds., Bielefeld: Universität Bielefeld, 2022.
Seebauer, F.M., Wagner, P.: Dimensions of quality for state of the art synthetic speech. In: Bruggeman, A., Ludusan, B., and Universität Bielefeld (eds.) Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum . Universität Bielefeld, Bielefeld (2022).
Seebauer, Fritz Michael, and Wagner, Petra. “Dimensions of quality for state of the art synthetic speech”. Berichtsband der 18. Konferenz für Phonetik und Phonologie im deutschsprachigen Raum . Ed. Anna Bruggeman, Bogdan Ludusan, and Universität Bielefeld. Bielefeld: Universität Bielefeld, 2022.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Creative Commons Namensnennung 4.0 International Public License (CC-BY 4.0):
Volltext(e)
Name
PP_2022_paper_0145-4.pdf
245.33 KB
Access Level
Open Access
Zuletzt Hochgeladen
2022-11-15T09:53:13Z
MD5 Prüfsumme
066a899533d3f20b3f4dce53c8073f33