Synthetic Speech Evaluation: Quality and Context

Seebauer FM (2026)
Bielefeld: Universität Bielefeld.

Bielefelder E-Dissertation | Englisch
 
Download
OA 12.84 MB
Gutachter*in / Betreuer*in
Abstract / Bemerkung
As speech synthesis progresses towards more sophisticated technologies, it has become unclear what the actual target properties of a perfect synthesis system are. Given that there is no gold standard for any given utterance and the fact that synthetic speech has seemed to reach a ceiling for traditional evaluation paradigms, it seems reasonable to re-examine the concept of what it even means for synthetic speech to be of good quality. This thesis addresses the question from multiple different viewpoints. Determining a definition of synthetic speech quality derived from literature, it is found that the main constituents of a quality estimation are the listener, the context in which the evaluation takes place and the system’s acoustic properties, not all of which are usually controlled for when carrying out synthetic speech evaluation. Investigating the sub-components of this overarching construct using empirical methods and marrying them to existing literature, it is found that the most stable sub-dimensions are Naturalness, concerning perceived artificiality, Intelligibility, pertaining to the ease of comprehension, Audio quality, which queries features of the auxiliary sound properties and Pleasantness, capturing a hedonic judgement of the friendliness and perceived warmth. This presents a set of global features in a neutral context, which can be extended and verified for specific applications. Moving into the field of evaluation, a taxonomy derived by classifying synthesis evaluation methodologies based on their properties yields the following dichotomies: Instrumental versus human evaluation, the type of human quality percept (behavioural, physiological, interpretative), the target dimension of quality and the time of quality elicitation (online or offline). Four aspects are further derived, which seem to behave more in a quantitative manner: validity, reliability, cost and expressivity. Examining the distribution of existing evaluation techniques along these dimensions, two potential gaps in methodology are explored further. It is found that annotation paradigms which promise to offer more expressive quality estimates seem to be reasonably reliable, and that instrumental evaluation could benefit greatly from becoming more expressive, although the investigated method of probing an automated quality predictor did not produce interpretable results. Finally, it is confirmed that both the idiosyncratic listener effects as well as the application context influence the outcome of a listening test in some of the different quality aspects. This final finding highlights the need for more application-oriented testing. A proposal is then put forward for the employ of virtual reality to ameliorate this new requirement for synthetic speech evaluation. A joint assessment of the same application scenarios being carried out in both virtual reality and the real in-person testing showed promising results of distributionally equivalent ratings between different synthesis systems.
Jahr
2026
Seite(n)
185
Page URI
https://pub.uni-bielefeld.de/record/3016370

Zitieren

Seebauer FM. Synthetic Speech Evaluation: Quality and Context. Bielefeld: Universität Bielefeld; 2026.
Seebauer, F. M. (2026). Synthetic Speech Evaluation: Quality and Context. Bielefeld: Universität Bielefeld. https://doi.org/10.4119/unibi/3016370
Seebauer, Fritz Michael. 2026. Synthetic Speech Evaluation: Quality and Context. Bielefeld: Universität Bielefeld.
Seebauer, F. M. (2026). Synthetic Speech Evaluation: Quality and Context. Bielefeld: Universität Bielefeld.
Seebauer, F.M., 2026. Synthetic Speech Evaluation: Quality and Context, Bielefeld: Universität Bielefeld.
F.M. Seebauer, Synthetic Speech Evaluation: Quality and Context, Bielefeld: Universität Bielefeld, 2026.
Seebauer, F.M.: Synthetic Speech Evaluation: Quality and Context. Universität Bielefeld, Bielefeld (2026).
Seebauer, Fritz Michael. Synthetic Speech Evaluation: Quality and Context. Bielefeld: Universität Bielefeld, 2026.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International Public License (CC BY-SA 4.0):
Volltext(e)
Access Level
OA Open Access
Zuletzt Hochgeladen
2026-05-05T13:44:20Z
MD5 Prüfsumme
040c32338997d0a21e0609d59f89730e


Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Suchen in

Google Scholar