Synthetic Speech Evaluation: Quality and Context
Seebauer FM (2026)
Bielefeld: Universität Bielefeld.
Bielefelder E-Dissertation | Englisch
Download
Autor*in
Gutachter*in / Betreuer*in
Abstract / Bemerkung
As speech synthesis progresses towards more sophisticated technologies, it has become unclear
what the actual target properties of a perfect synthesis system are. Given that there is no gold
standard for any given utterance and the fact that synthetic speech has seemed to reach a ceiling
for traditional evaluation paradigms, it seems reasonable to re-examine the concept of what it
even means for synthetic speech to be of good quality. This thesis addresses the question from
multiple different viewpoints. Determining a definition of synthetic speech quality derived from
literature, it is found that the main constituents of a quality estimation are the listener, the context in
which the evaluation takes place and the system’s acoustic properties, not all of which are usually
controlled for when carrying out synthetic speech evaluation. Investigating the sub-components
of this overarching construct using empirical methods and marrying them to existing literature,
it is found that the most stable sub-dimensions are Naturalness, concerning perceived artificiality,
Intelligibility, pertaining to the ease of comprehension, Audio quality, which queries features of
the auxiliary sound properties and Pleasantness, capturing a hedonic judgement of the friendliness
and perceived warmth. This presents a set of global features in a neutral context, which can be
extended and verified for specific applications. Moving into the field of evaluation, a taxonomy derived
by classifying synthesis evaluation methodologies based on their properties yields the following
dichotomies: Instrumental versus human evaluation, the type of human quality percept (behavioural,
physiological, interpretative), the target dimension of quality and the time of quality elicitation
(online or offline). Four aspects are further derived, which seem to behave more in a quantitative
manner: validity, reliability, cost and expressivity. Examining the distribution of existing evaluation
techniques along these dimensions, two potential gaps in methodology are explored further. It is
found that annotation paradigms which promise to offer more expressive quality estimates seem to
be reasonably reliable, and that instrumental evaluation could benefit greatly from becoming more
expressive, although the investigated method of probing an automated quality predictor did not
produce interpretable results. Finally, it is confirmed that both the idiosyncratic listener effects as well
as the application context influence the outcome of a listening test in some of the different quality
aspects. This final finding highlights the need for more application-oriented testing. A proposal is
then put forward for the employ of virtual reality to ameliorate this new requirement for synthetic
speech evaluation. A joint assessment of the same application scenarios being carried out in both
virtual reality and the real in-person testing showed promising results of distributionally equivalent
ratings between different synthesis systems.
Jahr
2026
Seite(n)
185
Urheberrecht / Lizenzen
Page URI
https://pub.uni-bielefeld.de/record/3016370
Zitieren
Seebauer FM. Synthetic Speech Evaluation: Quality and Context. Bielefeld: Universität Bielefeld; 2026.
Seebauer, F. M. (2026). Synthetic Speech Evaluation: Quality and Context. Bielefeld: Universität Bielefeld. https://doi.org/10.4119/unibi/3016370
Seebauer, Fritz Michael. 2026. Synthetic Speech Evaluation: Quality and Context. Bielefeld: Universität Bielefeld.
Seebauer, F. M. (2026). Synthetic Speech Evaluation: Quality and Context. Bielefeld: Universität Bielefeld.
Seebauer, F.M., 2026. Synthetic Speech Evaluation: Quality and Context, Bielefeld: Universität Bielefeld.
F.M. Seebauer, Synthetic Speech Evaluation: Quality and Context, Bielefeld: Universität Bielefeld, 2026.
Seebauer, F.M.: Synthetic Speech Evaluation: Quality and Context. Universität Bielefeld, Bielefeld (2026).
Seebauer, Fritz Michael. Synthetic Speech Evaluation: Quality and Context. Bielefeld: Universität Bielefeld, 2026.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Creative Commons Namensnennung - Weitergabe unter gleichen Bedingungen 4.0 International Public License (CC BY-SA 4.0):
Volltext(e)
Name
Access Level
Open Access
Zuletzt Hochgeladen
2026-05-05T13:44:20Z
MD5 Prüfsumme
040c32338997d0a21e0609d59f89730e
