Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion

Gburrek, Tobias; Ebbers, Janek; Häb-Umbach, Reinhold; Wagner, Petra

Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion

Gburrek T, Ebbers J, Häb-Umbach R, Wagner P (2019)
In: Proceedings of the 10 Speech Synthesis Workshop (SSW10).

Konferenzbeitrag | Englisch

Download

Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!

Autor*in

Gburrek, Tobias; Ebbers, Janek; Häb-Umbach, Reinhold; Wagner, Petra^UniBi

Einrichtung

Center of Excellence - Cognitive Interaction Technology CITEC
Fakultät für Linguistik und Literaturwissenschaft > Department Linguistik

Abstract / Bemerkung

This paper presents an approach to voice conversion, which does neither require parallel data nor speaker or phone labels for training. It can convert between speakers which are not in the training set by employing the previously proposed concept of a factorized hierarchical variational autoencoder. Here, linguistic and speaker induced variations are separated upon the notion that content induced variations change at a much shorter time scale, i.e., at the segment level, than speaker induced variations, which vary at the longer utterance level. In this contribution we propose to employ convolutional instead of recurrent network layers in the encoder and decoder blocks, which is shown to achieve better phone recognition accuracy on the latent segment variables at frame-level due to their better temporal resolution. For voice conversion the mean of the utterance variables is replaced with the respective estimated mean of the target speaker. The resulting log-mel spectra of the decoder output are used as local conditions of a WaveNet which is utilized for synthesis of the speech waveforms. Experiments show both good disentanglement properties of the latent space variables, and good voice conversion performance, as assessed both quantitatively and qualitatively.

Stichworte

biphonetics

Erscheinungsjahr

2019

Titel des Konferenzbandes

Proceedings of the 10 Speech Synthesis Workshop (SSW10)

Konferenzort

Vienna, Austria

Page URI

https://pub.uni-bielefeld.de/record/2936382

Zitieren

Gburrek T, Ebbers J, Häb-Umbach R, Wagner P. Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion. In: Proceedings of the 10 Speech Synthesis Workshop (SSW10). 2019.

Gburrek, T., Ebbers, J., Häb-Umbach, R., & Wagner, P. (2019). Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion. Proceedings of the 10 Speech Synthesis Workshop (SSW10)

Gburrek, Tobias, Ebbers, Janek, Häb-Umbach, Reinhold, and Wagner, Petra. 2019. “Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion”. In Proceedings of the 10 Speech Synthesis Workshop (SSW10).

Gburrek, T., Ebbers, J., Häb-Umbach, R., and Wagner, P. (2019). “Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion” in Proceedings of the 10 Speech Synthesis Workshop (SSW10).

Gburrek, T., et al., 2019. Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion. In Proceedings of the 10 Speech Synthesis Workshop (SSW10).

T. Gburrek, et al., “Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion”, Proceedings of the 10 Speech Synthesis Workshop (SSW10), 2019.

Gburrek, T., Ebbers, J., Häb-Umbach, R., Wagner, P.: Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion. Proceedings of the 10 Speech Synthesis Workshop (SSW10). (2019).

Gburrek, Tobias, Ebbers, Janek, Häb-Umbach, Reinhold, and Wagner, Petra. “Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion”. Proceedings of the 10 Speech Synthesis Workshop (SSW10). 2019.

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Suchen in

Google Scholar

PUB - Publikationen an der Universität Bielefeld

Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion

Zitieren