Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion
Gburrek T, Ebbers J, Häb-Umbach R, Wagner P (2019)
In: Proceedings of the 10 Speech Synthesis Workshop (SSW10).
Konferenzbeitrag | Englisch
Download
Es wurden keine Dateien hochgeladen. Nur Publikationsnachweis!
Autor*in
Gburrek, Tobias;
Ebbers, Janek;
Häb-Umbach, Reinhold;
Wagner, PetraUniBi
Einrichtung
Abstract / Bemerkung
This paper presents an approach to voice conversion, which
does neither require parallel data nor speaker or phone labels for
training. It can convert between speakers which are not in the
training set by employing the previously proposed concept of a
factorized hierarchical variational autoencoder. Here, linguistic
and speaker induced variations are separated upon the notion
that content induced variations change at a much shorter time
scale, i.e., at the segment level, than speaker induced variations,
which vary at the longer utterance level. In this contribution we
propose to employ convolutional instead of recurrent network
layers in the encoder and decoder blocks, which is shown to
achieve better phone recognition accuracy on the latent segment
variables at frame-level due to their better temporal resolution.
For voice conversion the mean of the utterance variables is replaced
with the respective estimated mean of the target speaker.
The resulting log-mel spectra of the decoder output are used as
local conditions of a WaveNet which is utilized for synthesis
of the speech waveforms. Experiments show both good disentanglement
properties of the latent space variables, and good
voice conversion performance, as assessed both quantitatively
and qualitatively.
Stichworte
biphonetics
Erscheinungsjahr
2019
Titel des Konferenzbandes
Proceedings of the 10 Speech Synthesis Workshop (SSW10)
Konferenzort
Vienna, Austria
Page URI
https://pub.uni-bielefeld.de/record/2936382
Zitieren
Gburrek T, Ebbers J, Häb-Umbach R, Wagner P. Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion. In: Proceedings of the 10 Speech Synthesis Workshop (SSW10). 2019.
Gburrek, T., Ebbers, J., Häb-Umbach, R., & Wagner, P. (2019). Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion. Proceedings of the 10 Speech Synthesis Workshop (SSW10)
Gburrek, Tobias, Ebbers, Janek, Häb-Umbach, Reinhold, and Wagner, Petra. 2019. “Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion”. In Proceedings of the 10 Speech Synthesis Workshop (SSW10).
Gburrek, T., Ebbers, J., Häb-Umbach, R., and Wagner, P. (2019). “Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion” in Proceedings of the 10 Speech Synthesis Workshop (SSW10).
Gburrek, T., et al., 2019. Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion. In Proceedings of the 10 Speech Synthesis Workshop (SSW10).
T. Gburrek, et al., “Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion”, Proceedings of the 10 Speech Synthesis Workshop (SSW10), 2019.
Gburrek, T., Ebbers, J., Häb-Umbach, R., Wagner, P.: Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion. Proceedings of the 10 Speech Synthesis Workshop (SSW10). (2019).
Gburrek, Tobias, Ebbers, Janek, Häb-Umbach, Reinhold, and Wagner, Petra. “Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion”. Proceedings of the 10 Speech Synthesis Workshop (SSW10). 2019.