More Than Just Natural. Contextually Relevant and Semantically Meaningful Gesture Generation
Voß H (2026)
Bielefeld: Universität Bielefeld.
Bielefelder E-Dissertation | Englisch
Download
Hendric_Diss_final.pdf
17.42 MB
Autor*in
Gutachter*in / Betreuer*in
Abstract / Bemerkung
Human communication is inherently multimodal, with gestures playing a crucial role in
conveying meaning beyond words alone. However, current gesture generation systems
predominantly focus on visual naturalness while neglecting the communicative pur-
pose of gestures, resulting in movements that appear fluid yet lack semantic informa-
tion. This thesis addresses the fundamental challenge of creating non-verbal behaviors
that are not only visually natural but also semantically meaningful and contextually
grounded, thereby enhancing multimodal communication. It establishes a new cate-
gory of gesture generation approaches called context-driven, that combines the strengths
of both intent-driven and speech-driven gesture generation methods. Through five inter-
connected papers, this work develops and evaluates several novel frameworks. First,
the AQ-GT model demonstrates how quantization and hybrid GRU-Transformer archi-
tectures can generate highly realistic beat gestures, while the AQ-GT-A model extends
this work by incorporating form and meaning annotations to guide the gesture genera-
tion process. An evaluation study of both models highlights the weaknesses of current
speech-driven gesture generation and shifts the focus of this thesis from speech-driven
to context-driven approaches. Based on this, the TF-JAX-IK algorithm provides a real-
time inverse kinematics solution for mapping high-level gesture concepts onto natural
human motion. This TF-JAX-IK algorithm is then used for the ImaGGen framework,
which introduces a Semantic Planning approach that generates contextually grounded
semantic gestures by analyzing visual input without requiring extensive training data.
The evaluation of ImaGGens contextually grounded gestures shows that they signifi-
cantly improve the delivery of information, particularly when speech is ambiguous,
while maintaining naturalness through the integration of speech-driven beat gestures.
The work presented here makes several theoretical and practical contributions to the
field of multimodal interaction and gesture generation. It challenges the prevailing as-
sumption of recent years that visual naturalness alone is sufficient for effective gesture
generation, emphasizing instead the crucial role of communicative intent and semantic
grounding in producing meaningful gestures. The developed methods have direct ap-
plications in virtual reality and assistive communication tools, enabling virtual agents
with ImaGGen-like capabilities to deliver clearer explanations and support more natu-
ral interactions. Finally, this thesis focuses on an open-source research approach, with
a transparent and extendable code basis for all presented papers.
Jahr
2026
Seite(n)
275
Urheberrecht / Lizenzen
Page URI
https://pub.uni-bielefeld.de/record/3016358
Zitieren
Voß H. More Than Just Natural. Contextually Relevant and Semantically Meaningful Gesture Generation. Bielefeld: Universität Bielefeld; 2026.
Voß, H. (2026). More Than Just Natural. Contextually Relevant and Semantically Meaningful Gesture Generation. Bielefeld: Universität Bielefeld. https://doi.org/10.4119/unibi/3016358
Voß, Hendric. 2026. More Than Just Natural. Contextually Relevant and Semantically Meaningful Gesture Generation. Bielefeld: Universität Bielefeld.
Voß, H. (2026). More Than Just Natural. Contextually Relevant and Semantically Meaningful Gesture Generation. Bielefeld: Universität Bielefeld.
Voß, H., 2026. More Than Just Natural. Contextually Relevant and Semantically Meaningful Gesture Generation, Bielefeld: Universität Bielefeld.
H. Voß, More Than Just Natural. Contextually Relevant and Semantically Meaningful Gesture Generation, Bielefeld: Universität Bielefeld, 2026.
Voß, H.: More Than Just Natural. Contextually Relevant and Semantically Meaningful Gesture Generation. Universität Bielefeld, Bielefeld (2026).
Voß, Hendric. More Than Just Natural. Contextually Relevant and Semantically Meaningful Gesture Generation. Bielefeld: Universität Bielefeld, 2026.
Alle Dateien verfügbar unter der/den folgenden Lizenz(en):
Creative Commons Namensnennung - Nicht kommerziell - Keine Bearbeitungen 4.0 International (CC BY-NC-ND 4.0):
Volltext(e)
Name
Hendric_Diss_final.pdf
17.42 MB
Access Level
Open Access
Zuletzt Hochgeladen
2026-04-29T13:22:41Z
MD5 Prüfsumme
d236bfbb3e2f81ceb87662e3658c9caa
Material in PUB:
Teil dieser Dissertation
AQ-GT: A Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis
Voß H, Kopp S (2023)
In: Proceedings of the 25th International Conference on Multimodal Interaction (ICMI 2023). André E, Chetouani M, Vaufreydaz D, Lucas G, Schultz T, Morency L-P, Vinciarelli A (Eds); New York: ACM Press: 60-69.
Voß H, Kopp S (2023)
In: Proceedings of the 25th International Conference on Multimodal Interaction (ICMI 2023). André E, Chetouani M, Vaufreydaz D, Lucas G, Schultz T, Morency L-P, Vinciarelli A (Eds); New York: ACM Press: 60-69.
Teil dieser Dissertation
Augmented Co-Speech Gesture Generation: Including Form and Meaning Features to Guide Learning-Based Gesture Synthesis
Voß H, Kopp S (2023)
In: ACM International Conference on Intelligent Virtual Agents (IVA '23). New York: ACM.
Voß H, Kopp S (2023)
In: ACM International Conference on Intelligent Virtual Agents (IVA '23). New York: ACM.
Teil dieser Dissertation
Conveying Meaning through Gestures: An Investigation into Semantic Co-Speech Gesture Generation
Voß H, Bohnenkamp LM, Kopp S (Submitted)
arXiv:2510.17599.
Voß H, Bohnenkamp LM, Kopp S (Submitted)
arXiv:2510.17599.
Teil dieser Dissertation
Real-Time Inverse Kinematics for Generating Multi-Constrained Movements of Virtual Human Characters
Voß H, Kopp S (2025)
In: Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents. Gebhard P, Schneeberger T, Biancardi B, Sabouret N, Schmitz M, Yumak Z (Eds); New York, NY, USA: ACM.
Voß H, Kopp S (2025)
In: Proceedings of the 25th ACM International Conference on Intelligent Virtual Agents. Gebhard P, Schneeberger T, Biancardi B, Sabouret N, Schmitz M, Yumak Z (Eds); New York, NY, USA: ACM.
Teil dieser Dissertation
ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input
Voß H, Kopp S (Submitted)
arXiv:2510.17617.
Voß H, Kopp S (Submitted)
arXiv:2510.17617.
