Multi-modal scene understanding using probabilistic models

Wachsmuth, Sven

Multi-modal scene understanding using probabilistic models

Wachsmuth S (2001)
Bielefeld (Germany): Bielefeld University.

Bielefelder E-Dissertation | Englisch

Download

0053.pdf

URN

urn:nbn:de:hbz:361-4816

Autor*in

Wachsmuth, Sven

Gutachter*in / Betreuer*in

Sagerer, Gerhard

Einrichtung

Technische Fakultät > AG Angewandte Informatik

Abstract / Bemerkung

This thesis addresses the problem of relating spoken utterances to the simultaneously perceived visual scene context. The development of systems that integrate verbal and visual information is an extending field of research. It is pushed by various applications like the indexing and querying of video databases, service robotics, augmented reality, document analysis, documentation systems with multi-modal interfaces, or other multi-media systems. Each of these applications has to relate two or more different input modalities, which is also known as the correspondence problem. The task of relating realistic inputs like speech or images is complicated by the fact that the interpretations of the surface modalities are often erroneous or incomplete such that an integration component must consider noisy and partial interpretations. As a consequence, this thesis treats the correspondence problem as a probabilistic decoding process. This perspective distinguishes this approach from other approaches that propose rule-based translation schemes or integrated knowledge bases and assume that a visual representation can be logically transformed into a verbal representation and vice versa. This thesis successfully applies Bayesian networks to the task of integrating speech and images. The correspondence problem is solved in the language of Bayesian networks in a consistent and efficient way by using a novel combination of conditioning and elimination techniques. The experimental study identifies Bayesian networks as an adequate formalism for speech and image integration tasks. The mental models of the speaker are partially reconstructed by estimating conditional probabilities from the data of psycholinguistic experiments. Context dependent shifts of word meanings are modeled by the structure of the network. The proposed Bayesian network scheme for integrating multi-modal input has been applied to a construction scenario. A robot is instructed by a speaker to grasp objects from a table, join them together, and put them down again. In this thesis an integration component is realized that is able to identify objects in the visual scene that are verbally referred to by the speaker. This task is successfully performed despite of vague descriptions, erroneous recognition results, and the use of names with unknown semantics. Several interaction tasks have been implemented that perform multi-modal object recognition, link unknown object names to scene objects, disambiguate alternative interpretations of utterances, predict undetected mounting relations, or determine the selected reference frame of the speaker.

Stichworte

Bayesian networks; Computer vision; Speech understanding; Multi-modal systems; Statistische Schlussweise; Bayes-Netz; Maschinelles Lernen; Mensch-Maschine-Kommunikation; Sprachverstehen; Bildverstehen

Jahr

2001

Page URI

https://pub.uni-bielefeld.de/record/2302984

Zitieren

Wachsmuth S. Multi-modal scene understanding using probabilistic models. Bielefeld (Germany): Bielefeld University; 2001.

Wachsmuth, S. (2001). Multi-modal scene understanding using probabilistic models. Bielefeld (Germany): Bielefeld University.

Wachsmuth, Sven. 2001. Multi-modal scene understanding using probabilistic models. Bielefeld (Germany): Bielefeld University.

Wachsmuth, S. (2001). Multi-modal scene understanding using probabilistic models. Bielefeld (Germany): Bielefeld University.

Wachsmuth, S., 2001. Multi-modal scene understanding using probabilistic models, Bielefeld (Germany): Bielefeld University.

S. Wachsmuth, Multi-modal scene understanding using probabilistic models, Bielefeld (Germany): Bielefeld University, 2001.

Wachsmuth, S.: Multi-modal scene understanding using probabilistic models. Bielefeld University, Bielefeld (Germany) (2001).

Wachsmuth, Sven. Multi-modal scene understanding using probabilistic models. Bielefeld (Germany): Bielefeld University, 2001.

Alle Dateien verfügbar unter der/den folgenden Lizenz(en):

Copyright Statement:

Dieses Objekt ist durch das Urheberrecht und/oder verwandte Schutzrechte geschützt. [...]

Volltext(e)

Name

0053.pdf

Access Level

Open Access

Zuletzt Hochgeladen

2019-09-06T08:57:41Z

MD5 Prüfsumme

30ea575a89dc07da8aa711aeb2a9f92a

Export

Markieren/ Markierung löschen
Markierte Publikationen

Open Data PUB

Suchen in

Google Scholar

PUB - Publikationen an der Universität Bielefeld

Multi-modal scene understanding using probabilistic models

Zitieren