Bilingual word and chunk alignment : a hybrid system for Amharic and English

Teserra SA (2007)
Bielefeld: Bielefeld University.

Download
OA
Bielefeld Dissertation | English
Author
Supervisor
Gibbon, Dafydd
Alternative Title
Bilinguales Wort und Chunk Alignment : ein hybrides System für Amharisch und Englisch
Abstract
In dieser Arbeit werden effiziente Wort- und Wortteil-Alignmentalgorithmen von satzweise alignierten Parallelkorpora vorgestellt. Es werden hauptsächlich Verfahren der statistischen Übersetzungsmodellierung verwendet, die Termverteilungsmatrizen nutzen, um Korrelationen zwischen Termen zu identifizieren. Des Weiteren werden linguistische Regeln der Morphologie und Syntax genutzt, um die statistischen Methoden zu ergänzen. Es werden zwei Modelle entwickelt: Modell I ist ein Modell, das in einem Zieldokument die Übersetzung eines Ausgangswortes sucht, während Modell II die Worte eines Ausgangssatzes und die Worte des übersetzten Zielsatzes aligniert. Beide Modelle werden anhand von amharisch-englischen Parallelkorpora evaluiert. Außerdem wird Modell II mit der Implementierung des IBM Alignment-Modells (GIZA++) verglichen.

This thesis presents efficient word alignment algorithms for sentence-aligned parallel corpora. For the most part, procedures for statistical translation modelling are employed that make use of measures of term distribution as the basis for finding correlations between terms. Linguistic rules of morphology and syntax have also been used to complement the statistical methods. Two models have been developed which are briefly described as follows: Alignment Model I. For this first model a statistical global alignment method has been designed in which the entire target document is searched for the translation of a source word. The term in the target language that has the highest similarity in distribution with the source term is taken as the best translation. The output of this algorithm is a 1:1 alignment of a complex Amharic word with simple English words. In reality, one word in one language is not necessarily translated into a single word in the other language and vice versa. This phenomenon is even more pronounced in disparate languages such as English and Amharic. Therefore, an enhancement method, relaxing routine, that would scale up the 1:1 alignments into 1:m alignments is devised. This approach that synthesises English chunks that are equivalent to Amharic words from parallel corpora is also described in this study. The procedure allows several words in the simpler language to be brought together and form a chunk equivalent to the complex word in the other language. The relaxing procedure may resolve the shortcomings of a 1:1 alignment but it does not solve the distortion in the statistics of words created by morphological variants, hence finite-state shallow stemmers that strip salient affixes in both languages have also been developed. Alignment Model II. Model II performs local alignment of a source word in a source sentence in the source language to a word in the target sentence in the target language. The search for a translation of a word in a sentence is only limited to the corresponding sentences instead of the entire document. This is a step towards achieving an increased recall, which is vital when dealing with languages that have scarcity of translation texts. This procedure, however, results in a drop in precision. To improve the diminished precision, two procedures have been integrated into it: 1. Reuse of the lexicon from model I, that is, known translations are excluded from the search space, leaving a limited number of words from which to choose the most likely translation; and 2. a pattern recognition approach for recognising morphological and syntactic features that allows the guessing of translations in sentences has also been developed. A comparative study of the performance of Model I across Amharic, English, Hebrew and German was also part of the study. The impact of the complexities and typological disparities on the performance of the alignment method has been observed. Another attempt to exploit translation texts that has been made in the course of this research was an attempt to recognise nouns in Amharic by transfer from German translation. Since nouns in German are recognised by their initial capital, aligning nouns leads to the recognition of nouns in Amharic, which do not have special features that distinguish them from words in other word classes. All the components of the system have been evaluated on text aligned at sentence level. On the same data, a comparison with the IBM alignment model implementation (GIZA++) has also been made.
Year
PUB-ID

Cite this

Teserra SA. Bilingual word and chunk alignment : a hybrid system for Amharic and English. Bielefeld: Bielefeld University; 2007.
Teserra, S. A. (2007). Bilingual word and chunk alignment : a hybrid system for Amharic and English. Bielefeld: Bielefeld University.
Teserra, S. A. (2007). Bilingual word and chunk alignment : a hybrid system for Amharic and English. Bielefeld: Bielefeld University.
Teserra, S.A., 2007. Bilingual word and chunk alignment : a hybrid system for Amharic and English, Bielefeld: Bielefeld University.
S.A. Teserra, Bilingual word and chunk alignment : a hybrid system for Amharic and English, Bielefeld: Bielefeld University, 2007.
Teserra, S.A.: Bilingual word and chunk alignment : a hybrid system for Amharic and English. Bielefeld University, Bielefeld (2007).
Teserra, Saba Amsalu. Bilingual word and chunk alignment : a hybrid system for Amharic and English. Bielefeld: Bielefeld University, 2007.
Main File(s)
Access Level
OA Open Access

This data publication is cited in the following publications:
This publication cites the following data publications:

Export

0 Marked Publications

Open Data PUB

Search this title in

Google Scholar