MTTK : Machine Translation Toolkit

Version 1.0 - Yonggang Deng,  Bill Byrne

MTTK is a collection of software tools for the alignment of parallel text for use in Statistical Machine Translation.  
The toolkit was written by Yonggang Deng in the course of his Ph.D. at The Johns Hopkins University Center for
Language and Speech Processing
.  

With MTTK you can ...

Further Reading ...

Y. Deng and W. Byrne. MTTK: An alignment toolkit for statistical machine translation. Presented in the HLT-NAACL Demonstrations Program, June 2006. Presentation (slides, poster, etc.)
The MTTK alignment toolkit for statistical machine translation can be used for word, phrase, and sentence alignment of parallel documents. It is designed mainly for building statistical machine translation systems, but can be exploited in other multilingual applications. It provides computationally efficient alignment and estimation procedures that can be used for the unsupervised alignment of parallel text collections in a language independent fashion. MTTK Version 1.0 is available under the Open Source Educational Community License.
Y. Deng and W. Byrne. HMM word and phrase alignment for statistical machine translation. IEEE Transactions on Audio, Speech, and Language Processing, 16(3):494–507, March 2008.
Efficient estimation and alignment procedures for word and phrase alignment HMMs are developed for the alignment of parallel text. The development of these models is motivated by an analysis of the desirable features of IBM Model 4, one of the original and most effective models for word alignment. These models are formulated to capture the desirable aspects of Model 4 in an HMM alignment formalism. Alignment behavior is analyzed and compared to human-generated reference alignments, and the ability of these models to capture different types of alignment phenomena is evaluated. In analyzing alignment performance, Chinese-English word alignments are shown to be comparable to those of IBM Model-4 even when models are trained over large parallel texts. In translation performance, phrase-based statistical machine translation systems based on these HMM alignments can equal and exceed systems based on Model-4 alignments, and this is shown in Arabic-English and Chinese-English translation. These alignment models can also be used to generate posterior statistics over collections of parallel text, and this is used to refine and extend phrase translation tables with a resulting improvement in translation quality.
Y. Deng, S. Kumar, and W. Byrne. Segmentation and alignment of parallel text for statistical machine translation. Journal of Natural Language Engineering, 13(3):235–260, 2006.
We address the problem of extracting bilingual chunk pairs from parallel text to create training sets for statistical machine translation. We formulate the problem in terms of a stochastic generative process over text translation pairs, and derive two different alignment procedures based on the underlying alignment model. The first procedure is a now-standard dynamic programming alignment model which we use to generate an initial coarse alignment of the parallel text. The second procedure is a divisive clustering parallel text alignment procedure which we use to refine the first-pass alignments. This latter procedure is novel in that it permits the segmentation of the parallel text into sub-sentence units which are allowed to be reordered to improve the chunk alignment. The quality of chunk pairs are measured by the performance of machine translation systems trained from them. We show practical benefits of divisive clustering as well as how system performance can be improved by exploiting portions of the parallel text that otherwise would have to be discarded. We also show that chunk alignment as a first step in word alignment can significantly reduce word alignment error rate.
Y. Deng and W. Byrne. HMM word and phrase alignment for statistical machine translation. In Proceedings of HLT-EMNLP, 2005. Presentation (slides, poster, etc.)
HMM-based models are developed for the alignment of words and phrases in bitext. The models are formulated so that alignment and parameter estimation can be performed efficiently. We find that Chinese-English word alignment performance is comparable to that of IBM Model-4 even over large training bitexts. Phrase pairs extracted from word alignments generated under the model can also be used for phrase-based translation, and in Chinese to English and Arabic to English translation, performance is comparable to systems based on Model-4 alignments. Direct phrase pair induction under the model is described and shown to improve translation performance.
Yonggang's dissertation: Bitext Alignment for Statistical Machine Translation, Electrical and Computer Engineering, The Johns Hopkins University, 2005
  Slides from thesis defense: [.pdf]

How to Obtain MTTK

MTTK is released under the Open Source Educational Community License.
You're free to use, modify, and distribute MTTK under the terms of this license.
To request a copy, send an email to bill . byrne @ eng.cam.ac.uk .