MTTK : Machine Translation
Toolkit
Version 1.0 - Yonggang Deng, Bill Byrne
MTTK is a
collection of software tools for the alignment of parallel text for use
in Statistical Machine Translation.
The toolkit was written by Yonggang Deng in the course of his Ph.D. at
The Johns Hopkins University Center
for
Language and Speech Processing.
With MTTK you can ...
- Align document translation pairs at the sentence or
sub-sentence level, sometimes known as chunking.
This is a useful pre-processing step to prepare collections
of translations for use in estimating the parameters of complex
alignment models. Sub-sentence alignment in
particular makes it possible to segment long sentences into shorter
aligned segments that otherwise would have to be discarded.
- Train statistical models for parallel text alignment.
The following models are supported :
- IBM Model-1 and Model-2
- Word-to-Word HMMs
- Word-to-Phrase HMMs , with bigram translation
probabilities
- Parallelize your model training procedures. If you
have multiple CPUs available, you can partition your translation
training texts into subsets, thus speeding up iterative parameter
re-estimation procedures and reducing the amount of memory needed in
training. This is done under exact EM-based parameter estimation
procedures.
- Generate word-to-word and word-to-phrase alignments of parallel
text. MTTK can generate Viterbi alignments of parallel
text (both training text and other texts) under the supported alignment
models.
- Extract word-to-word translation tables from aligned bitext and from the estimated models.
- Extract phrase-to-phrase translation tables (phrase-pair inventories) from aligned parallel text.
- Use the HMM alignment models to induce
phrase translations under its statistical models. Phrase-pair
induction can generate richer inventories of phrase translations than
can be extracted from Viterbi alignments.
- Edit the C++ source code to implement your own estimation and alignment procedures.
Further Reading ...
- Y. Deng and W. Byrne.
MTTK: An alignment toolkit for
statistical
machine translation.
Presented in the HLT-NAACL Demonstrations Program, June 2006.
Presentation (slides, poster,
etc.)
The MTTK alignment toolkit for statistical machine
translation can
be used for word, phrase, and sentence alignment of parallel
documents. It is
designed mainly for building statistical machine translation
systems, but can
be exploited in other multilingual applications. It provides
computationally
efficient alignment and estimation procedures that can be used for
the
unsupervised alignment of parallel text collections in a language
independent
fashion. MTTK Version 1.0 is available under the Open Source
Educational
Community License.
- Y. Deng
and W. Byrne.
HMM word and phrase alignment for
statistical machine translation.
IEEE Transactions on Audio, Speech, and Language
Processing,
16(3):494–507, March 2008.
Efficient estimation and alignment procedures for word and
phrase
alignment HMMs are developed for the alignment of parallel text. The
development of these models is motivated by an analysis of the
desirable
features of IBM Model 4, one of the original and most effective
models for
word alignment. These models are formulated to capture the desirable
aspects
of Model 4 in an HMM alignment formalism. Alignment behavior is
analyzed and
compared to human-generated reference alignments, and the ability of
these
models to capture different types of alignment phenomena is
evaluated. In
analyzing alignment performance, Chinese-English word alignments are
shown to
be comparable to those of IBM Model-4 even when models are trained
over large
parallel texts. In translation performance, phrase-based statistical
machine
translation systems based on these HMM alignments can equal and
exceed
systems based on Model-4 alignments, and this is shown in
Arabic-English and
Chinese-English translation. These alignment models can also be used
to
generate posterior statistics over collections of parallel text, and
this is
used to refine and extend phrase translation tables with a resulting
improvement in translation quality.
- Y. Deng, S. Kumar,
and W. Byrne.
Segmentation and alignment of
parallel
text for statistical machine translation.
Journal of Natural Language Engineering,
13(3):235–260, 2006.
We address the problem of extracting bilingual chunk pairs
from
parallel text to create training sets for statistical machine
translation. We
formulate the problem in terms of a stochastic generative process
over text
translation pairs, and derive two different alignment procedures
based on the
underlying alignment model. The first procedure is a now-standard
dynamic
programming alignment model which we use to generate an initial
coarse
alignment of the parallel text. The second procedure is a divisive
clustering
parallel text alignment procedure which we use to refine the
first-pass
alignments. This latter procedure is novel in that it permits the
segmentation of the parallel text into sub-sentence units which are
allowed
to be reordered to improve the chunk alignment. The quality of chunk
pairs
are measured by the performance of machine translation systems
trained from
them. We show practical benefits of divisive clustering as well as
how system
performance can be improved by exploiting portions of the parallel
text that
otherwise would have to be discarded. We also show that chunk
alignment as a
first step in word alignment can significantly reduce word alignment
error
rate.
- Y. Deng and W. Byrne.
HMM word and phrase alignment for
statistical machine translation.
In Proceedings of HLT-EMNLP, 2005.
Presentation (slides, poster,
etc.)
HMM-based models are developed for the alignment of words
and
phrases in bitext. The models are formulated so that alignment and
parameter
estimation can be performed efficiently. We find that
Chinese-English word
alignment performance is comparable to that of IBM Model-4 even over
large
training bitexts. Phrase pairs extracted from word alignments
generated under
the model can also be used for phrase-based translation, and in
Chinese to
English and Arabic to English translation, performance is comparable
to
systems based on Model-4 alignments. Direct phrase pair induction
under the
model is described and shown to improve translation
performance.
-
Yonggang's dissertation:
Bitext Alignment for Statistical Machine Translation,
Electrical and Computer Engineering, The Johns Hopkins
University, 2005
  Slides from thesis defense: [.pdf]
How to Obtain
MTTK
MTTK is released under the Open Source Educational
Community License.
You're free to use, modify, and distribute MTTK under the terms of this
license.
To request a copy, send an email to bill . byrne @ eng.cam.ac.uk .