Bill Byrne
Publications and Presentations
2012
- W. Byrne.
Hierarchical
phrase-based translation representations.
Workshop on `More Structure for Better Statistical Machine Translation?',
University of Amsterdam, Netherlands, January 2012.
Invited lecture.
- W. Byrne.
Weighted
finite state transducers in statistical machine translation.
International Winter School in Language and Speech Technologies (WSLST 2012),
Tarragona, Spain, January 2012.
Six lecture short course.
This short course will present some recent advances in statistical
machine translation (SMT) using modelling approaches based on Weighted Finite
State Transducers (WFSTs) and Finite State Automata (FSA). The course focus
will be on decoding procedures for SMT, i.e. the generation of translations
using stochastic translation grammars and language models. WFSTs can offer a
very powerful modelling framework for language processing. For problems which
can be formulated in terms of WFSTs or FSAs, there are general purpose
algorithms which can be used to implement efficient and exact search and
estimation procedures. This is true even for problems which are not
inherently finite state, such as translation with some stochastic context
free grammars. The course will begin with an introduction to WFSTs, pushdown
automata, and semirings in the context of SMT. The use of WFST and FSA
modelling approaches will be presented for: SMT decoding with phrase-based
models; SMT decoding with stochastic synchronous context free grammars (e.g.
Hiero); SMT parameter optimisation (MERT); the use of large language models
and 'fast' grammars in translation; translation lattice generation; and
rescoring procedures such as minimum Bayes risk decoding and system
combination. Implementations using the OpenFst toolkit will also be
described. The course material will be suitable for researchers already
familiar with SMT and who wish to learn about alternative methods in decoder
design. Enough background will be given so that researchers new to machine
translation or unfamiliar with applications of WFSTs in natural language
processing will also find the material appropriate.
- Kei Hashimoto, Junichi
Yamagishi, William Byrne, Simon King, and Keiichi Tokudaa.
Impacts of machine translation and speech synthesis on speech-to-speech
translation.
Speech Communication, To Appear. 2012.
This paper analyzes the impacts of machine translation and speech
synthesis on speech-to-speech translation systems. A typical speech-to-speech
translation system consists of three components: speech recognition, machine
translation and speech synthesis. Many techniques have been proposed for
integration of speech recognition and machine translation. However,
corresponding techniques have not yet been considered for speech synthesis.
The focus of the current work is machine translation and speech synthesis,
and we present a subjective evaluation designed to analyze their impact on
speech-to-speech translation. The results of these analyses show that the
naturalness and intelligibility of the synthesized speech are strongly
affected by the fluency of the translated sentences. In addition, various
features were found to correlate well with the average fluency of the
translated sentences and the average naturalness of the synthesized
speech.
2011
- K. Hashimoto, J. Yamagishi, W. Byrne,
S. King, and K. Tokuda.
An analysis of machine translation and speech synthesis in speech-to-speech
translation system.
In Proceedings of IEEE Conference on Acoustics, Speech and Signal
Processing, 2011.
This paper provides an analysis of the impacts of machine
translation and speech synthesis on speech-to-speech translation systems. The
speech-to-speech translation system consists of three components: speech
recognition, machine translation and speech synthesis. Recently, many
techniques for integration of speech recognition and machine translation have
been proposed. However, speech synthesis has not yet been considered. The
quality of synthesized speech is important, since users will not understand
what the system said if the quality of synthesized speech is bad. Therefore,
in this paper, we focus on the machine translation and speech synthesis
components, and report a subjective evaluation to analyze the impact of each
component. The results of these analyses show that the machine translation
component affects the performance of speech-to-speech translation greatly,
and that fluent sentences lead to higher naturalness and lower word error
rate of synthesized speech.
- M. Shannon, H. Zen, and
W. Byrne.
The effect of using normalized models in statistical speech synthesis.
In Proceedings of the 12th Annual Conference of the International
Speech Communication Association, 2011.
The standard approach to HMM-based speech synthesis is inconsistent
in the enforcement of the deterministic constraints between static and
dynamic features. The trajectory HMM and autoregressive HMM have been
proposed as normalized models which rectify this inconsistency. This paper
investigates the practical effects of using these normalized models, and
examines the strengths and weaknesses of the different models as
probabilistic models of speech. The most striking difference observed is that
the standard approach greatly underestimates predictive variance. We argue
that the normalized models have better predictive distributions than the
standard approach, but that all the models we consider are still far from
satisfactory probabilistic models of speech. We also present evidence that
better intra-frame correlation modelling goes some way towards improving
existing normalized models.
- Gonzalo Iglesias, Cyril
Allauzen, William Byrne, Adrià de Gispert, and Michael Riley.
Hierarchical phrase-based
translation representations.
In Proceedings of the 2011 Conference on Empirical Methods in Natural
Language Processing, pages 1373–1383, Edinburgh, Scotland, UK., July
2011. Association for Computational Linguistics.
This paper compares several translation representations for a
synchronous context-free grammar parse including CFGs/hypergraphs,
finite-state automata (FSA), and pushdown automata (PDA). The representation
choice is shown to determine the form and complexity of target LM
intersection and shortest-path algorithms that follow. Intersection, shortest
path, FSA expansion and RTN replacement algorithms are presented for PDAs.
Chinese-toEnglish translation experiments using HiFST and HiPDT, FSA and
PDA-based decoders, are presented using admissible (or exact) search,
possible for HiFST with compact SCFG rulesets and HiPDT with compact LMs. For
large rulesets with large LMs, we introduce a two-pass search strategy which
we then analyze in terms of search errors and translation
performance.
- J. Dines, H. Liang, L. Saheer, M. Gibson,
W. Byrne, K. Oura, K. Tokuda, J. Yamagishi, S. King, M. Wester,
T. Hirsimäki, R. Karhila, and M. Kurimo.
Personalising speech-to-speech translation: Unsupervised cross-lingual speaker
adaptation for HMM-based speech synthesis.
Computer Speech and Language, 2011.
In this paper we present results of unsupervised cross-lingual
speaker adaptation applied to text-to-speech synthesis. The application of
our research is the personalisation of speech-to-speech translation in which
we employ a HMM statistical framework for both speech recognition and
synthesis. This framework provides a logical mechanism to adapt synthesised
speech output to the voice of the user by way of speech recognition. In this
work we present results of several different unsupervised and cross-lingual
adaptation approaches as well as an end-to-end speaker adaptive
speech-to-speech translation system. Our experiments show that we can
successfully apply speaker adaptation in both unsupervised and cross-lingual
scenarios and our proposed algorithms seem to generalise well for several
language pairs. We also discuss important future directions including the
need for better evaluation metrics.
- M. Gibson and W. Byrne.
Unsupervised intra-lingual and cross-lingual speaker adaptation for HMM-based
speech synthesis using two-pass decision tree construction.
IEEE Transactions on Audio, Speech, and Language Processing,
19(4):895 – 904, 2011.
Accepted, to appear.
Hidden Markov model (HMM)-based speech synthesis systems possess
several advantages over concatenative synthesis systems. One such advantage
is the relative ease with which HMM-based systems are adapted to speakers not
present in the training dataset. Speaker adaptation methods used in the field
of HMM-based automatic speech recognition (ASR) are adopted for this task. In
the case of unsupervised speaker adaptation, previous work has used a
supplementary set of acoustic models to estimate the transcription of the
adaptation data. This paper first presents an approach to the unsupervised
speaker adaptation task for HMM-based speech synthesis models which avoids
the need for such supplementary acoustic models. This is achieved by defining
a mapping between HMM-based synthesis models and ASR-style models, via a
two-pass decision tree construction process. Second, it is shown that this
mapping also enables unsupervised adaptation of HMM-based speech synthesis
models without the need to perform linguistic analysis of the estimated
transcription of the adaptation data. Third, this paper demonstrates how this
technique lends itself to the task of unsupervised cross-lingual adaptation
of HMM-based speech synthesis models, and explains the advantages of such an
approach. Finally, listener evaluations reveal that the proposed unsupervised
adaptation methods deliver performance approaching that of supervised
adaptation.
- A. de Gispert, W. Byrne, J. Xu,
R. Zbib, J. Makhoul, A. Chalabi, H. Nader, N. Habash, and F. Sadat.
Proprocessing Arabic for Arabic-English statistical machine translation.
In J. Olive, C. Christianson, and J. McCary, editors, Handbook of
natural language processing and machine translation. DARPA Global
Autonomous Language Exploitation, pages 135 – 145. Springer,
2011.
2010
- William Byrne.
Hierarchical phrase-based translation with weighted finite state transducers.
Natural Language Processing Group, Department of Computer Science, University
of Sheffield, UK, December 2010.
- William Byrne.
Hierarchical phrase-based translation with
weighted finite state transducers.
7th International Workshop on Spoken Language Translation, Paris, France,
December 2010.
Keynote lecture.
- William Byrne.
Recent research in statistical machine translation.
Winton Capital Management Internal Research Conference, November 2010.
Invited presentation.
- William Byrne.
Hierarchical phrase-based translation with
weighted finite state transducers.
FALA 2010 Conference (VI Jornadas en Tecnologías del Habla and II Iberian
Workshop on Speech and Language Technologies for Iberian Languages), Vigo,
Spain, November 2010.
Keynote lecture.
- William Byrne.
Hierarchical phrase-based translation with weighted finite state transducers.
Dublin Computational Linguistics Research Seminar, Dublin, Ireland, November
2010.
- Matthew Gibson and William Byrne.
EMIME
project overview.
European Commission Information Society Conference (ICT 2010), Brussels,
Belgium, September 2010.
- William Byrne and Adrià
de Gispert.
Fast Hiero grammars.
DARPA GALE PI Meeting, Scottsdale, AZ, USA, April 2010.
- William Byrne.
Hierarchical phrase-based translation with weighted finite state transducers.
Columbia University, New York, NY, USA, April 2010.
- William Byrne.
Hierarchical phrase-based translation with weighted finite state transducers.
Google, Inc, Mountain View, CA, USA, April 2010.
- William Byrne.
FAUST project overview.
ICT-FP7 Language Technology Days, Luxembourg, March 2010.
- Adrià de Gispert, Juan Pino,
and William Byrne.
Hierarchical phrase-based translation
grammars extracted from alignment posterior probabilities.
In Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP), Cambridge, MA, 2010.
We report on investigations into hierarchical phrase-based
translation grammars based on rules extracted from posterior distributions
over alignments of the parallel text. Rather than restrict rule extraction to
a single alignment, such as Viterbi, we instead extract rules based on
posterior distributions provided by the HMM word-to-word alignment model. We
define translation grammars progressively by adding classes of rules to a
basic phrase-based system. We assess these grammars in terms of their
expressive power, measured by their ability to align the parallel text from
which their rules are extracted, and the quality of the translations they
yield. In Chinese-to-English translation, we find that rule extraction from
posteriors gives translation improvements. We also find that grammars with
rules with only one nonterminal, when extracted from posteriors, can
outperform more complex grammars extracted from Viterbi alignments. Finally,
we show that the best way to exploit source-to- target and target-to-source
alignment models is to build two separate systems and combine their output
translation lattices.
- Matthew Shannon and
William Byrne.
Autoregressive
clustering for HMM speech synthesis.
In Proceedings of INTERSPEECH, 2010.
The autoregressive HMM has been shown to provide efficient
parameter estimation and high-quality synthesis, but in previous experiments
decision trees derived from a non-autoregressive system were used. In this
paper we investigate the use of autoregressive clustering for autoregressive
HMM-based speech synthesis. We describe decision tree clustering for the
autoregressive HMM and highlight differences to the standard clustering
procedure. Subjective listening evaluation results suggest that
autoregressive clustering improves the naturalness of the resulting speech.
We find that the standard minimum description length (MDL) criterion for
selecting model complexity is inappropriate for the autoregressive HMM.
Investigating the effect of model complexity on naturalness, we find that a
large degree of overfitting is tolerated without a substantial decrease in
naturalness.
- Graeme Blackwood, Adrià
de Gispert, and William Byrne.
Efficient path counting transducers for minimum
Bayes-risk decoding of statistical machine translation lattices.
In Proceedings of the Annual Meeting of the Association for Computational
Linguistics, 2010.
This paper presents an efficient implementation of linearised
lattice minimum Bayes-risk decoding using weighted finite state transducers.
We introduce transducers to efficiently count lattice paths containing
n-grams and use these to gather the required statistics. We show that these
procedures can be implemented exactly through simple transformations of word
sequences to sequences of n-grams. This yields a novel implementation of
lattice minimum Bayes-risk decoding which is fast and exact even for very
large lattices.
- Mikko Kurimo and William Byrne
et al.
Personalising speech-to-speech translation in
the EMIME project.
In Proceedings of the Annual Meeting of the Association for Computational
Linguistics, 2010.
Demo Session.
- Graeme Blackwood, Adrià
de Gispert, and William Byrne.
Fluency constraints for minimum Bayes-risk
decoding of statistical machine translation lattices.
In Proceedings of the International Conference on Computational
Linguistics (COLING), 2010.
A novel and robust approach to incorporating natural language
generation into statistical machine tr anslation is developed within a
minimum Bayes-risk decoding framework. Segmentation of translation l attices
is guided by confidence measures over the maximum likelihood translation
hypothesis in order to focus on regions with potential translation errors.
Modeling techniques intended to improve flue ncy in low confidence regions
are introduced so as to improve overall translation
fluency.
- Juan Pino, Gonzalo Iglesias, Adrià
Gispert, Graeme Blackwood, Jamie Brunning, and William Byrne.
The CUED HiFST system for the WMT10 translation
shared task.
In Proceedings of the ACL 2010 Joint Fifth Workshop on Statistical
Machine Translation, 2010.
This paper describes the Cambridge University Engineering
Department submission to the Fifth Workshop on Statistical Machine
Translation. We report results for the French-English and Spanish-English
shared translation tasks in both directions. The CUED system is based on
HiFST, a hierarchical phrase-based decoder implemented using weighted
finite-state transducers. In the French-English task, we investigate the use
of context-dependent alignment models. We also show that lattice minimum
Bayes-risk decoding is an effective framework for multi-source translation,
leading to large gains in BLEU score.
- M. Kurimo, S. Virpioja, V. T. Turunen,
G. W. Blackwood, and W. Byrne.
Overview and results of Morpho Challenge
2009.
In C. Peters et al., editor, Multilingual Information Access Evaluation,
10th Workshop of the Cross-Language Evaluation Forum - CLEF 2009,
volume 1 of Revised Selected Papers, Lecture Notes in Computer Science,
LNCS 6241, pages 579–598. Springer, 2010.
The goal of Morpho Challenge 2009 was to evaluate unsupervised
algorithms that provide morpheme analyses for words in different languages
and in various practical applications. Morpheme analysis is particularly
useful in speech recognition, information retrieval and machine translation
for morphologically rich languages where the amount of different word forms
is very large. The evaluations consisted of: 1. a comparison to grammatical
morphemes, 2. using morphemes instead of words in information retrieval
tasks, and 3. combining morpheme and word based systems in statistical
machine translation tasks. The evaluation languages were: Finnish, Turkish,
German, English and Arabic. This paper describes the tasks, evaluation
methods, and obtained results. The Morpho Challenge was part of the EU
Network of Excellence PASCAL Challenge Program and organized in collaboration
with CLEF.
- Matthew Gibson, Teemu Hirsimaki,
Reima Karhila, Mikko Kurimo, and William Byrne.
Unsupervised cross-lingual speaker
adaptation for HMM-based speech synthesis using two-pass decision tree
construction.
In Proceedings of IEEE Conference on Acoustics, Speech and Signal
Processing, 2010.
This paper demonstrates how unsupervised cross-lingual adaptation
of HMM-based speech synthesis models may be performed without explicit
knowledge of the adaptation data language. A two-pass decision tree
construction technique is deployed for this purpose. Using parallel
translated datasets, cross-lingual and intralingual adaptation are compared
in a controlled manner. Listener evaluations reveal that the proposed method
delivers performance approaching that of unsupervised intralingual
adaptation.
- M. Kurimo, S. Virpioja,
V.T. Turunen, G.W. Blackwood, and W. Byrne.
Overview and
results of Morpho Challenge 2009.
In Multilingual Information Access Evaluation, 10th Workshop of the
Cross-Language Evaluation Forum, CLEF 2009, volume 1 of Lecture
Notes in Computer Science, pages 578–597. Springer, 2010.
In the Morpho Challenge 2009 unsupervised algorithms that provide
morpheme analyses for words in different languages were evaluated in various
practical applications. Morpheme analysis is particularly useful in speech
recognition, information retrieval and machine translation for
morphologically rich languages where the amount of different word forms is
very large. The evaluations consisted of: 1. a comparison to grammatical
morphemes, 2. using morphemes instead of words in information retrieval
tasks, and 3. combining morpheme and word based systems in statistical
machine translation tasks. The evaluation languages in 2009 were: Finnish,
Turkish, German, English and Arabic. This overview paper describes the tasks,
evaluation methods, and obtained results. The Morpho Challenge is part of the
EU Network of Excellence PASCAL Challenge Program and organized in
collaboration with CLEF.
- A. de Gispert, G. Iglesias,
G. Blackwood, E. R. Banga, , and W. Byrne.
Hierarchical phrase-based translation with weighted
finite state transducers and shallow-N grammars.
Computational Linguistics, 36(3), September 2010.
In this paper we describe HiFST, a lattice-based decoder for
hierarchical phrase-based translation and alignment. The decoder is
implemented with standard Weighted Finite-State Transducer (WFST) operations
as an alternative to the well-known cube pruning procedure. We find that the
use of WFSTs rather than k-best lists requires less pruning in translation
search, resulting in fewer search errors, better parameter optimization, and
improved translation performance. The direct generation of translation
lattices in the target language can improve subsequent rescoring procedures,
yielding further gains when applying long-span language models and Minimum
Bayes Risk decoding. We also give insight as to how to control the size of
the search space defined by hierarchical rules. We show that shallow-N
grammars, low-level rule catenation and other search constraints can help to
match the power of the translation system to specific language
pairs.
2009
- William Byrne.
Hierarchical phrase-based translation with weighted finite state transducers.
The Johns Hopkins University Center for Language and Speech Processing,
Baltimore, MD, USA, November 2009.
- A de Gispert, G Iglesias, G Blackwood,
J Brunning, and B Byrne.
The CUED NIST 2009 Arabic-English SMT System.
NIST Open Machine Translation 2009 Evaluation (MT09) Workshop, Ottowa, ON,
CAN., August 2009.
- W. Byrne.
Context-dependent alignment models and hierarchical phrase-based translation
with weighted finite state transducers.
GALE PI Meeting, Tampa, FL, USA, May 2009.
- Gonzalo Iglesias, Adrià
de Gispert, Eduardo R. Banga, and William Byrne.
The HiFST system for the europarl
spanish-to-english task.
In Proceedings of SEPLN, pages 207–214, 2009.
In this paper we present results for the Europarl
Spanish-to-English translation task. We use HiFST, a novel hierarchical
phrase-based translation system implemented with finite-state technology that
creates target lattices rather than k-best lists
- M. Shannon and W. Byrne.
Autoregressive HMMs for speech
synthesis.
In Proceedings of INTERSPEECH, 2009.
We propose the autoregressive HMM for speech synthesis. We show
that the autoregressive HMM supports efficient EM parameter estimation and
that we can use established effective synthesis techniques such as synthesis
considering global variance with minimal modification. The autoregressive HMM
uses the same model for parameter estimation and synthesis in a consistent
way, in contrast to the standard HMM synthesis framework, and supports easy
and efficient parameter estimation, in contrast to the trajectory HMM. We
find that the autoregressive HMM gives performance comparable to the standard
HMM synthesis framework on a Blizzard Challenge-style naturalness
evaluation.
- A. de Gispert, S. Virpioja,
M. Kurimo, and W. Byrne.
Minimum Bayes risk combination of
translation hypotheses from alternative morphological decompositions.
In Proceedings of NAACL-HLT, 2009.
We describe a simple strategy to achieve translation performance
improvements by combining output from identical statistical machine
translation systems trained on alternative morphological decompositions of
the source language. Combination is done by means of Minimum Bayes Risk
decoding over a shared Nbest list. When translating into English from two
highly inflected languages such as Arbic and Finnish we obtain significant
improvements over simply selecting the best morphological
decomposition.
- J. Brunning, A. de Gispert, and
W. Byrne.
Context-dependent alignment models for
statistical machine translation.
In Proceedings of NAACL-HLT, 2009.
We introduce alignment models for Machine Translation that take
into account the context of a source word when determining its translation.
Since the use of these contexts alone causes data sparsity problems, we
develop a decision tree algorithm for clustering the contexts based on
optimisation of the EM auxiliary function. We show that our context-dependent
models lead to an improvement in alignment quality, and an increase in
translation quality when the alignments are used to build a machine
translation system.
- G. Iglesias, A. de Gispert,
E. R. Banga, and W. Byrne.
Hierarchical phrase-based translation with
weighted finite state transducers.
In Proceedings of NAACL-HLT, 2009.
This paper describes a lattice-based decoder for hierarchical
phrase-based translation. The decoder is implemented with standard WFST
operations as an alternative to the well-known cube pruning procedure. We
find that the use of WFSTs rather than k-best lists requires less pruning in
translation search, resulting in fewer search errors, direct generation of
translation lattices in the target language, better parameter optimization,
and improved translation performance when rescoring with long-span language
models and MBR decoding. We report translation experiments for the
Arabic-to-English and Chinese-to-English NIST translation tasks and contrast
the WFST-based hierarchical decoder with hierarchical translation under cube
pruning.
- G. Iglesias, A. de Gispert, E. R.
Banga, and W. Byrne.
Rule filtering by pattern for efficient
hierarchical translation.
In Proceedings of the 12th Conference of the European Chapter of the
Association for Computational Linguistics (EACL 2009), 2009.
We describe refinements to hierarchical translation search
procedures intended to reduce both search errors and memory usage through
modifications to hypothesis expansion in cube pruning and reductions in the
size of the rule sets used in translation. Rules are put into syntactic
classes based on the number of non-terminals and the pattern, and various
filtering strategies are then applied to assess the impact on translation
speed and quality. Results are reported on the 2008 NIST Arabic-to-English
evaluation task.
2008
- W. Byrne.
Statistical techniques in machine translation.
Google EMEA Faculty Summit, Zurich, Switzerland, 2008.
Keynote lecture.
- W. Byrne.
Phrase-based statistical machine
translation with weighted finite state transducers.
IRTG Summer School in Computational Linguistics and Psycholinguistics,
University of Edinburgh, UK, September 2008.
Invited tutorial.
The Transducer Translation Model (TTM) for phrase-based statistical
machine translation system follows a generative model of translation and is
implemented by the composition of component models realized as Weighted
Finite State Transducers via the OpenFst Toolkit. This flexible architecture
requires no special purpose decoder and readily handles the large-scale
natural language processing demands of state-of-the-art machine translation
systems. This presentation describes how the system was used for the NIST
2008 Arabic-English machine translation evaluation task and for the
Spanish-English and French-English translation in the ACL 2008 Third Workshop
on Statistical Machine Translation Shared Task. General issues in using WFSTs
for such tasks will also be discussed.
- A. de Gispert, G. Blackwood,
J. Brunning, and W. Byrne.
The CUED NIST 2008 Arabic-English SMT
System.
NIST MT Workshop, Alexandria, VA, USA, March 2008.
- W. Byrne.
Statistical machine translation.
Advanced Machine Learning Tutorial Lectures Series, Cambridge University
Engineering Department, UK, February 2008.
- G. Blackwood, A. de Gispert,
J. Brunning, and W. Byrne.
Large-scale statistical machine translation with
weighted finite state transducers.
In Proceedings of FSMNLP 2008: Finite-State Methods and Natural Language
Processing, Ispra, Lago Maggiore, Italy, September 2008.
The Cambridge University Engineering Department phrase-based
statistical machine translation system follows a generative model of
translation and is implemented by the composition of component models
realised as Weighted Finite State Transducers. Our flexible architecture
requires no special purpose decoder and readily handles the large-scale
natural language processing demands of state-of-the-art machine translation
systems. In this paper we describe the CUED participation in the NIST 2008
Arabic-English machine translation evaluation task.
- G. Blackwood, A. de Gispert, and
W. Byrne.
Phrasal segmentation models for statistical
machine translation.
In Proceedings of the 22nd International Conference on Computational
Linguistics, Manchester, UK, August 2008.
Phrasal segmentation models define a mapping from the words of a
sentence to sequences of translatable phrases. We discuss the estimation of
these models from large quantities of monolingual training text and describe
their realization as weighted finite state transducers for incorporation into
phrase-based statistical machine translation systems. Results are reported on
the NIST Arabic-English translation tasks showing significant complementary
gains in BLEU score with large 5-gram and 6-gram language
models.
- G. Blackwood, A. de Gispert, J. Brunning,
and W. Byrne.
European language translation with weighted finite
state transducers: The CUED MT system for the 2008 ACL workshop on
statistical machine translation.
In Proceedings of the ACL 2008 Third Workshop on Statistical Machine
Translation, June 2008.
We describe the Cambridge University Engineering Department
phrase-based statistical machine translation system for Spanish-English and
French-English translation in the ACL 2008 Third Workshop on Statistical
Machine Translation Shared Task. The CUED system follows a generative model
of translation and is implemented by composition of component models realised
as Weighted Finite State Transducers, without the use of a special-purpose
decoder. Details of system tuning for both Europarl and News translation
tasks are provided.
- Y. Deng and W. Byrne.
HMM word and phrase alignment for
statistical machine translation.
IEEE Transactions on Audio, Speech, and Language Processing,
16(3):494–507, March 2008.
Efficient estimation and alignment procedures for word and phrase
alignment HMMs are developed for the alignment of parallel text. The
development of these models is motivated by an analysis of the desirable
features of IBM Model 4, one of the original and most effective models for
word alignment. These models are formulated to capture the desirable aspects
of Model 4 in an HMM alignment formalism. Alignment behavior is analyzed and
compared to human-generated reference alignments, and the ability of these
models to capture different types of alignment phenomena is evaluated. In
analyzing alignment performance, Chinese-English word alignments are shown to
be comparable to those of IBM Model-4 even when models are trained over large
parallel texts. In translation performance, phrase-based statistical machine
translation systems based on these HMM alignments can equal and exceed
systems based on Model-4 alignments, and this is shown in Arabic-English and
Chinese-English translation. These alignment models can also be used to
generate posterior statistics over collections of parallel text, and this is
used to refine and extend phrase translation tables with a resulting
improvement in translation quality.
2007
- K.-C. Sim, W. Byrne, M. Gales,
H. Sahbi, and P.C. Woodland.
Consensus network decoding for statistical
machine translation system combination.
In IEEE Conference on Acoustics, Speech and Signal Processing,
2007.
This paper presents a simple and robust consensus decoding approach
for combining multiple Machine Translation (MT) system outputs. A consensus
network is constructed from an N -best list by aligning the hypotheses
against an alignment reference, where the alignment is based on minimising
the translation edit rate (TER). The Minimum Bayes Risk (MBR) decoding
technique is investigated for the selection of an appropriate alignment
reference. Several alternative decoding strategies proposed to retain
coherent phrases in the original translations. Experimental results are
presented primarily based on three-way combination of Chinese-English
translation outputs, and also presents results for six-way system
combination. It is shown that worthwhile improvements in translation
performance can be obtained using the methods discussed.
- X. A. Liu, W. J. Byrne, M. J. F.
Gales, A. de Gispert, M. Tomalin, P. C. Woodland, and K. Yu.
Discriminative language model adaptation for mandarin broadcast speech
transcription and translation.
In Proc. IEEE Automatic Speech Recognition and Understanding
(ASRU), Kyoto, Japan, 2007.
- V. Venkataramani,
S. Chakrabartty, and W. Byrne.
Gini support vector machines for
segmental minimum Bayes risk decoding of continuous speech.
Computer Speech and Language, 21:423–442, 2007.
Published online by Elsevier Ltd., 2 October 2006.
We describe the use of Support Vector Machines (SVMs) for
continuous speech recognition by incorporating them in Segmental Minimum
Bayes Risk decoding. Lattice cutting is used to convert the Automatic Speech
Recognition search space into sequences of smaller recognition problems. SVMs
are then trained as discriminative models over each of these problems and
used in a rescoring framework. We pose the estimation of a posterior
distribution over hypothesis in these regions of acoustic confusion as a
logistic regression problem. We also show that GiniSVMs can be used as an
approximation technique to estimate the parameters of the logistic regression
problem. On a small vocabulary recognition task we show that the use of
GiniSVMs can improve the performance of a well trained Hidden Markov Model
system trained under the Maximum Mutual Information criterion. We also find
that it is possible to derive reliable confidence scores over the GiniSVM
hypotheses and that these can be used to good effect in hypothesis
combination. We discuss the problems that we expect to encounter in extending
this approach to Large Vocabulary Continuous Speech Recognition and describe
initial investigation of constrained estimation techniques to derive feature
spaces for SVMs.
2006
- Y. Deng and W. Byrne.
MTTK: An alignment toolkit for statistical
machine translation.
HLT-NAACL Demonstrations Program, New York, NY, USA, June 2006.
The MTTK alignment toolkit for statistical machine translation can
be used for word, phrase, and sentence alignment of parallel documents. It is
designed mainly for building statistical machine translation systems, but can
be exploited in other multilingual applications. It provides computationally
efficient alignment and estimation procedures that can be used for the
unsupervised alignment of parallel text collections in a language independent
fashion. MTTK Version 1.0 is available under the Open Source Educational
Community License.
- W. Byrne.
Integrating automatic speech recognition and statistical machine translation.
TC-STAR OpenLab on Speech Translation, Trento, Italy, April 2006.
Invited tutorial.
- W. Byrne.
Statistical phrase-based speech translation.
GALE Mid-Phase PI Meeting, Boston, MA, USA, March 2006.
- W. Byrne.
Minimum Bayes risk estimation and decoding in large vocabulary continuous
speech recognition.
University of Sheffield, UK, January 2006.
Progress in automatic speech recognition is frequently measured by
easily computed, task-neutral measures such as Word Error Rate. Ideally it
could be possible to design systems tailored for any application no matter
how complex or specialized the performance criteria. Minimum Bayes-Risk (MBR)
processing is a modeling framework that attempts to minimize the empirical
expected risk under task-specific loss functions that describe desired system
behavior. This presentation will describe risk-based recognition and model
estimation procedures developed for the refinement of automatic speech
recognition systems. The MBR formulation has also made it possible to
implement a hybrid estimation and discriminative training approach called
Acoustic Code-Breaking. This is a divide-and-conquer strategy that breaks
continuous speech recognition problems into a sequence of smaller, distinct
subproblems that can be solved independently using specially trained
discriminative models such as Support Vector Machines. These estimation and
decoding approaches will be described, along with evaluation of their
performance on various automatic speech recognition tasks.
- V. Venkataramani and
W. Byrne.
Sub-problem selection for acoustic
code-breaking.
Technical Report CUED/F/INFENG/TR563, Cambridge University Engineering
Department, December 2006.
We have developed lattice rescoring procedures which apply
specialized acoustic and language models, e.g. discriminatively trained HMMs
and Support Vector Machines, to improve the transcription performance of
general purpose ASR systems. Initial investigations showed that large gains
in small vocabulary tasks are possible, but that large improvements in large
vocabulary recognition are harder to achieve. In this paper we analyze the
feasibility of the entire approach. We find that sub-problems be identified
reliably, and that the approach has potential, but that the selection of
sub-problems for solution should consider additional factors, such as the
availability of training data. This report analyzes a particular previously
developed decoding framework designed to identify possible recognition errors
in a first-pass recognition hypothesis; Support Vector Machines were then
used to improve the original ASR hypotheses. Here we analyze the improvements
obtained to better understand the decoding framework. We also present methods
for selecting a larger number of sub-problems in large vocabulary tasks that
offer scope for additional improvement.
- L. Mathias and W. Byrne.
Statistical phrase-based speech
translation.
In IEEE Conference on Acoustics, Speech and Signal Processing,
2006.
A generative statistical model of speech-to-text translation is
developed as an extension of existing models of phrase-based text
translation. Speech is translated by mapping ASR word lattices to lattices of
phrase sequences which are then translated using operations developed for
text translation. Performance is reported on Chinese to English translation
of Mandarin Broadcast News.
- Y. Deng, S. Kumar, and W. Byrne.
Segmentation and alignment of parallel
text for statistical machine translation.
Journal of Natural Language Engineering, 13(3):235–260, 2006.
We address the problem of extracting bilingual chunk pairs from
parallel text to create training sets for statistical machine translation. We
formulate the problem in terms of a stochastic generative process over text
translation pairs, and derive two different alignment procedures based on the
underlying alignment model. The first procedure is a now-standard dynamic
programming alignment model which we use to generate an initial coarse
alignment of the parallel text. The second procedure is a divisive clustering
parallel text alignment procedure which we use to refine the first-pass
alignments. This latter procedure is novel in that it permits the
segmentation of the parallel text into sub-sentence units which are allowed
to be reordered to improve the chunk alignment. The quality of chunk pairs
are measured by the performance of machine translation systems trained from
them. We show practical benefits of divisive clustering as well as how system
performance can be improved by exploiting portions of the parallel text that
otherwise would have to be discarded. We also show that chunk alignment as a
first step in word alignment can significantly reduce word alignment error
rate.
- W. Byrne.
Minimum Bayes risk estimation and
decoding in large vocabulary continuous speech recognition.
Proceedings of the Institute of Electronics, Information, and
Communication Engineers, Japan – Special Section on Statistical Modeling for
Speech Processing, E89-D(3), March 2006.
Invited paper.
- S. Kumar,
Y. Deng, and W. Byrne.
A weighted finite state transducer translation
template model for statistical machine translation.
Journal of Natural Language Engineering, 12(1):35–75, March 2006.
We present a Weighted Finite State Transducer Translation Template
Model for statistical machine translation. The approach we describe allows us
to implement each constituent distribution of the model as a weighted finite
state transducer or acceptor. We show that bitext word alignment and
translation under the model can be performed with standard FSM operations
involving these transducers. One of the benefits of using this framework is
that it avoids the need to develop specialized search procedures, even for
the generation of lattices or N-Best lists of bitext word alignments and
translation hypotheses. We report and analyze bitext word alignment and
translation performance of the model on French-English and Chinese-English
tasks.
See also CLSP Tech. Rep. 48, 2004 – Download
- J. Li, F. Zheng, W. Byrne, and
D. Jurafsky.
A dialectal Chinese speech recognition
framework.
Journal of Computer Science and Technology (Science Press, Beijing,
China), (1):106–115, January 2006.
A framework for dialectal Chinese speech recognition is proposed
and studied, where a relatively small dialectal Chinese (or in other words
Chinese influenced by the native dialect) speech corpus and the
dialect-related knowledge are adopted to translate a standard Chinese (or
Putonghua, abbreviated as PTH) speech recognizer into a dialectal Chinese
speech recognizer. There are two kinds of knowledge sources: one is human
experts and another is a small dialectal Chinese corpus. This knowledge
includes four levels : a phonetics level, lexicon level, language level, and
the acoustic decoder level. This paper takes Wu dialectal Chinese (WDC) as an
example target language with the goal of deriving an acceptable WDC speech
recognizer from an existing PTH speech recognizer. Based on the Initial-Final
structure of the Chinese language and a study of how dialectal Chinese
speakers speak Putonghua, we proposed to use the knowledge of the
context-independent PTH-IF mappings (where IF means either a Chinese Initial
or a Chinese Final), the context-independent WDC-IF mappings, and the
syllable-dependent WDC-IF mappings obtained from either experts or data, and
then to combine these with the surface-form based maximum likelihood linear
regression (MLLR) acoustic model adaptation method. To reduce the size of the
multi-pronunciation lexicon introduced by the IF mappings which might entail
confusion in the lexicon and hence lead to the performance degradation, a
Multi-Pronunciation Expansion (MPE) method based on an accumulated uni-gram
probability (AUP) was proposed. Compared with the original PTH speech
recognizer, the resulted WDC speech recognizer achieved over 10% absolute
Character Error Rate (CER) reduction when recognizing WDC with only 0.62% CER
increase when recognizing PTH. The proposed framework and methods are
intended to work not only for Wu dialectal Chinese but also for other
dialectal Chinese languages and even other languages.
2005
- W. Byrne.
Minimum Bayes risk estimation and decoding in large vocabulary continuous
speech recognition.
Google, Inc, Mountain View, CA, USA, September 2005.
Progress in automatic speech recognition is frequently measured by
easily computed, task-neutral measures such as Word Error Rate. Ideally it
could be possible to design systems tailored for any application no matter
how complex or specialized the performance criteria. Minimum Bayes-Risk (MBR)
processing is a modeling framework that attempts to minimize the empirical
expected risk under task-specific loss functions that describe desired system
behavior. This presentation will describe risk-based recognition and model
estimation procedures developed for the refinement of automatic speech
recognition systems. The MBR formulation has also made it possible to
implement a hybrid estimation and discriminative training approach called
Acoustic Code-Breaking. This is a divide-and-conquer strategy that breaks
continuous speech recognition problems into a sequence of smaller, distinct
subproblems that can be solved independently using specially trained
discriminative models such as Support Vector Machines. These estimation and
decoding approaches will be described, along with evaluation of their
performance on various automatic speech recognition tasks.
- S. Kumar, Y. Deng, and W. Byrne.
Johns Hopkins University - Cambridge University Chinese-English and
Arabic-English 2005 NIST MT Evaluation Systems.
2005 NIST MT Workshop, Bethesda, MD, USA, June 2005.
- W. Byrne.
Current Research in Phrase-Based Statistical Machine Translation – and some
links to ASR.
Kings College London, UK, May 2005.
- W. Byrne.
Phrase-based statistical machine translation using finite state machines –
with some links to ASR.
University of Washington, Seattle, WA, USA, May 2005.
- S. Kumar, Y. Deng, and
W. Byrne.
JHU/CUED Chinese-English translation system – 2005 TC-STAR evaluation.
TC-STAR Evaluation Meeting, Trento, Italy, April 2005.
- W. Byrne.
Current research in phrase-based statistical machine translation and some links
to ASR.
Machine Intelligence Laboratory Speech Seminar, Cambridge University
Engineering Department, UK, March 2005.
- W. Byrne.
Current research in phrase-based statistical machine translation and some links
to ASR.
Seminar Series, Institute for Collaborative and Communicating Systems and Human
Communication Research Centre, University of Edinburgh, UK, January 2005.
- S. Kumar and W. Byrne.
Local phrase reordering models for
statistical machine translation.
In Proceedings of HLT-EMNLP, 2005.
We describe stochastic models of local phrase movement that can be
incorporated into a Statistical Machine Translation (SMT) system. These
models provide properly formulated, non-deficient, probability distributions
over reordered phrase sequences. They are implemented by Weighted Finite
State Transducers. We describe EM-style parameter re-estimation procedures
based on phrase alignment under the complete translation model incorporating
reordering. Our experiments show that the reordering model yields substantial
improvements in translation performance on Arabic-to-English and
Chinese-to-English MT tasks. We also show that the procedure scales as the
bitext size is increased.
- Y. Deng and W. Byrne.
HMM word and phrase alignment for
statistical machine translation.
In Proceedings of HLT-EMNLP, 2005.
HMM-based models are developed for the alignment of words and
phrases in bitext. The models are formulated so that alignment and parameter
estimation can be performed efficiently. We find that Chinese-English word
alignment performance is comparable to that of IBM Model-4 even over large
training bitexts. Phrase pairs extracted from word alignments generated under
the model can also be used for phrase-based translation, and in Chinese to
English and Arabic to English translation, performance is comparable to
systems based on Model-4 alignments. Direct phrase pair induction under the
model is described and shown to improve translation
performance.
- J. Psutka, P. Ircing, J.V.
Psutka, J. Hajic, W. Byrne, and J. Mirovski.
Automatic transcription of Czech,
Russian, and Slovak spontaneous speech in the MALACH project.
In Proceedings of EUROSPEECH, 2005.
This paper describes the 3.5-years effort put into building LVCSR
systems for recognition of spontaneous speech of Czech, Russian, and Slovak
witnesses of the Holocaust in the MALACH project. For processing of
colloquial, highly emotional and heavily accented speech of elderly people
containing many non-speech events we have developed techniques that very
effectively handle both non-speech events and colloquial and accented
variants of uttered words. Manual transcripts as one of the main sources for
language modeling were automatically ãnormalizedÓ using standardized
lexicon, which brought about 2 to 3% reduction of the word error rate (WER).
The subsequent interpolation of such LMs with models built from an additional
collection (consisting of topically selected sentences from general text
corpora) resulted into an additional improvement of performance of up to 3 .
- V. Venkataramani and W. Byrne.
Lattice segmentation and support vector
machines for large vocabulary continuous speech recognition.
In IEEE Conference on Acoustics, Speech and Signal Processing,
2005.
Lattice segmentation procedures are used to spot possible
recognition errors in first-pass recognition hypotheses produced by a large
vocabulary continuous speech recognition system. This approach is analyzed in
terms of its ability to reliably identify, and provide good alternatives for,
incorrectly hypothesized words. A procedure is described to train and apply
Support Vector Machines to strengthen the first pass system where it was
found to be weak, resulting in small but statistically significant
recognition improvements on a large test set of conversational
speech.
- S. Tsakalidis and W. Byrne.
Acoustic training from heterogeneous data
sources: Experiments in Mandarin conversational telephone speech
transcription.
In IEEE Conference on Acoustics, Speech and Signal Processing,
2005.
In this paper we investigate the use of heterogeneous data sources
for acoustic training. We describe an acoustic normalization procedure for
enlarging an ASR acoustic training set with out-of-domain acoustic data. A
larger in-domain training set is created by effectively transforming the
out-of-domain data before incorporation in training. Baseline experimental
results in Mandarin conversational telephone speech transcription show that a
simple attempt to add out-of-domain data degrades performance. Preliminary
experiments assess the effectiveness of the proposed cross-corpus acoustic
normalization.
- A. Gunawardana and W. Byrne.
Convergence theorems for generalized
alternating minimization procedures.
Journal of Machine Learning Research, (6):2049–2073, December
2005.
The EM algorithm is widely used to develop iterative parameter
estimation procedures for statistical models. In cases where these procedures
strictly follow the EM formulation, the convergence properties of the
estimation procedures are well understood. In some instances there are
practical reasons to develop procedures that do not strictly fall within the
EM framework. We study EM variants in which the E-Step is not performed
exactly, either to obtain improved rates of convergence, or due to
approximations needed to compute statistics under a model family over which
E-Steps cannot be realized. Since these variants are not EM procedures, the
standard (G)EM convergence results do not apply to them. We present an
information geometric framework for describing such algorithms and analyzing
their convergence properties. We apply this framework to analyze the
convergence properties of incremental EM and variational EM. For incremental
EM, we discuss conditions under these algorithms converge in likelihood. For
variational EM, we show how the E-Step approximation prevents convergence to
local maxima in likelihood.
- V. Doumpiotis and W. Byrne.
Lattice segmentation and minimum Bayes
risk discriminative training for large vocabulary continuous speech
recognition.
Speech Communication, (2):142–160, 2005.
Lattice segmentation techniques developed for Minimum Bayes Risk
decoding in large vocabulary speech recognition tasks are used to compute the
statistics for discriminative training algorithms that estimate HMM
parameters so as to reduce the overall risk over the training data. New
estimation procedures are developed and evaluated for small vocabulary and
large vocabulary recognition tasks, and additive performance improvements are
shown relative to maximum mutual information estimation. These relative gains
are explained through a detailed analysis of individual word recognition
errors.
- V. Doumpiotis, S. Tsakalidis, and
W. Byrne.
Discriminative linear transforms for
feature normalization and speaker adaptation in HMM estimation.
IEEE Transactions on Speech and Audio Processing, 13(3):367–376,
May 2005.
Linear transforms have been used extensively for training and
adaptation of HMM-based ASR systems. Recently procedures have been developed
for the estimation of linear transforms under the Maximum Mutual Information
(MMI) criterion. In this paper we introduce discriminative training
procedures that employ linear transforms for feature normalization and for
speaker adaptive training. We integrate these discriminative linear
transforms into MMI estimation of HMM parameters for improvement of large
vocabulary conversational speech recognition systems.
2004
- W. Byrne.
Minimum Bayes risk estimation and decoding in large vocabulary continuous
speech recognition.
ATR Workshop "Beyond HMMs", Kyoto, Japan, December 2004.
Invited paper and lecture.
- W. Byrne.
Current research in statistical machine translation and links with automatic
speech recognition.
ISM Open Lectures on Statistical Speech Processing, The Institute for
Statistical Mathematics, Tokyo, Japan, December 2004.
Invited lecture.
- S. Kumar et al.
The Johns Hopkins University 2004 Chinese-English and Arabic-English MT
Evaluation Systems.
2004 NIST MT Workshop, Alexandria, VA, USA, June 2004.
- W. Byrne.
Minimum Risk Estimation and Decoding for Speech and Language Processing.
Microsoft Research, Redmond, Washington, USA, February 2004.
- W. Byrne.
Minimum Risk Estimation and Decoding for Speech and Language Processing.
Signal, Speech and Language Interpretation Lab, University of Washington,
Seattle, WA, USA, February 2004.
- W. Byrne.
Minimum Risk Estimation and Decoding for Speech and Language Processing.
Speech Analysis and Interpretation Laboraory, University of Southern California
School of Engineering, Los Angeles, CA, USA, February 2004.
- W. Byrne.
Minimum Bayes risk estimation and
decoding in large vocabulary continuous speech recognition.
In Proceedings of the ATR Workshop "Beyond HMMs", Kyoto, Japan,
2004.
Invited paper.
Minimum risk estimation and decoding strategies based on lattice
segmentation techniques can be used to refine large vocabulary continuous
speech recognition systems through the estimation of the parameters of the
underlying hidden Mark models and through the identification of smaller
recognition tasks which provides the opportunity to incorporate novel
modeling and decoding procedures in LVCSR. These techniques are discussed in
the context of going beyond HMMs.
- I. Shafran and W. Byrne.
Task-specific minimum Bayes-risk decoding
using learned edit distance.
In Proc. of the International Conference on Spoken Language
Processing, 2004.
This paper extends the minimum Bayes-risk framework to incorporate
a loss function specific to the task and the ASR system. The errors are
modeled as a noisy channel and the parameters are learned from the data. The
resulting loss function is used in the risk criterion for decoding.
Experiments on a large vocabulary conversational speech recognition system
demonstrate significant gains of about 1% absolute over MAP hypothesis and
about 0.6% absolute over untrained lossfunction. The approach is general
enough to be applicable to other sequence recognition problems such as in
Optical Character Recognition (OCR) and in analysis of biological
sequences.
- V. Doumpiotis and W. Byrne.
Pinched lattice minimum Bayes risk
discriminative training for large vocabulary continuous speech
recognition.
In Proc. of the International Conference on Spoken Language
Processing, 2004.
Iterative estimation procedures that minimize empirical risk based
on general loss functions such as the Levenshtein distance have been derived
as extensions of the Extended Baum Welch algorithm. While reducing expected
loss on training data is a desirable training criterion, these algorithms can
be difficult to apply. They are unlike MMI estimation in that they require an
explicit listing of the hypotheses to be considered and in complex problems
such lists tend to be prohibitively large. To overcome this difficulty,
modeling techniques originally developed to improve search efficiency in
Minimum Bayes Risk decoding can be used to transform these estimation
algorithms so that exact update, risk minimization procedures can be used for
complex recognition problems. Experimental results in two large vocabulary
speech recognition tasks show improvements over conventionally trained MMIE
models.
- J. Psutka, P. Ircing,
J. Hjic, V. Radova, J.V. Psutka, W. Byrne, and S. Gustman.
Issues in annotation of the Czech spontaneous
speech corpus in the MALACH project.
In Proceedings of the International Conference on Language Resources and
Evaluation (LREC), 2004.
The paper present the issues encountered in processing spontaneous
Czech speech in the MALACH project. Specific problems connected with a
frequent occurrence of colloquial words in spontaneous Czech are analyzed; a
partial solution is proposed and experimentally evaluated.
- J. Psutka, J. Hajic, and
W. Byrne.
Slavic languages in the MALACH
project.
In IEEE Conference on Acoustics, Speech and Signal Processing.
IEEE, 2004.
Invited Paper in Special Session on Multilingual Speech Processing.
The development of acoustic training material for Slavic languages
within the MALACH project is described. Initial experience with the variety
of speakers and the difficulties encountered in transcribing Czech, Slovak,
and Russian language oral history are described along with ASR recognition
results intended investigate the effectiveness of different transcription
conventions that address language specific phenomena within the task
domain.
- S. Kumar and W. Byrne.
Minimum Bayes-risk decoding for statistical
machine translation.
In Proceedings of HLT-NAACL, 2004.
We present Minimum Bayes-Risk (MBR) decoding for statistical
machine translation. This statistical approach aims to minimize expected loss
of translation errors under loss functions that measure translation
performance. We describe a hierarchy of loss functions that incorporate
different levels of linguistic information from word strings, word-to-word
alignments from an MT system, and syntactic structure from parse-trees of
source and target language sentences. We report the performance of the MBR
decoders on a Chinese-to-English translation task. Our results show that MBR
decoding can be used to tune statistical MT performance for specific loss
functions.
- W. Byrne, D. Doermann,
M. Franz, S. Gustman, J. Hajic, D. Oard, M. Picheny, J. Psutka,
B. Ramabhadran, D. Soergel, T. Ward, and W.-J. Zhu.
Automatic recognition of spontaneous speech for access to multilingual oral
history archives.
IEEE Transactions on Speech and Audio Processing, Special Issue on
Spontaneous Speech Processing, pages 420–435, July 2004.
The MALACH project has the goal of developing the technologies
needed to facilitate access to large collections of spontaneous speech. Its
aim is to dramatically improve the state of the art in key Automatic Speech
Recognition (ASR), Natural Language Processing (NLP) technologies for use in
large-scale retrieval systems. The project leverages a unique collection of
oral history interviews with survivors of the Holocaust that has been
assembled and extensively annotated by the Survivors of the Shoah Visual
History Foundation. This paper describes the collection, 116,000 hours of
interviews in 32 languages, and the way in which system requirements have
been discerned through user studies. It discusses ASR methods for very
difficult speech (heavily accented, emotional, and elderly spontaneous
speech), including transcription to create training data and methods for
language modeling and speaker adaptation. Results are presented for for
English and Czech. NLP results are presented for named entity tagging, topic
segmentation, and supervised topic classification, and the architecture of an
integrated search system that uses these results is
described.
- V. Goel, S. Kumar, and W. Byrne.
Segmental minimum
Bayes-risk decoding for automatic speech recognition.
IEEE Transactions on Speech and Audio Processing, 12:234–249,
May 2004.
Minimum Bayes-Risk (MBR) speech recognizers have been shown to
yield improvements over the search over word lattices. We present a Segmental
Minimum Bayes-Risk decoding (SMBR) framework that simplifies the
implementation of MBR recognizers through the segmentation of the N-best
lists or lattices over which the recognition is to be performed. This paper
presents lattice cutting procedures that underly SMBR decoding. Two of these
procedures are based on a risk minimization criterion while a third one is
guided by word-level confidence scores. In conjunction with SMBR decoding,
these lattice segmentation procedures give consistent improvements in
recognition word error rate (WER) on the Switchboard corpus. We also discuss
an application of risk-based lattice cutting to multiplesystem SMBR decoding
and show that it is related to other system combination techniques such as
ROVER. This strategy combines lattices produced from multiple ASR systems and
is found to give WER improvements in a Switchboard evaluation system.
Correction Available : In our
recently published paper, we presented a risk-based lattice cutting procedure
to segment ASR word lattices into smaller sub-lattices as a means to to
improve the efficiency of Minimum Bayes-Risk (MBR) rescoring. In the
experiments reported, some of the hypotheses in the original lattices were
inadvertently discarded during segmentation, and this affected MBR
performance adversely. This note gives the corrected results as well as
experiments demonstrating that the segmentation process does not discard any
paths from the original lattice.
2003
- W. Byrne, S. Khudanpur, W. Kim,
S. Kumar, P. Pecina, P. Virga, P. Xu, and D. Yarowsky.
The Johns Hopkins University 2003 Chinese-English machine translation
system.
2003 NIST MT Workshop, Gaithersburg, MD, USA, June 2003.
- W. Byrne.
Minimum Bayes-Risk Estimation and Decoding Procedures for Speech and Language
Processing.
University of Edinburgh, UK, May 2003.
- V. Venkataramani, S. Chakrabartty,
and W. Byrne.
Support vector machines for segmental
minimum Bayes risk decoding of continuous speech.
In IEEE Automatic Speech Recognition and Understanding Workshop,
2003.
Segmental Minimum Bayes Risk (SMBR) Decoding involves the
refinement of the search space into manageable confusion sets i.e.,
smaller sets of confusable words. We describe the application of Support
Vector Machines (SVMs) as discriminative models for the refined search space.
We show that SVMs, which in their basic formulation are binary classifiers of
fixed dimensional observations, can be used for continuous speech
recognition. We also study the use of GiniSVMs, which is a variant of the
basic SVM. On a small vocabulary task, we show this two pass scheme
outperforms MMI trained HMMs. Using system combination we also obtain further
improvements over discriminatively trained HMMs.
- W. Byrne, S. Khudanpur, W. Kim,
S. Kumar, P. Pecina, P.Virga, P. Xu, and D. Yarowsky.
The Johns Hopkins University 2003
Chinese-English Machine Translation System.
In Machine Translation Summit IX. The Association for Machine
Translation in the Americas, 2003.
We describe a Chinese to English Machine Translation system
developed at the Johns Hopkins University for the NIST 2003 MT evaluations.
The system is based on a Weighted Finite State Transducer implementation of
the alignment template translation model for statistical machine translation.
The baseline MT system was trained using 100,000 sentence pairs selected from
a static bitext training collection. Information retrieval techniques were
then used to create specific training collections for each document to be
translated. This document-specific training set included bitext and name
entities that were then added to the baseline system by augmenting the
library of alignment templates. We report translation performance of baseline
and IR-based systems on two NIST MT evaluation test sets.
- J. Psutka, I. Iljuchin, P. Ircing,
J.V. Psutka, V. Trejbal, W. Byrne, J. Hajic, and S. Gustman.
Building LVCSR systems for transcription of spontaneously produced Russian
witnesses in the MALACH project: initial steps and first results.
In Proceedings of the Text, Speech, and Dialog Workshop, 2003.
The MALACH project uses the world's largest digital archive of
video oral histories collected by the Survivors of the Shoah Visual History
Foundation (VHF) and attempts to access such archives by advancing the
state-of-the-art in Automatic Speech Recognition and Information Retrieval.
This paper discusses the intial steps and first results in building large
vocabulary continuous speech recognition (LVCSR) systems for the
transcription of Russian witnesses. As the third language processed in the
MALACH project (following English and Czech), Russian has posed new ASR
challenges, especially in phonetic modeling. Although most of the Russian
testimonies were provided by native Russian survivors, the speakers come from
many different regions and countries resulting in a diverse collection of
accented spontaneous Russian speech.
- J. Psutka, P. Ircing, J. V. Psutka,
V. Radova, W. Byrne, J. Hajic, and S. Gustman.
Towards automatic transcription of
spontaneous Czech speech in the MALACH project.
In Proceedings of the Text, Speech, and Dialog Workshop, 2003.
Our paper discusses the progress achieved during a one-year effort
in building the Czech LVCSR system for the automatic transcription of
spontaneously produced testimonies in the MALACH project. The difficulty of
this task stems from the highly inflectional nature of the Czech language and
is further multiplied by the presence of many colloquial words in spontaneous
Czech speech as well as by the need to handle emotional speech filled with
disfluencies, heavy accents, age-related coarticulation and language
switching. In this paper we concentrate mainly on the acoustic modeling
issues - the proper choice of front-end paramterization, the handling of
non-speech events in acoustic modeling, and unsupervised acoustic adaptation
via MLLR. A method for selecting suitable language modeling data is also
briefly discussed.
- V. Doumpiotis, S. Tsakalidis,
and W. Byrne.
Lattice segmentation and minimum Bayes
risk discriminative training.
In Proc. of the European Conference on Speech Communication and
Technology (EUROSPEECH), 2003.
Modeling approaches are presented that incorporate discriminative
training procedures in segmental Minimum Bayes-Risk decoding (SMBR). SMBR is
used to segment lattices produced by a general automatic speech recognition
(ASR) system into sequences of separate decision problems involving small
sets of confusable words. We discuss two approaches to incorporating these
segmented lattices in discriminative training. We investigate the use of
acoustic models specialized to discriminate between the competing words in
these classes which are then applied in subsequent SMBR rescoring passes.
Refinement of the search space that allows the use of specialized
discriminative models is shown to be an improvement over rescoring with
conventionally trained discriminative models.
- J. Psutka, P. Ircing,
J.V. Psutka, V. Radovic, W. Byrne, J. Hajic, Jiri Mirovsky, and Samuel
Gustman.
Large vocabulary ASR for spontaneous
Czech in the MALACH project.
In Proc. of the European Conference on Speech Communication and
Technology (EUROSPEECH), 2003.
This paper describes LVCSR research into the automatic
transcription of spontaneous Czech speech in the MALACH (Multilingual Access
to Large Spoken Archives) project. This project attempts to provide improved
access to the large multilingual spoken archives collected by the Survivors
of the Shoah Visual History Foundation (VHF) (www.vhf.org) by advancing the
state of the art in automated speech recognition. We describe a baseline ASR
system and discuss the problems in language modeling that arise from the
nature of Czech as a highly inflectional language that also exhibits
diglossia between its written and spontaneous forms. The difficulties of this
task are compounded by heavily accented, emotional and disfluent speech along
with frequent switching between languages. To overcome the limited amount of
relevant language model data we use statistical techniques for selecting an
appropriate training corpus from a large unstructured text collection
resulting in significant reductions in word error rate. recognition and
retrieval techniques to improve cataloging efficiency and eventually to
provide direct access into the archive itself.
- A. Ikeno, B. Pellom, D. Cer,
A. Thornton, J. M. Brenier, D. Jurafsky, W. Ward, and W. Byrne.
Issues in recognition of
Spanish-accented spontaneous English.
In Proceedings of the ISCA and IEEE workshop on Spontaneous Speech
Processing and Recognition, Tokyo Institute of Technology, Tokyo,
Japan, 2003. ISCA and IEEE.
We describe a recognition experiment and two analytic experiments
on a database of strongly Hispanic-accented English. We show the crucial
importance of training on the Hispanic-accented data for acoustic model
performance, and describe the tendency of Spanish-accented speakers to use
longer, and presumably less-reduced, schwa vowels than native-English
speakers.
- V. Doumpiotis, S. Tsakalidis,
and W. Byrne.
Discriminative training for segmental
minimum Bayes-risk decoding.
In IEEE Conference on Acoustics, Speech and Signal Processing.
IEEE, 2003.
A modeling approach is presented that incorporates discriminative
training procedures within segmental Minimum Bayes-Risk decoding (SMBR). SMBR
is used to segment lattices produced by a general automatic speech
recognition (ASR) system into sequences of separate decis ion problems
involving small sets of confusable words. Acoustic models specialized to
discriminate between the competing words in these classes are then applied in
subsequent SMBR rescoring passes. Refinement of the search space that allows
the use of specialized discriminative models is shown to be an improvement
over rescoring with conventionally trained discriminative
models.
- D. Oard, D. Doermann, B. Dorr,
D. He, P. Resnik, W. Byrne, S. Khudanpur, D. Yarowsky, A. Leuski, P. Koehn,
and K. Knight.
Desperately seeking Cebuano.
In Proceedings of HLT-NAACL, 2003.
This paper describes an effort to rapidly develop language
resources and component technology to support searching Cebuano news stories
using English queries. Results from the first 60 hours of the exercise are
presented.
- S. Kumar and W. Byrne.
A weighted finite state transducer
implementation of the alignment template model for statistical machine
translation.
In Proceedings of HLT-NAACL, 2003.
We present a derivation of the alignment template model for
statistical machine translation and an implementation of the model using
weighted finite state transducers. The approach we describe allows us to
implement each constituent distribution of the model as a weighted finite
state transducer or acceptor. We show that bitext word alignment and
translation under the model can be performed with standard FSM operations
involving these transducers. One of the benefits of using this framework is
that it obviates the need to develop specialized search procedures, even for
the generation of lattices or N-Best lists of bitext word alignments and
translation hypotheses. We evaluate the implementation of the model on the
Frenchto- English Hansards task and report alignment and translation
performance.
- O. Kolak, W. Byrne, and P. Resnik.
A generative probabilistic OCR model for
NLP applications.
In Proceedings of HLT-NAACL, 2003.
In this paper we introduce a generative probabilistic optical
character recognition (OCR) model that describes an end-to-end process in the
noisy channel framework, progressing from generation of true text through its
transformation into the noisy output of an OCR system. The model is designed
for use in error correction, with a focus on post-processing the output of
black-box OCR systems in order to make them more useful for NLP tasks. We
present an implementation of the model based on finite-state models,
demonstrate the model's ability to significantly reduce character and word
error rate, and provide evaluation results involving automatic extraction of
translation lexicons from printed text.
- V. Goel and W. Byrne.
Minimum Bayes-risk automatic speech recognition.
In W. Chou and B.-H. Juang, editors, Pattern Recognition in Speech and
Language Processing, pages 51–77. CRC Press, 2003.
2002
- W. Byrne, V. Doumpiotis, S. Kumar,
S. Tsakalidis, and V. Venkataramani.
The Johns Hopkins University 2002 Large Vocabulary Conversational Speech
Recognition System.
NIST 2002 Rich Transcription Workshop, Vienna, VA, USA, 2002.
- W. Byrne.
MALACH:
Multilingual Access to Large Spoken Archives.
AT&T Speech Days, Florham Park, NY, USA, October 2002.
Invited talk.
- S. Gustman, D. Soergel, D. Oard, W. Byrne,
M. Picheny, B. Ramabhadran, and D. Greenberg.
Supporting access to large digital oral
history archives.
In Proceedings of the Joint Conference on Digital Libraries, 2002.
This paper describes our experience with the creation, indexing,
and provision of access to a very large archive of videotaped oral histories
- 116,000 hours of digitized interviews in 32 languages from 52,000
survivors, liberators, rescuers, and witnesses of the Nazi Holocaust. It goes
on to identify a set of critical research issues that must be addressed if we
are to provide full and detailed access to collections of this size: issues
in user requirement studies, automatic speech recognition, automatic
classification, segmentation, summarization, retrieval, and user interfaces.
The paper ends by inviting others to discuss use of these materials in their
own research.
- W. Ward, H. Krech, X. Yu, K. Herold,
G. Figgs, A. Ikeno, D. Jurafsky, and W. Byrne.
Lexicon adaptation for LVCSR: speaker
idiosyncracies, non-native speakers, and pronunciation choice.
In ISCA ITR Workshop on Pronunciation Modeling and Lexicon
Adaptation, 2002.
We report on our preliminary experiments on building dynamic
lexicons for native-speaker conversational speech and for foreign-accented
conversational speech. Our goal is to build a lexicon with a set of
pronunciations for each word, in which the probability distribution over
pronunciation is dynamically computed. The set of pronunciations are derived
from hand-written rules (for foreign accent) or clustering (for
phonetically-transcribed Switchboard data). The dynamic
pronunciation-probability will take into account specific characteristics of
the speaker as well as factors such as language-model probability,
disfluencies, sentence position, and phonetic context.
- D. Oard, D. Demner-Fushman,
J. Hajic, B Ramabhadran, S Gustman, W Byrne, D. Soergel, B. Dorr, P. Resnik,
and M. Picheney.
Cross-language access to recorded speech in the
MALACH project.
In Proceedings of the Text, Speech, and Dialog Workshop, 2002.
The MALACH project seeks to help users find information in a vast
multilingual collection of untranscribed oral history interviews. This paper
introduces the goals of the project and focuses on supporting access by users
who are unfamiliar with the interview language. It begins with a review of
the state of the art in cross-language speech retrieval: approaches that will
be investigated in the project are then described. Czech was selected as the
first non-English language to be supported; results of an initial
experimental with Czech/English cross-language retrieval are
reported.
- J. Psutka, P. Ircing, J. Psutka,
V. Radova, W. Byrne, J. Hajic, S. Gustman, and B. Ramabhadran.
Automatic transcription of Czech language
oral history in the MALACH project: Resources and initial experiments.
In Proceedings of the Text, Speech, and Dialog Workshop, 2002.
In this paper we describe the initial stages of the ASR component
of the MALACH project. This project will attempt to provide improved access
to the large multilingual spoken archives collected by the Survivors of the
Shoah Visual History Foundation by advancing the state of the art in
automated speech recognition. In order to train the ASR system, it is
necessary to manually transcribe a large amount of speech data, identify the
appropriate vocabulary, and obtain relevant text for language modeling. We
give a detailed description of the speech annotation process; show the
specific properties of the spontaneous speech contained in the archives; and
present baseline speech recognition results.
- S. Kumar and W. Byrne.
Minimum Bayes-risk alignment of bilingual
texts.
In Proc. of the Conference on Empirical Methods in Natural Language
Processing, Philadelphia, PA, USA, 2002.
We present Minimum Bayes-Risk word alignment for machine
translation. This statistical, model-based approach attempts to minimize the
expected risk of alignment errors under loss functions that measure alignment
quality. We describe various loss functions, including some that incorporate
linguistic analysis as can be obtained from parse trees, and show that these
approaches can improve alignments of the English-French
Hansards.
- S. Kumar and W. Byrne.
Risk based lattice cutting for segmental
minimum Bayes-risk decoding.
In Proc. of the International Conference on Spoken Language
Processing, Denver, Colorado, USA, 2002.
Minimum Bayes-Risk (MBR) speech recognizers have been shown to give
improvements over the conventional maximum a-posteriori probability (MAP)
decoders through N-best list rescoring and A-star search over word lattices.
Segmental MBR (SMBR) decoders simplify the implementation of MBR recognizers
by segmenting the N-best lists or lattices over which the recognition is
performed. We present a lattice cutting procedure that attempts to minimize
the total Bayes-Risk of all word strings in the segmented lattice. We provide
experimental results on the Switchboard conversational speech corpus showing
that this segmentation procedure, in conjunction with SMBR decoding, gives
modest but significant improvements over MAP decoders as well as MBR decoders
on unsegmented lattices.
- S. Tsakalidis, V. Doumpiotis, and
W. Byrne.
Discriminative linear transforms for
feature normalization and speaker adaptation in HMM estimation.
In Proc. of the International Conference on Spoken Language
Processing, Denver, Colorado, USA, 2002.
Linear transforms have been used extensively for training and
adaptation of HMM-based ASR systems. Recently procedures have been developed
for the estimation of linear transforms under the Maximum Mutual Information
(MMI) criterion. In this paper we introduce discriminative training
procedures that employ linear transforms for feature normalization and for
speaker adaptive training. We integrate these discriminative linear
transforms into MMI estimation of HMM parameters for improvement of large
vocabulary conversational speech recognition systems.
- F. Zheng, Z. Song, P. Fung, and
W. Byrne.
Mandarin pronunciation modeling based on the
CASS corpus.
Journal of Computer Science and Technology (Science Press, Beijing,
China), 17(3), May 2002.
16 pages.
The pronunciation variability is an important issue that must be
faced with when developing practical automatic spontaneous speech recognition
systems. In this paper, the factors that may affect the recognition
performance are analyzed, including those specific to the Chinese language.
By studying the INITIAL/FINAL (IF) characteristics of Chinese language and
developing the Bayesian equation, we propose the concepts of generalized
INITIAL/FINAL (GIF) and generalized syllable (GS), the GIF modeling and the
IF-GIF modeling, as well as the context-dependent pronunciation weighting,
based on a well phonetically transcribed seed database. By using these
methods, the Chinese syllable error rate (SER) was reduced by 6.3% and 4.2 compared with the GIF modeling and IF modeling respectively when the language
model, such as syllable or word N-gram, is not used. The effectiveness of
these methods is also proved when more data without the phonetic
transcription is used to refine the acoustic model using the proposed
iterative force-alignment based transcribing (IFABT) method, achieving a
5.7% SER reduction.
2001
- W. Byrne.
Minimum Bayes-Risk Automatic Speech Recognition.
University of Colorado, Boulder, CO, USA, November 2001.
- W. Byrne.
Minimum Bayes-Risk Automatic Speech Recognition.
Signal, Speech and Language Interpretation Lab, University of Washington,
Seattle, WA, USA, June 2001.
- F. Zheng, Z. Song,
P. Fung, and W. Byrne.
Modeling pronunciaiton variation using context-dependent weighting and B/S
refined acoustic modeling.
In Proc. of the European Conference on Speech Communication and
Technology (EUROSPEECH), 2001.
Pronunciation variability is an important issue that must be faced
with when developing practical automatic spontaneous speech recognition
systems. By studying the initial/final (IF) characteristics of Chinese
language and developing the Bayesian equation, we propose the concepts of
generalized initial/final (GIF) and generalized syllable (GS), the GIF
modeling method and the IF-GIF modeling method, as well as the
context-dependent pronunciation weighting method. By using these approaches,
the IF-GIF modeling reduces the Chinese syllable error rate (SER) by 6.3 and 4.2% compared with the GIF modeling and IF modeling respectively when
the language modeling, such as syllable or word N-gram, is not
used.
- P. Ircing, P. Krebc, J. Hajic,
S. Khudanpur, F. Jelinek, J. Psutka, and W. Byrne.
On large vocabulary continuous speech recognition of highly inflectional
language - Czech.
In Proc. of the European Conference on Speech Communication and
Technology (EUROSPEECH), 2001.
- V. Goel, S. Kumar, and
W. Byrne.
Confidence based lattice segmentation and
minimum Bayes-risk decoding.
In Proc. of the European Conference on Speech Communication and
Technology (EUROSPEECH), volume 4, pages 2569–2572, Aalborg, Denmark,
2001.
Minimum Bayes Risk (MBR) speech recognizers have been shown to
yield improvements over the conventional maximum a-posteriori probability
(MAP) decoders in the context of Nbest list rescoring andsearch over
recognition lattices. Segmental MBR (SMBR) procedures have been developed to
simplify implementation of MBR recognizers, by segmenting the N-best list or
lattice, to reduce the size of the search space over which MBR recognition is
carried out. In this paper we describe lattice cutting as a method to segment
recognition word lattices into regions of low confidence and high confidence.
We present two SMBR decoding procedures that can be applied on low confidence
segment sets. Results obtained on the Switchboard conversational telephone
speech corpus show modest but significant improvements relative to MAP
decoders.
- A. Gunawardana
and W. Byrne.
Discriminative speaker adaptation
with conditional maximum likelihood linear regression.
In Proc. of the European Conference on Speech Communication and
Technology (EUROSPEECH), 2001.
We present a simplified derivation of the extended Baum-Welch
procedure, which shows that it can be used for Maximum Mutual Information
(MMI) of a large class of continuous emission density hidden Markov models
(HMMs). We use the extended Baum-Welch procedure for discriminative
estimation of MLLR-type speaker adaptation transformations. The resulting
adaptation procedure, termed Conditional Maximum Likelihood Linear Regression
(CMLLR), is used successfully for supervised and unsupervised adaptation
tasks on the Switchboard corpus, yielding an improvement over MLLR. The
interaction of unsupervised CMLLR with segmental minimum Bayes risk lattice
voting procedures is also explored, showing that the two procedures are
complimentary.
- W. Byrne, V. Venkataramani,
T. Kamm, T.F. Zheng, Z. Song, P. Fung, Y. Lui, and U. Ruhi.
Automatic generation of pronunciation
lexicons for Mandarin casual speech.
In IEEE Conference on Acoustics, Speech and Signal Processing,
volume 1, pages 569–572, Salt Lake City, Utah, 2001. IEEE.
Pronunciation modeling for large vocabulary speech recognition
attempts to improve recognition accuracy by identifying and modeling
pronunciations that are not in the ASR systems pronunciation lexicon.
Pronunciation variability in spontaneous Mandarin is studied using the newly
created CASS corpus of phonetically annotated spontaneous speech.
Pronunciation modeling techniques developed in English are applied to this
corpus to train pronunciaton models when are then applied in Mandarin
Broadcast News transcription.
- V. Venkataramani and W. Byrne.
MLLR adaptation techniques for
pronunciation modeling.
In IEEE Workshop on Automatic Speech Recognition and
Understanding, Madonna di Campiglio, Italy, 2001.
Multiple regression class MLLR transforms are investigated for use
with pronunciation models that predict variation in the observed
pronunciations given the phonetic context. Regression classes can be
constructed so that MLLR transforms can be estimated and used to model
specific acoustic changes associated with pronunciation variation. The
effectiveness of this modeling approach is evaluated on the phonetically
transcribed portion of the SWITCHBOARD conversational speech
corpus.
- A. Gunawardana and W. Byrne.
Convergence of DLLR rapid speaker adaptation
algorithms.
In ISCA ITR-Workshop on Adaptation Methods for Automatic Speech
Recognition, 2001.
Discounted Likelihood Linear Regression (DLLR) is a speaker
adaptation technique for cases where there is insufficient data for MLLR
adaptation. Here, we provide an alternative derivation of DLLR by using a
censored EM formulation which postulates additional adaptation data which is
hidden. This derivation shows that DLLR, if allowed to converge, provides
maximum likelihood solutions. Thus the robustness of DLLR to small amounts of
data is obtained by slowing down the convergence of the algorithm and by
allowing termination of the algorithm before overtraining occurs. We then
show that discounting the observed adaptation data by postulating additional
hidden data can also be extended to MAP estimation of MLLR-type adaptation
transformations.
- A. Gunawardana and
W. Byrne.
Discounted likelihood linear regression for rapid speaker adaptation.
Computer Speech and Language, 15(1):15–38, Jan 2001.
The widely used maximum likelihood linear regression speaker
adaptation procedure suffers from overtraining when used for rapid adaptation
tasks in which the amount of adaptation data is severely limited. This is a
well known difficulty associated with the estimation maximization algorithm.
We use an information geometric analysis of the estimation maximization
algorithm as an alternating minimization of a Kullback-Leibler-type
divergence to see the cause of this difficulty, and propose a more robust
discounted likelihood estimation procedure. This gives rise to a discounted
likelihood linear regression procedure, which is a variant of maximum
likelihood linear regression suited for small adaptation sets. Our procedure
is evaluated on an unsupervised rapid adaptation task defined on the
Switchboard conversational telephone speech corpus, where our proposed
procedure improves word error rate by 1.6% (absolute) with as little as five
seconds of adaptation data, which is a situation in which maximum likelihood
linear regression overtrains in the first iteration of adaptation. We compare
several realizations of discounted likelihood linear regression with maximum
likelihood linear regression and other simple maximum likelihood linear
regression variants, and discuss issues that arise in implementing our
discounted likelihood procedures.
2000
- W. Byrne.
Discounted likelihood linear regression for rapid speaker adaptation.
Tsinghua University, Beijing, China, October 2000.
- LI A., ZHENG F., W. Byrne, P. Fung,
T. Kamm, LIU Yi, SONG Z., U. Ruhi, V. Venkataramani, and CHEN X.
CASS: A phonetically transcribed corpus of
Mandarin spontaneous speech.
In Proc. of the International Conference on Spoken Language
Processing, 2000.
A collection of Chinese spoken language has been collected and
phonetically annotated to capture spontaneous speech and language effects.
The Chinese Annotated Spontaneous Speech (CASS) corpus contains phonetically
transcribed spontaneous speech. This corpus was created to begin to collect
samples of most of the phonetic variations in Mandarin spontaneous speech due
to pronunciation effects, including allophonic changes, phoneme reduction,
phoneme deletion and insertion, as well as duration changes. It is intended
for use in pronunciation modeling for improved automatic speech recognition
and will be used at the 2000 Johns Hopkins University Language Engineering
Workshop by the project on Pronunciation Modeling of Mandarin Casual
Speech.
- W. Byrne, P. Beyerlein, J. Huerta,
S. Khudanpur, B. Marthi, J. Morgan, N. Peterek, J. Picone, D. Vergyri, and
W. Wang.
Towards language independent acoustic
modeling.
In IEEE Conference on Acoustics, Speech and Signal Processing,
pages 1029–1032, Istanbul, Turkey, 2000. IEEE.
We describe procedures and experimental results using speech from
diverse source languages to build an ASR system for a single target language.
This work is intended to improve ASR in languages for which large amounts of
training data are not available. We have developed both knowledge-based and
automatic methods to map phonetic units from the source languages to the
target language. We employed HMM adaptation techniques and Discriminative
Model Combination to combine acoustic models from the individual source
languages for recognition of speech in the target language. Experiments are
described in which Czech Broadcast News is transcribed using acoustic models
trained from small amounts of Czech read speech augmented by English,
Spanish, Russian, and Mandarin acoustic models.
- V. Goel, S. Kumar, and W. Byrne.
Segmental minimum Bayes-risk ASR voting
strategies.
In Proc. of the International Conference on Spoken Language
Processing, volume 3, pages 139–142, Beijing, China, 2000.
ROVER and its successor voting procedures have been shown to be
quite effective in reducing the recognition word error rate (WER). The
success of these methods has been attributed to their minimum Bayes-risk
(MBR) nature: they produce the hypothesis with the least expected word error.
In this paper we develop a general procedure within the MBR framework, called
segmental MBR recognition, that encompasses current voting techniques and
allows further extensions that yield lower expected WER. It also allows
incorporation of loss functions other than the WER. We present a derivation
of voting procedure of N-best ROVER as an instance of segmental MBR
recognition. We then present an extension, called e-ROVER, that alleviates
some of the restrictions of N-best ROVER by better approximating the WER.
e-ROVER is compared with N-best ROVER on multi-lingual acoustic modeling task
and is shown to yield modest yet significant and easily obtained
improvements.
- A. Gunawardana and
W. Byrne.
Robust estimation for rapid adaptation using
discounted likelihood techniques.
In International Conference on Acoustics, Speech, and Signal
Processing. IEEE, 2000.
The discounted likelihood procedure, which is a robust extension of
the usual EM procedure, is presented, and two approximations which lead to
two different variants of the usual MLLR adaptation scheme are introduced.
These schemes are shown to robustly estimate speaker adaptation transforms
with very little data. The evaluation is carried out on the Switchboard
corpus.
- D. Vergyri, S. Tsakalidis,
and W. Byrne.
Minimum risk acoustic clustering for
multilingual acoustic model compination.
In International Conference on Spoken Language Processing, 2000.
In this paper we describe procedures for combining multiple
acoustic models, obtained using training corpora from different languages, in
order to improve ASR performance in languages for which large amounts of
training data are not available. We treat these models as multiple sources of
information whose scores are combined in a log-linear model to compute the
hypothesis likelihood. The model combination can either be performed in a
static way, with constant combination weights, or in a dynamic way, with
parameters that can vary for different segments of a hypothesis. The aim is
to optimize the parameters so as to achieve minimum word error rate. In order
to achieve robust parameter estimation in the dynamic combination case, the
parameters are defined to be piecewise constant on different phonetic classes
that form a partition of the space of hypothesis segments. The partition is
defined, using phonological knowledge, on segments that correspond to
hypothesized phones. We examine different ways to define such a partition,
including an automatic approach that gives a binary tree structured partition
which tries to achieve the minimum WER with the minimum number of
classes.
- J. McDonough and W. Byrne.
On the incremental addition of regression classes for speaker adaptation.
In IEEE Conference on Acoustics, Speech and Signal Processing.
IEEE, 2000.
- William J. Byrne, Jan Hajic, Pavel Krbec,
Pavel Ircing, and Josef Psutka.
Morpheme based language models for speech recognition of czech.
In TDS '00: Proceedings of the Third International Workshop on Text,
Speech and Dialogue, pages 211–216, London, UK, 2000.
Springer-Verlag.
- V. Goel and W. Byrne.
Minimum Bayes-Risk automatic speech recognition.
Computer Speech and Language, 14(2):115–135, 2000.
In this paper we address the problem of efficient implementation of
the minimum Bayes-risk classifiers for automatic speech recognition.
Simplifying assumptions that allow computationally feasible approximations to
these classifiers are proposed. Under these assumptions an approximate
implementation as an A-star search algorithm over recognition lattice is
constructed. This algorithm improves up on the previously proposed N-best
list rescoring implementation of these classifiers. The minimum Bayes-risk
classifiers are shown to outperform the most commonly used maximum
a-posteriori probability (MAP) classifier on three speech recognition tasks:
reduction of word error rate, reduction of content word error rate, and
identification of Named Entities in speech. The A-star implementation is also
contrasted with the N-best list rescoring implementation and is found to
obtain modest but significant improvements in accuracy with little
computational overhead.
- W. Byrne and A. Gunawardana.
Comments on 'Efficient training algorithms for HMM's using incremental
estimation'.
IEEE Transactions on Speech and Audio Processing, 8(6):751–754,
Nov 2000.
``Efficient Training Algorithms for HMM's using Incremental
Estimation'' investigates EM procedures that increase training speed. The
authors' claim that these are GEM procedures is incorrect. We discuss why
this is so, provide an example of non-monotonic convergence to a local
maximum in likelihood, and outline conditions that guarantee such
convergence.
1999
- V. Digalakis, S. Berkowitz,
E. Bochieri, C. Boulis, W. Byrne, H. Collier, A. Corduneanu, A. Kannan,
S. Khudanpur, J. McDonough, and A. Sankar.
Rapid speech recognizer adaptation to new
speakers.
In IEEE Conference on Acoustics, Speech and Signal Processing.
IEEE, 1999.
This paper summarizes the work of the ``Rapid Speech Recognizer
Adaptation'' team in the workshop held at Johns Hopkins University in the
summer of 1998. The project addressed the modeling of dependencies between
units of speech with the goal of making more effective use of small amounts
of data for speaker adaptation. A variety of methods were investigated and
their effectiveness in a rapid adaptation task defined on the SWITCHBOARD
conversational speech corpus is reported.
- W. Byrne, P. Beyerlein, J. Huerta,
S. Khudanpur, B. Marthi, J. Morgan, N. Peterek, J. Picone, D. Vergyri, and
W. Wang.
Towards language independent acoustic
modeling.
In IEEE Workshop on Automatic Speech Recognition and
Understanding, Keystone, Colorado, 1999.
We describe procedures and experimental results using speech from
diverse source languages to build an ASR system for a single target language.
This work is intended to improve ASR in languages for which large amounts of
training data are not available. We have developed both knowledge based and
automatic methods to map phonetic units from the source languages to the
target language. We employed HMM adaptation techniques and Discriminative
Model Combination to combine acoustic models from the individual source
languages for recognition of speech in the target language. Experiments are
described in which Czech Broadcast News is transcribed using acoustic models
trained from small amounts of Czech read speech augmented by English,
Spanish, Russian, and Mandarin acoustic models.
- W. Byrne and A. Gunawardana.
Convergence of EM variants.
In IEEE Information Theory Workshop on Detection, Estimation,
Classification, and Imaging, page 64, 1999.
- W. Byrne and A. Gunawardana.
Discounted likelihood linear regression
for rapid adaptation.
In Proc. of the European Conference on Speech Communication and
Technology (EUROSPEECH), 1999.
Rapid adaptation schemes that employ the EM algorithm may suffer
from overtraining problems when used with small amounts of adaptation data.
An algorithm to alleviate this problem is derived within the information
geometric framework of Csiszár and Tusnády, and is used to improve MLLR
adaptation on NAB and Switchboard adaptation tasks. It is shown how this
algorithm approximately optimizes a discounted likelihood
criterion.
- W. Byrne, J. Hajic, P. Ircing, F. Jelinek,
S. Khudanpur, J. McDonough, N. Peterek, and J. Psutka.
Large vocabulary speech recognition for read
and broadcast Czech.
In Proceedings of the Text, Speech, and Dialog Workshop, 1999.
We describe read speech and broadcast news corpora collected as
part of a multi-year international collaboration for the development of large
vocabulary speech recognition systems in the Czech language. Initial
investigations into language modeling for Czech automatic speech recognition
are described and preliminary recognition results on the read speech corpus
are presented.
- J. McDonough and W. Byrne.
Single-pass adapted training with
all-pass transforms.
In Proc. of the European Conference on Speech Communication and
Technology (EUROSPEECH), 1999.
In recent work, the all-pass transform (APT) was proposed as the
basis of a speaker adaptation scheme intended for use with a large vocabulary
speech recognition system. It was shown that APT-based adaptation reduces to
a linear transformation of cepstral means, much like the better known maximum
likelihood linear regression (MLLR), but is specified by far fewer free
parameters. Due to its linearity, APT-based adaptation can be used in
conjunction with speaker-adapted training (SAT), an algorithm for performing
maximum likelihood estimation of the parameters of an HMM when speaker
adaptation is to be employed during both training and test. In this work, we
propose a refinement of SAT called single-pass adapted trainingB (SPAT) which
achieves the same improvement in system performance as SAT but requires much
less computation for HMM training. In a set of speech recognition experiments
conducted on the Switchboard Corpus, we report a word error rate reduction of
5.3% absolute using a single, global APT.
- V. Goel and W. Byrne.
Task dependent loss functions in speech
recognition: A-star search over recognition lattices.
In Proc. of the European Conference on Speech Communication and
Technology (EUROSPEECH), 1999.
A recognition strategy that can be matched to specific system
performance criteria has recently been found to yield improvem ents over the
usual maximum a posteriori probability strategy. Some examples of different
system performance criteria are word error rate (WER), F-measure for Named
Entity extraction tasks, and word-specific errors for keyword spotting tasks.
In the match ed-to-the-task strategy the hypothesis is chosen to minimize the
expected loss or the Bayes Risk under a loss function defined by th e
performance measure of interest. Due to the prohibitively expensive
implementation of this strategy, only an approximate implemen tation as an
N-best list rescoring scheme has been used so far. Our goal is to improve the
performance of such risk-based dec oders by developing search strategies that
can incorporate more acoustic evidence. In this paper we present search
algorithms to implement the risk-based recognition strategy over word
lattices that contain acoustic and language model scores. These algorithms
are extensions of the N-best list rescoring approximation and are formulated
as A-star algorithms. We first present a single stack A-star search and show
how to obtain an under-estimate and an over-estimate of the cost needed for
the search. For loss functions that do not depend on time segmentation of
hypotheses, a prefix-tree based simpl ification of the single stack algorithm
is then derived. For yet a further subset of loss functions, including the
usual Levenshtei n distance based loss for WER reduction tasks, we describe a
search organization that facilitates further efficiencies in computatio n and
storage. Finally we present a path equivalence criterion for merging of
prefix tree nodes during search to allow for a larger search space. We find
that restricted loss functions yield the most efficient search procedures.
However the general single stack search can be applied quite broadly even in
principle to loss functions that measure semantic agreement between
sentences. Preliminary experiments were performed for WER reduction task on
the Switchboard corpus, dev-test set of the 1997 JHU-LVCSR workshop. We
obtain an error rate reduction of 0.8-0.9% absolute over a baseline of
38.5% WER. The search speed is comparable to the N-best list rescoring
procedure which is much more restrictive in the amount of hypotheses
considered for search and produces slightly inferior results (0.5-0.6 absolute improvement). At the conference we will present the framework of
task dependent recognition strategy, its implementation as A-star search, and
the speed and accuracy comparison of the search with N-best list rescoring
procedure.
- V. Goel and W. Byrne.
Task dependent loss functions in speech recognition: Application to named
entity extraction.
In ESCA-ETR Workshop on accessing information in spoken audio,
1999.
- J. McDonough and W. Byrne.
Speaker adaptation with all-pass
transforms.
In International Conference on Acoustics, Speech, and Signal
Processing. IEEE, 1999.
In recent work, a class of transforms were proposed which achieve a
remapping of the frequency axis much like conventional vocal tract length
normalization. These mappings, known collectively as all-pass
transforms (APT), were shown to produce substantial improvements in the
performance of a large vocabulary speech recognition system when used to
normalize incoming speech prior to recognition. In this application, the most
advantageous characteristic of the APT was its cepstral-domain linearity;
this linearity makes speaker normalization simple to implement, and provides
for the robust estimation of the parameters characterizing individual
speakers. In the current work, we exploit the APT to develop a speaker
adaptation scheme in which the cepstral means of a speech recognition model
are transformed to better match the speech of a given speaker. In a set of
speech recognition experiments conducted on the Switchboard Corpus, we report
reductions in word error rate of 3.7% absolute.
- M. Riley, W. Byrne, M. Finke,
S. Khudanpur, A. Ljolje, J. McDonough, H. Nock, M. Saraclar, C. Wooters, and
G. Zavaliagkos.
Stochastic pronunciation modeling from hand-labelled phonetic corpora.
Speech Communication, pages 109–116, November 1999.
In the early '90s, the availability of the TIMIT read-speech
phonetically transcribed corpus led to work at AT&T on the automatic
inference of pronunciation variation. This work, briefly summarized here,
used stochastic decisions trees trained on phonetic and linguistic features,
and was applied to the DARPA North American Business News read-speech ASR
task. More recently, the ICSI spontaneous-speech phonetically transcribed
corpus was collected at the behest of the 1996 and 1997 LVCSR Summer
Workshops held at Johns Hopkins University. A 1997 workshop (WS97) group
focused on pronunciation inference from this corpus for application to the
DoD Switchboard spontaneous telephone speech ASR task. We describe several
approaches taken there. These include (1) one analogous to the AT&T
approach, (2) one, inspired by work at WS96 and CMU, that involved adding
pronunciation variants of a sequence of one or more words (`multiwords') in
the corpus (with corpus-derived probabilities) into the ASR lexicon, and
(1+2) a hybrid approach in which a decision-tree model was used to
automatically phonetically transcribe a much larger speech corpus than ICSI
and then the multiword approach was used to construct an ASR recognition
pronunciation lexicon.
1998
- W. Byrne, M. Finke, S. Khudanpur,
J. McDonough, H. Nock, M. Riley, M. Saraclar, C. Wooters, and
G. Zavaliagkos.
Pronunciation modelling using a hand-labelled
corpus for conversational speech recognition.
In IEEE International Conference on Acoustics, Speech and Signal
Processing. IEEE, 1998.
Accurately modelling pronunciation variability in conversational
speech is an important component of an automatic speech recognition system.
We describe some of the projects undertaken in this direction during and
after WS97, the Fifth LVCSR Summer Workshop, held at Johns Hopkins
University, Baltimore, in July- August, 1997. We first illustrate a use of
hand-labelled phonetic transcriptions of a portion of the Switchboard corpus,
in conjunction with statistical techniques, to learn alternatives to
canonical pronunciations of words. We then describe the use of these
alternate pronunciations in an automatic speech recognition system. We
demonstrate that the improvement in recognition performance from
pronunciation modelling persists as the system is enhanced with better
acoustic and language models.
- W. Byrne, M. Finke, S. Khudanpur,
A. Ljolje, J. McDonough, H. Nock H, M. Riley, M. Saraclar, C. Wooters, and
G. Zavaliagkos.
Stochastic pronunciation modeling from
hand-labeled phonetic corpora.
In Proceedings of the Workshop on Modeling Pronunciation Variation for
Automatic Speech Recognition, 1998.
- J. McDonough, W. Byrne, and
X. Luo.
Speaker normalization with all-pass
transforms.
In International Conference on Spoken Language Processing, 1998.
Speaker normalization is a process in which the short-time features
of speech from a given speaker are transformed so as to better match some
speaker independent model. Vocal tract length normalization (VTLN) is a
popular speaker normalization scheme wherein the frequency axis of the
short-time spectrum associated with a speaker's speech is rescaled or warped
prior to the extraction of cepstral features. In this work, we develop a
novel speaker normalization scheme by exploiting the fact that frequency
domain transformations similar to that inherent in VTLN can be accomplished
entirely in the cepstral domain through the use of conformal maps. We propose
a class of such maps, designated all-pass transforms for reasons given
hereafter, and in a set of speech recognition experiments conducted on the
Switchboard Corpus demonstrate their capacity to achieve word error rate
reductions of 3.7% absolute.
- V. Goel, W. Byrne, and
S. Khudanpur.
LVCSR rescoring with modified loss
functions: a decision theoretic perspective.
In International Conference on Acoustics, Speech, and Signal
Processing. IEEE, 1998.
In this work, the problem of speech decoding is viewed in a
Decision Theoretic framework. A modified speech decoding procedure to
minimize the expected word error rate is formulated in this framework, and
its implementation in N-best list rescoring is presented. Preliminary
experiments on the Switch-board show a small but statistically significant
error rate improvements.
1997
- W. Byrne, M. Finke, S. Khudanpur,
J. McDonough, H. Nock H, M. Riley, M. Saraclar, C. Wooters, and
G. Zavaliagkos.
Pronunciation modelling for conversational
speech recognition: A status report from WS97.
In IEEE Automatic Speech Recognition and Understanding Workshop,
1997.
Accurately modelling pronunciation variability in conversational
speech is an important component for automatic speech recognition. We
describe some of the projects undertaken in this direction at WS97, the Fifth
LVCSR Summer Workshop, held at Johns Hopkins University, Baltimore, in
July-August, 1997. We first illustrate a use of hand-labelled phonetic
transcriptions of a portion of the Switchboard corpus, in conjunction with
statistical techniques, to learn alternatives to canonical pronunciations of
words. We then describe the use of these alternate pronunciations in a
recognition experiment as well as in the acoustic training of an automatic
speech recognition system. Our results show a reduction of word error rate in
both cases band 2.2% with acoustic retraining.
- W. Byrne, S. Khudanpur,
E. Knodt, and J. Bernstein.
Is automatic speech recognition ready for
non-native speech? a data collection effort and initial experiments in
modeling conversational Hispanic english.
In ESCA-ITR Workshop on speech technology in language learning,
1997.
We describe the protocol used for collecting a corpus of
conversational English speech from non-native speakers at several levels of
proficiency, and report the results of preliminary automatic speech
recognition (ASR) experiments on this corpus using HTK-based ASR systems. The
speech corpus contains both read and conversational speech recorded
simultaneously on wide-band and telephone channels, and has detailed time
aligned transcriptions. The immediate goal of the ASR experiments is to
assess the difficulty of the ASR problem in language learning exercises and
thus to gauge how current ASR technology may be used in conversational
computer assisted language learning (CALL) systems. The long-term goal of
this research, of which the data collection and experiments are a first step,
is to incorporate ASR into computer-based conversational language instruction
systems.
- W. Byrne and S. Shamma.
Neurocontrol in sequence
recognition.
In O. Omidvar and D. Elliott, editors, Progress in Neural Networks:
Neural Networks for Control, pages 31–56. Academic Press, 1997.
An artificial neural network intended for sequence modeling and
recognition is described. The network is based on a lateral inhibitory
network with controlled, oscillatory behavior so that it naturally models
sequence generation. Dynamic programming algorithms can be used to transform
the network into a sequence recognizer. Markov decision theory is used to
develop novel and more ``neural'' recognition control strategies as
alternatives to dynamic programming.
1996
- M. Ostendorf, W. Byrne, M. Bacchiani,
M. Finke, A. Gunawardana, K. Ross, S. Roweis, E. Shriberg, D. Talkin,
A. Waibel, B. Wheatley, and T. Zeppenfeld.
Modeling systematic variations in pronunciation via a language-dependent hiddn
speaking mode.
In Proceedings of the International Conference on Spoken Language
Processing, 1996.
- W. Byrne.
Information geometry and maximum likelihood
criteria.
In Conference on Information Sciences and Systems, Princeton, NJ,
1996.
This paper presents a brief comparison of two information
geometries as they are used to describe the EM algorithm used in maximum
likelihood estimation from incomplete data. The Alternating Minimization
framework based on the I-Geometry developed by Csiszar is presented first,
followed by the em-algorithm of Amari. Following a comparison of these
algorithms, a discussion of a variation in likelihood criterion is presented.
The EM algorithm is usually formulated so as to improve the marginal
likelihood criterion. Closely related algorithms also exist which are
intended to maximize different likelihood criteria. The 1-Best criterion, for
example, leads to the Viterbi training algorithm used in Hidden Markov
Modeling. This criterion has an information geometric description that
results from a minor modification of the marginal likelihood
formulation.
1994
- S. Young, P. Woodland, and W. Byrne.
Spontaneous speech recognition for the credit card corpus using the HTK
toolkit.
IEEE Transactions on Speech and Audio Processing, pages
615–621, 1994.
This paper describes the speech recognition system which was
provided as a baseline for the Summer Workshop on Robust Speech Processing
held at the Rutgers CAIP Center in July/August 1993.
1993
- W. Byrne.
Generalization and maximum likelihood from small
data sets.
In IEEE-SP Workshop on Neural Networks in Signal Processing,
1993.
An often encountered learning problem is maximum likelihood
training of exponential models. When the state is only partially specified by
the training data, iterative training algorithms are used to produce a
sequence of models that assign increasing likelihood to the training data.
Although the performance as measured on the training set continues to improve
as the algorithms progress, performance on related data sets may eventually
begin to deteriorate. The cause of this behavior can be seen when the
training problem is stated in the Alternating Minimization framework. A
modified maximum likelihood training criterion is suggested to counter this
behavior. It leads to a simple modification of the learning algorithms which
relates generalization to learning speed. Training Boltzmann Machines and
Hidden Markov Models is discussed under this modified
criterion.
- K. Wang, S. Shamma, and W. Byrne.
Noise robustness in the auditory representation of speech signals.
In International Conference on Acoustics, Speech, and Signal
Processing. IEEE, 1993.
1992
- W. Byrne.
Alternating Minimization and Boltzmann Machine learning.
IEEE Transactions on Neural Networks, 3(4):612–620, 1992.
Training a Boltzmann machine with hidden units is appropriately
treated in information geometry using the information divergence and the
technique of alternating minimization. The resulting algorithm is shown to be
closely related to gradient descent Boltzmann machine learning rules, and the
close relationship of both to the EM algorithm is described. An iterative
proportional fitting procedure is described and incorporated into the
alternating minimization algorithm.
1989
- W. Byrne, J. Robinson, and
S. Shamma.
The auditory processing and recognition of speech.
In Proceedings of the Speech and Natural Language Workshop, pages
325–331, October 1989.
1986
- W. Byrne,
R. Zapp, P. Flynn, and M. Siegel.
Adaptive filter processing in remote heart monitors.
IEEE Transactions on Biomedical Engineering, pages 717–722,
1986.
1985
- W. Byrne,
R. Zapp, P. Flynn, and M. Siegel.
Adaptive filtering in microwave remote heart monitors.
In IEEE Engineering in Medicine and Biology Society, Seventh Annual
Conference, 1985.