Bill Byrne - Publications and Presentations

2014    Pushdown automata in statistical machine translation. C. Allauzen, W. Byrne, A. de Gispert, G. Iglesias, and M. Riley. Computational Linguistics, 2014. Accepted. To appear. Paper [PDF].

This paper describes the use of pushdown automata (PDA) in the context of statistical machine translation and alignment under a synchronous context-free grammar. We use PDAs to compactly represent the space of candidate translations generated by the grammar when applied to an input sentence. General-purpose PDA algorithms for replacement, composition, shortest path, and expansion are presented. We describe HiPDT, a hierarchical phrase-based decoder using the PDA representation and these algorithms. We contrast the complexity of this decoder with a decoder based on a finite state automata (FSA) representation, showing that PDAs provide a more suitable framework to achieve exact decoding for larger SCFGs and smaller language models. We assess this experimentally on a large-scale Chinese-to-English alignment and translation task. In translation, we propose a two-pass decoding strategy involving a weaker language model in the first-pass to address the results of PDA complexity analysis. We study in depth the experimental conditions and tradeoffs in which HiPDT can achieve state-of- the-art performance for large-scale SMT.

         Investigating automatic and human filled pause insertion for synthetic speech. Rasmus Dall, Marcus Tomalin, Mirjam Wester, William Byrne, and Simon King. In Proceedings of INTERSPEECH, September 2014.

Filled Pauses are pervasive in conversational speech and have been shown to serve a range of psychological and structural purposes. Despite this, they are seldom modelled overtly by state-of-the-art speech synthesis systems. This paper seeks to motivate the incorporation of filled pauses into speech synthesis systems by exploring their use in conversational speech, and by comparing the performance of several automatic systems that insert filled pauses into fluent texts. Two initial experiments are described which seek to determine whether people’s predictions about appropriate insertion points for filled pauses are consistent with actual practice and/or with each other. The experiments also investigate whether there are ’right’ and ’wrong’ places to insert filled pauses in a given sentence. The results summarised in this paper show good consistency between people’s predictions of usage and their actual practice, as well as a perceptual preference for the ’right’ placement. The third experiment contrasts the performance of several automatic systems that insert filled pauses into fluent sentences. The best performance (as determined by precision, recall and F-measure) was produced by interpolating a Recurrent Neural Network and a 4gram Language Model. The research presented in this paper offers new insights into the way in which filled pauses are used and perceived by humans, and how automatic systems can be used to predict the locations of filled pauses in fluent input text.

         Effective incorporation of source syntax into hierarchical phrase-based translation. Tong Xiao, Adrią de Gispert, Jingbo Zhu, and Bill Byrne. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pages 2064–2074, Dublin, Ireland, August 2014. Dublin City University and Association for Computational Linguistics. http://www.aclweb.org/anthology/C14-1195 [Slides].

In this paper we explicitly consider source language syntactic information in both rule extraction and decoding for hierarchical phrase-based translation. We obtain tree-to-string rules by the GHKM method and use them to complement Hiero-style rules. All these rules are then employed to decode new sentences with source language parse trees. We experiment with our approach in a state-of-the-art Chinese-English system and demonstrate +1.2 and +0.8 BLEU improvements on the NIST newswire and web evaluation data of MT08 and MT12.

         Word ordering with phrase-based grammars. Adrią de Gispert, Marcus Tomalin, and Bill Byrne. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 259–268, Gothenburg, Sweden, April 2014. Association for Computational Linguistics. Paper [PDF].

We describe an approach to word ordering using modelling techniques from statistical machine translation. The system incorporates a phrase-based model of string generation that aims to take unordered bags of words and produce fluent, grammatical sentences. We describe the generation grammars and introduce parsing procedures that address the computational complexity of generation under permutation of phrases. Against the best previous results reported on this task, obtained using syntax driven models, we report huge quality improvements, with BLEU score gains of 20+ which we confirm with human fluency judgements. Our system incorporates dependency language models, large n-gram language models, and minimum Bayes risk decoding.

         A graph-based approach to string regeneration. Matic Horvat and William Byrne. In Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 85–95, Gothenburg, Sweden, April 2014. Association for Computational Linguistics. Paper [PDF].

The string regeneration problem is the problem of generating a fluent sentence from a bag of words. We explore the N-gram language model approach to string regeneration. The approach computes the highest probability permutation of the input bag of words under an N-gram language model. We describe a graph-based approach for finding the optimal permutation. The evaluation of the approach on a number of datasets yielded promising results, which were confirmed by conducting a manual evaluation study.

         Source-side preordering for translation using logistic regression and depth-first branch-and-bound search. Laura Jehl, Adrią de Gispert, Mark Hopkins, and Bill Byrne. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 239–248, Gothenburg, Sweden, April 2014. Association for Computational Linguistics. Paper [PDF].

We present a simple preordering approach for machine translation based on a feature-rich logistic regression model to predict whether two children of the same node in the source-side parse tree should be swapped or not. Given the pair-wise children regression scores we conduct an efficient depth-first branch-and-bound search through the space of possible children permutations, avoiding using a cascade of classifiers or limiting the list of possible ordering outcomes. We report experiments in translating English to Japanese and Korean, demonstrating superior performance as (a) the number of crossing links drops by more than 10% absolute with respect to other state-of-the-art preordering approaches, (b) BLEU scores improve on 2.2 points over the baseline with lexicalised reordering model, and (c) decoding can be carried out 80 times faster.

2013    Pushdown automata in statistical machine translation, W. Byrne. International Conference on Finite-State Methods and Natural Language Processing, FSMNLP, 2013. Keynote lecture. Presentation [PDF]. http://fsmnlp2013.cs.st-andrews.ac.uk/abstracts.html#byrne.

This talk will present some recent work investigating pushdown automata (PDA) in the context of statistical machine translation and alignment under synchronous context-free grammars (SCFGs). PDAs can be used to compactly represent the space of candidate translations generated by the grammar when applied to an input sentence, and this presentation will give an overview of general-purpose PDA algorithms for replacement, composition, shortest path, and expansion. HiPDT, a hierarchical phrase-based decoder using the PDA representation and these algorithms will be described and the complexity of the HiPDT decoder operations will be compared to decoders based on finite state automata and the widely used hypergraph representations. PDAs have strengths in a particular translation scenario: exact decoding with large SCFGs and relatively smaller language models. This talk is based on recent work with Adri de Gispert and Gonzalo Iglesias at University of Cambridge, and Michael Riley and Cyril Allauzen at Google Research.

         The University of Cambridge Russian-English system at WMT13. Juan Pino, Aurelien Waite, Tong Xiao, Adrią de Gispert, Federico Flego, and William Byrne. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 200–205, Sofia, Bulgaria, August 2013. Association for Computational Linguistics. http://www.aclweb.org/anthology/W13-2225.

This paper describes the University of Cambridge submission to the Eighth Workshop on Statistical Machine Translation. We report results for the Russian-English translation task. We use multiple segmentations for the Russian input language. We employ the Hadoop framework to extract rules. The decoder is HiFST, a hierarchical phrase-based decoder implemented using weighted finite-state transducers. Lattices are rescored with a higher order language model and minimum Bayes-risk objective.

         Syntax-based statistical machine translation, and evaluation of machine translation systems, W. Byrne. Cognition Institute Summer School: Bilingual Minds, Bilingual Machines, June 2013. Three lecture short course. http://www.plymouth.ac.uk/pages/dynamic.asp?page=events&eventID=7506&showEvent=1

         Fast, low-artifact speech synthesis considering global variance. M. Shannon and W. Byrne. In Proceedings of IEEE Conference on Acoustics, Speech and Signal Processing, June 2013. http://mi.eng.cam.ac.uk/~sms46/papers/shannon2013fast-submitted.pdf.

Speech parameter generation considering global variance (GV generation) is widely acknowledged to dramatically improve the quality of synthetic speech generated by HMM-based systems. However it is slower and has higher latency than the standard speech parameter generation algorithm. In addition it is known to produce artifacts, though existing approaches to prevent artifacts are effective. In this paper we present a simple new mathematical analysis of speech parameter generation considering global variance based on Lagrange multipliers. This analysis sheds light on one source of artifacts and suggests a way to reduce their occurrence. It also suggests an approximation to exact GV generation that allows fast, low latency synthesis. In a subjective evaluation the naturalness of our fast approximate algorithm is as good as conventional GV generation.

2012    N-gram posterior probability confidence measures for statistical machine translation: an empirical study. A. de Gispert, G. Blackwood, G. Iglesias, and W. Byrne. Machine Translation, pages 1–30 (31 pages), 2012. Published online 1 September 2012. http://dx.doi.org/10.1007/s10590-012-9132-2.

We report an empirical study of n -gram posterior probability confidence measures for statistical machine translation (SMT). We first describe an efficient and practical algorithm for rapidly computing n -gram posterior probabilities from large translation word lattices. These probabilities are shown to be a good predictor of whether or not the n -gram is found in human reference translations, motivating their use as a confidence measure for SMT. Comprehensive n -gram precision and word coverage measurements are presented for a variety of different language pairs, domains and conditions. We analyze the effect on reference precision of using single or multiple references, and compare the precision of posteriors computed from k -best lists to those computed over the full evidence space of the lattice. We also demonstrate improved confidence by combining multiple lattices in a multi-source translation framework.

         Simple and efficient model filtering in statistical machine translation. J. Pino, A. Waite, and W. Byrne. The Prague Bulletin of Mathematical Linguistics, (98):5–24 (20 pages), 2012. Published online 6 September 2012. http://ufal.mff.cuni.cz/pbml-91-100.html.

Data availability and distributed computing techniques have allowed statistical machine translation (SMT) researchers to build larger models. However, decoders need to be able to retrieve information efficiently from these models to be able to translate an input sentence or a set of input sentences. We introduce an easy to implement and general purpose solution to tackle this problem: we store SMT models as a set of key-value pairs in an HFile. We apply this strategy to two specific tasks: test set hierarchical phrase-based rule filtering and n-gram count filtering for language model lattice rescoring. We compare our approach to alternative strategies and show that its trade offs in terms of speed, memory and simplicity are competitive.

         Autoregressive models for statistical parametric speech synthesis. M. Shannon, H. Zen, and W. Byrne. IEEE Transactions on Audio, Speech and Language Processing, 2012. Paper [PDF].

We propose using the autoregressive hidden Markov model (HMM) for speech synthesis. The autoregressive HMM uses the same model for parameter estimation and synthesis in a consistent way, in contrast to the standard approach to statistical parametric speech synthesis. It supports easy and efficient parameter estimation using expectation maximization, in contrast to the trajectory HMM. At the same time its similarities to the standard approach allow use of established high quality synthesis algorithms such as speech parameter generation considering global variance. The autoregressive HMM also supports a speech parameter generation algorithm not available for the standard approach or the trajectory HMM and which has particular advantages in the domain of real-time, low latency synthesis. We show how to do efficient parameter estimation and synthesis with the autoregressive HMM and look at some of the similarities and differences between the standard approach, the trajectory HMM and the autoregressive HMM. We compare the three approaches in subjective and objective evaluations. We also systematically investigate which choices of parameters such as autoregressive order and number of states are optimal for the autoregressive HMM.

         Impacts of machine translation and speech synthesis on speech-to-speech translation. K. Hashimoto, J. Yamagishi, W. Byrne, S. King, and K. Tokuda. Speech Communication, 54(7):857–866 (10 pages), September 2012. http://www.sciencedirect.com/science/article/pii/S0167639312000283.

This paper analyzes the impacts of machine translation and speech synthesis on speech-to-speech translation systems. A typical speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many techniques have been proposed for integration of speech recognition and machine translation. However, corresponding techniques have not yet been considered for speech synthesis. The focus of the current work is machine translation and speech synthesis, and we present a subjective evaluation designed to analyze their impact on speech-to-speech translation. The results of these analyses show that the naturalness and intelligibility of the synthesized speech are strongly affected by the fluency of the translated sentences. In addition, various features were found to correlate well with the average fluency of the translated sentences and the average naturalness of the synthesized speech.

         The CUED OpenMT12 Arabic-English and Chinese-English SMT systems. NIST Open MT Workshop, Washington, DC, July 2012. Presentation [PDF]. http://www.nist.gov/itl/iad/mig/openmt12results.cfm

         Lattice-based minimum error rate training using weighted finite-state transducers with tropical polynomial weights. A. Waite, G. Blackwood, and W. Byrne. In Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing (FSMNLP 2012), Donostia-San Sebastian, Spain, July 2012. (11 pages). Paper [PDF], Presentation [PDF]. http://aclweb.org/anthology-new/W/W12/W12-6219.pdf.

Minimum Error Rate Training (MERT) is a method for training the parameters of a log-linear model. One advantage of this method of training is that it can use the large number of hypotheses encoded in a translation lattice as training data. We demonstrate that the MERT line optimisation can be modelled as computing the shortest distance in a weighted finite-state transducer using a tropical polynomial semiring.

         Statistical machine translation, W. Byrne. Cambridge Language Sciences Launch Event, Newnham College, Cambridge, May 2012. http://www.languagesciences.cam.ac.uk/event-reports/cambridge-language-sciences-launch-event

         Hierarchical phrase-based translation representations, W. Byrne. Workshop on ‘More Structure for Better Statistical Machine Translation?’, University of Amsterdam, Netherlands, January 2012. Invited lecture. http://staff.science.uva.nl/~simaan/workshop2012.html

         Weighted finite state transducers in statistical machine translation, W. Byrne. International Winter School in Language and Speech Technologies (WSLST 2012), Tarragona, Spain, January 2012. Six lecture short course. http://grammars.grlmc.com/wslst2012/courseDescription.php#Byrne.

This short course will present some recent advances in statistical machine translation (SMT) using modelling approaches based on Weighted Finite State Transducers (WFSTs) and Finite State Automata (FSA). The course focus will be on decoding procedures for SMT, i.e. the generation of translations using stochastic translation grammars and language models. WFSTs can offer a very powerful modelling framework for language processing. For problems which can be formulated in terms of WFSTs or FSAs, there are general purpose algorithms which can be used to implement efficient and exact search and estimation procedures. This is true even for problems which are not inherently finite state, such as translation with some stochastic context free grammars. The course will begin with an introduction to WFSTs, pushdown automata, and semirings in the context of SMT. The use of WFST and FSA modelling approaches will be presented for: SMT decoding with phrase-based models; SMT decoding with stochastic synchronous context free grammars (e.g. Hiero); SMT parameter optimisation (MERT); the use of large language models and ’fast’ grammars in translation; translation lattice generation; and rescoring procedures such as minimum Bayes risk decoding and system combination. Implementations using the OpenFst toolkit will also be described. The course material will be suitable for researchers already familiar with SMT and who wish to learn about alternative methods in decoder design. Enough background will be given so that researchers new to machine translation or unfamiliar with applications of WFSTs in natural language processing will also find the material appropriate.

2011    Proprocessing Arabic for Arabic-English statistical machine translation. A. de Gispert, W. Byrne, J. Xu, R. Zbib, J. Makhoul, A. Chalabi, H. Nader, N. Habash, and F. Sadat. In J. Olive, C. Christianson, and J. McCary, editors, Handbook of natural language processing and machine translation. DARPA Global Autonomous Language Exploitation, pages 135 – 145 (11 pages). Springer, 2011

         Personalising speech-to-speech translation: Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis. J. Dines, H. Liang, L. Saheer, M. Gibson, W. Byrne, K. Oura, K. Tokuda, J. Yamagishi, S. King, M. Wester, T. Hirsimäki, R. Karhila, and M. Kurimo. Computer Speech and Language, page (18 pages), 2011. In press. Available online 17 September 2011. doi:10.1016/j.csl.2011.08.003. http://www.sciencedirect.com/science/article/pii/S0885230811000441.

In this paper we present results of unsupervised cross-lingual speaker adaptation applied to text-to-speech synthesis. The application of our research is the personalisation of speech-to-speech translation in which we employ a HMM statistical framework for both speech recognition and synthesis. This framework provides a logical mechanism to adapt synthesised speech output to the voice of the user by way of speech recognition. In this work we present results of several different unsupervised and cross-lingual adaptation approaches as well as an end-to-end speaker adaptive speech-to-speech translation system. Our experiments show that we can successfully apply speaker adaptation in both unsupervised and cross-lingual scenarios and our proposed algorithms seem to generalise well for several language pairs. We also discuss important future directions including the need for better evaluation metrics.

         Unsupervised intra-lingual and cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction. M. Gibson and W. Byrne. IEEE Transactions on Audio, Speech, and Language Processing, 19(4):895 – 904 (10 pages), 2011. http://dx.doi.org/10.1109/TASL.2010.2066968.

Hidden Markov model (HMM)-based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to estimate the transcription of the adaptation data. This paper first presents an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for such supplementary acoustic models. This is achieved by defining a mapping between HMM-based synthesis models and ASR-style models, via a two-pass decision tree construction process. Second, it is shown that this mapping also enables unsupervised adaptation of HMM-based speech synthesis models without the need to perform linguistic analysis of the estimated transcription of the adaptation data. Third, this paper demonstrates how this technique lends itself to the task of unsupervised cross-lingual adaptation of HMM-based speech synthesis models, and explains the advantages of such an approach. Finally, listener evaluations reveal that the proposed unsupervised adaptation methods deliver performance approaching that of supervised adaptation.

         An analysis of machine translation and speech synthesis in speech-to-speech translation system. K. Hashimoto, J. Yamagishi, W. Byrne, S. King, and K. Tokuda. In Proceedings of IEEE Conference on Acoustics, Speech and Signal Processing, pages 5108 – 5111 (4 pages), 2011. http://dx.doi.org/10.1109/ICASSP.2011.5946361.

This paper provides an analysis of the impacts of machine translation and speech synthesis on speech-to-speech translation systems. The speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Recently, many techniques for integration of speech recognition and machine translation have been proposed. However, speech synthesis has not yet been considered. The quality of synthesized speech is important, since users will not understand what the system said if the quality of synthesized speech is bad. Therefore, in this paper, we focus on the machine translation and speech synthesis components, and report a subjective evaluation to analyze the impact of each component. The results of these analyses show that the machine translation component affects the performance of speech-to-speech translation greatly, and that fluent sentences lead to higher naturalness and lower word error rate of synthesized speech.

         The effect of using normalized models in statistical speech synthesis. M. Shannon, H. Zen, and W. Byrne. In Proceedings of the 12th Annual Conference of the International Speech Communication Association, 2011. (4 pages). Paper [PDF], Presentation [PDF].

The standard approach to HMM-based speech synthesis is inconsistent in the enforcement of the deterministic constraints between static and dynamic features. The trajectory HMM and autoregressive HMM have been proposed as normalized models which rectify this inconsistency. This paper investigates the practical effects of using these normalized models, and examines the strengths and weaknesses of the different models as probabilistic models of speech. The most striking difference observed is that the standard approach greatly underestimates predictive variance. We argue that the normalized models have better predictive distributions than the standard approach, but that all the models we consider are still far from satisfactory probabilistic models of speech. We also present evidence that better intra-frame correlation modelling goes some way towards improving existing normalized models.

         Hierarchical phrase-based translation representations. G. Iglesias, C. Allauzen, W. Byrne, A. de Gispert, and M. Riley. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1373–1383 (11 pages), Edinburgh, Scotland, UK., July 2011. Association for Computational Linguistics. http://www.aclweb.org/anthology/D11-1127.

This paper compares several translation representations for a synchronous context-free grammar parse including CFGs/hypergraphs, finite-state automata (FSA), and pushdown automata (PDA). The representation choice is shown to determine the form and complexity of target LM intersection and shortest-path algorithms that follow. Intersection, shortest path, FSA expansion and RTN replacement algorithms are presented for PDAs. Chinese-toEnglish translation experiments using HiFST and HiPDT, FSA and PDA-based decoders, are presented using admissible (or exact) search, possible for HiFST with compact SCFG rulesets and HiPDT with compact LMs. For large rulesets with large LMs, we introduce a two-pass search strategy which we then analyze in terms of search errors and translation performance.

2010    Efficient path counting transducers for minimum Bayes-risk decoding of statistical machine translation lattices. G. Blackwood, A. de Gispert, and W. Byrne. In Proceedings of the Annual Meeting of the Association for Computational Linguistics – Short Papers, pages 27–32 (6 pages), 2010. Paper [PDF], Presentation [PDF].

This paper presents an efficient implementation of linearised lattice minimum Bayes-risk decoding using weighted finite state transducers. We introduce transducers to efficiently count lattice paths containing n-grams and use these to gather the required statistics. We show that these procedures can be implemented exactly through simple transformations of word sequences to sequences of n-grams. This yields a novel implementation of lattice minimum Bayes-risk decoding which is fast and exact even for very large lattices.

         Fluency constraints for minimum Bayes-risk decoding of statistical machine translation lattices. G. Blackwood, A. de Gispert, and W. Byrne. In Proceedings of the International Conference on Computational Linguistics (COLING), pages 71–79 (9 pages), 2010. Paper [PDF], Presentation [PDF].

A novel and robust approach to incorporating natural language generation into statistical machine tr anslation is developed within a minimum Bayes-risk decoding framework. Segmentation of translation l attices is guided by confidence measures over the maximum likelihood translation hypothesis in order to focus on regions with potential translation errors. Modeling techniques intended to improve flue ncy in low confidence regions are introduced so as to improve overall translation fluency.

         Hierarchical phrase-based translation grammars extracted from alignment posterior probabilities. A. de Gispert, J. Pino, and W. Byrne. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 545–554 (10 pages), Cambridge, MA, 2010. Paper [PDF].

We report on investigations into hierarchical phrase-based translation grammars based on rules extracted from posterior distributions over alignments of the parallel text. Rather than restrict rule extraction to a single alignment, such as Viterbi, we instead extract rules based on posterior distributions provided by the HMM word-to-word alignment model. We define translation grammars progressively by adding classes of rules to a basic phrase-based system. We assess these grammars in terms of their expressive power, measured by their ability to align the parallel text from which their rules are extracted, and the quality of the translations they yield. In Chinese-to-English translation, we find that rule extraction from posteriors gives translation improvements. We also find that grammars with rules with only one nonterminal, when extracted from posteriors, can outperform more complex grammars extracted from Viterbi alignments. Finally, we show that the best way to exploit source-to- target and target-to-source alignment models is to build two separate systems and combine their output translation lattices.

         Unsupervised cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction. M. Gibson, T. Hirsimaki, R. Karhila, M. Kurimo, and W. Byrne. In Proceedings of IEEE Conference on Acoustics, Speech and Signal Processing, pages 4642 – 4645 (4 pages), 2010. Paper [PDF], Presentation [PDF].

This paper demonstrates how unsupervised cross-lingual adaptation of HMM-based speech synthesis models may be performed without explicit knowledge of the adaptation data language. A two-pass decision tree construction technique is deployed for this purpose. Using parallel translated datasets, cross-lingual and intralingual adaptation are compared in a controlled manner. Listener evaluations reveal that the proposed method delivers performance approaching that of unsupervised intralingual adaptation.

         Personalising speech-to-speech translation in the EMIME project. M. Kurimo, W. Byrne, J. Dines, P. Garner, M. Gibson, Y. Guan, T. Hirsimäki, R. Karhila, S. King, H. Liang, K. Oura, L. Saheer, M. Shannon, S. Shiota, J. Tian, K. Tokuda, M. Wester, Y.-J. Wu, and J. Yamagishi. In Proceedings of the Annual Meeting of the Association for Computational Linguistics – Demonstration Systems, pages 48–53 (6 pages), 2010. Demo Session. Paper [PDF], Presentation [PDF]

         Overview and results of Morpho Challenge 2009. M. Kurimo, S. Virpioja, V. T. Turunen, G. W. Blackwood, and W. Byrne. In C. Peters et al., editor, Multilingual Information Access Evaluation, 10th Workshop of the Cross-Language Evaluation Forum - CLEF 2009, volume 1 of Revised Selected Papers, Lecture Notes in Computer Science, LNCS 6241, pages 579–598 (20 pages). Springer, 2010. Paper [PDF].

The goal of Morpho Challenge 2009 was to evaluate unsupervised algorithms that provide morpheme analyses for words in different languages and in various practical applications. Morpheme analysis is particularly useful in speech recognition, information retrieval and machine translation for morphologically rich languages where the amount of different word forms is very large. The evaluations consisted of: 1. a comparison to grammatical morphemes, 2. using morphemes instead of words in information retrieval tasks, and 3. combining morpheme and word based systems in statistical machine translation tasks. The evaluation languages were: Finnish, Turkish, German, English and Arabic. This paper describes the tasks, evaluation methods, and obtained results. The Morpho Challenge was part of the EU Network of Excellence PASCAL Challenge Program and organized in collaboration with CLEF.

         Overview and results of Morpho Challenge 2009. M. Kurimo, S. Virpioja, V.T. Turunen, G.W. Blackwood, and W. Byrne. In Multilingual Information Access Evaluation, 10th Workshop of the Cross-Language Evaluation Forum, CLEF 2009, volume 1 of Lecture Notes in Computer Science, pages 578–597 (20 pages). Springer, 2010. http://eprints.pascal-network.org/archive/00006052.

In the Morpho Challenge 2009 unsupervised algorithms that provide morpheme analyses for words in different languages were evaluated in various practical applications. Morpheme analysis is particularly useful in speech recognition, information retrieval and machine translation for morphologically rich languages where the amount of different word forms is very large. The evaluations consisted of: 1. a comparison to grammatical morphemes, 2. using morphemes instead of words in information retrieval tasks, and 3. combining morpheme and word based systems in statistical machine translation tasks. The evaluation languages in 2009 were: Finnish, Turkish, German, English and Arabic. This overview paper describes the tasks, evaluation methods, and obtained results. The Morpho Challenge is part of the EU Network of Excellence PASCAL Challenge Program and organized in collaboration with CLEF.

         The CUED HiFST system for the WMT10 translation shared task. J. Pino, G. Iglesias, A. Gispert, G. Blackwood, J. Brunning, and W. Byrne. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR, pages 155–160 (6 pages), 2010. Paper [PDF], Presentation [PDF].

This paper describes the Cambridge University Engineering Department submission to the Fifth Workshop on Statistical Machine Translation. We report results for the French-English and Spanish-English shared translation tasks in both directions. The CUED system is based on HiFST, a hierarchical phrase-based decoder implemented using weighted finite-state transducers. In the French-English task, we investigate the use of context-dependent alignment models. We also show that lattice minimum Bayes-risk decoding is an effective framework for multi-source translation, leading to large gains in BLEU score.

         Autoregressive clustering for HMM speech synthesis. M. Shannon and W. Byrne. In Proceedings of INTERSPEECH, 2010. (4 pages). Paper [PDF], Presentation [PDF].

The autoregressive HMM has been shown to provide efficient parameter estimation and high-quality synthesis, but in previous experiments decision trees derived from a non-autoregressive system were used. In this paper we investigate the use of autoregressive clustering for autoregressive HMM-based speech synthesis. We describe decision tree clustering for the autoregressive HMM and highlight differences to the standard clustering procedure. Subjective listening evaluation results suggest that autoregressive clustering improves the naturalness of the resulting speech. We find that the standard minimum description length (MDL) criterion for selecting model complexity is inappropriate for the autoregressive HMM. Investigating the effect of model complexity on naturalness, we find that a large degree of overfitting is tolerated without a substantial decrease in naturalness.

         Hierarchical phrase-based translation with weighted finite state transducers, William Byrne. 7th International Workshop on Spoken Language Translation, Paris, France, December 2010. Keynote lecture. http://iwslt2010.fbk.eu/

         Hierarchical phrase-based translation with weighted finite state transducers, William Byrne. Natural Language Processing Group, Department of Computer Science, University of Sheffield, UK, December 2010

         Hierarchical phrase-based translation with weighted finite state transducers, William Byrne. Dublin Computational Linguistics Research Seminar, Dublin, Ireland, November 2010

         Hierarchical phrase-based translation with weighted finite state transducers, William Byrne. FALA 2010 Conference (VI Jornadas en Tecnologķas del Habla and II Iberian Workshop on Speech and Language Technologies for Iberian Languages), Vigo, Spain, November 2010. Keynote lecture. http://fala2010.uvigo.es/

         Recent research in statistical machine translation, William Byrne. Winton Capital Management Internal Research Conference, November 2010. Invited presentation

         Hierarchical phrase-based translation with weighted finite state transducers and shallow-N grammars. A. de Gispert, G. Iglesias, G. Blackwood, E. R. Banga, , and W. Byrne. Computational Linguistics, 36(3):505—533 (29 pages), September 2010. Paper [PDF].

In this paper we describe HiFST, a lattice-based decoder for hierarchical phrase-based translation and alignment. The decoder is implemented with standard Weighted Finite-State Transducer (WFST) operations as an alternative to the well-known cube pruning procedure. We find that the use of WFSTs rather than k-best lists requires less pruning in translation search, resulting in fewer search errors, better parameter optimization, and improved translation performance. The direct generation of translation lattices in the target language can improve subsequent rescoring procedures, yielding further gains when applying long-span language models and Minimum Bayes Risk decoding. We also give insight as to how to control the size of the search space defined by hierarchical rules. We show that shallow-N grammars, low-level rule catenation and other search constraints can help to match the power of the translation system to specific language pairs.

         EMIME project overview, Matthew Gibson and William Byrne. European Commission Information Society Conference (ICT 2010), Brussels, Belgium, September 2010. http://ec.europa.eu/information_society/events/cf/ict2010/item-display.cfm?id=3322

         Hierarchical phrase-based translation with weighted finite state transducers, William Byrne. Columbia University, New York, NY, USA, April 2010

         Hierarchical phrase-based translation with weighted finite state transducers, William Byrne. Google, Inc, Mountain View, CA, USA, April 2010

         Fast Hiero grammars, William Byrne and Adrią de Gispert. DARPA GALE PI Meeting, Scottsdale, AZ, USA, April 2010

         FAUST project overview, William Byrne. ICT-FP7 Language Technology Days, Luxembourg, March 2010

2009    Context-dependent alignment models for statistical machine translation. J. Brunning, A. de Gispert, and W. Byrne. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 110–118 (9 pages), 2009. Paper [PDF], Presentation [PDF].

We introduce alignment models for Machine Translation that take into account the context of a source word when determining its translation. Since the use of these contexts alone causes data sparsity problems, we develop a decision tree algorithm for clustering the contexts based on optimisation of the EM auxiliary function. We show that our context-dependent models lead to an improvement in alignment quality, and an increase in translation quality when the alignments are used to build a machine translation system.

         Minimum Bayes risk combination of translation hypotheses from alternative morphological decompositions. A. de Gispert, S. Virpioja, M. Kurimo, and W. Byrne. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 73–76 (4 pages), 2009. Paper [PDF], Presentation [PDF].

We describe a simple strategy to achieve translation performance improvements by combining output from identical statistical machine translation systems trained on alternative morphological decompositions of the source language. Combination is done by means of Minimum Bayes Risk decoding over a shared Nbest list. When translating into English from two highly inflected languages such as Arbic and Finnish we obtain significant improvements over simply selecting the best morphological decomposition.

         The HiFST system for the europarl spanish-to-english task. G. Iglesias, A. de Gispert, E. Banga, and W. Byrne. In Proceedings of SEPLN, pages 207–214 (8 pages), 2009. Paper [PDF], Presentation [PDF].

In this paper we present results for the Europarl Spanish-to-English translation task. We use HiFST, a novel hierarchical phrase-based translation system implemented with finite-state technology that creates target lattices rather than k-best lists

         Hierarchical phrase-based translation with weighted finite state transducers. G. Iglesias, A. de Gispert, E. R. Banga, and W. Byrne. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 433–441 (9 pages), 2009. Paper [PDF], Presentation [PDF].

This paper describes a lattice-based decoder for hierarchical phrase-based translation. The decoder is implemented with standard WFST operations as an alternative to the well-known cube pruning procedure. We find that the use of WFSTs rather than k-best lists requires less pruning in translation search, resulting in fewer search errors, direct generation of translation lattices in the target language, better parameter optimization, and improved translation performance when rescoring with long-span language models and MBR decoding. We report translation experiments for the Arabic-to-English and Chinese-to-English NIST translation tasks and contrast the WFST-based hierarchical decoder with hierarchical translation under cube pruning.

         Rule filtering by pattern for efficient hierarchical translation. G. Iglesias, A. de Gispert, E. R. Banga, and W. Byrne. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009), pages 380–388 (9 pages), 2009. Paper [PDF], Presentation [PDF].

We describe refinements to hierarchical translation search procedures intended to reduce both search errors and memory usage through modifications to hypothesis expansion in cube pruning and reductions in the size of the rule sets used in translation. Rules are put into syntactic classes based on the number of non-terminals and the pattern, and various filtering strategies are then applied to assess the impact on translation speed and quality. Results are reported on the 2008 NIST Arabic-to-English evaluation task.

         Autoregressive HMMs for speech synthesis. M. Shannon and W. Byrne. In Proceedings of INTERSPEECH, 2009. (4 pages). Paper [PDF], Presentation [PDF].

We propose the autoregressive HMM for speech synthesis. We show that the autoregressive HMM supports efficient EM parameter estimation and that we can use established effective synthesis techniques such as synthesis considering global variance with minimal modification. The autoregressive HMM uses the same model for parameter estimation and synthesis in a consistent way, in contrast to the standard HMM synthesis framework, and supports easy and efficient parameter estimation, in contrast to the trajectory HMM. We find that the autoregressive HMM gives performance comparable to the standard HMM synthesis framework on a Blizzard Challenge-style naturalness evaluation.

         Hierarchical phrase-based translation with weighted finite state transducers, William Byrne. The Johns Hopkins University Center for Language and Speech Processing, Baltimore, MD, USA, November 2009

         The CUED NIST 2009 Arabic-English SMT System, A de Gispert, G Iglesias, G Blackwood, J Brunning, and B Byrne. NIST Open Machine Translation 2009 Evaluation (MT09) Workshop, Ottowa, ON, CAN., August 2009. Presentation [PDF]

         Context-dependent alignment models and hierarchical phrase-based translation with weighted finite state transducers, W. Byrne. GALE PI Meeting, Tampa, FL, USA, May 2009. Presentation [PDF]

2008    Statistical techniques in machine translation, W. Byrne. Google EMEA Faculty Summit, Zurich, Switzerland, 2008. Keynote lecture. Presentation [PDF]

         Large-scale statistical machine translation with weighted finite state transducers. G. Blackwood, A. de Gispert, J. Brunning, and W. Byrne. In Proceedings of FSMNLP 2008: Finite-State Methods and Natural Language Processing, Ispra, Lago Maggiore, Italy, September 2008. (12 pages). Paper [PDF].

The Cambridge University Engineering Department phrase-based statistical machine translation system follows a generative model of translation and is implemented by the composition of component models realised as Weighted Finite State Transducers. Our flexible architecture requires no special purpose decoder and readily handles the large-scale natural language processing demands of state-of-the-art machine translation systems. In this paper we describe the CUED participation in the NIST 2008 Arabic-English machine translation evaluation task.

         Phrase-based statistical machine translation with weighted finite state transducers, W. Byrne. IRTG Summer School in Computational Linguistics and Psycholinguistics, University of Edinburgh, UK, September 2008. Invited tutorial. Presentation [PDF].

The Transducer Translation Model (TTM) for phrase-based statistical machine translation system follows a generative model of translation and is implemented by the composition of component models realized as Weighted Finite State Transducers via the OpenFst Toolkit. This flexible architecture requires no special purpose decoder and readily handles the large-scale natural language processing demands of state-of-the-art machine translation systems. This presentation describes how the system was used for the NIST 2008 Arabic-English machine translation evaluation task and for the Spanish-English and French-English translation in the ACL 2008 Third Workshop on Statistical Machine Translation Shared Task. General issues in using WFSTs for such tasks will also be discussed.

         Phrasal segmentation models for statistical machine translation. G. Blackwood, A. de Gispert, and W. Byrne. In Proceedings of the 22nd International Conference on Computational Linguistics, pages 19–22 (4 pages), Manchester, UK, August 2008. Paper [PDF].

Phrasal segmentation models define a mapping from the words of a sentence to sequences of translatable phrases. We discuss the estimation of these models from large quantities of monolingual training text and describe their realization as weighted finite state transducers for incorporation into phrase-based statistical machine translation systems. Results are reported on the NIST Arabic-English translation tasks showing significant complementary gains in BLEU score with large 5-gram and 6-gram language models.

         European language translation with weighted finite state transducers: The CUED MT system for the 2008 ACL workshop on statistical machine translation. G. Blackwood, A. de Gispert, J. Brunning, and W. Byrne. In Proceedings of the ACL 2008 Third Workshop on Statistical Machine Translation, pages 131–134 (4 pages), June 2008. Paper [PDF].

We describe the Cambridge University Engineering Department phrase-based statistical machine translation system for Spanish-English and French-English translation in the ACL 2008 Third Workshop on Statistical Machine Translation Shared Task. The CUED system follows a generative model of translation and is implemented by composition of component models realised as Weighted Finite State Transducers, without the use of a special-purpose decoder. Details of system tuning for both Europarl and News translation tasks are provided.

         The CUED NIST 2008 Arabic-English SMT System, A. de Gispert, G. Blackwood, J. Brunning, and W. Byrne. NIST MT Workshop, Alexandria, VA, USA, March 2008. Presentation [PDF]

         HMM word and phrase alignment for statistical machine translation. Y. Deng and W. Byrne. IEEE Transactions on Audio, Speech, and Language Processing, 16(3):494–507 (14 pages), March 2008. Paper [PDF].

Efficient estimation and alignment procedures for word and phrase alignment HMMs are developed for the alignment of parallel text. The development of these models is motivated by an analysis of the desirable features of IBM Model 4, one of the original and most effective models for word alignment. These models are formulated to capture the desirable aspects of Model 4 in an HMM alignment formalism. Alignment behavior is analyzed and compared to human-generated reference alignments, and the ability of these models to capture different types of alignment phenomena is evaluated. In analyzing alignment performance, Chinese-English word alignments are shown to be comparable to those of IBM Model-4 even when models are trained over large parallel texts. In translation performance, phrase-based statistical machine translation systems based on these HMM alignments can equal and exceed systems based on Model-4 alignments, and this is shown in Arabic-English and Chinese-English translation. These alignment models can also be used to generate posterior statistics over collections of parallel text, and this is used to refine and extend phrase translation tables with a resulting improvement in translation quality.

         Statistical machine translation, W. Byrne. Advanced Machine Learning Tutorial Lectures Series, Cambridge University Engineering Department, UK, February 2008

2007    Discriminative language model adaptation for mandarin broadcast speech transcription and translation. X. A. Liu, W. J. Byrne, M. J. F. Gales, A. de Gispert, M. Tomalin, P. C. Woodland, and K. Yu. In Proc. IEEE Automatic Speech Recognition and Understanding (ASRU), pages 153– 158 (6 pages), Kyoto, Japan, 2007

         Consensus network decoding for statistical machine translation system combination. K.-C. Sim, W. Byrne, M. Gales, H. Sahbi, and P.C. Woodland. In IEEE Conference on Acoustics, Speech and Signal Processing, 2007. (4 pages). Paper [PDF].

This paper presents a simple and robust consensus decoding approach for combining multiple Machine Translation (MT) system outputs. A consensus network is constructed from an N -best list by aligning the hypotheses against an alignment reference, where the alignment is based on minimising the translation edit rate (TER). The Minimum Bayes Risk (MBR) decoding technique is investigated for the selection of an appropriate alignment reference. Several alternative decoding strategies proposed to retain coherent phrases in the original translations. Experimental results are presented primarily based on three-way combination of Chinese-English translation outputs, and also presents results for six-way system combination. It is shown that worthwhile improvements in translation performance can be obtained using the methods discussed.

         Gini support vector machines for segmental minimum Bayes risk decoding of continuous speech. V. Venkataramani, S. Chakrabartty, and W. Byrne. Computer Speech and Language, 21:423–442 (20 pages), 2007. Published online by Elsevier Ltd., 2 October 2006. Paper [PDF].

We describe the use of Support Vector Machines (SVMs) for continuous speech recognition by incorporating them in Segmental Minimum Bayes Risk decoding. Lattice cutting is used to convert the Automatic Speech Recognition search space into sequences of smaller recognition problems. SVMs are then trained as discriminative models over each of these problems and used in a rescoring framework. We pose the estimation of a posterior distribution over hypothesis in these regions of acoustic confusion as a logistic regression problem. We also show that GiniSVMs can be used as an approximation technique to estimate the parameters of the logistic regression problem. On a small vocabulary recognition task we show that the use of GiniSVMs can improve the performance of a well trained Hidden Markov Model system trained under the Maximum Mutual Information criterion. We also find that it is possible to derive reliable confidence scores over the GiniSVM hypotheses and that these can be used to good effect in hypothesis combination. We discuss the problems that we expect to encounter in extending this approach to Large Vocabulary Continuous Speech Recognition and describe initial investigation of constrained estimation techniques to derive feature spaces for SVMs.

2006    Segmentation and alignment of parallel text for statistical machine translation. Y. Deng, S. Kumar, and W. Byrne. Journal of Natural Language Engineering, 13(3):235–260 (26 pages), 2006. Paper [PDF].

We address the problem of extracting bilingual chunk pairs from parallel text to create training sets for statistical machine translation. We formulate the problem in terms of a stochastic generative process over text translation pairs, and derive two different alignment procedures based on the underlying alignment model. The first procedure is a now-standard dynamic programming alignment model which we use to generate an initial coarse alignment of the parallel text. The second procedure is a divisive clustering parallel text alignment procedure which we use to refine the first-pass alignments. This latter procedure is novel in that it permits the segmentation of the parallel text into sub-sentence units which are allowed to be reordered to improve the chunk alignment. The quality of chunk pairs are measured by the performance of machine translation systems trained from them. We show practical benefits of divisive clustering as well as how system performance can be improved by exploiting portions of the parallel text that otherwise would have to be discarded. We also show that chunk alignment as a first step in word alignment can significantly reduce word alignment error rate.

         Statistical phrase-based speech translation. L. Mathias and W. Byrne. In IEEE Conference on Acoustics, Speech and Signal Processing, 2006. (4 pages). Paper [PDF], Presentation [PDF].

A generative statistical model of speech-to-text translation is developed as an extension of existing models of phrase-based text translation. Speech is translated by mapping ASR word lattices to lattices of phrase sequences which are then translated using operations developed for text translation. Performance is reported on Chinese to English translation of Mandarin Broadcast News.

         MTTK: An alignment toolkit for statistical machine translation, Y. Deng and W. Byrne. HLT-NAACL Demonstrations Program, New York, NY, USA, June 2006. Paper [PDF], Presentation [PDF].

The MTTK alignment toolkit for statistical machine translation can be used for word, phrase, and sentence alignment of parallel documents. It is designed mainly for building statistical machine translation systems, but can be exploited in other multilingual applications. It provides computationally efficient alignment and estimation procedures that can be used for the unsupervised alignment of parallel text collections in a language independent fashion. MTTK Version 1.0 is available under the Open Source Educational Community License.

         Integrating automatic speech recognition and statistical machine translation, W. Byrne. TC-STAR OpenLab on Speech Translation, Trento, Italy, April 2006. Invited tutorial. Presentation [PDF]

         Minimum Bayes risk estimation and decoding in large vocabulary continuous speech recognition. W. Byrne. Proceedings of the Institute of Electronics, Information, and Communication Engineers, Japan – Special Section on Statistical Modeling for Speech Processing, E89-D(3):900–907 (8 pages), March 2006. Invited paper. Paper [PDF]

         Statistical phrase-based speech translation, W. Byrne. GALE Mid-Phase PI Meeting, Boston, MA, USA, March 2006. Presentation [PDF]

         A weighted finite state transducer translation template model for statistical machine translation. S. Kumar, Y. Deng, and W. Byrne. Journal of Natural Language Engineering, 12(1):35–75 (41 pages), March 2006. Paper [PDF].

We present a Weighted Finite State Transducer Translation Template Model for statistical machine translation. The approach we describe allows us to implement each constituent distribution of the model as a weighted finite state transducer or acceptor. We show that bitext word alignment and translation under the model can be performed with standard FSM operations involving these transducers. One of the benefits of using this framework is that it avoids the need to develop specialized search procedures, even for the generation of lattices or N-Best lists of bitext word alignments and translation hypotheses. We report and analyze bitext word alignment and translation performance of the model on French-English and Chinese-English tasks.
See also CLSP Tech. Rep. 48, 2004 – Download

         Minimum Bayes risk estimation and decoding in large vocabulary continuous speech recognition, W. Byrne. University of Sheffield, UK, January 2006. Presentation [PDF].

Progress in automatic speech recognition is frequently measured by easily computed, task-neutral measures such as Word Error Rate. Ideally it could be possible to design systems tailored for any application no matter how complex or specialized the performance criteria. Minimum Bayes-Risk (MBR) processing is a modeling framework that attempts to minimize the empirical expected risk under task-specific loss functions that describe desired system behavior. This presentation will describe risk-based recognition and model estimation procedures developed for the refinement of automatic speech recognition systems. The MBR formulation has also made it possible to implement a hybrid estimation and discriminative training approach called Acoustic Code-Breaking. This is a divide-and-conquer strategy that breaks continuous speech recognition problems into a sequence of smaller, distinct subproblems that can be solved independently using specially trained discriminative models such as Support Vector Machines. These estimation and decoding approaches will be described, along with evaluation of their performance on various automatic speech recognition tasks.

         A dialectal Chinese speech recognition framework. J. Li, F. Zheng, W. Byrne, and D. Jurafsky. Journal of Computer Science and Technology (Science Press, Beijing, China), (1):106–115 (10 pages), January 2006. Paper [PDF].

A framework for dialectal Chinese speech recognition is proposed and studied, where a relatively small dialectal Chinese (or in other words Chinese influenced by the native dialect) speech corpus and the dialect-related knowledge are adopted to translate a standard Chinese (or Putonghua, abbreviated as PTH) speech recognizer into a dialectal Chinese speech recognizer. There are two kinds of knowledge sources: one is human experts and another is a small dialectal Chinese corpus. This knowledge includes four levels : a phonetics level, lexicon level, language level, and the acoustic decoder level. This paper takes Wu dialectal Chinese (WDC) as an example target language with the goal of deriving an acceptable WDC speech recognizer from an existing PTH speech recognizer. Based on the Initial-Final structure of the Chinese language and a study of how dialectal Chinese speakers speak Putonghua, we proposed to use the knowledge of the context-independent PTH-IF mappings (where IF means either a Chinese Initial or a Chinese Final), the context-independent WDC-IF mappings, and the syllable-dependent WDC-IF mappings obtained from either experts or data, and then to combine these with the surface-form based maximum likelihood linear regression (MLLR) acoustic model adaptation method. To reduce the size of the multi-pronunciation lexicon introduced by the IF mappings which might entail confusion in the lexicon and hence lead to the performance degradation, a Multi-Pronunciation Expansion (MPE) method based on an accumulated uni-gram probability (AUP) was proposed. Compared with the original PTH speech recognizer, the resulted WDC speech recognizer achieved over 10Character Error Rate (CER) reduction when recognizing WDC with only 0.62increase when recognizing PTH. The proposed framework and methods are intended to work not only for Wu dialectal Chinese but also for other dialectal Chinese languages and even other languages.

2005    HMM word and phrase alignment for statistical machine translation. Y. Deng and W. Byrne. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 169–176 (8 pages), 2005. Paper [PDF], Presentation [PDF].

HMM-based models are developed for the alignment of words and phrases in bitext. The models are formulated so that alignment and parameter estimation can be performed efficiently. We find that Chinese-English word alignment performance is comparable to that of IBM Model-4 even over large training bitexts. Phrase pairs extracted from word alignments generated under the model can also be used for phrase-based translation, and in Chinese to English and Arabic to English translation, performance is comparable to systems based on Model-4 alignments. Direct phrase pair induction under the model is described and shown to improve translation performance.

         Lattice segmentation and minimum Bayes risk discriminative training for large vocabulary continuous speech recognition. V. Doumpiotis and W. Byrne. Speech Communication, (2):142–160 (19 pages), 2005. Paper [PDF].

Lattice segmentation techniques developed for Minimum Bayes Risk decoding in large vocabulary speech recognition tasks are used to compute the statistics for discriminative training algorithms that estimate HMM parameters so as to reduce the overall risk over the training data. New estimation procedures are developed and evaluated for small vocabulary and large vocabulary recognition tasks, and additive performance improvements are shown relative to maximum mutual information estimation. These relative gains are explained through a detailed analysis of individual word recognition errors.

         Local phrase reordering models for statistical machine translation. S. Kumar and W. Byrne. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pages 161–168 (8 pages), 2005. Paper [PDF], Presentation [PDF].

We describe stochastic models of local phrase movement that can be incorporated into a Statistical Machine Translation (SMT) system. These models provide properly formulated, non-deficient, probability distributions over reordered phrase sequences. They are implemented by Weighted Finite State Transducers. We describe EM-style parameter re-estimation procedures based on phrase alignment under the complete translation model incorporating reordering. Our experiments show that the reordering model yields substantial improvements in translation performance on Arabic-to-English and Chinese-to-English MT tasks. We also show that the procedure scales as the bitext size is increased.

         Automatic transcription of Czech, Russian, and Slovak spontaneous speech in the MALACH project. J. Psutka, P. Ircing, J.V. Psutka, J. Hajic, W. Byrne, and J. Mirovski. In Proceedings of EUROSPEECH, 2005. (4 pages). Paper [PDF].

This paper describes the 3.5-years effort put into building LVCSR systems for recognition of spontaneous speech of Czech, Russian, and Slovak witnesses of the Holocaust in the MALACH project. For processing of colloquial, highly emotional and heavily accented speech of elderly people containing many non-speech events we have developed techniques that very effectively handle both non-speech events and colloquial and accented variants of uttered words. Manual transcripts as one of the main sources for language modeling were automatically ćnormalizedÓ using standardized lexicon, which brought about 2 to 3% reduction of the word error rate (WER). The subsequent interpolation of such LMs with models built from an additional collection (consisting of topically selected sentences from general text corpora) resulted into an additional improvement of performance of up to 3% .

         Acoustic training from heterogeneous data sources: Experiments in Mandarin conversational telephone speech transcription. S. Tsakalidis and W. Byrne. In IEEE Conference on Acoustics, Speech and Signal Processing, 2005. (4 pages). Paper [PDF], Presentation [PDF].

In this paper we investigate the use of heterogeneous data sources for acoustic training. We describe an acoustic normalization procedure for enlarging an ASR acoustic training set with out-of-domain acoustic data. A larger in-domain training set is created by effectively transforming the out-of-domain data before incorporation in training. Baseline experimental results in Mandarin conversational telephone speech transcription show that a simple attempt to add out-of-domain data degrades performance. Preliminary experiments assess the effectiveness of the proposed cross-corpus acoustic normalization.

         Lattice segmentation and support vector machines for large vocabulary continuous speech recognition. V. Venkataramani and W. Byrne. In IEEE Conference on Acoustics, Speech and Signal Processing, 2005. (4 pages). Paper [PDF], Presentation [PDF].

Lattice segmentation procedures are used to spot possible recognition errors in first-pass recognition hypotheses produced by a large vocabulary continuous speech recognition system. This approach is analyzed in terms of its ability to reliably identify, and provide good alternatives for, incorrectly hypothesized words. A procedure is described to train and apply Support Vector Machines to strengthen the first pass system where it was found to be weak, resulting in small but statistically significant recognition improvements on a large test set of conversational speech.

         Convergence theorems for generalized alternating minimization procedures. A. Gunawardana and W. Byrne. Journal of Machine Learning Research, (6):2049–2073 (25 pages), December 2005. Paper [PDF].

The EM algorithm is widely used to develop iterative parameter estimation procedures for statistical models. In cases where these procedures strictly follow the EM formulation, the convergence properties of the estimation procedures are well understood. In some instances there are practical reasons to develop procedures that do not strictly fall within the EM framework. We study EM variants in which the E-Step is not performed exactly, either to obtain improved rates of convergence, or due to approximations needed to compute statistics under a model family over which E-Steps cannot be realized. Since these variants are not EM procedures, the standard (G)EM convergence results do not apply to them. We present an information geometric framework for describing such algorithms and analyzing their convergence properties. We apply this framework to analyze the convergence properties of incremental EM and variational EM. For incremental EM, we discuss conditions under these algorithms converge in likelihood. For variational EM, we show how the E-Step approximation prevents convergence to local maxima in likelihood.

         Minimum Bayes risk estimation and decoding in large vocabulary continuous speech recognition, W. Byrne. Google, Inc, Mountain View, CA, USA, September 2005. Presentation [PDF].

Progress in automatic speech recognition is frequently measured by easily computed, task-neutral measures such as Word Error Rate. Ideally it could be possible to design systems tailored for any application no matter how complex or specialized the performance criteria. Minimum Bayes-Risk (MBR) processing is a modeling framework that attempts to minimize the empirical expected risk under task-specific loss functions that describe desired system behavior. This presentation will describe risk-based recognition and model estimation procedures developed for the refinement of automatic speech recognition systems. The MBR formulation has also made it possible to implement a hybrid estimation and discriminative training approach called Acoustic Code-Breaking. This is a divide-and-conquer strategy that breaks continuous speech recognition problems into a sequence of smaller, distinct subproblems that can be solved independently using specially trained discriminative models such as Support Vector Machines. These estimation and decoding approaches will be described, along with evaluation of their performance on various automatic speech recognition tasks.

         Johns Hopkins University - Cambridge University Chinese-English and Arabic-English 2005 NIST MT Evaluation Systems, S. Kumar, Y. Deng, and W. Byrne. 2005 NIST MT Workshop, Bethesda, MD, USA, June 2005. Presentation [PDF]

         Current Research in Phrase-Based Statistical Machine Translation – and some links to ASR, W. Byrne. Kings College London, UK, May 2005. Presentation [PDF]

         Phrase-based statistical machine translation using finite state machines – with some links to ASR, W. Byrne. University of Washington, Seattle, WA, USA, May 2005. Presentation [PDF]

         Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation. V. Doumpiotis, S. Tsakalidis, and W. Byrne. IEEE Transactions on Speech and Audio Processing, 13(3):367–376 (10 pages), May 2005. Paper [PDF].

Linear transforms have been used extensively for training and adaptation of HMM-based ASR systems. Recently procedures have been developed for the estimation of linear transforms under the Maximum Mutual Information (MMI) criterion. In this paper we introduce discriminative training procedures that employ linear transforms for feature normalization and for speaker adaptive training. We integrate these discriminative linear transforms into MMI estimation of HMM parameters for improvement of large vocabulary conversational speech recognition systems.

         JHU/CUED Chinese-English translation system – 2005 TC-STAR evaluation, S. Kumar, Y. Deng, and W. Byrne. TC-STAR Evaluation Meeting, Trento, Italy, April 2005. Presentation [PDF]

         Current research in phrase-based statistical machine translation and some links to ASR, W. Byrne. Machine Intelligence Laboratory Speech Seminar, Cambridge University Engineering Department, UK, March 2005. Presentation [PDF]

         Current research in phrase-based statistical machine translation and some links to ASR, W. Byrne. Seminar Series, Institute for Collaborative and Communicating Systems and Human Communication Research Centre, University of Edinburgh, UK, January 2005. Presentation [PDF]

2004    Minimum Bayes risk estimation and decoding in large vocabulary continuous speech recognition. W. Byrne. In Proceedings of the ATR Workshop ”Beyond HMMs”, Kyoto, Japan, 2004. (6 pages). Paper [PDF], Presentation [PDF].

Minimum risk estimation and decoding strategies based on lattice segmentation techniques can be used to refine large vocabulary continuous speech recognition systems through the estimation of the parameters of the underlying hidden Mark models and through the identification of smaller recognition tasks which provides the opportunity to incorporate novel modeling and decoding procedures in LVCSR. These techniques are discussed in the context of going beyond HMMs.

         Pinched lattice minimum Bayes risk discriminative training for large vocabulary continuous speech recognition. V. Doumpiotis and W. Byrne. In Proc. of the International Conference on Spoken Language Processing, 2004. (4 pages). Paper [PDF], Presentation [PDF].

Iterative estimation procedures that minimize empirical risk based on general loss functions such as the Levenshtein distance have been derived as extensions of the Extended Baum Welch algorithm. While reducing expected loss on training data is a desirable training criterion, these algorithms can be difficult to apply. They are unlike MMI estimation in that they require an explicit listing of the hypotheses to be considered and in complex problems such lists tend to be prohibitively large. To overcome this difficulty, modeling techniques originally developed to improve search efficiency in Minimum Bayes Risk decoding can be used to transform these estimation algorithms so that exact update, risk minimization procedures can be used for complex recognition problems. Experimental results in two large vocabulary speech recognition tasks show improvements over conventionally trained MMIE models.

         Minimum Bayes-risk decoding for statistical machine translation. S. Kumar and W. Byrne. In Proceedings of HLT-NAACL, pages 169–176 (8 pages), 2004. Paper [PDF], Presentation [PDF].

We present Minimum Bayes-Risk (MBR) decoding for statistical machine translation. This statistical approach aims to minimize expected loss of translation errors under loss functions that measure translation performance. We describe a hierarchy of loss functions that incorporate different levels of linguistic information from word strings, word-to-word alignments from an MT system, and syntactic structure from parse-trees of source and target language sentences. We report the performance of the MBR decoders on a Chinese-to-English translation task. Our results show that MBR decoding can be used to tune statistical MT performance for specific loss functions.

         Slavic languages in the MALACH project. J. Psutka, J. Hajic, and W. Byrne. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 2004. Invited Paper in Special Session on Multilingual Speech Processing (4 pages). Presentation [PDF].

The development of acoustic training material for Slavic languages within the MALACH project is described. Initial experience with the variety of speakers and the difficulties encountered in transcribing Czech, Slovak, and Russian language oral history are described along with ASR recognition results intended investigate the effectiveness of different transcription conventions that address language specific phenomena within the task domain.

         Issues in annotation of the Czech spontaneous speech corpus in the MALACH project. J. Psutka, P. Ircing, J. Hjic, V. Radova, J.V. Psutka, W. Byrne, and S. Gustman. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), 2004. (4 pages). Paper [PDF].

The paper present the issues encountered in processing spontaneous Czech speech in the MALACH project. Specific problems connected with a frequent occurrence of colloquial words in spontaneous Czech are analyzed; a partial solution is proposed and experimentally evaluated.

         Task-specific minimum Bayes-risk decoding using learned edit distance. I. Shafran and W. Byrne. In Proc. of the International Conference on Spoken Language Processing, 2004. (4 pages). Paper [PDF], Presentation [PDF].

This paper extends the minimum Bayes-risk framework to incorporate a loss function specific to the task and the ASR system. The errors are modeled as a noisy channel and the parameters are learned from the data. The resulting loss function is used in the risk criterion for decoding. Experiments on a large vocabulary conversational speech recognition system demonstrate significant gains of about 1over MAP hypothesis and about 0.6approach is general enough to be applicable to other sequence recognition problems such as in Optical Character Recognition (OCR) and in analysis of biological sequences.

         Current research in statistical machine translation and links with automatic speech recognition, W. Byrne. ISM Open Lectures on Statistical Speech Processing, The Institute for Statistical Mathematics, Tokyo, Japan, December 2004. Invited lecture. Presentation [PDF]

         Minimum Bayes risk estimation and decoding in large vocabulary continuous speech recognition, W. Byrne. ATR Workshop ”Beyond HMMs”, Kyoto, Japan, December 2004. Invited paper and lecture. Presentation [PDF]

         Automatic recognition of spontaneous speech for access to multilingual oral history archives. W. Byrne, D. Doermann, M. Franz, S. Gustman, J. Hajič, D. Oard, M. Picheny, J. Psutka, B. Ramabhadran, D. Soergel, T. Ward, and W.-J. Zhu. IEEE Transactions on Speech and Audio Processing, Special Issue on Spontaneous Speech Processing, pages 420–435 (16 pages), July 2004.

The MALACH project has the goal of developing the technologies needed to facilitate access to large collections of spontaneous speech. Its aim is to dramatically improve the state of the art in key Automatic Speech Recognition (ASR), Natural Language Processing (NLP) technologies for use in large-scale retrieval systems. The project leverages a unique collection of oral history interviews with survivors of the Holocaust that has been assembled and extensively annotated by the Survivors of the Shoah Visual History Foundation. This paper describes the collection, 116,000 hours of interviews in 32 languages, and the way in which system requirements have been discerned through user studies. It discusses ASR methods for very difficult speech (heavily accented, emotional, and elderly spontaneous speech), including transcription to create training data and methods for language modeling and speaker adaptation. Results are presented for for English and Czech. NLP results are presented for named entity tagging, topic segmentation, and supervised topic classification, and the architecture of an integrated search system that uses these results is described.

         The Johns Hopkins University 2004 Chinese-English and Arabic-English MT Evaluation Systems, S. Kumar et al. 2004 NIST MT Workshop, Alexandria, VA, USA, June 2004. Presentation [PDF]

         Segmental minimum Bayes-risk decoding for automatic speech recognition. V. Goel, S. Kumar, and W. Byrne. IEEE Transactions on Speech and Audio Processing, 12:234–249 (16 pages), May 2004. http://dx.doi.org/10.1109/TSA.2004.825678.

Minimum Bayes-Risk (MBR) speech recognizers have been shown to yield improvements over the search over word lattices. We present a Segmental Minimum Bayes-Risk decoding (SMBR) framework that simplifies the implementation of MBR recognizers through the segmentation of the N-best lists or lattices over which the recognition is to be performed. This paper presents lattice cutting procedures that underly SMBR decoding. Two of these procedures are based on a risk minimization criterion while a third one is guided by word-level confidence scores. In conjunction with SMBR decoding, these lattice segmentation procedures give consistent improvements in recognition word error rate (WER) on the Switchboard corpus. We also discuss an application of risk-based lattice cutting to multiplesystem SMBR decoding and show that it is related to other system combination techniques such as ROVER. This strategy combines lattices produced from multiple ASR systems and is found to give WER improvements in a Switchboard evaluation system.
Correction Available : In our recently published paper, we presented a risk-based lattice cutting procedure to segment ASR word lattices into smaller sub-lattices as a means to to improve the efficiency of Minimum Bayes-Risk (MBR) rescoring. In the experiments reported, some of the hypotheses in the original lattices were inadvertently discarded during segmentation, and this affected MBR performance adversely. This note gives the corrected results as well as experiments demonstrating that the segmentation process does not discard any paths from the original lattice.

         Minimum Risk Estimation and Decoding for Speech and Language Processing, W. Byrne. Microsoft Research, Redmond, Washington, USA, February 2004

         Minimum Risk Estimation and Decoding for Speech and Language Processing, W. Byrne. Speech Analysis and Interpretation Laboraory, University of Southern California School of Engineering, Los Angeles, CA, USA, February 2004

         Minimum Risk Estimation and Decoding for Speech and Language Processing, W. Byrne. Signal, Speech and Language Interpretation Lab, University of Washington, Seattle, WA, USA, February 2004

2003    The Johns Hopkins University 2003 Chinese-English Machine Translation System. W. Byrne, S. Khudanpur, W. Kim, S. Kumar, P. Pecina, P.Virga, P. Xu, and D. Yarowsky. In Machine Translation Summit IX. The Association for Machine Translation in the Americas, 2003. (4 pages). Paper [PDF], Presentation [PDF].

We describe a Chinese to English Machine Translation system developed at the Johns Hopkins University for the NIST 2003 MT evaluations. The system is based on a Weighted Finite State Transducer implementation of the alignment template translation model for statistical machine translation. The baseline MT system was trained using 100,000 sentence pairs selected from a static bitext training collection. Information retrieval techniques were then used to create specific training collections for each document to be translated. This document-specific training set included bitext and name entities that were then added to the baseline system by augmenting the library of alignment templates. We report translation performance of baseline and IR-based systems on two NIST MT evaluation test sets.

         Discriminative training for segmental minimum Bayes-risk decoding. V. Doumpiotis, S. Tsakalidis, and W. Byrne. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 2003. (4 pages). Paper [PDF], Presentation [PDF].

A modeling approach is presented that incorporates discriminative training procedures within segmental Minimum Bayes-Risk decoding (SMBR). SMBR is used to segment lattices produced by a general automatic speech recognition (ASR) system into sequences of separate decis ion problems involving small sets of confusable words. Acoustic models specialized to discriminate between the competing words in these classes are then applied in subsequent SMBR rescoring passes. Refinement of the search space that allows the use of specialized discriminative models is shown to be an improvement over rescoring with conventionally trained discriminative models.

         Lattice segmentation and minimum Bayes risk discriminative training. V. Doumpiotis, S. Tsakalidis, and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 2003. (4 pages). Paper [PDF], Presentation [PDF].

Modeling approaches are presented that incorporate discriminative training procedures in segmental Minimum Bayes-Risk decoding (SMBR). SMBR is used to segment lattices produced by a general automatic speech recognition (ASR) system into sequences of separate decision problems involving small sets of confusable words. We discuss two approaches to incorporating these segmented lattices in discriminative training. We investigate the use of acoustic models specialized to discriminate between the competing words in these classes which are then applied in subsequent SMBR rescoring passes. Refinement of the search space that allows the use of specialized discriminative models is shown to be an improvement over rescoring with conventionally trained discriminative models.

         Minimum Bayes-risk automatic speech recognition. V. Goel and W. Byrne. In W. Chou and B.-H. Juang, editors, Pattern Recognition in Speech and Language Processing, pages 51–77 (27 pages). CRC Press, 2003

         Issues in recognition of Spanish-accented spontaneous English. A. Ikeno, B. Pellom, D. Cer, A. Thornton, J. M. Brenier, D. Jurafsky, W. Ward, and W. Byrne. In Proceedings of the ISCA and IEEE workshop on Spontaneous Speech Processing and Recognition, Tokyo Institute of Technology, Tokyo, Japan, 2003. ISCA and IEEE. (4 pages). Paper [PDF].

We describe a recognition experiment and two analytic experiments on a database of strongly Hispanic-accented English. We show the crucial importance of training on the Hispanic-accented data for acoustic model performance, and describe the tendency of Spanish-accented speakers to use longer, and presumably less-reduced, schwa vowels than native-English speakers.

         A generative probabilistic OCR model for NLP applications. O. Kolak, W. Byrne, and P. Resnik. In Proceedings of HLT-NAACL, pages 55–62 (8 pages), 2003. Paper [PDF].

In this paper we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for use in error correction, with a focus on post-processing the output of black-box OCR systems in order to make them more useful for NLP tasks. We present an implementation of the model based on finite-state models, demonstrate the model’s ability to significantly reduce character and word error rate, and provide evaluation results involving automatic extraction of translation lexicons from printed text.

         A weighted finite state transducer implementation of the alignment template model for statistical machine translation. S. Kumar and W. Byrne. In Proceedings of HLT-NAACL, pages 63 – 70 (8 pages), 2003. Paper [PDF], Presentation [PDF].

We present a derivation of the alignment template model for statistical machine translation and an implementation of the model using weighted finite state transducers. The approach we describe allows us to implement each constituent distribution of the model as a weighted finite state transducer or acceptor. We show that bitext word alignment and translation under the model can be performed with standard FSM operations involving these transducers. One of the benefits of using this framework is that it obviates the need to develop specialized search procedures, even for the generation of lattices or N-Best lists of bitext word alignments and translation hypotheses. We evaluate the implementation of the model on the Frenchto- English Hansards task and report alignment and translation performance.

         Desperately seeking Cebuano. D. Oard, D. Doermann, B. Dorr, D. He, P. Resnik, W. Byrne, S. Khudanpur, D. Yarowsky, A. Leuski, P. Koehn, and K. Knight. In Proceedings of HLT-NAACL, 2003. (3 pages). Paper [PDF].

This paper describes an effort to rapidly develop language resources and component technology to support searching Cebuano news stories using English queries. Results from the first 60 hours of the exercise are presented.

         Building LVCSR systems for transcription of spontaneously produced Russian witnesses in the MALACH project: initial steps and first results. J. Psutka, I. Iljuchin, P. Ircing, J.V. Psutka, V. Trejbal, W. Byrne, J. Hajic, and S. Gustman. In Proceedings of the Text, Speech, and Dialog Workshop, pages 214–219 (6 pages), 2003.

The MALACH project uses the world’s largest digital archive of video oral histories collected by the Survivors of the Shoah Visual History Foundation (VHF) and attempts to access such archives by advancing the state-of-the-art in Automatic Speech Recognition and Information Retrieval. This paper discusses the intial steps and first results in building large vocabulary continuous speech recognition (LVCSR) systems for the transcription of Russian witnesses. As the third language processed in the MALACH project (following English and Czech), Russian has posed new ASR challenges, especially in phonetic modeling. Although most of the Russian testimonies were provided by native Russian survivors, the speakers come from many different regions and countries resulting in a diverse collection of accented spontaneous Russian speech.

         Towards automatic transcription of spontaneous Czech speech in the MALACH project. J. Psutka, P. Ircing, J. V. Psutka, V. Radova, W. Byrne, J. Hajic, and S. Gustman. In Proceedings of the Text, Speech, and Dialog Workshop, pages 327–332 (6 pages), 2003. Paper [PDF].

Our paper discusses the progress achieved during a one-year effort in building the Czech LVCSR system for the automatic transcription of spontaneously produced testimonies in the MALACH project. The difficulty of this task stems from the highly inflectional nature of the Czech language and is further multiplied by the presence of many colloquial words in spontaneous Czech speech as well as by the need to handle emotional speech filled with disfluencies, heavy accents, age-related coarticulation and language switching. In this paper we concentrate mainly on the acoustic modeling issues - the proper choice of front-end paramterization, the handling of non-speech events in acoustic modeling, and unsupervised acoustic adaptation via MLLR. A method for selecting suitable language modeling data is also briefly discussed.

         Large vocabulary ASR for spontaneous Czech in the MALACH project. J. Psutka, P. Ircing, J.V. Psutka, V. Radovic, W. Byrne, J. Hajic, Jiri Mirovsky, and Samuel Gustman. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 2003. (4 pages). Paper [PDF].

This paper describes LVCSR research into the automatic transcription of spontaneous Czech speech in the MALACH (Multilingual Access to Large Spoken Archives) project. This project attempts to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) (www.vhf.org) by advancing the state of the art in automated speech recognition. We describe a baseline ASR system and discuss the problems in language modeling that arise from the nature of Czech as a highly inflectional language that also exhibits diglossia between its written and spontaneous forms. The difficulties of this task are compounded by heavily accented, emotional and disfluent speech along with frequent switching between languages. To overcome the limited amount of relevant language model data we use statistical techniques for selecting an appropriate training corpus from a large unstructured text collection resulting in significant reductions in word error rate. recognition and retrieval techniques to improve cataloging efficiency and eventually to provide direct access into the archive itself.

         Support vector machines for segmental minimum Bayes risk decoding of continuous speech. V. Venkataramani, S. Chakrabartty, and W. Byrne. In IEEE Automatic Speech Recognition and Understanding Workshop, 2003. (6 pages). Paper [PDF], Presentation [PDF].

Segmental Minimum Bayes Risk (SMBR) Decoding involves the refinement of the search space into manageable confusion sets i.e., smaller sets of confusable words. We describe the application of Support Vector Machines (SVMs) as discriminative models for the refined search space. We show that SVMs, which in their basic formulation are binary classifiers of fixed dimensional observations, can be used for continuous speech recognition. We also study the use of GiniSVMs, which is a variant of the basic SVM. On a small vocabulary task, we show this two pass scheme outperforms MMI trained HMMs. Using system combination we also obtain further improvements over discriminatively trained HMMs.

         The Johns Hopkins University 2003 Chinese-English machine translation system, W. Byrne, S. Khudanpur, W. Kim, S. Kumar, P. Pecina, P. Virga, P. Xu, and D. Yarowsky. 2003 NIST MT Workshop, Gaithersburg, MD, USA, June 2003. Presentation [PDF]

         Minimum Bayes-Risk Estimation and Decoding Procedures for Speech and Language Processing, W. Byrne. University of Edinburgh, UK, May 2003

2002    The Johns Hopkins University 2002 Large Vocabulary Conversational Speech Recognition System, W. Byrne, V. Doumpiotis, S. Kumar, S. Tsakalidis, and V. Venkataramani. NIST 2002 Rich Transcription Workshop, Vienna, VA, USA, 2002. Presentation [PDF]

         Supporting access to large digital oral history archives. S. Gustman, D. Soergel, D. Oard, W. Byrne, M. Picheny, B. Ramabhadran, and D. Greenberg. In Proceedings of the Joint Conference on Digital Libraries, 2002. (10 pages). Paper [PDF].

This paper describes our experience with the creation, indexing, and provision of access to a very large archive of videotaped oral histories - 116,000 hours of digitized interviews in 32 languages from 52,000 survivors, liberators, rescuers, and witnesses of the Nazi Holocaust. It goes on to identify a set of critical research issues that must be addressed if we are to provide full and detailed access to collections of this size: issues in user requirement studies, automatic speech recognition, automatic classification, segmentation, summarization, retrieval, and user interfaces. The paper ends by inviting others to discuss use of these materials in their own research.

         Minimum Bayes-risk alignment of bilingual texts. S. Kumar and W. Byrne. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pages 140–147 (8 pages), Philadelphia, PA, USA, 2002. Paper [PDF], Presentation [PDF].

We present Minimum Bayes-Risk word alignment for machine translation. This statistical, model-based approach attempts to minimize the expected risk of alignment errors under loss functions that measure alignment quality. We describe various loss functions, including some that incorporate linguistic analysis as can be obtained from parse trees, and show that these approaches can improve alignments of the English-French Hansards.

         Risk based lattice cutting for segmental minimum Bayes-risk decoding. S. Kumar and W. Byrne. In Proc. of the International Conference on Spoken Language Processing, Denver, Colorado, USA, 2002. (4 pages). Paper [PDF], Presentation [PDF].

Minimum Bayes-Risk (MBR) speech recognizers have been shown to give improvements over the conventional maximum a-posteriori probability (MAP) decoders through N-best list rescoring and A-star search over word lattices. Segmental MBR (SMBR) decoders simplify the implementation of MBR recognizers by segmenting the N-best lists or lattices over which the recognition is performed. We present a lattice cutting procedure that attempts to minimize the total Bayes-Risk of all word strings in the segmented lattice. We provide experimental results on the Switchboard conversational speech corpus showing that this segmentation procedure, in conjunction with SMBR decoding, gives modest but significant improvements over MAP decoders as well as MBR decoders on unsegmented lattices.

         Cross-language access to recorded speech in the MALACH project. D. Oard, D. Demner-Fushman, J. Hajic, B Ramabhadran, S Gustman, W Byrne, D. Soergel, B. Dorr, P. Resnik, and M. Picheney. In Proceedings of the Text, Speech, and Dialog Workshop, 2002. (8 pages). Paper [PDF].

The MALACH project seeks to help users find information in a vast multilingual collection of untranscribed oral history interviews. This paper introduces the goals of the project and focuses on supporting access by users who are unfamiliar with the interview language. It begins with a review of the state of the art in cross-language speech retrieval: approaches that will be investigated in the project are then described. Czech was selected as the first non-English language to be supported; results of an initial experimental with Czech/English cross-language retrieval are reported.

         Automatic transcription of Czech language oral history in the MALACH project: Resources and initial experiments. J. Psutka, P. Ircing, J. Psutka, V. Radova, W. Byrne, J. Hajic, S. Gustman, and B. Ramabhadran. In Proceedings of the Text, Speech, and Dialog Workshop, 2002. (8 pages). Paper [PDF].

In this paper we describe the initial stages of the ASR component of the MALACH project. This project will attempt to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation by advancing the state of the art in automated speech recognition. In order to train the ASR system, it is necessary to manually transcribe a large amount of speech data, identify the appropriate vocabulary, and obtain relevant text for language modeling. We give a detailed description of the speech annotation process; show the specific properties of the spontaneous speech contained in the archives; and present baseline speech recognition results.

         Discriminative linear transforms for feature normalization and speaker adaptation in HMM estimation. S. Tsakalidis, V. Doumpiotis, and W. Byrne. In Proc. of the International Conference on Spoken Language Processing, Denver, Colorado, USA, 2002. (5 pages). Paper [PDF], Presentation [PDF].

Linear transforms have been used extensively for training and adaptation of HMM-based ASR systems. Recently procedures have been developed for the estimation of linear transforms under the Maximum Mutual Information (MMI) criterion. In this paper we introduce discriminative training procedures that employ linear transforms for feature normalization and for speaker adaptive training. We integrate these discriminative linear transforms into MMI estimation of HMM parameters for improvement of large vocabulary conversational speech recognition systems.

         Lexicon adaptation for LVCSR: speaker idiosyncracies, non-native speakers, and pronunciation choice. W. Ward, H. Krech, X. Yu, K. Herold, G. Figgs, A. Ikeno, D. Jurafsky, and W. Byrne. In ISCA ITR Workshop on Pronunciation Modeling and Lexicon Adaptation, 2002. (4 pages). Paper [PDF].

We report on our preliminary experiments on building dynamic lexicons for native-speaker conversational speech and for foreign-accented conversational speech. Our goal is to build a lexicon with a set of pronunciations for each word, in which the probability distribution over pronunciation is dynamically computed. The set of pronunciations are derived from hand-written rules (for foreign accent) or clustering (for phonetically-transcribed Switchboard data). The dynamic pronunciation-probability will take into account specific characteristics of the speaker as well as factors such as language-model probability, disfluencies, sentence position, and phonetic context.

         MALACH: Multilingual Access to Large Spoken Archives, W. Byrne. AT&T Speech Days, Florham Park, NY, USA, October 2002. Invited talk. http://www.research.att.com/conf/spchday2002/program.html

         Mandarin pronunciation modeling based on the CASS corpus. F. Zheng, Z. Song, P. Fung, and W. Byrne. Journal of Computer Science and Technology (Science Press, Beijing, China), 17(3), May 2002. (16 pages). Paper [PDF].

The pronunciation variability is an important issue that must be faced with when developing practical automatic spontaneous speech recognition systems. In this paper, the factors that may affect the recognition performance are analyzed, including those specific to the Chinese language. By studying the INITIAL/FINAL (IF) characteristics of Chinese language and developing the Bayesian equation, we propose the concepts of generalized INITIAL/FINAL (GIF) and generalized syllable (GS), the GIF modeling and the IF-GIF modeling, as well as the context-dependent pronunciation weighting, based on a well phonetically transcribed seed database. By using these methods, the Chinese syllable error rate (SER) was reduced by 6.3% and 4.2% compared with the GIF modeling and IF modeling respectively when the language model, such as syllable or word N-gram, is not used. The effectiveness of these methods is also proved when more data without the phonetic transcription is used to refine the acoustic model using the proposed iterative force-alignment based transcribing (IFABT) method, achieving a 5.7% SER reduction.

2001    Automatic generation of pronunciation lexicons for Mandarin casual speech. W. Byrne, V. Venkataramani, T. Kamm, T.F. Zheng, Z. Song, P. Fung, Y. Lui, and U. Ruhi. In IEEE Conference on Acoustics, Speech and Signal Processing, volume 1, pages 569–572 (4 pages), Salt Lake City, Utah, 2001. IEEE. Paper [PDF].

Pronunciation modeling for large vocabulary speech recognition attempts to improve recognition accuracy by identifying and modeling pronunciations that are not in the ASR systems pronunciation lexicon. Pronunciation variability in spontaneous Mandarin is studied using the newly created CASS corpus of phonetically annotated spontaneous speech. Pronunciation modeling techniques developed in English are applied to this corpus to train pronunciaton models when are then applied in Mandarin Broadcast News transcription.

         Confidence based lattice segmentation and minimum Bayes-risk decoding. V. Goel, S. Kumar, and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), volume 4, pages 2569–2572 (4 pages), Aalborg, Denmark, 2001. Paper [PDF].

Minimum Bayes Risk (MBR) speech recognizers have been shown to yield improvements over the conventional maximum a-posteriori probability (MAP) decoders in the context of Nbest list rescoring andsearch over recognition lattices. Segmental MBR (SMBR) procedures have been developed to simplify implementation of MBR recognizers, by segmenting the N-best list or lattice, to reduce the size of the search space over which MBR recognition is carried out. In this paper we describe lattice cutting as a method to segment recognition word lattices into regions of low confidence and high confidence. We present two SMBR decoding procedures that can be applied on low confidence segment sets. Results obtained on the Switchboard conversational telephone speech corpus show modest but significant improvements relative to MAP decoders.

         Convergence of DLLR rapid speaker adaptation algorithms. A. Gunawardana and W. Byrne. In ISCA ITR-Workshop on Adaptation Methods for Automatic Speech Recognition, 2001. (4 pages). Paper [PDF].

Discounted Likelihood Linear Regression (DLLR) is a speaker adaptation technique for cases where there is insufficient data for MLLR adaptation. Here, we provide an alternative derivation of DLLR by using a censored EM formulation which postulates additional adaptation data which is hidden. This derivation shows that DLLR, if allowed to converge, provides maximum likelihood solutions. Thus the robustness of DLLR to small amounts of data is obtained by slowing down the convergence of the algorithm and by allowing termination of the algorithm before overtraining occurs. We then show that discounting the observed adaptation data by postulating additional hidden data can also be extended to MAP estimation of MLLR-type adaptation transformations.

         Discriminative speaker adaptation with conditional maximum likelihood linear regression. A. Gunawardana and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 2001. (4 pages). Paper [PDF].

We present a simplified derivation of the extended Baum-Welch procedure, which shows that it can be used for Maximum Mutual Information (MMI) of a large class of continuous emission density hidden Markov models (HMMs). We use the extended Baum-Welch procedure for discriminative estimation of MLLR-type speaker adaptation transformations. The resulting adaptation procedure, termed Conditional Maximum Likelihood Linear Regression (CMLLR), is used successfully for supervised and unsupervised adaptation tasks on the Switchboard corpus, yielding an improvement over MLLR. The interaction of unsupervised CMLLR with segmental minimum Bayes risk lattice voting procedures is also explored, showing that the two procedures are complimentary.

         On large vocabulary continuous speech recognition of highly inflectional language - Czech. P. Ircing, P. Krebc, J. Hajic, S. Khudanpur, F. Jelinek, J. Psutka, and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 2001. (4 pages)

         MLLR adaptation techniques for pronunciation modeling. V. Venkataramani and W. Byrne. In IEEE Workshop on Automatic Speech Recognition and Understanding, Madonna di Campiglio, Italy, 2001. (4 pages). Paper [PDF].

Multiple regression class MLLR transforms are investigated for use with pronunciation models that predict variation in the observed pronunciations given the phonetic context. Regression classes can be constructed so that MLLR transforms can be estimated and used to model specific acoustic changes associated with pronunciation variation. The effectiveness of this modeling approach is evaluated on the phonetically transcribed portion of the SWITCHBOARD conversational speech corpus.

         Modeling pronunciaiton variation using context-dependent weighting and B/S refined acoustic modeling. F. Zheng, Z. Song, P. Fung, and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 2001. (4 pages).

Pronunciation variability is an important issue that must be faced with when developing practical automatic spontaneous speech recognition systems. By studying the initial/final (IF) characteristics of Chinese language and developing the Bayesian equation, we propose the concepts of generalized initial/final (GIF) and generalized syllable (GS), the GIF modeling method and the IF-GIF modeling method, as well as the context-dependent pronunciation weighting method. By using these approaches, the IF-GIF modeling reduces the Chinese syllable error rate (SER) by 6.3% and 4.2% compared with the GIF modeling and IF modeling respectively when the language modeling, such as syllable or word N-gram, is not used.

         Minimum Bayes-Risk Automatic Speech Recognition, W. Byrne. University of Colorado, Boulder, CO, USA, November 2001

         Minimum Bayes-Risk Automatic Speech Recognition, W. Byrne. Signal, Speech and Language Interpretation Lab, University of Washington, Seattle, WA, USA, June 2001

         Discounted likelihood linear regression for rapid speaker adaptation. A. Gunawardana and W. Byrne. Computer Speech and Language, 15(1):15–38 (24 pages), Jan 2001.

The widely used maximum likelihood linear regression speaker adaptation procedure suffers from overtraining when used for rapid adaptation tasks in which the amount of adaptation data is severely limited. This is a well known difficulty associated with the estimation maximization algorithm. We use an information geometric analysis of the estimation maximization algorithm as an alternating minimization of a Kullback-Leibler-type divergence to see the cause of this difficulty, and propose a more robust discounted likelihood estimation procedure. This gives rise to a discounted likelihood linear regression procedure, which is a variant of maximum likelihood linear regression suited for small adaptation sets. Our procedure is evaluated on an unsupervised rapid adaptation task defined on the Switchboard conversational telephone speech corpus, where our proposed procedure improves word error rate by 1.6% (absolute) with as little as five seconds of adaptation data, which is a situation in which maximum likelihood linear regression overtrains in the first iteration of adaptation. We compare several realizations of discounted likelihood linear regression with maximum likelihood linear regression and other simple maximum likelihood linear regression variants, and discuss issues that arise in implementing our discounted likelihood procedures.

2000    Towards language independent acoustic modeling. W. Byrne, P. Beyerlein, J. Huerta, S. Khudanpur, B. Marthi, J. Morgan, N. Peterek, J. Picone, D. Vergyri, and W. Wang. In IEEE Conference on Acoustics, Speech and Signal Processing, pages 1029–1032 (4 pages), Istanbul, Turkey, 2000. IEEE. Paper [PDF].

We describe procedures and experimental results using speech from diverse source languages to build an ASR system for a single target language. This work is intended to improve ASR in languages for which large amounts of training data are not available. We have developed both knowledge-based and automatic methods to map phonetic units from the source languages to the target language. We employed HMM adaptation techniques and Discriminative Model Combination to combine acoustic models from the individual source languages for recognition of speech in the target language. Experiments are described in which Czech Broadcast News is transcribed using acoustic models trained from small amounts of Czech read speech augmented by English, Spanish, Russian, and Mandarin acoustic models.

         Morpheme based language models for speech recognition of czech. William J. Byrne, Jan Hajic, Pavel Krbec, Pavel Ircing, and Josef Psutka. In TDS ’00: Proceedings of the Third International Workshop on Text, Speech and Dialogue, pages 211–216 (6 pages), London, UK, 2000. Springer-Verlag

         Minimum Bayes-Risk automatic speech recognition. V. Goel and W. Byrne. Computer Speech and Language, 14(2):115–135 (21 pages), 2000.

In this paper we address the problem of efficient implementation of the minimum Bayes-risk classifiers for automatic speech recognition. Simplifying assumptions that allow computationally feasible approximations to these classifiers are proposed. Under these assumptions an approximate implementation as an A-star search algorithm over recognition lattice is constructed. This algorithm improves up on the previously proposed N-best list rescoring implementation of these classifiers. The minimum Bayes-risk classifiers are shown to outperform the most commonly used maximum a-posteriori probability (MAP) classifier on three speech recognition tasks: reduction of word error rate, reduction of content word error rate, and identification of Named Entities in speech. The A-star implementation is also contrasted with the N-best list rescoring implementation and is found to obtain modest but significant improvements in accuracy with little computational overhead.

         Segmental minimum Bayes-risk ASR voting strategies. V. Goel, S. Kumar, and W. Byrne. In Proc. of the International Conference on Spoken Language Processing, volume 3, pages 139–142 (4 pages), Beijing, China, 2000. Paper [PDF].

ROVER and its successor voting procedures have been shown to be quite effective in reducing the recognition word error rate (WER). The success of these methods has been attributed to their minimum Bayes-risk (MBR) nature: they produce the hypothesis with the least expected word error. In this paper we develop a general procedure within the MBR framework, called segmental MBR recognition, that encompasses current voting techniques and allows further extensions that yield lower expected WER. It also allows incorporation of loss functions other than the WER. We present a derivation of voting procedure of N-best ROVER as an instance of segmental MBR recognition. We then present an extension, called e-ROVER, that alleviates some of the restrictions of N-best ROVER by better approximating the WER. e-ROVER is compared with N-best ROVER on multi-lingual acoustic modeling task and is shown to yield modest yet significant and easily obtained improvements.

         Robust estimation for rapid adaptation using discounted likelihood techniques. A. Gunawardana and W. Byrne. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 2000. (4 pages). Paper [PDF].

The discounted likelihood procedure, which is a robust extension of the usual EM procedure, is presented, and two approximations which lead to two different variants of the usual MLLR adaptation scheme are introduced. These schemes are shown to robustly estimate speaker adaptation transforms with very little data. The evaluation is carried out on the Switchboard corpus.

         CASS: A phonetically transcribed corpus of Mandarin spontaneous speech. A. LI, F. ZHENG, W. Byrne, P. Fung, T. Kamm, Y. LIU, Z. SONG, U. Ruhi, V. Venkataramani, and X. CHEN. In Proc. of the International Conference on Spoken Language Processing, 2000. (4 pages). Paper [PDF].

A collection of Chinese spoken language has been collected and phonetically annotated to capture spontaneous speech and language effects. The Chinese Annotated Spontaneous Speech (CASS) corpus contains phonetically transcribed spontaneous speech. This corpus was created to begin to collect samples of most of the phonetic variations in Mandarin spontaneous speech due to pronunciation effects, including allophonic changes, phoneme reduction, phoneme deletion and insertion, as well as duration changes. It is intended for use in pronunciation modeling for improved automatic speech recognition and will be used at the 2000 Johns Hopkins University Language Engineering Workshop by the project on Pronunciation Modeling of Mandarin Casual Speech.

         On the incremental addition of regression classes for speaker adaptation. J. McDonough and W. Byrne. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 2000. (4 pages)

         Minimum risk acoustic clustering for multilingual acoustic model compination. D. Vergyri, S. Tsakalidis, and W. Byrne. In International Conference on Spoken Language Processing, 2000. (4 pages). Paper [PDF].

In this paper we describe procedures for combining multiple acoustic models, obtained using training corpora from different languages, in order to improve ASR performance in languages for which large amounts of training data are not available. We treat these models as multiple sources of information whose scores are combined in a log-linear model to compute the hypothesis likelihood. The model combination can either be performed in a static way, with constant combination weights, or in a dynamic way, with parameters that can vary for different segments of a hypothesis. The aim is to optimize the parameters so as to achieve minimum word error rate. In order to achieve robust parameter estimation in the dynamic combination case, the parameters are defined to be piecewise constant on different phonetic classes that form a partition of the space of hypothesis segments. The partition is defined, using phonological knowledge, on segments that correspond to hypothesized phones. We examine different ways to define such a partition, including an automatic approach that gives a binary tree structured partition which tries to achieve the minimum WER with the minimum number of classes.

         Comments on ’Efficient training algorithms for HMM’s using incremental estimation’. W. Byrne and A. Gunawardana. IEEE Transactions on Speech and Audio Processing, 8(6):751–754 (4 pages), Nov 2000.

“Efficient Training Algorithms for HMM’s using Incremental Estimation” investigates EM procedures that increase training speed. The authors’ claim that these are GEM procedures is incorrect. We discuss why this is so, provide an example of non-monotonic convergence to a local maximum in likelihood, and outline conditions that guarantee such convergence.

         Discounted likelihood linear regression for rapid speaker adaptation, W. Byrne. Tsinghua University, Beijing, China, October 2000. Presentation [PDF]

1999    Towards language independent acoustic modeling. W. Byrne, P. Beyerlein, J. Huerta, S. Khudanpur, B. Marthi, J. Morgan, N. Peterek, J. Picone, D. Vergyri, and W. Wang. In IEEE Workshop on Automatic Speech Recognition and Understanding, Keystone, Colorado, 1999. (4 pages). Paper [PDF].

We describe procedures and experimental results using speech from diverse source languages to build an ASR system for a single target language. This work is intended to improve ASR in languages for which large amounts of training data are not available. We have developed both knowledge based and automatic methods to map phonetic units from the source languages to the target language. We employed HMM adaptation techniques and Discriminative Model Combination to combine acoustic models from the individual source languages for recognition of speech in the target language. Experiments are described in which Czech Broadcast News is transcribed using acoustic models trained from small amounts of Czech read speech augmented by English, Spanish, Russian, and Mandarin acoustic models.

         Convergence of EM variants. W. Byrne and A. Gunawardana. In IEEE Information Theory Workshop on Detection, Estimation, Classification, and Imaging, page 64 (1 page), 1999. Paper [PDF]

         Discounted likelihood linear regression for rapid adaptation. W. Byrne and A. Gunawardana. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 1999. (4 pages). Paper [PDF].

Rapid adaptation schemes that employ the EM algorithm may suffer from overtraining problems when used with small amounts of adaptation data. An algorithm to alleviate this problem is derived within the information geometric framework of Csiszįr and Tusnįdy, and is used to improve MLLR adaptation on NAB and Switchboard adaptation tasks. It is shown how this algorithm approximately optimizes a discounted likelihood criterion.

         Large vocabulary speech recognition for read and broadcast Czech. W. Byrne, J. Hajic, P. Ircing, F. Jelinek, S. Khudanpur, J. McDonough, N. Peterek, and J. Psutka. In Proceedings of the Text, Speech, and Dialog Workshop, 1999. (6 pages). Paper [PDF].

We describe read speech and broadcast news corpora collected as part of a multi-year international collaboration for the development of large vocabulary speech recognition systems in the Czech language. Initial investigations into language modeling for Czech automatic speech recognition are described and preliminary recognition results on the read speech corpus are presented.

         Rapid speech recognizer adaptation to new speakers. V. Digalakis, S. Berkowitz, E. Bochieri, C. Boulis, W. Byrne, H. Collier, A. Corduneanu, A. Kannan, S. Khudanpur, J. McDonough, and A. Sankar. In IEEE Conference on Acoustics, Speech and Signal Processing. IEEE, 1999. (4 pages). Paper [PDF].

This paper summarizes the work of the “Rapid Speech Recognizer Adaptation” team in the workshop held at Johns Hopkins University in the summer of 1998. The project addressed the modeling of dependencies between units of speech with the goal of making more effective use of small amounts of data for speaker adaptation. A variety of methods were investigated and their effectiveness in a rapid adaptation task defined on the SWITCHBOARD conversational speech corpus is reported.

         Task dependent loss functions in speech recognition: A-star search over recognition lattices. V. Goel and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 1999. (4 pages). Paper [PDF].

A recognition strategy that can be matched to specific system performance criteria has recently been found to yield improvem ents over the usual maximum a posteriori probability strategy. Some examples of different system performance criteria are word error rate (WER), F-measure for Named Entity extraction tasks, and word-specific errors for keyword spotting tasks. In the match ed-to-the-task strategy the hypothesis is chosen to minimize the expected loss or the Bayes Risk under a loss function defined by th e performance measure of interest. Due to the prohibitively expensive implementation of this strategy, only an approximate implemen tation as an N-best list rescoring scheme has been used so far. Our goal is to improve the performance of such risk-based dec oders by developing search strategies that can incorporate more acoustic evidence. In this paper we present search algorithms to implement the risk-based recognition strategy over word lattices that contain acoustic and language model scores. These algorithms are extensions of the N-best list rescoring approximation and are formulated as A-star algorithms. We first present a single stack A-star search and show how to obtain an under-estimate and an over-estimate of the cost needed for the search. For loss functions that do not depend on time segmentation of hypotheses, a prefix-tree based simpl ification of the single stack algorithm is then derived. For yet a further subset of loss functions, including the usual Levenshtei n distance based loss for WER reduction tasks, we describe a search organization that facilitates further efficiencies in computatio n and storage. Finally we present a path equivalence criterion for merging of prefix tree nodes during search to allow for a larger search space. We find that restricted loss functions yield the most efficient search procedures. However the general single stack search can be applied quite broadly even in principle to loss functions that measure semantic agreement between sentences. Preliminary experiments were performed for WER reduction task on the Switchboard corpus, dev-test set of the 1997 JHU-LVCSR workshop. We obtain an error rate reduction of 0.8-0.9% absolute over a baseline of 38.5% WER. The search speed is comparable to the N-best list rescoring procedure which is much more restrictive in the amount of hypotheses considered for search and produces slightly inferior results (0.5-0.6% absolute improvement). At the conference we will present the framework of task dependent recognition strategy, its implementation as A-star search, and the speed and accuracy comparison of the search with N-best list rescoring procedure.

         Task dependent loss functions in speech recognition: Application to named entity extraction. V. Goel and W. Byrne. In ESCA-ETR Workshop on accessing information in spoken audio, 1999. (4 pages)

         Single-pass adapted training with all-pass transforms. J. McDonough and W. Byrne. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 1999. (4 pages). Paper [PDF].

In recent work, the all-pass transform (APT) was proposed as the basis of a speaker adaptation scheme intended for use with a large vocabulary speech recognition system. It was shown that APT-based adaptation reduces to a linear transformation of cepstral means, much like the better known maximum likelihood linear regression (MLLR), but is specified by far fewer free parameters. Due to its linearity, APT-based adaptation can be used in conjunction with speaker-adapted training (SAT), an algorithm for performing maximum likelihood estimation of the parameters of an HMM when speaker adaptation is to be employed during both training and test. In this work, we propose a refinement of SAT called single-pass adapted trainingB (SPAT) which achieves the same improvement in system performance as SAT but requires much less computation for HMM training. In a set of speech recognition experiments conducted on the Switchboard Corpus, we report a word error rate reduction of 5.3% absolute using a single, global APT.

         Speaker adaptation with all-pass transforms. J. McDonough and W. Byrne. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1999. (4 pages). Paper [PDF].

In recent work, a class of transforms were proposed which achieve a remapping of the frequency axis much like conventional vocal tract length normalization. These mappings, known collectively as all-pass transforms (APT), were shown to produce substantial improvements in the performance of a large vocabulary speech recognition system when used to normalize incoming speech prior to recognition. In this application, the most advantageous characteristic of the APT was its cepstral-domain linearity; this linearity makes speaker normalization simple to implement, and provides for the robust estimation of the parameters characterizing individual speakers. In the current work, we exploit the APT to develop a speaker adaptation scheme in which the cepstral means of a speech recognition model are transformed to better match the speech of a given speaker. In a set of speech recognition experiments conducted on the Switchboard Corpus, we report reductions in word error rate of 3.7% absolute.

         Stochastic pronunciation modeling from hand-labelled phonetic corpora. M. Riley, W. Byrne, M. Finke, S. Khudanpur, A. Ljolje, J. McDonough, H. Nock, M. Saraclar, C. Wooters, and G. Zavaliagkos. Speech Communication, pages 109–116 (8 pages), November 1999.

In the early ’90s, the availability of the TIMIT read-speech phonetically transcribed corpus led to work at AT&T on the automatic inference of pronunciation variation. This work, briefly summarized here, used stochastic decisions trees trained on phonetic and linguistic features, and was applied to the DARPA North American Business News read-speech ASR task. More recently, the ICSI spontaneous-speech phonetically transcribed corpus was collected at the behest of the 1996 and 1997 LVCSR Summer Workshops held at Johns Hopkins University. A 1997 workshop (WS97) group focused on pronunciation inference from this corpus for application to the DoD Switchboard spontaneous telephone speech ASR task. We describe several approaches taken there. These include (1) one analogous to the AT&T approach, (2) one, inspired by work at WS96 and CMU, that involved adding pronunciation variants of a sequence of one or more words (‘multiwords’) in the corpus (with corpus-derived probabilities) into the ASR lexicon, and (1+2) a hybrid approach in which a decision-tree model was used to automatically phonetically transcribe a much larger speech corpus than ICSI and then the multiword approach was used to construct an ASR recognition pronunciation lexicon.

1998    Stochastic pronunciation modeling from hand-labeled phonetic corpora. W. Byrne, M. Finke, S. Khudanpur, A. Ljolje, J. McDonough, H. Nock H, M. Riley, M. Saraclar, C. Wooters, and G. Zavaliagkos. In Proceedings of the Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition, 1998. (8 pages). Paper [PDF]

         Pronunciation modelling using a hand-labelled corpus for conversational speech recognition. W. Byrne, M. Finke, S. Khudanpur, J. McDonough, H. Nock, M. Riley, M. Saraclar, C. Wooters, and G. Zavaliagkos. In IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 1998. (4 pages). Paper [PDF].

Accurately modelling pronunciation variability in conversational speech is an important component of an automatic speech recognition system. We describe some of the projects undertaken in this direction during and after WS97, the Fifth LVCSR Summer Workshop, held at Johns Hopkins University, Baltimore, in July- August, 1997. We first illustrate a use of hand-labelled phonetic transcriptions of a portion of the Switchboard corpus, in conjunction with statistical techniques, to learn alternatives to canonical pronunciations of words. We then describe the use of these alternate pronunciations in an automatic speech recognition system. We demonstrate that the improvement in recognition performance from pronunciation modelling persists as the system is enhanced with better acoustic and language models.

         LVCSR rescoring with modified loss functions: a decision theoretic perspective. V. Goel, W. Byrne, and S. Khudanpur. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1998. (4 pages). Paper [PDF].

In this work, the problem of speech decoding is viewed in a Decision Theoretic framework. A modified speech decoding procedure to minimize the expected word error rate is formulated in this framework, and its implementation in N-best list rescoring is presented. Preliminary experiments on the Switch-board show a small but statistically significant error rate improvements.

         Speaker normalization with all-pass transforms. J. McDonough, W. Byrne, and X. Luo. In International Conference on Spoken Language Processing, 1998. (4 pages). Paper [PDF].

Speaker normalization is a process in which the short-time features of speech from a given speaker are transformed so as to better match some speaker independent model. Vocal tract length normalization (VTLN) is a popular speaker normalization scheme wherein the frequency axis of the short-time spectrum associated with a speaker’s speech is rescaled or warped prior to the extraction of cepstral features. In this work, we develop a novel speaker normalization scheme by exploiting the fact that frequency domain transformations similar to that inherent in VTLN can be accomplished entirely in the cepstral domain through the use of conformal maps. We propose a class of such maps, designated all-pass transforms for reasons given hereafter, and in a set of speech recognition experiments conducted on the Switchboard Corpus demonstrate their capacity to achieve word error rate reductions of 3.7% absolute.

1997    Pronunciation modelling for conversational speech recognition: A status report from WS97. W. Byrne, M. Finke, S. Khudanpur, J. McDonough, H. Nock H, M. Riley, M. Saraclar, C. Wooters, and G. Zavaliagkos. In IEEE Automatic Speech Recognition and Understanding Workshop, 1997. (8 pages). Paper [PDF].

Accurately modelling pronunciation variability in conversational speech is an important component for automatic speech recognition. We describe some of the projects undertaken in this direction at WS97, the Fifth LVCSR Summer Workshop, held at Johns Hopkins University, Baltimore, in July-August, 1997. We first illustrate a use of hand-labelled phonetic transcriptions of a portion of the Switchboard corpus, in conjunction with statistical techniques, to learn alternatives to canonical pronunciations of words. We then describe the use of these alternate pronunciations in a recognition experiment as well as in the acoustic training of an automatic speech recognition system. Our results show a reduction of word error rate in both cases band 2.2% with acoustic retraining.

         Is automatic speech recognition ready for non-native speech? a data collection effort and initial experiments in modeling conversational Hispanic english. W. Byrne, S. Khudanpur, E. Knodt, and J. Bernstein. In ESCA-ITR Workshop on speech technology in language learning, 1997. (4 pages). Paper [PDF].

We describe the protocol used for collecting a corpus of conversational English speech from non-native speakers at several levels of proficiency, and report the results of preliminary automatic speech recognition (ASR) experiments on this corpus using HTK-based ASR systems. The speech corpus contains both read and conversational speech recorded simultaneously on wide-band and telephone channels, and has detailed time aligned transcriptions. The immediate goal of the ASR experiments is to assess the difficulty of the ASR problem in language learning exercises and thus to gauge how current ASR technology may be used in conversational computer assisted language learning (CALL) systems. The long-term goal of this research, of which the data collection and experiments are a first step, is to incorporate ASR into computer-based conversational language instruction systems.

         Neurocontrol in sequence recognition. W. Byrne and S. Shamma. In O. Omidvar and D. Elliott, editors, Progress in Neural Networks: Neural Networks for Control, pages 31–56 (26 pages). Academic Press, 1997. Paper [PDF].

An artificial neural network intended for sequence modeling and recognition is described. The network is based on a lateral inhibitory network with controlled, oscillatory behavior so that it naturally models sequence generation. Dynamic programming algorithms can be used to transform the network into a sequence recognizer. Markov decision theory is used to develop novel and more “neural” recognition control strategies as alternatives to dynamic programming.

1996    Information geometry and maximum likelihood criteria. W. Byrne. In Conference on Information Sciences and Systems, Princeton, NJ, 1996. (6 pages). Paper [PDF].

This paper presents a brief comparison of two information geometries as they are used to describe the EM algorithm used in maximum likelihood estimation from incomplete data. The Alternating Minimization framework based on the I-Geometry developed by Csiszar is presented first, followed by the em-algorithm of Amari. Following a comparison of these algorithms, a discussion of a variation in likelihood criterion is presented. The EM algorithm is usually formulated so as to improve the marginal likelihood criterion. Closely related algorithms also exist which are intended to maximize different likelihood criteria. The 1-Best criterion, for example, leads to the Viterbi training algorithm used in Hidden Markov Modeling. This criterion has an information geometric description that results from a minor modification of the marginal likelihood formulation.

         Modeling systematic variations in pronunciation via a language-dependent hiddn speaking mode. M. Ostendorf, W. Byrne, M. Bacchiani, M. Finke, A. Gunawardana, K. Ross, S. Roweis, E. Shriberg, D. Talkin, A. Waibel, B. Wheatley, and T. Zeppenfeld. In Proceedings of the International Conference on Spoken Language Processing, 1996. (4 pages)

1994    Spontaneous speech recognition for the credit card corpus using the HTK toolkit. S. Young, P. Woodland, and W. Byrne. IEEE Transactions on Speech and Audio Processing, pages 615–621 (6 pages), 1994.

This paper describes the speech recognition system which was provided as a baseline for the Summer Workshop on Robust Speech Processing held at the Rutgers CAIP Center in July/August 1993.

1993    Generalization and maximum likelihood from small data sets. W. Byrne. In IEEE-SP Workshop on Neural Networks in Signal Processing, 1993. (7 pages). Paper [PDF].

An often encountered learning problem is maximum likelihood training of exponential models. When the state is only partially specified by the training data, iterative training algorithms are used to produce a sequence of models that assign increasing likelihood to the training data. Although the performance as measured on the training set continues to improve as the algorithms progress, performance on related data sets may eventually begin to deteriorate. The cause of this behavior can be seen when the training problem is stated in the Alternating Minimization framework. A modified maximum likelihood training criterion is suggested to counter this behavior. It leads to a simple modification of the learning algorithms which relates generalization to learning speed. Training Boltzmann Machines and Hidden Markov Models is discussed under this modified criterion.

         Noise robustness in the auditory representation of speech signals. K. Wang, S. Shamma, and W. Byrne. In International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1993. (4 pages)

1992    Alternating Minimization and Boltzmann Machine learning. W. Byrne. IEEE Transactions on Neural Networks, 3(4):612–620 (9 pages), 1992. Paper [PDF].

Training a Boltzmann machine with hidden units is appropriately treated in information geometry using the information divergence and the technique of alternating minimization. The resulting algorithm is shown to be closely related to gradient descent Boltzmann machine learning rules, and the close relationship of both to the EM algorithm is described. An iterative proportional fitting procedure is described and incorporated into the alternating minimization algorithm.

1989    The auditory processing and recognition of speech. W. Byrne, J. Robinson, and S. Shamma. In Proceedings of the Speech and Natural Language Workshop, pages 325–331 (7 pages), October 1989

1986    Adaptive filter processing in remote heart monitors. W. Byrne, R. Zapp, P. Flynn, and M. Siegel. IEEE Transactions on Biomedical Engineering, pages 717–722 (6 pages), 1986

1985    Adaptive filtering in microwave remote heart monitors. W. Byrne, R. Zapp, P. Flynn, and M. Siegel. In IEEE Engineering in Medicine and Biology Society, Seventh Annual Conference, 1985. (4 pages)