Abstract for odell_thesis

PhD Thesis, University of Cambridge

THE USE OF CONTEXT IN LARGE VOCABULARY SPEECH RECOGNITION

Julian Odell

March 1995

In recent years, considerable progress has been made in the field of continuous speech recognition where the predominant technology is based on hidden Markov models (HMMs). HMMs represent sequences of time varying speech spectra using probabilistic functions of an underlying Markov chain.

However, because the probability distribution represented by a HMM is very simple, its discriminative ability is limited. As a consequence, a careful choice of the units represented by each model is required in order to accurately model the variation inherent in natural speech. In practice, much of the variation is due to consistent contextual effects and can be accounted for by using context dependent models.

In large vocabulary recognition the use of context dependent models introduces two major problems. Firstly, some method must be devised to determine the set of contexts which require distinct models. Furthermore, this must be done in a way which takes account of the sparsity and unevenness of the training data. Secondly, a strategy must be devised which allows efficient decoding using models incorporating context dependencies both within words and across word boundaries. This thesis addresses both of these key problems.

Firstly, a method of constructing robust and accurate recognisers using decisiontree based clustering techniques is described. The strength of this approach lies in its ability to accurately model contexts not appearing in the training data. Linguistic knowledge is used, in conjunction with the data, to decide which contexts are similar and can share parameters. A key feature of this approach is that it allows the construction of models which are dependent upon contextual effects occurring across word boundaries.

The use of cross word context dependent models presents problems for conventional decoders. The second part of the thesis therefore presents a new decoder design which is capable of using these models efficiently. The decoder is suitable for use with very large vocabularies and long span language models. It is also capable of generating a lattice of word hypotheses with little computational overhead. These lattices can be used to constrain further decoding, allowing efficient use of complex acoustic and language models.

The effectiveness of these techniques has been assessed on a variety of large vocabulary continuous speech recognition tasks and results are presented which analyse performance in terms of computational complexity and recognition accuracy. The experiments demonstrate state of the art performance and a recogniser using these techniques was used in the 1994 US ARPA CSR Evaluations where it returned the lowest error rate of any system tested.

(ftp:) odell_thesis.ps.Z (http:) odell_thesis.ps.Z
PDF (automatically generated from original PostScript document - may be badly aliased on screen):
(ftp:) odell_thesis.pdf | (http:) odell_thesis.pdf

If you have difficulty viewing files that end '.gz', which are gzip compressed, then you may be able to find tools to uncompress them at the gzip web site.

If you have difficulty viewing files that are in PostScript, (ending '.ps' or '.ps.gz'), then you may be able to find tools to view them at the gsview web site.

We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.