Abstract for knill_tr230

Cambridge University Engineering Department Technical Report CUED/F-INFENG/TR230

SPEAKER DEPENDENT KEYWORD SPOTTING FOR ACCESSING STORED SPEECH

Kate Knill and Steve Young

September 1995

Many word-spotting applications require an open keyword vocabulary, allowing the user to search for any term in an audio document database. In conjunction with this, an automatic method of determining the acoustic representation of an arbitrary keyword is needed. For a HMM-based system, where the keyword is represented by a concatenated string of phones, the keyword phone string (KPS), the phonetic transcription must be estimated. This report describes automatic transcription methods for orthographically spelt, spoken, and combined spelt and spoken, keyword input modes.

The spoken keyword example case is examined in more detail for the following reasons. Firstly, interaction with an audio-based system is more natural than typing at a keyboard or speaking the orthographic spelling. This is of particular interest for hand-held devices with no, or a limited, keyboard. Secondly, there is likely to be a high occurrence of real names and user-defined jargon in retrieval requests which are difficult to cover fully in spelling based systems. The basic approach considered is that of using a phone level speech recogniser to hypothesise one or more keyword transcriptions. The effect on the KPSs of the number of pronunciation strings, the HMM complexity, and the language model used in the phone recogniser, and the number of sample keyword utterances is evaluated through a series of speaker dependent word-spotting experiments on spontaneous speech messages from the Video Mail Retrieval database.

Overall it was found that speech derived KPSs are less robust than phonetic dictionary defined KPSs. However, since the speech-based system does not use a dictionary it has the advantage that it can handle any word or sound. It also requires less memory. Given a single keyword utterance, producing multiple keyword pronunciations using a 7 N-best recogniser was found to give the best word-spotting performance, with a 9.3% drop in performance relative to the phonetic dictionary defined system for a null grammar, monophone HMM-based KPS recogniser. If two utterances are available, greater robustness can be achieved as the problem of poor keyword examples is partially overcome. Again, a 7 N-best approach yielded the best performance (6.1% relative drop), but good performance was also achieved using the Viterbi string for each utterance (8.5% relative drop), which has a lower computational cost.

(ftp:) knill_tr230.ps.Z (http:) knill_tr230.ps.Z
PDF (automatically generated from original PostScript document - may be badly aliased on screen):
(ftp:) knill_tr230.pdf | (http:) knill_tr230.pdf

If you have difficulty viewing files that end '.gz', which are gzip compressed, then you may be able to find tools to uncompress them at the gzip web site.

If you have difficulty viewing files that are in PostScript, (ending '.ps' or '.ps.gz'), then you may be able to find tools to view them at the gsview web site.

We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.