Many word-spotting applications require an open keyword vocabulary, allowing the user to search for any term in an audio document database. In conjunction with this, an automatic method of determining the acoustic representation of an arbitrary keyword is needed. For a HMM-based system, where the keyword is represented by a concatenated string of phones, the keyword phone string (KPS), the phonetic transcription must be estimated. This is equivalent to the new word problem in speech recognition, except that there is the spelling of the word is not essential and no revision of the language model is required.
A new keyword (or key phrase) can be entered by text and/or voice, dependent on the device. This gives three possible input modes for the new word: orthographic spelling; spoken example; and combination of these. Work in all these areas is reviewed in a CUED technical report, Techniques for automatically transcribing unknown keywords for open keyword set HMM-based word-spotting. The spoken keyword case is examined in more detail as it is a more natural mode of interaction with audio-based systems, which is particularly important for hand-held devices with no, or a limited, keyboard. There is also likely to be a high occurrence of real names and user-defined jargon in retrieval requests which are difficult to cover fully in spelling based systems.
Overall it has been found that speech derived KPSs are less robust than phonetic dictionary defined KPSs. However, since the speech-based system does not use a dictionary it has the advantage that it can handle any word or sound. It also requires less memory. Given a single keyword utterance, producing multiple keyword pronunciations using a N-best recogniser was found to give the best word-spotting performance. Greater robustness can be achieved if more keyword utterances are available. Again, a N-best approach yielded the best performance, but good performance was also achieved using the Viterbi string for each utterance, which has a lower computational cost.
Paper on Keyword training using a single spoken example for applications in audio document detrieval, presented at ICSST, Perth, Australia, 1994. Use of a N-best recogniser is proposed to derive multiple pronunciations of the keyword to model some of the variability inherent in its pronunciation.
Technical report on Techniques for automatically transcribing unknown keywords for open keyword set HMM-based word-spotting.