CU Crest

Multimedia Document Retrieval (MDR)


Contents of this page

Other Pages for the MDR Project

Main Project Staff

¹ Cambridge University Engineering Department
² Cambridge University Computer Laboratory

Industrial Collaborators

Entropic logo
ATT logo Dr. Ken Wood ³ (krw@uk.research.att.com)
³ AT&T Laboratories, Cambridge


Project Objectives

The aims of the project are to develop techniques for using speech recognition to automatically transcribe and index audio and video material; and to integrate these techniques with a probabilistic information retrieval model to provide large-scale retrieval of multimedia documents.

The work will pursue the following specific objectives against which the success of the project may be judged.

  1. To establish an infrastructure for recognition system development and spoken document retrieval based on the US ARPA "Broadcast News" task.
  2. To develop techniques for transcribing broadcast news material; specifically techniques for
    1. audio segmentation and classification
    2. speaker clustering and tracking
    3. recognition with background speech and music
    4. robust adaptation of acoustic and language models
    5. modelling fast speech pronunciation effects
    6. integrating phone-lattice based word spotting for recognising new words.
  3. To understand the behaviour of probabilistic retrieval models with respect to increasing size of corpora and the effects of word recognition errors; and to refine the models accordingly.
  4. To develop a relevance feedback mechanism for spoken document retrieval using both term reweighting and query expansion.
  5. To evaluate the above by participation in the US ARPA CSR (continuous speech recognition) and TREC (information retrieval) evaluations
  6. To illustrate the effectiveness of the above by extending the existing VMR demonstration system (produced within the VMR project GR/H87629) to support the interactive retrieval of broadcast news.

Progress to Date (summary)

For a more detailed description of our progress, then please read our project progress page

September 1997 - Hub 4 evaluation and Recogniser Improvements
Work initially focussed on the 1997 Hub 4 Broadcast News evaluation. New segmentation, clustering and better modelling was introduced. The final HTK system yielded an overall word error rate of 22.0% on the 1996 unpartitioned broadcast news development test data and just 16.2% on the evaluation test set - the lowest overall word error rate in the 1997 DARPA broadcast news evaluation, by a statistically significant margin. (see NIST report).

Our adaptation software was revised and extended to include speaker adaptive training (SAT) and a new maximum-likelihood based clustering scheme was developed and shown to reduce word error rates in recognition. In addition a tool for locating and analysing errors in found speech was developed.

January 1998 - Speech and IR Development
In preparation for the TREC-7 Spoken Document Retrieval evaluation, work was done by Entropic Ltd. to speed up the decoder. The final two-pass transcription system, incorporating the new segmentation, and clustering algorithms, gender/bandwidth dependent models, MLLR adaptation, a 4-gram language model and a 65k vocabulary, ran in approximately 50 times real time and gave a word error rate of around 25% on the 50 hours of TREC-6 SDR data, which we subsequently used for information retrieval development work.

IR development was carried out on a query set developed in-house. Standard notions of stopping and stemming were applied, with extra text processing, such as dealing with abbreviations (e.g. "U.S.A.") and known stemming exceptions (e.g. news/new; Californian/California). Query terms were weighted by their part-of-speech and some modest statistical pre-search query expansion was used to add terms to the query that were more common statistical hyponyms of the query terms. Finally, term-position indexing was implemented to allow phrasal terms, extracted from the query using the part of speech information, to be included. The combined weight formula was used to score the documents. All of these measures were shown to increase performance on our own development query set.

May 1998 - The TREC-7 Evaluation
The overall word error rate for our recognition system on the 100 hours of TREC-7 SDR data was 24.8%, the lowest in the TREC-7 SDR evaluation. Retrieval was run on these transcriptions and those from other competing sites with word error rates ranging from 29% to 66%. Our retrieval system performed well and the evaluation showed that for this relatively small task (in IR terms), the degradation of performance with word error was small. (6% mean average precision for 25% word error rate). We introduced the concept of a Processed Term Error Rate (PTER), to evaluate transcription errors from a retrieval-based perspective and showed it varied approximately linearly with mean average precision for our system.

November 1998 - Improving the Probabilistic IR Model
Partially Ordered Sets (posets) were introduced into the improved probabilistic retrieval framework, to allow semantically related words to be added to the query - and were shown in increase IR performance. Work on relevance feedback, both on the the test collection, and on a (larger) parallel collection showed that both these techniques can increase retrieval performance across a wide range of transcription errors.

November 1998 - Hub4 evaluation
Systems were made for both the time-unlimited and 10x real time 1998 DARPA/NIST Hub4 evaluations, in conjunction with Entropic. This included vocal-tract length normalisation; cluster-based cepstral mean and variance normalisation; better acoustic and language models and improved adaptation using a full variance transformation in addition to standard MLLR. The final HTK unconstrained compute system ran in about 300xRT and gave an overall word error rate of 13.8% on the complete evaluation data set (not significantly different statistically to the best system). The 10xRT system used a two-pass strategy with highly optimised decoder, giving an error rate of 16.1% (the lowest for the 10xRT task by a statistically significant margin.)

March 1999 - Modelling Speakers and Speaking Rates
Further work in speaker clustering showed that by modifying the clustering system used in recognition, the automatically generated segments could be split into speaker groups quite successfully. This could allow, for example, all speech said by the announcer in a news show to be extracted and presented to the user as a summary of the main news of the day.

The correlation between an inter-frame distance measure and a phone-based concept of speaking rate was also investigated. It was shown that it is possible to build MAP estimators to distinguish between fast and slow phonemes and recognition was improved by incorporating this information into an HMM system.

July 1999 - The TREC-8 Evaluation
Work then focussed on the 1999 TREC-8 SDR evaluation. New models and parameter sets were made for the segmentation and clustering. A novel algorithm was developed to detect commercials by searching for repeating audio in the broadcasts. This rejected 42.3 hours of audio, of which 41.4 hours was labelled as non-story content in the reference. A larger vocabulary of 108k words was used to reduce the OOV problem, whilst the highly-optimised decoder from Entropic allowed the system to run in 13 times real time. Improved acoustic and language modelling also helped increase accuracy over the TREC-7 SDR system. The final WER on the 10 hour scored subset of the TREC-8 SDR corpus was 20.6% (the lowest in the evaluation).

The IR system was augmented with:

We also built a system for the story-boundary-unknown evaluation, which used a windowing system which exploited the broadcast structure information from the commercial detection and audio segmentation. The retrieval was the same as the story-known case, but without document feedback, and subsequently nearby windows in time were combined to give the final retrieved documents.

November 1999 - Analysing Results from the TREC-8 evaluation
We achieved an average precision of 55.3% on our own transcriptions (the best in the evaluation) and 41.5% for the story-unknown evaluation. Work focussed on analysing the effects of the various components of the system and the average precision for the story-unknown case was increased to 46.5% by making minor modifications.

February 2000 - Investigating Out of Vocabulary (OOV) Effects
A key issue in SDR is the effect of OOV words. If OOV words occur in the spoken documents recognition errors result, while if there are OOV terms in written queries the OOV terms will not match the documents in the collection. We studied the impact of OOV effects in the context of the TREC-8 SDR corpus and query set. Using a fast recogniser we transcribed the 500 hours of TREC-8 broadcast news material with 5 different recognisers which only differed in vocabulary size (55k, 27k, 13k, 6k, 3k) and covered a large range of OOV rates. We ran a series of IR experiments on these different transcription sets with IR systems with and without both query and document expansion. Query expansion used blind relevance feedack from either just the document collection and/or from a large parallel collection of newspaper texts. Document expansion used the spoken documents as queries to the parallel collection to add terms to the original documents and can directly compensate for OOV effects. The experiments showed that the use of parallel corpora for query and document expansion can compensate for the effects of OOV words to a large extent for moderate OOV rates. These results imply that, at least for the type of queries and documents used in TREC8, further compensation for OOV words using e.g. phone lattices is not needed.

March 2000 - Consolidating the MDR demo
The MDR demo system, which downloads RealAudio from the WWW, automatically transcribes this and allows the user to search the resulting database, was expanded and improved. Both filtering with fixed queries, and conventional retrieval with new queries, were allowed with the user able to browse the transcripts as well as listen to the original audio. Interactive query expansion using semantic posets and relevance feedback was also included, whilst keyword highlighting and displaying the best-matching extract allowed the user to find relevant passages more quickly. The system was also extended to form artificial story boundaries for convenient retrieval when the original audio was not pre-segmented by topic. The demo was presented at both RIAO 2000 and SIGIR 2000 and attracted considerable interest.

August 2000 - TREC-9
The TREC-9 SDR evaluation was similar to the TREC-8 evaluation, using the same document collection and main tasks, but had a few subtle differences. The story-unknown task became the main focus, and the use of non-lexical automatically derived information (such as the gender of the speaker, or the presence of music) was allowed in retrieval for the first time. We generated the following non-lexical tags: segment, gender, bandwidth, high-energy, no-speech (silence, noise or music), repeats, and commercials. We focussed on improving the performance of the system and obtained over 51% AveP for the story unknown case and over 60% for the story known case in trials using the TREC-8 queries (c.f. 41.5 and 55.3% respectively in the TREC-8 evaluation.)

The team's results in TREC-9 were very good, clearly demonstrating effective retrieval performance in the story-unknown task and showing that spoken document retrieval, even when the word error rate in recognition is not trivial, is a perfectly practical proposition. The final project publications include the TREC-9 paper, 2 papers in the International Journal of Speech Technology and a comprehensive technical report


Related Projects and Issues


Previous Work

Future Work

This project follows on from a project on Video Mail Retrieval Using Voice
Issues which might be appropriate for future work and grants are
  • Reliable methods for characterising audio into categories such as speech,speech+music, music, noise.
  • Detecting and integrating video cues to improve audio segmentation and labelling e.g. detecting show boundaries, speaker changes, etc
  • Recognising text cues in the video signal
  • Locating and tracking lip-movement to improve recognition
which was a collaboration between Cambridge University Computer Laboratory and Engineering Department, and ORL (now AT&T Laboratories). It explored spoken document retrieval for video mail messages using a small corpus of messages along with two query sets. This initial study of spoken document retrieval showed that probabilistic retrieval techniques could be successfully combined with classical speech recognition methods for acceptable system performance.



Please see the Project Publications page for our references.



This work is funded by EPSRC grant GR/L49611

This Page is maintained by Sue Johnson,   sej28@eng.cam.ac.uk
Sun 7 Oct 2001