CU Crest Olivetti logo

Video Mail Retrieval Using Voice


Project Staff:

¹ Cambridge University Engineering Department
² Cambridge University Computer Laboratory
³ Olivetti Research Laboratory


Interest in video-based communications and multimedia is growing rapidly. Cambridge University, in collaboration with Olivetti Research Laboratory (ORL), is developing the Medusa networked multimedia system, which is now in regular use on a high speed ATM network covering ORL, the Computer Laboratory and the Engineering Department's SVR Group. One of the most popular Medusa services is video mail, and users are now amassing large archives of stored video messages. However, unlike regular electronic mail which is easily searched using conventional text retrieval methods, video mail only has images and sound. The goal of this project is to develop retrieval methods based on spotting keywords in the audio soundtrack. The project has successfully integrated speech recognition methods and information retrieval technology to yield a practical audio and video retrieval system.

This project was supported by the UK DTI Grant IED4/1/5804 and SERC Grant GR/H87629.

-> return to top

Project Objectives

-> return to top


The project is organised in 3 stages, each lasting one year and culminating in a prototype demonstration system. The first stage prototype was completed in September 1994 and successfully demonstrated message retrieval from known speakers using a set of 35 predefined keywords. The second stage, completed in September of 1995, extended this to allow unknown speakers. In July 1996 the final stage demonstrated open-keyword video document retrieval from arbitrary speakers, as well as a video mail browser allowing random access to video documents.

Document Retrieval

Finding interesting material in a large collection of documents is often time consuming and inefficient. In automated text retrieval systems statistical techniques are applied to a search query, seeking relevant material present in the document archive. The retrieval system outputs a list of potentially relevant documents for the user to inspect, ranked by a score reflecting the match between the query and each individual document. Effective retrieval systems for electronic text archives have been developed using this statistical approach, while similar methods are now being used for Web search engines. But while text documents may be readily indexed by their contents, determining the information content of audio documents is considerably more difficult. documents. In the absence of manually-generated transcriptions, spoken documents can be retrieved only if their contents can be indexed using an automatic speech recognition (ASR) system. Speech recognition is a non-trivial process and presents several challenges in the retrieval domain.

-> return to top

Speech Recognition

Ideally, a speech recogniser would generate an exact transcription of the document contents, regardless of speaking style, vocabulary, or the acoustic environment. However, despite recent advances in ASR technology, this ideal system is not yet practical. State-of-the-art ASR systems can recognise vocabularies of many thousands of words. However, out-of-vocabulary (OOV) words, such as many proper nouns, cannot be recognised. This is a particular problem in retrieval applications where users frequently wish to search using OOV words including names of people, products, places or jargon. The VMR system overcomes this problem using novel approach. The ASR system generates a generalised sub-word or phone lattice. Spoken words can be decomposed into a sequence of phone units (of which there are about 45 in British English). Because speech recognition is computationally expensive, the recognition phase is performed in advance of retrieval. To search for a query word during retrieval, the pre-computed lattices may be rapidly scanned (many times faster than real-time) for phone strings corresponding to the query word. As with all ASR systems, lattice word spotting is imperfect and is prone to false alarms (hypotheses of a word when it is not present), and misses (failures to hypothesise words which are present). Experiments have shown that statistical methods allow robust retrieval despite these search errors. ASR in the VMR prototype is implemented using the Cambridge/Entropic HTK toolkit with speaker-independent hidden Markov models.

-> return to top

The Retrieval System

block diagram
Figure 1. Block diagram of video mail retrieval system

Figure 1 shows an overview of the VMR system. New mail is passed to the ASR which computes a phone lattice for the message. To search for interesting messages, the user inputs words that indicate the information need. A match score is then computed between the query and each of the messages. The user is then presented with a ranked list of the potentially most interesting messages.

The match score does not require all the query words to be present in the message, but rather forms a query/message correlation score. Individual words in each message are assigned a weight that depends on the frequency of the word in the message, the number of messages in which the word appears as well as the length of the message. Thus, high scoring messages may actually have fewer matching words than lower scoring ones. The retrieval system's user interface presents the matching score graphically so that interesting messages may be quickly identified.

Video mail retrieval GUI
Figure 2. Video mail retrieval GUI

Figure 2 shows the user interface displaying a ranked list of messages --- the result of the query "folk festival cam-bridge." Prior to a search, the message archive is shown ranked by date and time of sending. The list can be narrowed to show only messages from selected originators. When a query is entered the list is re-ranked by the query/message match score. The bars to the the right of the messages graphically indicate the relative score of each message.

-> return to top

The Video Browser

While there are convenient methods for the graphical browsing of text, eg scroll bars, ``page-forward'' commands, and word-search functions, existing video and audio playback interfaces almost universally adopt the ``tape recorder'' metaphor. To scan an entire message, it must be auditioned from start to finish to ensure that no parts are missed. Even if there is a ``fast forward'' button it is generally a hit-or-miss operation to find a desired section in a lengthy message. In contrast, the transcription of a minute-long message is typically a paragraph of text, which may be scanned by eye in a matter of seconds. Clearly there must be more economical ways to access and review audio/video data.

A video browser
Figure 3. The video mail browser

The video browser shown in Figure 3 attempts to represent a dynamic time-varying process (the video stream) by a static image that can be taken in at a glance. A message is represented as horizontal timeline, and keyword events are displayed graphically along it. Time runs from left to right, and events are represented proportionally to when they occur in the message; for example, events at the beginning appear on the left side of the bar and short-duration events are short. In the browser shown above, the timeline is the black bar and the scale indicates time in seconds. During playback, or when pointed at with the mouse, a keyword hit is highlighted and its name is displayed. (In the figure, the keyword "FESTIVAL" has just been played.) The message may be played starting at any time simply by clicking at the desired time in the time bar; this lets the user selectively play regions of interest, rather than the entire message.

-> return to top

Conclusion and Future Work

The VMR project has demonstrated automatic retrieval of spoken documents both experimentally and with a working prototype. Future work will focus on combining phone lattice information with automatic transcription and extending the retrieval techniques to handle larger and more diverse message sets.

A new project on Multimedia Document Retrieval developping on from this work is now underway at Cambridge University.

-> return to top


-> return to top
Mon Nov 10 1997