CU Crest
Multimedia Document Retrieval
(1997 - 2000) - Progress

Progress to Date

September 1997 - Hub 4 evaluation and recogniser improvements
January 1998 - Speech and IR development for TREC-7
May 1998 - The TREC-7 evaluation system
November 1998 - Improving the probabilistic IR model
November 1998 Hub-4 evaluation
March 1999 - Modelling Speakers and Speaking Rates
July 1999 - The TREC-8 evaluation
November 1999 - Analysing results from the TREC-8 evaluation
February 2000 - Investigating Out of Vocabulary Effects
March 2000 - Consolidating the MDR demo
August 2000 - The TREC-9 Evaluation

Project Links

September 1997 -
Hub 4 Evaluation and Recogniser Improvement

Work Initially focussed on the 1997 Hub 4 Broadcast News evaluation. Many experiments were carried out which compared data-type specific and non-specific models, differing amounts of training data, the use of gender-dependent modelling and the effects of automatic data-type classification (see [4]). An automatic segmenter which chopped up the unrestricted audio into portions of homogeneous speaker and acoustic condition segments was developed as a front end to the recognition process (see [2]). The final HTK system designed from the results of these experiments yielded an overall word error rate of 22.0% on the 1996 unpartitioned broadcast news development test data and just 16.2% on the evaluation test set (see NIST report). This was the lowest overall word error rate in the 1997 DARPA broadcast news evaluation, by a statistically significant margin. (see [3] for a more detailed description)

In parallel with the Hub4 system development, a number of algorithm improvements have been made. Our adaptation software was completely revised and extended to include speaker adaptive training (SAT). A new scheme was developed for finding speaker clusters in found speech such as broadcast news. This new scheme has been shown to increase the data likelihood in MLLR base adaptation and subsequently reduce word error rates in recognition (see [5]). In addition a tool for locating and analysing errors in found speech was developed (see [6]).

January 1998 -
Speech and IR development for TREC-7

Following on from the Hub4 recognition development, our focus changed to retrieval issues and preparation for the 1997 TREC-7 Spoken Document Retrieval (SDR) Evaluation. This consisted of two parts. Firstly the automatic transcription of 100 hours of broadcast news data and secondly the retrieval of documents relevant to 23 natural language queries. For the first stage, some work was done by Entropic to speed up the decoder without significantly reducing the errors incurred. Our new segmenter (see [2]) and covariance-based clustering strategy (see [5]) were used to produce relatively homogeneous clusters. Gender dependent modelling, adaptation and 4-gram lattice rescoring were then used to get the final transcription. This system ran in approximately 50 times real time and gave a word error rate of around 25% on the 50 hours of TREC-6 SDR data, which we subsequently used for information retrieval development work.

Work on information retrieval began by creating a pool of 60 queries of our own and generating manual relevance assessments. This gave us the opportunity to try out different retrieval methods and examine their relative benefits/disadvantages. The basic system began by splitting the transcriptions into pre-marked "stories", then removing words given in a stop-word list. These are mainly function words such as "a", "the" etc. Several small features were added at this stage to remove unfinished words, deal with abbreviations (e.g. "U.S.A."), double words (e.g. and/or), and words containing punctuation (e.g. Martha's). A mapping list was included to standardise the spellings of words which are frequently spelt incorrectly. A Porter Stemmer was then applied to strip the standard suffixes of words (e.g. managing, manager, managed -> manage). This algorithm is well-established, but known to make errors both in conflation (news->new) and omission (Californian/California). A dictionary of known problems was made to form a stemmer-exceptions list and was applied using the mapping functionality. The combined weight formula (see [1]) was used to score the documents.

Additional work looked into weighting the query terms depending on their part of speech, found with a Brill tagger. Another development was using statistical pre-search query expansion, where a term x is expanded with a term y if y occurs in more documents than x and x occurs more often when y is present than when it is absent. y can be seen as a statistical hyponym of x. The original original combined weight, weighted by the part-of-speech (POS), is replaced by an expanded combined weight :
ecw(x,j) = POS(x) * sum_y [P(x/y) * cw(y,j)] / sum_y P(x/y)
Finally, term-position indexing was implemented to allow phrasal terms to be added to the query. These were found by locating unstopped noun compounds, or adjective-noun groups in the query and weighting them by a tuned bigram-weight. All of these measures were shown to increase performance on our own development query set. (see [8] for more details)

May 1998 -
The TREC-7 evaluation system

We ran the system described above on the 100 hours of TREC-7 SDR data. The overall word error rate for the data was 24.8%, the lowest in the TREC-7 SDR evaluation, with the baselines provided by NIST offering 33 and 42%. Transcriptions from other competing sites offered a range of error rates from 29% to 66% on which we could also run our retrieval system.

We used all the retrieval techniques described above in the TREC-7 evaluation. A detailed breakdown of the effects of each of these techniques on the results can be found in [7]. Our retrieval system performed well and the evaluation showed that for this relatively small task (in IR terms), the degradation of performance with word error was small. (6% mean average precision for 25% word error rate). We introduced the concept of a Term Error Rate (TER), which evaluates transcription errors from a retrieval-based perspective ([8]), and showed that pre-processed term error rate (PTER) varies approximately linearly with mean average precision for the TREC-7 data using our retrieval system (see [7]).

November 1998 -
Improving the Probabilistic IR Model

After TREC-7, work in IR was focussed on improving the probabilistic retrieval model. The foundations for the theory were expanded (see [9]) and a new concept of partially ordered sets (posets) was introduced (see [10]). This allowed terms semantically related to the query to be automatically added using information about relationships in partially ordered sets. For example, a set of posets was defined by extracting from a travel WWW server the names, states and countries of all cities with an international airport. If a query contained a word in this poset, then a new term was defined corresponding to the original word and all its sublocations. This allows, for example, documents about "Washington" to be retrieved when the query refers simply to the "U.S.". Another semantic poset was built using unambiguous nouns from WordNet. These posets were shown to increase mean average precision for all sets of transcriptions for the TREC-7 SDR evaluation task. (see [10 , 14]).

Experiments using blind relevance feedback were also conducted. Results show that if a (large) parallel source of data is used for the feedback, the average precision increases on all transcriptions, but if the (small) test collection is used, then average precision only increases for the more accurate transcriptions.([14]) With all these improvements, the difference in retrieval performance between the manual transcriptions and our own was reduced to 1%. ([10]) A detailed set of experiments and subsequent analysis was made in [17].

November 1998 -
Hub-4 evaluation

We also took part in the November 1998 DARPA/NIST Hub4 evaluation of broadcast news transcription of US television and radio shows. For the first time this year there were two sub-tasks: the first to derive the most accurate transcription of the data without regard to computational effort (the Hub task) and secondly a task aiming to maximise transcription subject to a constraint that processing takes less than ten-times real time. Systems were entered for both tasks: the 10xRT system being developed in conjunction with Entropic Ltd.

Additions to the unconstrained system relative to the one used in the 1997 evaluation included vocal-tract length normalisation; cluster-based cepstral mean and variance normalisation; the use of twice as much acoustic training data; improved language model using a merged interpolated model and a more appropriate training data pool; and improved adaptation using a full variance transformation in addition to standard MLLR. The final HTK unconstrained compute system gave an overall word error rate of 13.8% on the complete evaluation data set (the difference to the best system was not statistically significant) and 7.8% on the baseline F0 broadcast speech condition (the lowest error rate) and the system represented a 13% relative reduction in error rate over the 1997 HTK Hub4 system. This unconstrained compute system ran in about 300xRT on a Sun Ultra2. Further details of the systems developed can be found in [11] and [15].

The 10xRT system was based on the 1997 evaluation system but discarded the quinphone stage and also used the enlarged training set and improved language modelling. The same two-pass overall strategy as used in the full system was employed with a highly optimised decoder supplied by Entropic which allowed the system to run in less than 10xRT on a 450MHz Pentium II based computer. The 1998 10xRT system gave the same error rate on the 1997 evaluation data (15.8%) as the full 1997 HTK system and 16.1% error on the 1998 evaluation set which was the lowest error rate for a system running in less than 10xRT by a statistically significant margin. Further details of the 10xRT system are given in [12] and [ 15].

The complete results from the evaluation can be browsed at ftp://jaguar.ncsl.nist.gov/csr98/h4e_98_official_scores_990119

March 99 -
Modelling Speakers and Speaking Rates

Further work in speaker clustering showed that by modifying the clustering system used in recognition, the automatically generated segments could be split into speaker groups quite successfully. ([14]). This could allow, for example, all speech said by the announcer in a news show to be extracted and presented to the user as a summary of the main news of the day.

The correlation between an inter-frame distance measure and a phone-based concept of speaking rate was also investigated. It was shown that it is possible to build MAP estimators to distinguish between fast and slow phonemes. This speaking-rate information was then included into a standard HMM structure by adding the distance to the feature vector and changing the topology of the HMM. A slight overall improvement in recognition was shown, due to a strong improvement in the spontaneous and non-native speaker conditions. (see [16] for more details).

July 99 -
The TREC-8 evaluation

Our TREC-8 SDR work built on our TREC-7 system, but significantly extended both the audio processing and the retrieval parts of the system. The ASR side was extended to provide extra information aimed at helping retrieval and the IR side was extended to be more focussed on the problems of retrieval from automatically transcribed broadcast news.

In ASR, the recogniser speed was greatly increased (from 50xRT to 13xRT) due to Entropic's highly optimised decoder. This also meant a larger vocabulary of 108k words could be used to reduce the problems from out-of-vocabulary information-bearing words, in the larger 500+ hour TREC-8 corpus. The error rate was also reduced (by about 10% relative to 15.7% on the 1998 Hub4 evaluation data). This is due to improved acoustic and language models and a much larger vocabulary. Our word error rate on the 10 hour scored subset of TREC-8 SDR was 20.6% (the lowest in the evaluation).

A novel algorithm was also developed for detecting commercials. This looks for and rejects repeated audio. The algorithm removes both whole commercials and within-broadcast jingles, and has the advantage not only of deleting material in which the user is (typically) not interested, but also of providing some structure for broadcasts. 65% of the commercials were removed for A.B.C. news shows, whilst erroneously removing only 28 seconds (0.02%) of news stories. Since the commercial detection stage removed around 8% of the entire audio, (of which 97.8% was marked non-story information in the reference), it also conveniently reduced the amount of data the transcription engine needed to process by 43 hours.

In IR, we augmented our TREC-7 SDR baseline system with five different techniques:

Semantic Poset Indexing, which takes account of some term relations of synonymy and hyponymy, based on WordNet and on a manually built geographic thesaurus.
Parallel Blind Relevance Feedback, which expands the query with terms from a large, written-text, parallel corpus.
Blind Relevance Feedback, which expands the query with terms from the actual collection of spoken document transcriptions.
Document Parallel Blind Relevance Feedback, which is an adaptation to our probabilistic model of the document expansion technique that AT&T used with the vector model in TREC-7.
Parallel Collection Frequency Weighting, which exploits a parallel corpus to improve the estimation of term weights.

Both the individual application and, more importantly, the combination of these five techniques gave excellent results on the TREC-7 SDR test data (mean average precision of 60% on HTK transcriptions).

We also built a system for the story-boundary-unknown evaluation. The structure in the broadcast supplied by the commercial detector along with the audio segmentation was used. to force story breaks. We then applied a sliding window technique and performed retrieval on the pseudo stories defined by the windows, using all the methods listed apart from document feedback. Stories nearby in time were then combined as retrieved documents.

November 99 -
Analysing Results from the TREC-8 evaluation

The overall results for the TREC-8 SDR evaluation can be found in the TREC-8 proceedings. We achieved an average precision on our own transcriptions of 55.29, the best in the evaluation. The corresponding average precision for the case where story boundaries were unknown was 41.47. Work focussed on analysing the effects of all the individual components on the overall result. For the story-known case, certain devices such as document and query parallel blind relevance feedback worked very well, whereas others did not perform quite as well on the much larger collection as had been hoped for (see [18]). Slight modifications to the term error rate formulae introduced in [7] were made to allow the more complicated retrieval process to be modelled and the fall off of performance with recogniser error rate was again found to be gentle.

Work on the story-unknown case showed that the automatic elimination of commercials increased the average precision of the overall system, whilst also reducing the amount of data to be recognised by 8%. This improvement could also be achieved on the transcriptions from other sites by applying a filter to remove the "commercials" after the retrieval but before scoring. Further experimentation reported in [21] showed that the performance for the story-unknown system could be increased to 46.5 by slight modification of the retrieval strategies and parameters.

Work on the direct audio search method developed for the evaluation showed that the technique was able to find exact matches of audio with 100% accuracy. The search can run in hundreds of times faster than real-time, and requires only a fraction of a second of cue-audio. (see [19] for more details).

February 2000 -
Investigating Out of Vocabulary Effects

A key issue in spoken document retrieval is the effect of words outside the speech recogniser vocabulary (out-of-vocabulary or OOV words). If OOV words occur in the spoken documents recognition errors result, while if there are OOV terms in written queries the OOV terms will not match the documents in the collection. We studied the impact of OOV effects in the context of the TREC-8 SDR corpus and query set. Using a fast recogniser we transcribed the 500 hours of TREC-8 broadcast news material with 5 different recognisers which only differed in vocabulary size (55k, 27k, 13k, 6k, 3k) and covered a large range of OOV rates. The automatic transcription of this volume of material took a total of 10 CPU months. We ran a series of IR experiments using each of the resulting document collections using both a baseline IR system and IR systems with query and document expansion. Query expansion used blind relevance feedack from either just the document collection and/or from a large parallel collection of newspaper texts. Document expansion used the spoken documents as queries to the parallel collection to add terms to the original documents and can directly compensate for OOV effects. The experiments showed that the use of parallel corpora for query and document expansion can compensate for the effects of OOV words to a large extent for moderate OOV rates. These results imply that, at least for the type of queries and documents used in TREC8, further compensation for OOV words using e.g. phone lattices is not needed. This work is described in a paper presented at SIGIR'2000. [23]

March 2000 -
Consolidating the MDR demo

The MDR demo system, which downloads RealAudio from the WWW, automatically transcribes this and allows the user to search the resulting database, was expanded and improved. Both filtering with fixed queries, and conventional retrieval with new queries, were allowed with the user able to browse the transcripts as well as listen to the original audio. Interactive query expansion using semantic posets and relevance feedback was also included, whilst keyword highlighting and displaying the best-matching extract allowed the user to find relevant passages more quickly. The system was also extended to form artificial story boundaries for convenient retrieval when the original audio was not pre-segmented by topic. The demo was presented at both RIAO 2000 [20] and SIGIR 2000 [22] and attracted considerable interest. (see the RIAO poster for an example and some screen shots).

August 2000 -
The TREC-9 Evaluation

The TREC-9 SDR evaluation was similar to the TREC-8 evaluation, using the same document collection and main tasks, but had a few subtle differences. The story-unknown task became the main focus, and the use of non-lexical automatically derived information (such as the gender of the speaker, or the presence of music) was allowed in retrieval for the first time. We generated the following non-lexical tags: segment, gender, bandwidth, high-energy, no-speech (silence, noise or music), repeats, and commercials. We focussed on improving the performance of the system and obtained over 51% AveP for the story unknown case and over 60% for the story known case in trials using the TREC-8 queries (c.f. 41.5 and 55.3% respectively in the TREC-8 evaluation.)

The team's results in TREC-9 were very good, clearly demonstrating effective retrieval performance in the story-unknown task and showing that spoken document retrieval, even when the word error rate in recognition is not trivial, is a perfectly practical proposition. The final project publications include the TREC-9 paper [24] , two papers in the International Journal of Speech Technology, one about the story-unknown retrieval system [26] and the other about the MDR demo system [25] and a comprehensive technical report covering a summary of experiments [27]

This work is funded by EPSRC grant GR/L49611

This Page is maintained by Sue Johnson, sej28@eng.cam.ac.uk
Sun 7 Oct 2001

Multimedia Document Retrieval (1997 - 2000) - Progress

Progress to Date

Project Links

Multimedia Document Retrieval
(1997 - 2000) - Progress