[ HTK Rich Audio Transcription Project homepage ]

HTK Rich Audio Transcription
Project Summary and Aims

The HTK Rich Audio Transcription project is funded by the DARPA Effective, Affordable Reusable Speech-to-text (EARS) programme for 5 years starting in May 2002. The aim of the project is to very significantly advance the state-of-the-art while tackling the hardest speech recognition challenges including the transcription of broadcast news and telephone conversations. A wide range of research areas will be pursued aimed at both improving the word error rate of conventional speech recognition systems and developing an enriched output format with additional acoustic and linguistic metadata.

Staff members associated with the project are Phil Woodland, Mark Gales, and Thomas Hain.

Research on this project may be split into three broad tasks detailed below.

Task 1: Core Algorithm Development

The objective of this task is to improve the core speech recognition performance. The aim is to improve and develop new, generally applicable, techniques for speech recognition. The work will build on the current expertise of the Speech Group in research and development of LVCSR systems. Some of the technical areas to be addressed are:

Acoustic model training: techniques involving adaptive and discriminative training will be used to make better use of available training data;
Acoustic model adaptation: techniques for improved acoustic model adaptation, using adaptive training techniques and new methods of appropriately factoring adaptation transformations;
Language model adaptation: language model adaptation techniques to improve the representation of domain specific language information for particular tasks;
Cost of training data: use of low-cost partially correct training data transcriptions to reduce cost of training data (lightly supervisied training), and investigating the interactions between discriminative training and lightly supervised training.
System building and resource allocation: approaches to determining appropriate combinations of techniques for specific domains and optimal use of decoder resources will be examined.

This is the major task in the project. It is expected that there will be 3 Research Associates and 3 PhD Research Students working on this task.

top

Task 2: Metadata Generation

This task will examine the automatic generation of acoustic and linguistic metadata. This metadata will then be used to generate enriched transcriptions which contain the identity of the speaker, acoustic environment and channel conditions, as well as topic information. Furthermore the text will contain punctuation and capitalisation information. The approaches adopted will be extensions to existing techniques developed in Cambridge as well as the development of novel schemes. Some of the technical areas to be addressed are:

Acoustic metadata: techniques based on factoring the acoustic signal to individual sources of variability allowing high quality, robust metadata labels to be generated.
Linguistic metadata: capitalisation and punctuation of the hypothesised transcriptions, based on combinations of statistical language models and prosodic models, will be examined.
Topic tracking: topic boundaries and clusters based on optimally derived language model training data clusters will be examined.

It is expected that there will be 1 Research Associate and 1 PhD Research Student working on this task.

top

Task 3: Public HTK Development

This task aims to develop and enhance the core HTK software toolkit available via the HTK Website. The HTK software has thousands of active users worldwide and is widely used in both research and educational environments. The functionality of the existing code-base is to be extended to incorporate the latest developments in speech recognition research. This will involve integrating existing "internal" software into the publicly available code and writing additional code as new techniques are developed. New functionality that is expected to be added includes improved adaptation, a large vocabulary decoder and discriminative training.

Apart from new functionality, ``recipes'' and infrastructure will be developed for a number of standard tasks. Infrastructure support will include the distribution of word lattices; language models and acoustic models.

It is expected that there will be 1 Research Associate working on this task.

top

[University of Cambridge | CUED | SVR Group ]

Last Modified: Monday April 15 2002

maintainer and other contact details

HTK Rich Audio Transcription Project Summary and Aims

HTK Rich Audio Transcription
Project Summary and Aims