The HTK Rich Audio Transcription project is funded by the DARPA Effective, Affordable Reusable Speech-to-text (EARS) programme for 5 years starting in May 2002. The aim of the project is to very significantly advance the state-of-the-art while tackling the hardest speech recognition challenges including the transcription of broadcast news and telephone conversations. A wide range of research areas will be pursued aimed at both improving the word error rate of conventional speech recognition systems and developing an enriched output format with additional acoustic and linguistic metadata.
Staff members associated with the project are Phil Woodland, Mark Gales, and Thomas Hain.
Research on this project may be split into three broad tasks detailed below.
Task 1: Core Algorithm Development
The objective of this task is to improve the core speech recognition performance. The aim is to improve and develop new, generally applicable, techniques for speech recognition. The work will build on the current expertise of the Speech Group in research and development of LVCSR systems. Some of the technical areas to be addressed are:
Task 2: Metadata Generation
This task will examine the automatic generation of acoustic and linguistic metadata. This metadata will then be used to generate enriched transcriptions which contain the identity of the speaker, acoustic environment and channel conditions, as well as topic information. Furthermore the text will contain punctuation and capitalisation information. The approaches adopted will be extensions to existing techniques developed in Cambridge as well as the development of novel schemes. Some of the technical areas to be addressed are:
Task 3: Public HTK Development
This task aims to develop and enhance the core HTK software toolkit available via the HTK Website. The HTK software has thousands of active users worldwide and is widely used in both research and educational environments. The functionality of the existing code-base is to be extended to incorporate the latest developments in speech recognition research. This will involve integrating existing "internal" software into the publicly available code and writing additional code as new techniques are developed. New functionality that is expected to be added includes improved adaptation, a large vocabulary decoder and discriminative training.
Apart from new functionality, ``recipes'' and infrastructure will be developed for a number of standard tasks. Infrastructure support will include the distribution of word lattices; language models and acoustic models.
It is expected that there will be 1 Research Associate working on this task.