HTK Rich Audio Transcription

High Level Summary

The HTK Rich Audio Transcription project is funded by the DARPA Effective, Affordable Reusable Speech-to-text (EARS) programme for 5 years which started in May 2002. The aim of the project is to very significantly advance the state-of-the-art while tackling the hardest speech recognition challenges including the transcription of broadcast news and telephone conversations. A wide range of research areas will be pursued aimed at both improving the word error rate of conventional speech recognition systems and developing an enriched output format with additional acoustic and linguistic metadata.

Research is split into three broad tasks:

Task 1. Core Algorithm Development: To improve and develop new, generally applicable, techniques for speech recognition. This will build on previous work on Speech-To-Text transcription at CUED
Task 2: Metadata Generation: To generate enriched transcriptions which contain the identity of the speaker, acoustic environment, channel conditions and some linguistic mark-up, such as the location of sentence-like boundaries or disfluent speech.
Task 3: Public HTK Development: To develop and enhance the core HTK software toolkit available via the HTK Website.

Personnel

Personnel working on this project are:

Staff:: Prof Phil Woodland (pcw@eng.cam.ac.uk) [ Principal Investigator ]; Dr. Mark Gales (mjfg@eng.cam.ac.uk) [ University Lecturer ]
RAs:: Gunnar Evermann (ge204@eng.cam.ac.uk) [ LVCSR search and HTK development and maintenance]; Dr. Bin Jia (bj214@eng.cam.ac.uk) [ Conversational Telephone Speech System, acoustic modelling[English/Chinese]/language modelling[Chinese]]; Dr. Do Yeong Kim (dyk21@eng.cam.ac.uk) [ Broadcast News Systems, acoustic modelling and adaptation]; Antti-Veikko Rosti (avir2@eng.cam.ac.uk) [ CTS segmentation ]; Dr. Marcus Tomalin (mt126@eng.cam.ac.uk) [ Metadata, Slash-unit detection ]; Sue Tranter (formerly Johnson) (sej28@eng.cam.ac.uk) [ Metadata, segmentation, speaker diarisation ]
PhD Students:: H.Y. (Ricky) Chan (hyc27@eng.cam.ac.uk) [ Lightly supervised acoustic modeling for LVCSR ]; Xunying Liu (Andrew) (xl207@eng.cam.ac.uk) [ model complexity control and subspace projection schemes ]; David Mrva (dm312@eng.cam.ac.uk) [ Language Modelling]; Khe Chai Sim (kcs23@eng.cam.ac.uk) [ Extended Maximum Likelihood Linear Transform (E-MLLT)]; Lan Wang (lw256@eng.cam.ac.uk) [ discriminative adaptive training]; Kai Yu (ky219@eng.cam.ac.uk) [ acoustic/speaker factorization and segmentation ]
Former Members of CUED EARS team:: Dr. Thomas Hain (th223@eng.cam.ac.uk) [ University Lecturer ] (now at Sheffield University); Dan Povey (dp10006@eng.cam.ac.uk) [ Discriminative training ] (now at IBM) - dpovey@us.ibm.com; Dr. Srinivasan Umesh (su216@eng.cam.ac.uk) [ VTLN/Acoustic Modelling ] (returned to IIT); Kit Thambiratnam (ajkt2@eng.cam.ac.uk) [ Broadcast News segmentation and clustering ] (returned to QUT)

This Page is maintained by Sue Tranter, sej28@eng.cam.ac.uk
Thurs 29th April 2004