Mark Gales - 4th Year Projects

Here are the projects that will be offered for the year 2011-2012. Please look at the papers and associated links for more information.

Compressive Sensing for Speech Recognition

Transcribing YouTube: Who Spoke What When?

Inifinite Gaussian Mixture Models for Speech Recognition

If you are interested in any of these projects it is important that you contact me so that we discuss the work involved in the project. Any queries, or request for further details, please contact me by email mjfg@eng.cam.ac.uk

There are some notes on statistical pattern processing on-line.

Compressive Sensing for Speech Recognition (F-MJFG-1)

the time-varying nature of speech data;
the large variability in speech for varying speakers, accents, noise-condition;
able to be trained on large amounts of training data.

This project combines parametric and non-parametric approaches to speech recognition that address all of these problems. The work will use parametric models to map the variable length speech data to a fixed length feature vector, a score. this handles the time varying aspects of the acoustic signal. By appropriately modifying the generative model it is possible to handle changes in the noise conditions and speaker.

In previous work SVMs have been used to generate a sparse, fixed, representation for the decision boundaries. This project will compare this fixed sparse representation, with a sparse representation dependent on the current word or sentence being evaluated. This will make use of recently proposed Bayesian Compressive Sensing approach.

The performance of the system will be evaluated against exiting SVM systems and standard speech recognition systems.

M.J.F. Gales F. and Flego. (2010) Discriminative Classifiers with Adaptive Kernels for Noise Robust Speech Recognition Computer Speech and Language, 2010.
T.N. Sainath, A. Carmi, D. Kanevsky and B. Ramabhadran. (2010) Bayesian Compressive Sensing for Phonetic Classification

top

Transcribing YouTube: Who Spoke What When? (F-MJFG-2)

This project will apply the state-of-the-art speech recognition systems developed in the Speech Group to YouTube data from more challenging sources. The data supplied by Google consists of audio from a number of election speeches from the 2008 US Presidential election. This data has a number of problems associated with it, including wide-ranges of background noise and highly spontaneous speaking style.

The project aims to extract three forms of information from the audio stream:

transcription of the words spoken from an individual speaker;
the identity of the speaker;
timing information of when that speaker was talking.

This allows a more informative transcription to be generated.

The project will take an existing Broadcast News Transcription, updated to reflect the vocabulary of the 2008 elections, and examine the performance on the YouTube Election data against simpler Broadcast News style data. A scheme for detecting names within the transcription will then be developed and evaluated for extracting the actual name of the speaker (where available). Using this additional information the aim is to improve the performance of the system by incorporating, for example, information from previous speeches both]iun the form of text and audio information.

S.E. Tranter. (2006), Who Really Spoke When? Finding Speaker Turns and Identities in Broadcast News Audio ICASSP May 2006
M.J.F. Gales, D.Y. Kim , P.C. Woodland, D. Mrva, R. Sinha and S.E Tranter (2006).
Progress in the CU-HTK Broadcast News Transcription System.
IEEE Transactions on Audio Speech and Language Processing, September 2006.

top

Inifinte Gaussian Mixture Models for Speech Recognition (F-MJFG-3)

The aim of this project is to examine these forms of model for speech recognition. In order to initially simplify the process, standard gen eratiove models will be used to map the variable length data to a fixed size and handle any requirements for speaker and environment changes. The output from these generative sequence score-spaces will then be modelled using an inifinite GMM. If time allows the scheme will be extended to inifinte hidden Markov models where the complete observartion sequence is directly modelled

In previous work SVMs have been used to generate a sparse, fixed, representation for the decision boundaries. This project will compare this fixed sparse representation, with the inifinte GMM and other Hieracrhical Dirichlet Prior Processe.

M.J.F. Gales F. and Flego. (2010) Discriminative Classifiers with Adaptive Kernels for Noise Robust Speech Recognition Computer Speech and Language, 2010.
C. E. Rasmiussen (2000) The Inifinite Gaussian Mixture Model
Y.W Teh, M.I. Jordan, M.J. Beal and D.M. Blei (2005) Hierarchical Dirichlet Processe

top

[ Cambridge University | CUED | SVR Group | Home]