[Univ of Cambridge] [Dept of Engineering]

Mark Gales - 4th Year Projects

Here are the projects that will be offered for the year 2011-2012. Please look at the papers and associated links for more information. If you are interested in any of these projects it is important that you contact me so that we discuss the work involved in the project. Any queries, or request for further details, please contact me by email mjfg@eng.cam.ac.uk

There are some notes on statistical pattern processing on-line.

Compressive Sensing for Speech Recognition (F-MJFG-1)

    In recent years the use of non-parametric techniques has become increasingly popular in a range of machine learning applications. However to date there has been only limited work in this area for speech recognition. For many years acoustic modelling for automatic speech recognition has been dominated by Hidden Markov Models (HMMs). The reason for this is that HMMs provide a simple parametric model that handles the time varying nature of speech (and they work well!). There are a number of reasons why non-parametric techniques have received little interest in speech. The most significant ones are:

    • the time-varying nature of speech data;
    • the large variability in speech for varying speakers, accents, noise-condition;
    • able to be trained on large amounts of training data.

    This project combines parametric and non-parametric approaches to speech recognition that address all of these problems. The work will use parametric models to map the variable length speech data to a fixed length feature vector, a score. this handles the time varying aspects of the acoustic signal. By appropriately modifying the generative model it is possible to handle changes in the noise conditions and speaker.

    In previous work SVMs have been used to generate a sparse, fixed, representation for the decision boundaries. This project will compare this fixed sparse representation, with a sparse representation dependent on the current word or sentence being evaluated. This will make use of recently proposed Bayesian Compressive Sensing approach.

    The performance of the system will be evaluated against exiting SVM systems and standard speech recognition systems.


Transcribing YouTube: Who Spoke What When? (F-MJFG-2)

    Recently Google has added a feature to YouTube that generates captions. This makes use of automatic speech recognition technology. However currently only a small number of channels, mainly featuring talks and interviews, are enabled with this option.

    This project will apply the state-of-the-art speech recognition systems developed in the Speech Group to YouTube data from more challenging sources. The data supplied by Google consists of audio from a number of election speeches from the 2008 US Presidential election. This data has a number of problems associated with it, including wide-ranges of background noise and highly spontaneous speaking style.

    The project aims to extract three forms of information from the audio stream:

    • transcription of the words spoken from an individual speaker;
    • the identity of the speaker;
    • timing information of when that speaker was talking.

    This allows a more informative transcription to be generated.

    The project will take an existing Broadcast News Transcription, updated to reflect the vocabulary of the 2008 elections, and examine the performance on the YouTube Election data against simpler Broadcast News style data. A scheme for detecting names within the transcription will then be developed and evaluated for extracting the actual name of the speaker (where available). Using this additional information the aim is to improve the performance of the system by incorporating, for example, information from previous speeches both]iun the form of text and audio information.


Inifinte Gaussian Mixture Models for Speech Recognition (F-MJFG-3)

    Gaussian mixture models are a standard approach for speech recognition. The flexibility of this "semi-parametric" approach, where a wide-ranfe of distributions can be modelled is one of the reasons for their popularity. However, as part of the design process it is necessary to determine the appropriate number of Gaussian components to use. Schemes have been developed, based on standard penalised likelihood schemes and variatiolnal approaches, for determine appropriate numbers. An alternative approach is to use Bayesian schemes where it is unnnecessary to limit the number of components as the use of an appropriate prior avoids the issue of "over-fitting". In the extreme an inifite number of components can be used. This form of inifite Gaussian mixture model is an example of hierarchical dirichlet proceess,

    The aim of this project is to examine these forms of model for speech recognition. In order to initially simplify the process, standard gen eratiove models will be used to map the variable length data to a fixed size and handle any requirements for speaker and environment changes. The output from these generative sequence score-spaces will then be modelled using an inifinite GMM. If time allows the scheme will be extended to inifinte hidden Markov models where the complete observartion sequence is directly modelled

    In previous work SVMs have been used to generate a sparse, fixed, representation for the decision boundaries. This project will compare this fixed sparse representation, with the inifinte GMM and other Hieracrhical Dirichlet Prior Processe.

[ Cambridge University | CUED | SVR Group | Home]