Generative Kernels and Score-Spaces for Classification of Speech

Generative Kernels and Score-Spaces
for Classification of Speech

Project Description

The aim of this project is to significantly improve the performance of automatic speech recognition systems across a wide range of environments, speakers and speaking styles. The performance of state-of-the-art speech recognition systems is often acceptable under fairly controlled conditions and where the levels of background noise are low. However for many realistic situations there can be high levels of background noise, for example in-car navigation, or widely ranging channel conditions and speaking styles, such as observed on YouTube-style data. This fragility of speech recognition systems is one of the primary reasons that speech recognition systems are not more widely deployed and used. It limits the possible domains in which speech can be reliably used, and increases the cost of developing applications as systems must be tuned to limit the impact of this fragility. This includes collecting domain specific data and significant tuning of the application itself.

The vast majority of research for speech recognition has concentrated on improving the performance of hidden Markov model (HMM) based systems. HMMs are an example of a generative model and are currently used in state-of-the-art speech recognition systems. A wide number of approaches have been developed to improve the performance of these systems under speaker and noise changes. Despite these approaches, systems are not sufficiently robust to allow speech recognition systems to achieve the level of impact that the naturalness of the interface should allow.

This project will combine the current generative models developed in the speech community with discriminative classifiers used in both the speech and machine learning communities. An important, novel, aspect of the proposed approach is that the generative models are used to define a score-space that can be used as features by the discriminative classifiers. This approach has a number of advantages. It is possible to use current state-of-the-art adaptation and robustness approaches to compensate the acoustic models for particular speakers and noise conditions. As well as enabling any advances in these approaches to be incorporated into the scheme, it is not necessary to develop approaches that adapt the discriminative classifiers to speakers, style and noise. One of the major problems with speech recognition is that variable length data sequences must be classified. Using generative models also allows the dynamic aspects of speech data to be handled without having to alter the discriminative classifier. The final advantage is the nature of the score-space obtained from the generative model. Generative models such as HMMs have underlying conditional independence assumptions that, whilst enabling them to efficiently represent data sequences, do not accurately represent the dependencies in data sequences such as speech. The score-space associated with a generative model does not have the same conditional independence assumptions as the original generative model. This allows more accurate modelling of the dependencies in the speech data.

The combination of generative and discriminative classifiers will be investigated on two very difficult forms of data that current systems perform badly on. The first task is adverse environment recognition of speech. In these situations there are very high levels of background noise which causes severe degradation in system performance. Data of interest for this task will be specified in collaboration with Toshiba Research Europe Ltd. The second task of interest is large vocabulary speech recognition of data from a wide-range of speaking styles and conditions. Google has supplied transcribed data from YouTube to allow evaluation of systems on highly diverse data. The project will yield significant performance gains over current state-of-the-art approaches for both tasks.

Active Research Areas

Structured discriminative models for speech recognition
Classification with non-parametric methods
Expectation semiring for fast feature extraction
Higher-order derivative features
Noise robust speech recognition

top

Personnel Associated/Linked with the Project

Prof. Mark Gales [Principal Investigator]
Dr Rogier van Dalen [Research Associate]
Justin Yang [Research Student]
Anton Ragni [Research Student (not funded)]
Austin Zhang [Research Student (not funded)]

top

Progress Reports

R. C. van Dalen, J. Yang, M. J. F. Gales, A. Ragni and S. X. Zhang (2012).
Generative Kernels and Score-Spaces for Classification of Speech: Progress Report
Technical Report CUED/F-INFENG/TR676, January 2012
R. C. van Dalen, J. Yang, M. J. F. Gales, and S. X. Zhang (2013).
Generative Kernels and Score-Spaces for Classification of Speech: Progress Report II
Technical Report CUED/F-INFENG/TR689, January 2013
R. C. van Dalen, J. Yang, and M. J. F. Gales (2015).
Generative Kernels and Score-Spaces for Classification of Speech: Progress Report III
Technical Report CUED/F-INFENG/TR699, May 2015

top

Publications (and related papers)

J. Yang, C. Zhang. A. Ragni, M. J. F. Gales and P. C. Woodland (2016),
System Combination with Log-Linear Models,
ICASSP 2016.
R.C. van Dalen, J. Yang, H. Wang, A. Ragni, C. Zhang and M. J. F. Gales (2015),
Structured Discriminative Models using Deep Neural Network Features,
ASRU 2015.
R.C. van Dalen, and M. J. F. Gales (2015),
Annotating large lattices with the exact word error,
Interspeech 2015.
J. Yang, R.C. van Dalen, S. Zhang and M. J. F. Gales (2014),
Infinite Structured Support Vector Machines in Speech Recognition ,
in Proc. ICASSP 2014.
J. Yang, R.C. van Dalen and M. J. F. Gales (2013),
Infinite Support Vector Machines in Speech Recognition,
Interspeech 2013.
R. C. van Dalen and M. J. F. Gales (2013).
Monoids: efficient segmental features for speech recognition.
Technical Report no. CUED/F-INFENG/TR.687.
R. C. van Dalen, A. Ragni, and M. J. F. Gales (2013).
Efficient Decoding with Generative Score-Spaces Using the Expectation Semiring.
ICASSP 2013.
S.-X. Zhang and M.J.F. Gales (2013).
Kernelized Log Linear Models For Continuous Speech Recognition.
ICASSP 2013.
S.-X. Zhang, and M.J.F. Gales (2013).
Structured SVMs for Automatic Speech Recognition.
IEEE Trans. on Audio Speech and Language Processing 21(3), pp. 544-555.
M.J.F. Gales, S. Watanabe and E. Fosler-Lussier,
Structured Discriminative Models for Speech Recognition.
IEEE Signal Processing Magazine, Nov 2012.
A. Ragni, M.J.F. Gales (2012).
Inference algorithms for generative score-spaces.
ICASSP 2012.
R. C. van Dalen, A. Ragni, and M. J. F. Gales (2012).
Efficient decoding with continuous rational kernels using the expectation semiring.
Technical Report no. CUED/F-INFENG/TR.674.
R. C. van Dalen and M.J.F. Gales (2011)
A Variational Perspective on Noise-Robust Speech Recognition.
Automatic Speech Recognition and Understanding Workshop 2011.
A. Ragni and M.J.F. Gales (2011)
Derivative Kernels for Noise Robust ASR.
Automatic Speech Recognition and Understanding Workshop 2011 (Best Student Paper Award).
S.-X. Zhang and M.J.F. Gales (2011)
Extending Noise Robust Structured Support Vector Machines to Larger Vocabulary Tasks.
Automatic Speech Recognition and Understanding Workshop 2011.
S.-X. Zhang and M.J.F. Gales (2011)
Structured Support Vector Machines for Noise Robust Continuous Speech Recognition.
InterSpeech 2011 (Best Student Paper nominated).
A. Ragni and M.J.F. Gales (2011)
Structured Discriminative Models for Noise Robust Continuous Speech Recognition.
ICASSP 2011.
M.J.F. Gales (2010)
Model-Based Approaches to Handling Uncertainty.
Chapter, Robust Speech Recognition of Uncertain Data, Springer Verlag.
S.-X. Zhang, A. Ragni, and M.J.F. Gales. (2010)
Structured Log-Linear Models for Noise Robust Speech Recognition.
IEEE Signal Processing Letters.
M.J.F. Gales and F. Flego (2010).
Discriminative Classifiers with Generative Kernels for Noise Robust Speech Recognition.
Computer Speech and Language 2010
M. J. F. Gales, A. Ragni, H. AlDamarki and C. Gautier (2009)
Support Vector Machines for Noise Robust ASR.
ASRU 2009
Martin Layton (2006)
Kernel Methods for Classifying Variable Length Data.
PhD Thesis Cambridge University, September 2006.

top

Source Code

Flipsta library, GitHub: manipulate finite-state automata in C++ and Python.
Cross-entropy for model compensation: analyse methods for model compensation in Python.

top

[ Cambridge University | CUED | MIL | Home]