MIL Speech Seminars 2005-2006

The MIL Speech Seminar series schedule for Lent Term 2006 was as follows:

February 27th 2006	Khe Chai Sim (MIL PhD)	fMPE and pMPE - A Discriminative Semi-Parametric Trajectory Model	Hidden Markov Models (HMMs) are widely used in speech recognition. For reasons of efficiency, a series of assumptions are made about the speech data, some of which are poor. In particular, the "independence assumption" where observations are assumed to be conditionally independent given the state. Thus, the output distribution associated with an HMM states is constant. Existing ways to overcome this limitation include the use of switching linear dynamical systems, stochastic segment models, polynomial segment models, buried Markov models and trajectory HMMs. To date, these models have had little success in improving the performance of large vocabulary continuous speech recognition systems. In this seminar, a discriminative semi-parametric trajectory model will be presented. This model represents the Gaussian mean vectors and covariance matrices as time varying parameters. This time dependent parameters are modelled as a function of the location of the current observation (and the neighbouring observations) in the acoustic space, which is represented by a series of centroids. Model parameters are discriminatively estimated using the Minimum Phone Error (MPE) criterion. One form of temporally varying mean vector is obtained by applying a time dependent bias to the static Gaussian mean. This time dependent bias is a weighted contribution from the bias vectors associated with each centroid (to be estimated discriminatively). The contribution weights are calculated as the posteriors of the observation (and neighbouring observations) given the centroids. The resulting model yields an fMPE model. On the other hand, the variance of each dimension may also be scaled by a positive time dependent factor to yield a temporally varying covariance matrix. This model is known as pMPE. Similar to fMPE, the time dependent scale factor is a weighted contribution from the centroid specific scales where the weights are given by the posteriors of the observations given the centroids. Experimental results are given based on a large vocabulary conversational telephone speech recognition task. Both fMPE and pMPE were found to give gains over the MPE alone system. It was also found that combining fMPE and pMPE could be beneficial in some cases.
13th March 2006	Martin Layton (MIL PhD)	Augmented Statistical Models for Speech Recognition	Recently there has been significant interest in developing new acoustic models for speech recognition. One such model, that allows complex dependencies to be represented, is the augmented statistical model. This extends standard HMMs using a local exponential expansion of the HMM, allowing additional dependencies to be incorporated. Unfortunately, the resulting model often has an intractable normalisation term rendering training difficult for all but binary classification tasks. In this paper, a maximum margin criterion is presented as a practical method of estimating augmented model parameters for binary classification tasks. For multi-class classification, conditional augmented (C-Aug) models are proposed as an attractive alternative. Instead of modelling utterance likelihoods and inferring decision boundaries, C-Aug models directly model the posterior probability of class labels, conditioned on the utterance. The resulting model is easy to normalise and can be trained using conditional maximum likelihood estimation. In addition, as a convex model, the optimisation converges to a global maximum.