MIL Speech Seminars 2003-2004

The MIL Speech Seminar series schedule for Lent Term 2004 was as follows:

February 13th 2004	Nikki Mirghafori (ICSI, Berkeley)	A novel and computationally efficient approach to setting decision thresholds in an adaptive speaker verification system	Setting decision thresholds for real-world speaker verification systems is a challenging task. To set thresholds accurately, data collections are often performed for each deployment. Clearly, it is desirable to obviate the need for this expensive step and automate the calculation of the decision threshold to be performed internally at runtime. The challenge of setting decision thresholds is even more formidable for online adaptive systems. Partly, because after adaptation, verification scores for both true speakers and impostors increase, which in turn increase the false accept (FA) rate, and hence the danger of speaker model corruption. Furthermore, as the thresholds must potentially be updated after every verification attempt, the algorithm needs to be computationally inexpensive. In this talk, I present a novel and computationally efficient strategy for run-time setting and updating of the decision threshold for an adaptive text-dependent speaker verification system. The approach entails calculating the score and its trajectory of increase as a function of length of the password, desired FA, and the number of training frames in the speaker model. Experimental results on 12 databases in four languages are presented.
February 24th 2004	Khe Chai (MIL)	Precision Matrix Modelling for Large Vocabulary Continuous Speech Recognition (LVCSR)	Speech recognition is a pattern classification task that involves a large number of Gaussian components with high dimensionality. This motivates the use of efficient covariance models to achieve robust parameter estimation and low computational cost. In recent years, instead of modelling covariance matrices directly, several precision matrix modelling schemes have been reported to yield better performance in speech recognition compared to the convensional diagonal covariance systems. These include Semi-tied Covariance (STC), Extended Maximum Likelihood Linear Transform (EMLLT) and Subspace for Precision and Mean (SPAM) models. Furthermore, Heteroscedastic Linear Discriminant Analysis (HLDA), which is commonly viewed as dimension reduction and feature decorrelation technique, can also be viewed as a precision matrix model. In this talk, I present the implementation of the above mentioned precision matrix models for Large Vocabulary Continuous Speech Recognition (LVCSR). In particular, issues concerning Maximum Likelihood (ML) and Minimum Phone Error (MPE) training are addressed. Experimental results are presented based on the Switchboard task.
March 2nd 2004	Jerry Zhu (CMU)	Semi-supervised Learning with Gaussian Random Fields	When you train a classifier, you need a labeled training set. Labels however are often difficult to obtain. On the other hand unlabeled data may be relatively easy to collect, but traditionally there are few ways to use them for the classifier. Semi-supervised learning attempts to use unlabeled data together with labeled data to help you build a better classifier. I start with a short survey of the field, then introduce a particular method involving random fields, electric networks and kernels.
March 9th 2004	Kai Yu (MIL PhD)	Adaptive Training Using Structured Transforms	Adaptive training is an important approach to train speech recognition systems on found, non-homogeneous, data. Standard approach employs a single transform to represent unwanted acoustic variability. However, for found data there are commonly multiple acoustic factors affecting the speech signal. This paper investigates the use of multiple forms of transformations, structured transforms (ST), to represent the complex non-speech variabilities in an adaptive training framework. Two forms of transformations are considered, cluster mean interpolation and constrained MLLR, consequently, the canonical model here is a multi-cluster HMM model. Both ML and minimum phone error (MPE) re-estimation formulae for the canonical model, are presented. This multi-cluster MPE training is also applicable to eigenvoice systems. Experiments to compare ST to standard adaptive training schemes were performed on a conversational telephone speech task. ST were found to significantly reduce the word error rate.
March 16th 2004	Hui (KK) Ye (MIL RA)	High Quality Voice Morphing	Voice morphing is a technique for modifying a source speaker's speech to sound as if it was spoken by some designated target speaker. Most of the recent approaches to voice morphing apply a linear transformation to the spectral envelope and pitch scaling to modify the prosody. Whilst these methods are effective, they also introduce artifacts arising from the effects of glottal coupling, phase incoherence, unnatural phase dispersion and the high spectral variance of unvoiced sounds. A practical voice morphing system must account for these if high audio quality is to be preserved. This seminar describes a complete voice morphing system and the enhancements needed for dealing with the various artifacts, including a novel method for synthesising natural phase dispersion. Each technique is assessed individually and the overall performance of the system evaluated using listening tests. Overall it is found that the enhancements significantly improve speaker identification scores and perceived audio quality.
March 26th 2004	Stephanie Seneff (MIT)	Multimodal Spoken Dialogue Systems	The Spoken Language Systems group at MIT has been developing multimodal spoken conversational systems for over a decade. The typical interaction involves telephone-access with optionally a synchronized display at a Web page. We are recently beginning to explore other options such as display on a cell phone or pen-based interaction with a tablet. This talk will discuss several of the research topics we are currently exploring, including multimodal (pen + speech) interaction, multilingual systems, off-line delegation, incorporating speaker ID information for security, rapid development of systems in new domains, user simulation, and flexible vocabulary issues. The talk will include several audio and video clips of users interacting with various systems we have developed.