Machine Intelligence Laboratory

Cambridge University Department of Engineering

Speech Synthesis Seminar series

To receive more information about the speech synthesis seminar series, please register to the mail list by sending an email to: with "subscribe" as the title.
  • Next Seminar (confirmed)

  • [2011-06-21] Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization [slides]
Venue: Cambridge University Engineering Department, Lecture Room 11
Speaker: Heiga Zen (Toshiba Research Europe Ltd.)
Abstract: An increasingly common scenario in building hidden Markov model-based speech synthesis and recognition systems is training on inhomogeneous data. For example, data from multiple different sources and/or different types of data are used. This seminar introduces a new technique for training hidden Markov models on such inhomogeneous speech data, in this case including speaker and language variations. The proposed technique, speaker and language factorization, attempts to factorize speaker-specific/language-specific characteristics in the data and model them by individual transforms. Language-specific factors in the data are represented by transforms based on cluster mean interpolation with cluster-dependent decision trees. Acoustic variations caused by speaker characteristics are handled by transforms based on constrained maximum likelihood linear regression. This technique allows multi-speaker/multi-language adaptive training to be performed. Since each factor is represented by an individual transform, it is possible to factor-in only one of them. Experimental results on statistical parametric speech synthesis show that the proposed technique enables the speaker and language to be factorized, allowing the speaker transform estimated in one language to be successfully used to synthesize speech in different language while keeping the voice characteristics.

  • [2011-04-13] Text analysis for speech synthesis. Sabine Buchholz. [slides]
Venue: Cambridge University Engineering Department, Language Unit Meeting Room
Speaker: Sabine Buchholz (Toshiba Research Europe Ltd.)
Abstract: In the terminology of HMM-TTS, text analysis refers to the process that turns input text into a sequence of context-dependent labels. We will describe this process from several points of view (naive, linguistic, machine learning, system architecture) and try to review some evidence for the relative importance of the various sub-processes.
  • [2011-03-16] Context Modelling for HMM-Based Speech Synthesis. Kai Yu. [ slides].

Speaker: Kai Yu (Cambridge University)
Date: Wednesday 16 March 2011, 12:30 - 14:00
Venue: Cambridge University Engineering Department, Lecture Room 4
Abstract: The use of rich contexts is one of the main differences of HMM based speech synthesis compared to speech recognition. The most widely used modelling approach is context-dependent HMM with parameter sharing. This talk will discuss the nature of contexts and several structured approaches for modelling rich contexts.
  • [2011-03-02] Speech Production Mechanism And Vocoding Technique In Statistical Parametric Speech Synthesis. Yannis Agiomyrgiannakis and Ranniery Maia.[ slides1][ slides2]

Speaker: Yannis Agiomyrgiannakis (Google) and Ranniery Maia (Toshiba Research Europe Ltd.)
Date: Wednesday 02 March 2011, 13:00 - 15:00
Venue: Cambridge University Engineering Department, Lecture Room 2
Abstract: Statistical parametric speech synthesis is based on the combination of a fully parametric speech representation and a generative statistical framework. The first part of this talk describes the speech production mechanism from a signal processing point of view and provides an introduction to modern vocoding techniques. The second part of this talk addresses vocoding in the context of statistical parametric speech synthesis and discusses modern techniques that alleviate the inherent unnaturalness of the sound produced by those systems.
  • [2011-02-09] New and emerging applications of 'adaptive' speech synthesis. Junichi Yamagishi.[ slides]

Speaker: Junichi Yamagishi, Centre for Speech Technology Research (CSTR) at the University of Edinburgh
Date: Wednesday 09 February 2011, 13:00 - 15:00
Venue: Cambridge University Engineering Department, Lecture Room 4
Abstract: Until recently, text-to-speech was often just an 'optional extra' which allowed text to be read out loud. But now, thanks to HMM and speaker adaptation technologies, which were originally developed for ASR, speech synthesis can mean more than just the reading out of text in a predefined voice. New research areas and more interesting applications are emerging. In this talk, after a quick overview of the basic approaches to statistical speech synthesis including speaker adaptation, we consider some of these new applications of speech synthesis. We look behind each application at the underlying theoretical techniques used and describe the scientific advances that have made them possible. The applications we will examine include personalised speech-to-speech translation, clinical applications such as voice reconstruction for patients who have disordered speech, and noise-adaptive speech synthesis. The techniques we will examine include structural adaptation approaches, unsupervised adaptation, multi-pass architecture, and cross-lingual adaptation for speech synthesis.
  • [2011-01-26] Modelling trajectories in statistical speech synthesis. Matt Shannon and Heiga Zen. [ slides1/ slides2]

Speaker: Matt Shannon (Cambridge) and Heiga Zen (Toshiba Research Europe Ltd.)
Date: Wednesday 26 January 2011, 13:00 - 15:00
Venue: Cambridge University Engineering Department, Lecture Room 2
Abstract: In statistical speech synthesis we build a probabilistic model of (processed) speech given (processed) text. The processed speech is in the form of a sequence of acoustic feature vectors, and the sequence over time of each component of this feature vector forms a trajectory. In this talk we'll discuss how to model these trajectories.
We will first review a few ways in which the standard HMM synthesis model is unsatisfactory. In particular the standard model is unnormalized, and we'll discuss the practical impact of this lack of normalization. We'll then look at normalized approaches, including the trajectory HMM (a globally normalized model) and the autoregressive HMM (a locally normalized model). Finally we'll discuss some other possible enhancements including minimum generation error (MGE) training.

  • [2011-01-11] Statistical speech synthesis. Heiga Zen. [slides]

Speaker: Heiga Zen (Toshiba Research Europe Ltd.)
Date: Tuesday 11 January 2011, 13:00 - 15:00
Venue: Cambridge University Engineering Department, Lecture Room 4
Abstract: This talk will formulate the speech synthesis problem in a statistical framework and discuss how it is decomposed into its sub-problems, which include text analysis, spectral envelope estimation, and statistical parametric speech synthesis. The talk will also describe how these sub-problems are solved in real speech synthesis systems. Some future challenges and the directions in speech synthesis research will also be discussed.