CUED Research Project: Voice Morphing

Voice Morphing which is also referred to as voice transformation and voice conversion is a technique to modify a source speaker's speech utterance to sound as if it was spoken by a target speaker. There are many applications which may benefit from this sort of technology. For example, a TTS system with voice morphing technology integrated can produce many different voices. In cases where the speaker identity plays a key role, such as dubbing movies and TV-shows, the availability of high quality voice morphing technology will be very valuable allowing the appropriate voice to be generated (maybe in different languages) without the original actors being present.

There are basically three inter-dependent issues that must be solved before building a voice morphing system. Firstly, it is important to develop a mathematical model to represent the speech signal so that the synthetic speech can be regenerated and prosody can be manipulated without artifacts. Secondly, the various acoustic cues which enable humans to identify speakers must be identified and extracted. Thirdly, the type of conversion function and the method of training and applying the conversion function must be decided.

The aim of this research is to develop flexible high quality algorithms which can morph speech from one speaker. A system has been developed based on a pitch synchronous sinusoidal model which uses LSF feature encoding and linear transforms. To ensure high quality, a number of novel techniques have been developed to minimise the artifacts which typically result from loss of glottal source information, formant bandwidth broadening, phase incoherance and spectral colouring of unvoiced sounds. Full details are given in references [1] and [2] and some demonstration files are given below.

Current work is focussed on extending the techniques to allow the conversion of an unknown speaker's voice to sound like that of a known target speaker.

Staff members associated with the project are Steve Young and Hui YE (aka KK).

A voice conversion demo is available for downloading: KMorph.zip.

Demonstration Files

Table below shows some examples of Voice Morphing Technology. The "Source Speech" column indicates the utterances of the source speaker, and the "Target Speech" column is the target speaker's utterances. The utterances in both these two columns are NOT included in the training data for the estimation of the conversion function. The next two columns, "Converted Speech 1" and "Converted Speech 2", are the results regenerated using the Voice Morphing technology. The difference between these two column is that the "Converted Speech 1" applies the target prosody extracted from the target utterance, but the "Converted Speech 2" still applies the original prosody of the source utterances. The reason to convert with different prosody is for the evaluation of prosody influence on speaker identification.

Source Speech
Target Speech
Converted Speech 1
Converted Speech 2
Female to Male
Male to Female
Female to Female
Male to Male


[1] Ye, H. and S. Young (2003). "Perceptually Weighted Linear Transformations for Voice Conversion". Eurospeech 2003, Geneva. Gzipped Postscript (39kb)

[2] Ye, H. and S. Young (2004). "High Quality Voice Morphing". Int Conference Acoustics Speech and Signal Processing, Montreal, Canada. Gzipped Postscript (37kb)


This research project was sponsored from March 2002 to February 2004 by Anthropics Technology Ltd.

