Trainable Speech Synthesis

During the course of my PhD I developed a speech synthesis system which used a set of decision-tree state-clustered HMMs to automatically select and segment a set of HMM-state sized subword units from a one hour single-speaker continuous-speech database for use in a concatenation synthesiser. Duration and energy parameters were also estimated automatically. The selected segments were concatenated using a TD-PSOLA synthesiser. The system could synthesise highly intelligible, fluent speech from a word string of known phonetic pronunciation. It could be retrained on a new voice in less than 48 hours, and was trained on 4 different voices. The segmental intelligibility of the speech was measured using large scale modified rhyme tests, and an error rate of 5.0% obtained. For full details see my thesis, which is available from my home page .

The following speech examples are either in 8kHz 8-bit mulaw Sun/Next format (m) or 16kHz 16-bit uncompressed pcm NIST format (N).

Speech examples are available in four voices: Rob (m,N), Phil (m,N), Patricia (m,N), and Tina (m,N).

This speech (m, N) was synthesised using a stylised pitch track. This female speech (m, N) was transformed into this male speech (m, N) by using the female speaker's models and the words of the utterance to obtain the original phone sequence and appropriate durations. These were used along with a scaled version of the pitch track, obtained in this case from a laryngograph signal, to synthesise the new speech.

Thanks to Tina Burrows, Phil Woodland and Patricia for recording databases.

Follow this link to a list of related sites and synthetic speech examples pages.

Up to my home page