Q5.2: Performing speech synthesis

There are several algorithms. The choice depends on the task they're used for. The easiest way is to just record the voice of a person speaking the desired phrases. This is useful if only a restricted volume of phrases and sentences is used, e.g. messages in a train station, or schedule information via phone. The quality depends on the way recording is done.

More sophisticated but worse in quality are algorithms which split the speech into smaller pieces. The smaller those units are, the less are they in number, but the quality also decreases. An often used unit is the phoneme, the smallest linguistic unit. Depending on the language used there are about 35-50 phonemes in western European languages, i.e. there are 35-50 single recordings. The problem is combining them as fluent speech requires fluent transitions between the elements. The intellegibility is therefore lower, but the memory required is small.

A solution to this dilemma is using diphones. Instead of splitting at the transitions, the cut is done at the center of the phonemes, leaving the transitions themselves intact. This gives about 400 elements (20*20) and the quality increases.

The longer the units become, the more elements are there, but the quality increases along with the memory required. Other units which are widely used are half-syllables, syllables, words, or combinations of them, e.g. word stems and inflectional endings.

The Museum of Speech Analysis and Synthesis has pictures of artificial speech systems going back over 150 years: worth a visit. ( http://mambo.ucsc.edu/psl/smus/smus.html)

Back to Section 5 of the comp.speech FAQ Home Page.
Jump to SpeechLinks, [Q5.1], [Q5.3], [Q5.4], [Q5.5]

Administrivia, Copyright, Submit Information : Last Revision: 14:32 08-Aug-1996