Reference #3


R.E.Donovan & P.C.Woodland, (1995). Improvements in an HMM-Based Speech Synthesiser. Proc. Eurospeech '95, pp. 573-576, Madrid.

ABSTRACT Improvements are presented to the performance of a speech synthesis system which learns it's parameters through training on a speech database. The system uses a set of cross-word decision-tree state-clustered triphone HMMs to segment the database into approximately 4000 clustered states, which are then used as the sub-word units for synthesis. The system is fully automatic, and can be retrained on a new voice in under 48 hours. The synthetic voice mimics the voice used in training. The improvements in segmentation presented in this paper, together with the adoption of the PSOLA synthesis technique have enabled the system to produce speech with high levels of intelligibility and naturalness.