Cambridge University Engineering Department Technical Report CUED/F-INFENG/TR217


Dan Kershaw, Mike Hochberg & Tony Robinson

July 1995

A modular method for incorporating context-dependent phone classes in the CUED connection-ist-HMM hybrid speech recognition system is introduced. The current CUED connectionist-HMM hybrid system performs well on large vocabulary speech recognition tasks. Although the recurrent framework does model acoustic context internally (mainly in the hidden state vector), the targets are currently context independent. It is proposed that by including phonetic-context dependent targets to the recurrent network, improved modelling would be possible, as is seen in equivalent monophone and triphone HMM systems.

This report discusses the methods necessary to introduce context-dependent outputs into the hybrid system. It focusses on two main issues: Which context classes should be modelled and which would be best for the recurrent framework, and given a set of context classes which mechanism should be employed to model them. A decision-tree based approach was used to cluster the different context classes of a phone. The final training strategy involved a modular solution, whereby single-layer networks were trained on the state-vector to discriminate between the different context classes, given the phone class.

Some initial experiments show an average reduction of around 16\% in word error rate on some ARPA Wall Street Journal tasks. The new context-dependent system still has far fewer parameters than any equivalent HMM system, and due to improved modelling decoding speed is over twice as fast as the context-independent system.

