Search Contact information
University of Cambridge Home Department of Engineering
University of Cambridge > Engineering Department > Machine Intelligence Lab

Abstract for witt_thesis

PhD Thesis


S. Witt

Nov 1999

Computer-assisted language learning (CALL) systems which are able to listen to a student's speech and to judge its quality would be very valuable for foreign language teaching. However, currently it is difficult to integrate pronunciation teaching and assessment in com-puter-assisted language learning systems. Two major problems need to be addressed. Firstly, without improved acoustic modelling of non-native speech, oral interaction between a language student and a CALL system is unduly limited by poor recognition performance. Secondly, there are no reliable methods to automatically score the pronunciation of a student and to localise pronunciation errors. The research presented in this thesis investigates solutions to these problems within the framework of hidden Markov model based automatic speech recognition.

The thesis begins by outlining those aspects of pronunciation teaching that are important for computer-assisted language learning. This helps understanding of what types of pronunciation exist and how they could be taught in an automated system. Next the characteristics of non-native speech that degrade speech recognition accuracy are discussed.

In order to improve the acoustic modelling of non-native speech, two adaptation algorithms have been developed called {\it Linear Model Combination} and {\it Model Merging}. These algorithms are based on the assumption that the mother-tongue of a non-native speaker is known. The basic idea underlying most findings of this thesis is that non-native speech can be modeled with a mixture of sounds of a speaker's native language and the target language. The newly developed speaker adaptation algorithms combine the acoustic models of the source and target language of a non-native speaker. The algorithms only differ with regard to the details how the model sets are combined.

A database of non-native English was recorded for the purpose of testing these adaptation algorithms. This database mostly consists of utterances of Japanese and Latin-American Spanish accented English. The recordings were transcribed by trained phoneticians to obtain transcriptions corresponding to the actual phoneme sequence uttered by the student as opposed to canonical transcriptions obtained from a standard pronunciation dictionary. Evaluation of the two new speaker adaptation techniques based on speech from this database demonstrates the algorithms' capability to reduce the recognition error rate by up to 40% relative to the baseline and by up to 20% in comparison with standard adaptation methods.

Minimising the required amount of adaptation data is highly desirable in CALL. Indeed the shorter the enrollment task, the more advantageous this is for commercial applications. For this reason, we developed a set of three algorithms to improve recognition of non-native speech. These algorithms are also based on combining the acoustic models of the source and target language of a foreign speaker, but they do not require any adaptation material. Evaluation of these algorithms with Japanese and Latin-American Spanish accented English proved their capability to improve recognition accuracy of non-native speech by up to 27% relative to a speaker-independent system.

The second part of this thesis analyses the problem of automatic pronunciation assessment. Because pronunciation assessment is highly subjective, it was necessary to develop a set of four performance measures which compare human or computer-based judgments of pronunciation on a phone-by-phone basis. These measures were applied to compare the manually edited transcriptions of the non-native database done by six different phoneticians. This analysis of human labelling characteristics showed that even though human assessment of pronunciation can vary considerably, there exists a common level of judgments. Therefore, the averaged assessment similarity of different human judges is used as a benchmark against which the performance of any automatic assessment method is measured.

Based on this analysis of how phoneticians assess pronunciation, an automatic method of assessing pronunciation which we call {\it Goodness of Pronunciation} (GOP) was developed. This method calculates a score for each phone in an utterance to localise pronunciation errors. The baseline algorithm was refined in several ways, with the result that the optimal set of refinements yields an algorithm whose assessment capability is comparable to human assessment.

The research findings of this thesis add a new direction to the future developments in research to improve CALL systems. The conclusions outline how these findings could be developed further. For example, the techniques could be applied to other languages and other types of recognition systems. The thesis concludes with a discussion of possible ways of integrating these new algorithms into a computer-assisted language learning system.

(ftp:) (http:)
PDF (automatically generated from original PostScript document - may be badly aliased on screen):
  (ftp:) witt_thesis.pdf | (http:) witt_thesis.pdf

If you have difficulty viewing files that end '.gz', which are gzip compressed, then you may be able to find tools to uncompress them at the gzip web site.

If you have difficulty viewing files that are in PostScript, (ending '.ps' or '.ps.gz'), then you may be able to find tools to view them at the gsview web site.

We have attempted to provide automatically generated PDF copies of documents for which only PostScript versions have previously been available. These are clearly marked in the database - due to the nature of the automatic conversion process, they are likely to be badly aliased when viewed at default resolution on screen by acroread.

© 2005 Cambridge University Engineering Dept
Information provided by milab-maintainer