LARGE VOCABULARY ASR FOR SPONTANEOUS CZECH IN THE MALACH PROJECT

Download: PDF.

“LARGE VOCABULARY ASR FOR SPONTANEOUS CZECH IN THE MALACH PROJECT” by J. Psutka, P. Ircing, J.V. Psutka, V. Radovic, W. Byrne, J. Hajic, Jiri Mirovsky, and Samuel Gustman. In Proc. of the European Conference on Speech Communication and Technology (EUROSPEECH), 2003.

Abstract

This paper describes LVCSR research into the automatic transcription of spontaneous Czech speech in the MALACH (Multilingual Access to Large Spoken Archives) project. This project attempts to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) (www.vhf.org) by advancing the state of the art in automated speech recognition. We describe a baseline ASR system and discuss the problems in language modeling that arise from the nature of Czech as a highly inflectional language that also exhibits diglossia between its written and spontaneous forms. The difficulties of this task are compounded by heavily accented, emotional and disfluent speech along with frequent switching between languages. To overcome the limited amount of relevant language model data we use statistical techniques for selecting an appropriate training corpus from a large unstructured text collection resulting in significant reductions in word error rate. recognition and retrieval techniques to improve cataloging efficiency and eventually to provide direct access into the archive itself.

Download: PDF.

BibTeX entry:

@inproceedings{malachczasr_eurospeech03,
   author = {J. Psutka and P. Ircing and J.V. Psutka and V. Radovic and W.
	Byrne and J. Hajic and Jiri Mirovsky and Samuel Gustman},
   title = {LARGE VOCABULARY {ASR} FOR SPONTANEOUS {C}ZECH IN THE {MALACH}
	PROJECT},
   booktitle = {Proc. of the European Conference on Speech Communication
	and Technology (EUROSPEECH)},
   pages = {(4 pages)},
   year = {2003}
}

Back to Bill Byrne publications.