A Generative Probabilistic OCR Model for NLP Applications

A Generative Probabilistic OCR Model for NLP Applications” by O. Kolak, W. Byrne, and P. Resnik. In Proceedings of HLT-NAACL, 2003, pp. 55-62 (8 pages).


In this paper we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for use in error correction, with a focus on post-processing the output of black-box OCR systems in order to make them more useful for NLP tasks. We present an implementation of the model based on finite-state models, demonstrate the model's ability to significantly reduce character and word error rate, and provide evaluation results involving automatic extraction of translation lexicons from printed text.

BibTeX entry:

   author = {O. Kolak and W. Byrne and P. Resnik},
   title = {A Generative Probabilistic {OCR} Model for {NLP} Applications},
   booktitle = {Proceedings of HLT-NAACL},
   pages = {55--62 (8 pages)},
   year = {2003},
   url = {http://www.aclweb.org/anthology/N/N03/N03-1018.pdf}

Back to Bill Byrne publications.