Text-induced corpus correction and lexical assessment tool
Humanities research makes extensive use of digital archives. Most of these archives, including the KB newspaper data, consist of digitized text. One of the major challenges of using these collections for research is the fact that Optical Character Recognition (OCR) on scanned historical documents is far from perfect. Although it is hard to quantify the impact of OCR mistakes on humanities research, it is known that these mistakes have a negative impact on basic text processing techniques such as sentence boundary detection, tokenization, and part-of-speech tagging. As these basic techniques are often used prior to performing more advanced techniques and most advanced techniques use words as features, it is likely that OCR mistakes have a negative impact on more advanced text mining tasks humanities researchers are interested in, such as named entity recognition, topic modeling, and sentiment analysis.
The goal of this research is to bring the digitized text closer to the original newspaper articles by applying post-correction. Post-correction involves improving digitized text quality by manipulating the textual output of the OCR process directly. The idea is that better quality data boosts eHumantities research. Although the quality of the KB newspaper data would definitely benefit from improving the OCR process itself (improved image recognition), post-correction will still be necessary, because the quality of historical newspapers is suboptimal for OCR (for example, due to poor paper and print quality).
Existing approaches for OCR post-correction generally make use of extensive dictionaries to replace words in the OCRed text that do not occur in the dictionary with words that do. Based on the assumption that a number of characters in every word will be identified correctly, words not in the dictionary are replaced with alternatives that are as similar as possible to the text recognized, possibly taking into account word frequencies to solve ties. The main problem with these existing approaches is that they do not take into account the context in which words occur.
Deep learning techniques provide an opportunity to take this context into account. This project aims to learn a character based language model of Dutch newspaper articles. This is a model of the character sequences occurring in the text of a corpus. OCR mistakes can be viewed as deviations from this model. Mistakes can be fixed by intervening when text deviates too much from the model.