A tool to clean up text generated by OCR using individual words as well as their context.

306 commits | Last update: September 16, 2019

What ochre can do for you

  • Train character-based language models/LSTMs for OCR post-correction
  • Ready to use workflows for data preprocessing, training correction models, doing the post-correction, and analyzing (remaining) errors
  • Compare (corrected) OCR text to the gold standard based on character error rate (CER), word error rate (WER), and order independent word error rate
  • Analyze OCR errors on the word level
  • Discover OCR post-correction data sets

Ochre is experimental software for cleaning up text with OCR mistakes. The software was developed to investigate whether character-based language models can be used to remove OCR mistakes. In addition, ochre provides functionality to analyze the kinds of OCR mistakes in a corpus. This enables researchers to compare different OCR post-correction methods and find out what kinds of mistakes they are good at solving.

Read more
  • Text analysis & natural language processing
  • Machine learning
Programming Language
  • Python
  • Apache-2.0
Source code

Participating organizations


  • Janneke van der Zwaan
    Netherlands eScience Center
Contact person
Janneke van der Zwaan
Netherlands eScience Center

Information for page maintainers

OAI-PMH metadata:
Expected a redirect from doi.org to zenodo.org, got 404 instead.
citation metadata:
error 404 resolving doi.