ochre

Clean up OCR text using words as well as their context.

1
mention
1
contributor
197 commits | Last update: August 10, 2018

What ochre can do for you

  • Train character-based language models/LSTMs for OCR post-correction
  • Ready to use workflows for data preprocessing, training correction models, doing the post-correction, and analyzing (remaining) errors
  • Compare (corrected) OCR text to the gold standard based on character error rate (CER), word error rate (WER), and order independent word error rate
  • Analyze OCR errors on the word level
  • Discover OCR post-correction data sets

Ochre is experimental software for cleaning up text with OCR mistakes. The software was developed to investigate whether character-based language models can be used to remove OCR mistakes. In addition, ochre provides functionality to analyze the kinds of OCR mistakes in a corpus. This enables researchers to compare different OCR post-correction methods and find out what kinds of mistakes they are good at solving.

Read more
Tags
  • Text analysis & natural language processing
  • Machine learning
Programming Language
  • Python
License
  • Apache-2.0

Participating organizations

Contributors

  • Janneke van der Zwaan
    Netherlands eScience Center
Contact person
Janneke van der Zwaan
Netherlands eScience Center