Ochre

Ochre is a toolbox for OCR post-correction. Please note that this software is experimental and very much a work in progress!

Overview of OCR post-correction data sets
Preprocess data sets
Train character-based language models/LSTMs for OCR post-correction
Do the post-correction
Assess the performance of OCR post-correction
Analyze OCR errors

Ochre contains ready-to-use data processing workflows (based on CWL). The software also allows you to create your own (OCR post-correction related) workflows. Examples of how to create these can be found in the notebooks directory (to be able to use those, make sure you have Jupyter Notebooks installed). This directory also contains notebooks that show how results can be analyzed and visualized.

Data sets

VU DNC corpus
- Language: nl
- Format: FoLiA
- ~3340 newspaper articles, different genres, 5 newspapers, 1950/1951
- Gold standard is noisy
ICDAR 2017 shared task on OCR post correction
- Language: en and fr
- Format: txt (more info on the website)
- Periodicals and monographs
Digitzed yearbooks of the Swiss Alpine Club (19th century)
- Paper: http://www.zora.uzh.ch/124786/
- Languages: de and fr
Sydney Morning Herald 1842-1954 (Overproof)
- Paper: http://dl.acm.org/citation.cfm?id=2595200
- Languages: en
byu synthetic ocr data
- Paper: http://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=2664&context=facpub
- Not (yet) available

Installation

git clone [email protected]:KBNLresearch/ochre.git
cd ochre
pip install -r requirements.txt
python setup.py develop

Using the CWL workflows requires (the development version of) nlppln and its requirements (see installation guidelines).
To run a CWL workflow type: cwltool|cwl-runner path/to/workflow.cwl <inputs> (if you run the command without inputs, the tool will tell you about what inputs are required and how to specify them). For more information on running CWL workflows, have a look at the nlppln documentation. This is especially relevant for Windows users.
Please note that some of the CWL workflows contain absolute paths, if you want to use them on your own machine, regenerate them using the associated Jupyter Notebooks.

Preprocessing

The software needs the data in the following formats:

ocr: text files containing the ocr-ed text, one file per unit (article, page, book, etc.)
gs: text files containing the gold standard (correct) text, one file per unit (article, page, book, etc.)
aligned: json files containing aligned character sequences:

{
    "ocr": ["E", "x", "a", "m", "p", "", "c"],
    "gs": ["E", "x", "a", "m", "p", "l", "e"]
}

Corresponding files in these directories should have the same name (or at least the same prefix), for example:

├── gs
│   ├── 1.txt
│   ├── 2.txt
│   └── 3.txt
├── ocr
│   ├── 1.txt
│   ├── 2.txt
│   └── 3.txt
└── aligned
    ├── 1.json
    ├── 2.json
    └── 3.json

To create data in these formats, CWL workflows are available. First run a preprocess workflow to create the gs and ocr directories containing the expected files. Next run an align workflow to create the align directory.

VU DNC corpus: vudnc-preprocess-pack.cwl (can be run as stand-alone; associated notebook vudnc-preprocess-workflow.ipynb)
ICDAR 2017 shared task on OCR post correction: icdar2017st-extract-data-all.cwl (cannot be run as stand-alone; regenerate with notebook ICDAR2017_shared_task_workflows.ipynb)

To create the alignments, run one of:

align-dir-pack.cwl to align all files in the gs and ocr directories
align-test-files-pack.cwl to align the test files in a data division

These workflows can be run as stand-alone; associated notebook align-workflow.ipynb.

Training networks for OCR post-correction

First, you need to divide the data into a train, validation and test set:

python -m ochre.create_data_division /path/to/aligned

The result of this command is a json file containing lists of file names, for example:

{
    "train": ["1.json", "2.json", "3.json", "4.json", "5.json", ...],
    "test": ["6.json", ...],
    "val": ["7.json", ...]
}

Script: lstm_synched.py

OCR post-correction

If you trained a model, you can use it to correct OCR text using the lstm_synced_correct_ocr command:

python -m ochre.lstm_synced_correct_ocr /path/to/keras/model/file /path/to/text/file/containing/the/characters/in/the/training/data /path/to/ocr/text/file

cwltool /path/to/ochre/cwl/lstm_synced_correct_ocr.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --txt /path/to/ocr/text/file

The command creates a text file containing the corrected text.

To generate corrected text for the test files of a dataset, do:

cwltool /path/to/ochre/cwl/post_correct_test_files.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --datadivision /path/to/data/division --in_dir /path/to/directory/with/ocr/text/files

To run it for a directory of text files, use:

cwltool /path/to/ochre/cwl/post_correct_dir.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --in_dir /path/to/directory/with/ocr/text/files

(these CWL workflows can be run as stand-alone; associated notebook post_correction_workflows.ipynb)

Explain merging of predictions

Performance

To calculate performance of the OCR (post-correction), the external tool ocrevalUAtion is used. More information about this tool can be found on the website and wiki.

Two workflows are available for calculating performance. The first calculates performance for all files in a directory. To use it type:

cwltool /path/to/ochre/cwl/ocrevaluation-performance-wf-pack.cwl#main --gt /path/to/dir/containing/the/gold/standard/ --ocr /path/to/dir/containing/ocr/texts/ [--out_name name-of-output-file.csv]

The second calculates performance for all files in the test set:

cwltool /path/to/ochre/cwl/ocrevaluation-performance-test-files-wf-pack.cwl --datadivision /path/to/datadivision.json --gt /path/to/dir/containing/the/gold/standard/ --ocr /path/to/dir/containing/ocr/texts/ [--out_name name-of-output-file.csv]

Both of these workflows are stand-alone (packed). The corresponding Jupyter notebook is ocr-evaluation-workflow.ipynb.

To use the ocrevalUAtion tool in your workflows, you have to add it to the WorkflowGenerator's steps library:

wf.load(step_file='https://raw.githubusercontent.com/nlppln/ocrevaluation-docker/master/ocrevaluation.cwl')

TODO: explain how to calculate performance with ignore case (or use lowercase-directory.cwl)

OCR error analysis

Different types of OCR errors exist, e.g., structural vs. random mistakes. OCR post-correction methods may be suitable for fixing different types of errors. Therefore, it is useful to gain insight into what types of OCR errors occur. We chose to approach this problem on the word level. In order to be able to compare OCR errors on the word level, words in the OCR text and gold standard text need to be mapped. CWL workflows are available to do this. To create word mappings for the test files of a dataset, use:

cwltool  /path/to/ochre/cwl/word-mapping-test-files.cwl --data_div /path/to/datadivision --gs_dir /path/to/directory/containing/the/gold/standard/texts --ocr_dir /path/to/directory/containing/the/ocr/texts/ --wm_name name-of-the-output-file.csv

To create word mappings for two directories of files, do:

cwltool  /path/to/ochre/cwl/word-mapping-wf.cwl --gs_dir /path/to/directory/containing/the/gold/standard/texts/ --ocr_dir /path/to/directory/containing/the/ocr/texts/ --wm_name name-of-the-output-file.csv

(These workflows can be regenerated using the notebook word-mapping-workflow.ipynb.)

The result is a csv-file containing mapped words. The first column contains a word id, the second column the gold standard text and the third column contains the OCR text of the word:

,gs,ocr
0,Hello,Hcllo
1,World,World
2,!,.

This csv file can be used to analyze the errors. See notebooks/categorize errors based on word mappings.ipynb for an example.

We use heuristics to categorize the following types of errors (ochre/ocrerrors.py):

TODO: add error types

OCR quality measure

Jupyter notebook

better (more balanced) training data is needed.

Generating training data

Scramble gold standard text

Ideas

Visualization of probabilities for each character (do the ocr mistakes have lower probability?) (probability=color)

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Toolbox for OCR post-correction

Related tags

Overview

Ochre

Data sets

Installation

Preprocessing

Training networks for OCR post-correction

OCR post-correction

Performance

OCR error analysis

OCR quality measure

Generating training data

Ideas

License

Owner

National Library of the Netherlands / Research

EAST for ICPR MTWI 2018 Challenge II (Text detection of network images)

This is the open source implementation of the ICLR2022 paper "StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis"

The code for CVPR2022 paper "Likert Scoring with Grade Decoupling for Long-term Action Assessment".

One Metrics Library to Rule Them All!

📷 This repository is focused on having various feature implementation of OpenCV in Python.

Python library to extract tabular data from images and scanned PDFs

Image Recognition Model Generator

A version of nrsc5-gui that merges the interface developed by cmnybo with the architecture developed by zefie in order to start a new baseline that is not heavily dependent upon Python processing.

computer vision, image processing and machine learning on the web browser or node.

Opencv-image-filters - A camera to capture videos in real time by placing filters using Python with the help of the Tkinter and OpenCV libraries

Textboxes_plusplus implementation with Tensorflow (python)

Code for the "Sensing leg movement enhances wearable monitoring of energy expenditure" paper.

LEARN OPENCV IN 3 HOURS USING PYTHON - INCLUDING EXAMPLE PROJECTS

1st place solution for SIIM-FISABIO-RSNA COVID-19 Detection Challenge

Generate text images for training deep learning ocr model

STEFANN: Scene Text Editor using Font Adaptive Neural Network

Code for CVPR'2022 paper ✨ "Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model"

【Auto】原神⭐钓鱼辅助工具 | 自动收竿、校准游标 | ✨您只需要抛出鱼竿，我们会帮你完成一切✨

A tensorflow implementation of EAST text detector

A bot that plays TFT using OCR. Keeps track of bench, board, items, and plays the user defined team comp.