Toolbox for OCR post-correction

Related tags

Computer Visionochre
Overview

Ochre

Ochre is a toolbox for OCR post-correction. Please note that this software is experimental and very much a work in progress!

  • Overview of OCR post-correction data sets
  • Preprocess data sets
  • Train character-based language models/LSTMs for OCR post-correction
  • Do the post-correction
  • Assess the performance of OCR post-correction
  • Analyze OCR errors

Ochre contains ready-to-use data processing workflows (based on CWL). The software also allows you to create your own (OCR post-correction related) workflows. Examples of how to create these can be found in the notebooks directory (to be able to use those, make sure you have Jupyter Notebooks installed). This directory also contains notebooks that show how results can be analyzed and visualized.

Data sets

Installation

git clone [email protected]:KBNLresearch/ochre.git
cd ochre
pip install -r requirements.txt
python setup.py develop
  • Using the CWL workflows requires (the development version of) nlppln and its requirements (see installation guidelines).
  • To run a CWL workflow type: cwltool|cwl-runner path/to/workflow.cwl <inputs> (if you run the command without inputs, the tool will tell you about what inputs are required and how to specify them). For more information on running CWL workflows, have a look at the nlppln documentation. This is especially relevant for Windows users.
  • Please note that some of the CWL workflows contain absolute paths, if you want to use them on your own machine, regenerate them using the associated Jupyter Notebooks.

Preprocessing

The software needs the data in the following formats:

  • ocr: text files containing the ocr-ed text, one file per unit (article, page, book, etc.)
  • gs: text files containing the gold standard (correct) text, one file per unit (article, page, book, etc.)
  • aligned: json files containing aligned character sequences:
{
    "ocr": ["E", "x", "a", "m", "p", "", "c"],
    "gs": ["E", "x", "a", "m", "p", "l", "e"]
}

Corresponding files in these directories should have the same name (or at least the same prefix), for example:

├── gs
│   ├── 1.txt
│   ├── 2.txt
│   └── 3.txt
├── ocr
│   ├── 1.txt
│   ├── 2.txt
│   └── 3.txt
└── aligned
    ├── 1.json
    ├── 2.json
    └── 3.json

To create data in these formats, CWL workflows are available. First run a preprocess workflow to create the gs and ocr directories containing the expected files. Next run an align workflow to create the align directory.

To create the alignments, run one of:

  • align-dir-pack.cwl to align all files in the gs and ocr directories
  • align-test-files-pack.cwl to align the test files in a data division

These workflows can be run as stand-alone; associated notebook align-workflow.ipynb.

Training networks for OCR post-correction

First, you need to divide the data into a train, validation and test set:

python -m ochre.create_data_division /path/to/aligned

The result of this command is a json file containing lists of file names, for example:

{
    "train": ["1.json", "2.json", "3.json", "4.json", "5.json", ...],
    "test": ["6.json", ...],
    "val": ["7.json", ...]
}
  • Script: lstm_synched.py

OCR post-correction

If you trained a model, you can use it to correct OCR text using the lstm_synced_correct_ocr command:

python -m ochre.lstm_synced_correct_ocr /path/to/keras/model/file /path/to/text/file/containing/the/characters/in/the/training/data /path/to/ocr/text/file

or

cwltool /path/to/ochre/cwl/lstm_synced_correct_ocr.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --txt /path/to/ocr/text/file

The command creates a text file containing the corrected text.

To generate corrected text for the test files of a dataset, do:

cwltool /path/to/ochre/cwl/post_correct_test_files.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --datadivision /path/to/data/division --in_dir /path/to/directory/with/ocr/text/files

To run it for a directory of text files, use:

cwltool /path/to/ochre/cwl/post_correct_dir.cwl --charset /path/to/text/file/containing/the/characters/in/the/training/data --model /path/to/keras/model/file --in_dir /path/to/directory/with/ocr/text/files

(these CWL workflows can be run as stand-alone; associated notebook post_correction_workflows.ipynb)

  • Explain merging of predictions

Performance

To calculate performance of the OCR (post-correction), the external tool ocrevalUAtion is used. More information about this tool can be found on the website and wiki.

Two workflows are available for calculating performance. The first calculates performance for all files in a directory. To use it type:

cwltool /path/to/ochre/cwl/ocrevaluation-performance-wf-pack.cwl#main --gt /path/to/dir/containing/the/gold/standard/ --ocr /path/to/dir/containing/ocr/texts/ [--out_name name-of-output-file.csv]

The second calculates performance for all files in the test set:

cwltool /path/to/ochre/cwl/ocrevaluation-performance-test-files-wf-pack.cwl --datadivision /path/to/datadivision.json --gt /path/to/dir/containing/the/gold/standard/ --ocr /path/to/dir/containing/ocr/texts/ [--out_name name-of-output-file.csv]

Both of these workflows are stand-alone (packed). The corresponding Jupyter notebook is ocr-evaluation-workflow.ipynb.

To use the ocrevalUAtion tool in your workflows, you have to add it to the WorkflowGenerator's steps library:

wf.load(step_file='https://raw.githubusercontent.com/nlppln/ocrevaluation-docker/master/ocrevaluation.cwl')
  • TODO: explain how to calculate performance with ignore case (or use lowercase-directory.cwl)

OCR error analysis

Different types of OCR errors exist, e.g., structural vs. random mistakes. OCR post-correction methods may be suitable for fixing different types of errors. Therefore, it is useful to gain insight into what types of OCR errors occur. We chose to approach this problem on the word level. In order to be able to compare OCR errors on the word level, words in the OCR text and gold standard text need to be mapped. CWL workflows are available to do this. To create word mappings for the test files of a dataset, use:

cwltool  /path/to/ochre/cwl/word-mapping-test-files.cwl --data_div /path/to/datadivision --gs_dir /path/to/directory/containing/the/gold/standard/texts --ocr_dir /path/to/directory/containing/the/ocr/texts/ --wm_name name-of-the-output-file.csv

To create word mappings for two directories of files, do:

cwltool  /path/to/ochre/cwl/word-mapping-wf.cwl --gs_dir /path/to/directory/containing/the/gold/standard/texts/ --ocr_dir /path/to/directory/containing/the/ocr/texts/ --wm_name name-of-the-output-file.csv

(These workflows can be regenerated using the notebook word-mapping-workflow.ipynb.)

The result is a csv-file containing mapped words. The first column contains a word id, the second column the gold standard text and the third column contains the OCR text of the word:

,gs,ocr
0,Hello,Hcllo
1,World,World
2,!,.

This csv file can be used to analyze the errors. See notebooks/categorize errors based on word mappings.ipynb for an example.

We use heuristics to categorize the following types of errors (ochre/ocrerrors.py):

  • TODO: add error types

OCR quality measure

Jupyter notebook

  • better (more balanced) training data is needed.

Generating training data

  • Scramble gold standard text

Ideas

  • Visualization of probabilities for each character (do the ocr mistakes have lower probability?) (probability=color)

License

Copyright (c) 2017-2018, Koninklijke Bibliotheek, Netherlands eScience Center

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Owner
National Library of the Netherlands / Research
National Library of the Netherlands / Research
National Library of the Netherlands / Research
Awesome anomaly detection in medical images

A curated list of awesome anomaly detection works in medical imaging, inspired by the other awesome-* initiatives.

Kang Zhou 57 Dec 19, 2022
An OCR evaluation tool

dinglehopper dinglehopper is an OCR evaluation tool and reads ALTO, PAGE and text files. It compares a ground truth (GT) document page with a OCR resu

QURATOR-SPK 40 Dec 20, 2022
Scan the MRZ code of a passport and extract the firstname, lastname, passport number, nationality, date of birth, expiration date and personal numer.

PassportScanner Works with 2 and 3 line identity documents. What is this With PassportScanner you can use your camera to scan the MRZ code of a passpo

Edwin Vermeer 441 Dec 24, 2022
A python program to block out your face

Readme This is a small program I threw together in about 6 hours to block out your face. It probably doesn't work very well, so be warned. By default,

1 Oct 17, 2021
Sort By Face

Sort-By-Face This is an application with which you can either sort all the pictures by faces from a corpus of photos or retrieve all your photos from

0 Nov 29, 2021
Open Source Differentiable Computer Vision Library for PyTorch

Kornia is a differentiable computer vision library for PyTorch. It consists of a set of routines and differentiable modules to solve generic computer

kornia 7.6k Jan 04, 2023
Course material for the Multi-agents and computer graphics course

TC2008B Course material for the Multi-agents and computer graphics course. Setup instructions Strongly recommend using a custom conda environment. Ins

16 Dec 13, 2022
The virtual calculator will be above the live streaming from your camera

The virtual calculator is above the live streaming from my camera usb , the program first detect my hand and in each frame calculate the distance between two finger ,if the distance is lower than the

gasbaoui mohammed al amine 5 Jul 01, 2022
An official PyTorch implementation of the paper "Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences", ICCV 2021.

PyTorch implementation of Learning by Aligning (ICCV 2021) This is an official PyTorch implementation of the paper "Learning by Aligning: Visible-Infr

CV Lab @ Yonsei University 30 Nov 05, 2022
Memory tests solver with using OpenCV

Human Benchmark project This project is OpenCV based programs which are puzzle solvers for 7 different games for https://humanbenchmark.com/. made as

Bahadır Araz 24 Dec 27, 2022
Dirty, ugly, and hopefully useful OCR of Facebook Papers docs released by Gizmodo

Quick and Dirty OCR of Facebook Papers Gizmodo has been working through the Facebook Papers and releasing the docs that they process and review. As lu

Bill Fitzgerald 2 Oct 28, 2021
This is the implementation of the paper "Gated Recurrent Convolution Neural Network for OCR"

Gated Recurrent Convolution Neural Network for OCR This project is an implementation of the GRCNN for OCR. For details, please refer to the paper: htt

90 Dec 22, 2022
A python scripts that uses 3 different feature extraction methods such as SIFT, SURF and ORB to find a book in a video clip and project trailer of a movie based on that book, on to it.

A python scripts that uses 3 different feature extraction methods such as SIFT, SURF and ORB to find a book in a video clip and project trailer of a movie based on that book, on to it.

tooraj taraz 3 Feb 10, 2022
Web interface for browsing arXiv papers

Currently, arxivbox considers only major computer vision and machine learning conferences

Ankan Kumar Bhunia 12 Sep 11, 2022
Image augmentation library in Python for machine learning.

Augmentor is an image augmentation library in Python for machine learning. It aims to be a standalone library that is platform and framework independe

Marcus D. Bloice 4.8k Jan 04, 2023
Use Convolutional Recurrent Neural Network to recognize the Handwritten line text image without pre segmentation into words or characters. Use CTC loss Function to train.

Handwritten Line Text Recognition using Deep Learning with Tensorflow Description Use Convolutional Recurrent Neural Network to recognize the Handwrit

sushant097 224 Jan 07, 2023
3点クリックで円を指定し、極座標変換を行うサンプルプログラム

click-warpPolar 3点クリックで円を指定し、極座標変換を行うサンプルプログラムです。 Requirements OpenCV 3.4.2 or Later Usage 実行方法は以下です。 起動後、マウスで3点をクリックし円を指定してください。 python click-warpPol

KazuhitoTakahashi 17 Dec 30, 2022
Regions sanitàries (RS), Sectors Sanitàris (SS) i Àrees Bàsiques de Salut (ABS) de Catalunya

Regions sanitàries (RS), Sectors Sanitaris (SS), Àrees de Gestió Assistencial (AGA) i Àrees Bàsiques de Salut (ABS) de Catalunya Fitxers GeoJSON de le

Glòria Macià Muñoz 2 Jan 23, 2022
Generating .npy dataset and labels out of given image, containing numbers from 0 to 9, using opencv

basic-dataset-generator-from-image-of-numbers generating .npy dataset and labels out of given image, containing numbers from 0 to 9, using opencv inpu

1 Jan 01, 2022
A Joint Video and Image Encoder for End-to-End Retrieval

Frozen️ in Time ❄️ ️️️️ ⏳ A Joint Video and Image Encoder for End-to-End Retrieval (arXiv) Repository to contain the code, models, data for end-to-end

225 Dec 25, 2022