METS/ALTO OCR enhancing tool by the National Library of Luxembourg (BnL)

Overview

Nautilus-OCR

The National Library of Luxembourg (BnL) started its first initiative in digitizing newspapers, with layout recognition and OCR on article level, back in 2006. Service providers were asked to create images of excellent quality, to run an optical layout recognition process, to identify articles and to run OCR on them. The data was modeled according to the METS/ALTO standard. In the meantime however, the potential of OCR software increased.

Developed by BnL in the context of its Open Data initiative, Nautilus-OCR uses these improvements in technology and the already structured data to rerun and enhance OCR. Nautilus-OCR can be used in two ways:

  1. Main purpose: Enhance the OCR quality of original (ori) METS/ALTO packages.

    drawing
    Nautilus-OCR METS/ALTO to METS/ALTO pipeline:
    - Extracts all ori images/text pairs
    - Targets a specific set of block types
    - Uses enhancement prediction on every target to possibly run OCR
    - Integrates new outputs into an updated METS/ALTO package

  2. Alternatively: Use as a regular OCR engine that is applied on a set of images.

    drawing
    Nautilus-OCR provides the possibility to visually compare ori (left) to new (right) outputs.

Key features:
  • Custom model training.
  • Included pre-trained OCR, font recognition and enhancement prediction models.
  • METS/ALTO to METS/ALTO using enhancement prediction.
  • Fast, multi-font OCR pipeline.

Nautilus-OCR is mainly built on open-source libraries combined with some proprietary contributions. Please note that the project is trying to be a generalized version of a tailored implementation for the specific needs of BnL.

Table of Contents

Quick Start

After having followed the installation instructions, Nautilus-OCR can be run by using the included BnL models and example METS/ALTO data.

With nautilusocr/ as the current working directory, first copy the BnL models to the final/ folder.1

cp models/bnl/* models/final/

Next, run enhance on the examples/ directory, containg a single mets-alto-package/

python3 src/main.py enhance -d examples/ -r 0.02

to generate new ALTO files for every block with a minimum enhancement prediction of 2%. Finally, the newly generated files can be located in output/.

1 As explained in models/final/README.md, the models within models/final/ are automatically applied when executing the enhance, train-epr, ocr and test-ocr actions. Models outside of models/final/ are supposed to be stored for testing and comparison purposes.

Requirements

Nautilus-OCR requires:

  • Linux / macOS
    The software requires dependencies that only work on Linux and macOS. Windows is not supported at the moment.
  • Python 3.8+
    The software has been developed using Python 3.8.5.
  • Dependencies
    Access to the libraries listed in requirements.txt.
  • METS/ALTO
    METS/ALTO packages as data, or alternatively TextBlock images representing single-column snippets of text.

Installation

With Python3 (tested on version 3.8.5) installed, clone this repostitory and install the required dependencies:

git clone https://github.com/natliblux/nautilusocr
cd nautilusocr
pip3 install -r requirements.txt

Hunspell dependency might require:

apt-get install libhunspell-dev
brew install hunspell

OpenCV dependency might require:

apt install libgl1-mesa-glx
apt install libcudart10.1

You can test that all dependencies have been sucessfully installed by running

python3 src/main.py -h

and looking for the following output:

Starting Nautilus-OCR

usage: main.py [-h] {set-ocr,train-ocr,test-ocr,enhance,ocr,set-fcr,train-fcr,test-fcr,test-seg,train-epr,test-epr} ...

Nautilus-OCR Command Line Tool

positional arguments:
  {set-ocr,train-ocr,test-ocr,enhance,ocr,set-fcr,train-fcr,test-fcr,test-seg,train-epr,test-epr}
                        sub-command help

optional arguments:
  -h, --help            show this help message and exit

Workflow

The command-line tool consists of four different modules, with each one exposing a predefined set of actions:

  • ocr - optical character recognition
  • seg - text line segmentation
  • fcr - font class recognition
  • epr - enhancement prediction

To get started, one should take note of the options available in config.ini and most importantly set the device (CPU/GPU) parameter and decide on the set of font_classes and supported_languages. Next, a general workflow could looks as follows:

  1. Test the seg algorithm using test-seg to see whether any parameters need to be adjusted.
  2. Create a fcr train set using set-fcr based on font ground truth information.
  3. Train a fcr model using train-fcr.
  4. Test the fcr model accurcy using test-fcr.
  5. Create an ocr train set using set-ocr based on ocr ground truth information.
  6. Train an ocr model for every font class using train-ocr.
  7. Test the ocr model for every font class using test-ocr.
  8. Train an epr model based on ground truth and ori data using train-epr.
  9. Test the epr model accuracy using test-epr.
  10. Enhance METS/ALTO packages using enhance.
  11. Alternatively: Run ocr on a set of images using ocr.

This id done by calling main.py followed by the desired action and options:

python3 src/main.py [action] [options]

The following module sections will list all available actions and options.

Modules

optical character recognition

set-ocr

Creates an ocr train set consisting of image/text line pairs. Every pair is of type New, Artificial or Existing:

  • New: Extracted using an image and ALTO file.
  • Generated: Image part of pair is generated artificially based on given input text.
  • Existing: Pair exists already (has been prepared beforehand) and is included in the train set.
Option Default Explanation
-j --jsonl Path to jsonl file referencing image and ALTO files 1 2
-c --confidence 9 (max tolerant) Highest tolerated confidence value for every character in line
-m --model fcr-model Name of fcr model to be used in absence of font class indication 3
-e --existing Path to directory containing existing pairs 4 5
-g --generated 0 (none) Number of artificially generated pairs to be added per font class 6 7
-t --text Path to text file containing text for artificial pairs 8
-n --nlines -1 (max) Maximum number of pairs per font class
-s --set ocr-train-set Name of ocr train set

1 Example lines:

{"image": "/path/image1.png", "gt": "/path/alto1.xml"}
{"image": "/path/image2.png", "gt": "/path/alto2.xml", "gt-block-id": "TB1"}
{"image": "/path/image3.png", "gt": "/path/alto3.xml", "gt-block-id": "TB2", "font": "fraktur"}

2 Key gt-block-id can optionally reference a single block in a multi-block ALTO file.
3 Absence of font key means that -m option must be set to automatically determine the font class.
4 Naming convention for existing pairs: [pair-name].png/.tif & [pair-name].gt.txt.
5 Image part of existing pairs is supposed to be unbinarized.
6 Artificially generated lines represent lower quality examples for the model to learn from.
7 Fonts in fonts/artificial/ are being randomly used and can be adjusted per font class.
8 Text is given by a .txt file with individual words delimited by spaces and line breaks.

train-ocr

Trains an ocr model for a specific font using an ocr train set.

Option Default Explanation
-s --set Name of ocr train set to be used
-f --font Name of font that ocr model should be trained on
-m --model ocr-model Name of ocr model to be created

test-ocr

Tests models in models/final/ on a test set defined by a jsonl file.
A comparison to the original ocr data can optionally be drawn.

Option Default Explanation
-j --jsonl Path to jsonl file referencing image and ground truth ALTO files 1 2
-i --image False Generate output image comparing ocr output with source image
-c --confidence False Add ocr confidence (through font greyscale level) to output image

1 Example lines:

{"id": "001", "image": "/path/image1.png", "gt": "/path/alto1.xml"}
{"id": "002", "image": "/path/image2.png", "gt": "/path/alto2.xml", "gt-block-id": "TB1"}
{"id": "003", "image": "/path/image3.png", "gt": "/path/alto3.xml", "gt-block-id": "TB2", "ori": "/path2/alto3.xml"}
{"id": "004", "image": "/path/image4.png", "gt": "/path/alto4.xml", "gt-block-id": "TB3", "ori": "/path2/alto4.xml", "ori-block-id": "TB4"}

2 Keys ori and ori-block-id can optionally reference original ocr output for comparison purposes.

enhance

Applies ocr on a set of original METS/ALTO packages, while aiming to enhance ocr accuracy.1
An optional enhancement prediction model can prevent running ocr for some target blocks.
Models in models/final/ are automatically used for this action.2

Option Default Explanation
-d --directory Path to directory containing all orignal METS/ALTO packages 3 4
-r --required 0.0 Value for minimum required enhancement prediction 5

1 Target text block types can be adjusted in config.ini.
2 The presence of an epr model is optional.
3 METS files need to end in -mets.xml.
4 Every package name should be unique and is defined as the directory name of the METS file.
5 Enhancement predictions are in range [-1,1], set to -1 to disable epr and automatically reprocess all target blocks.

ocr

Applies ocr on a directory of images while using the models in models/final/.

Option Default Explanation
-d --directory Path to directory containing target ocr source images 1
-a --alto False Output ocr in ALTO format
-i --image False Generate output image comparing ocr with source image
-c --confidence False Add ocr confidence (through font greyscale level) to output image

1 Subdirectories possible, images should be in .png or .tif format.

text line segmentation

test-seg

Tests the CombiSeg segmentation algorithm on a test set defined by a jsonl file. The correct functionning of the segmentation algorithm is essential for most other modules and actions.
The default parameters should generally work well, however they can be adjusted. 1

Option Default Explanation
-j --jsonl path to jsonl file referencing image and ALTO files 2

1 Algorithm parameters can be adjusted in config.ini in case of unsatisfactory performance.
2 Example lines:

{"image": "/path/image1.png", "gt": "/path/alto1.xml"}
{"image": "/path/image2.png", "gt": "/path/alto2.xml", "gt-block-id": "TB1"}

font class recognition

set-fcr

Creates a fcr train set consisting of individual character images.

Option Default Explanation
-j --jsonl Path to jsonl file referencing image files and the respective font classes 1
-n --nchars max Maximum number of characters extracted from every image 2
-s --set fcr-train-set Name of fcr train set

1 Example line:

{"image": "/path/image.png", "font": "fraktur"}

2 Fewer extracted chars for a larger amount of images generally leads to a more diverse train set.

train-fcr

Trains a fcr model using a fcr train set.

Option Default Explanation
-s --set Name of fcr train set
-m --model fcr-model Name of fcr model to be created

test-fcr

Tests a fcr model on a test set defined by a jsonl file.

Option Default Explanation
-j --jsonl Path to jsonl file referencing image files and the respective font classes 1
-m --model fcr-model Name of fcr model to be tested

1 Example line:

{"image": "/path/image.png", "font": "fraktur"}

enhancement prediction

This module requires language dictionaries. For all language xx in supported_languages in config.ini, please either add a list of words as xx.txt or the Hunspell files xx.dic and xx.aff to dicts/.

train-epr

Trains an epr model (for use in enhance) that predicts the enhancement in ocr accuracy (from ori to new) and can hence be used to prevent ocr from running on all target blocks.
Please take note of the parameters in config.ini before starting training.
This action uses the models in models/final/.

Option Default Explanation
-j --jsonl Path to jsonl file referencing image, ground truth ALTO and original ALTO files 1
-m --model epr-model Name of epr model to be created

1 Example lines:

{"image": "/path/image1.png", "gt": "/path/alto1.xml", "ori": "/path/alto1.xml", "year": 1859}
{"image": "/path/image2.png", "gt": "/path/alto2.xml", "gt-block-id": "TB1", "ori": "/path/alto2.xml", "year": 1859}
{"image": "/path/image3.png", "gt": "/path/alto3.xml", "gt-block-id": "TB2", "ori": "/path/alto3.xml", "ori-block-id": "TB2", "year": 1859}

test-epr

Tests an epr model and returns the mean average error after applying leave-one-out cross-validation (kNN algorithm).

Option Default Explanation
-m --model epr-model Name of epr model to be tested

Models

Nautilus-OCR encloses four pre-trained models:

  • bnl-ocr-antiqua.mlmodel

OCR model built with kraken and trained on the antiqua data (70k pairs) of an extended version of bnl-ground-truth-newspapers-before-1878 that is not limited to the cut-off date of 1878.

  • bnl-ocr-fraktur.mlmodel

OCR model built with kraken and trained on the fraktur data (43k pairs) of an extended version of bnl-ground-truth-newspapers-before-1878 that is not limited to the cut-off date of 1878.

  • bnl-fcr.h5

Binary font recognition model built with TensorFlow and trained to perform classification using font classes [antiqua, fraktur]. Please note that the fcr module automatically extends the set of classes to [antiqua, fraktur, unknown], to cover for the case where the neural network input preprocessing fails. The model has been trained on 50k individual character images and showed 100% accuracy on a 200 image test set.

  • bnl-epr-de-fr-lb.jsonl

Enhancement prediction model trained on more than 4.5k text blocks for the language set [de, fr, lb]. Training data has been published between 1840 and 1960. Enhancement is predicted for the application of bnl-ocr-antiqua.mlmodel and bnl-ocr-fraktur.mlmodel, therefore based on font class set [antiqua, fraktur]. The model makes use of the dictionaries for all three languages within dicts/. Using leave-one-out cross-validation (kNN algorithm), mean average error of 0.024 was achieved.

Ground Truth

bnl-ground-truth-newspapers-before-1878

OCR ground truth dataset including more than 33k text line image/text pairs, split in antiqua (19k) and fraktur (14k) font classes. The set is based on Luxembourg historical newspapers in the public domain (published before 1878), written generally in German, French and Luxembourgish. Transcription was done using a double-keying technique with a minimum accuracy of 99.95%. Font class was automatically determined using bnl-fcr.h5.

Libraries

Nautilus-OCR is mostly built on open-source libraries, with the most important ones being:

License

License: GPL v3

See COPYING to see full text.

Credits

Thanks and credits go to the Lexicolux project, whose work is the basis for the generation of dicts/lb.txt.

Contact

If you want to get in touch, please contact us here.

Owner
National Library of Luxembourg
National Library of Luxembourg
Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

UnivNet UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation This is an unofficial PyTorch

MINDs Lab 170 Jan 04, 2023
YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with ONNX, TensorRT, ncnn, and OpenVINO supported.

Introduction YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and ind

7.7k Jan 03, 2023
[CVPR 2021] Region-aware Adaptive Instance Normalization for Image Harmonization

RainNet — Official Pytorch Implementation Region-aware Adaptive Instance Normalization for Image Harmonization Jun Ling, Han Xue, Li Song*, Rong Xie,

130 Dec 11, 2022
This is the official implementation for the paper "(Almost) Free Incentivized Exploration from Decentralized Learning Agents" in NeurIPS 2021.

Observe then Incentivize Experiments This is the code used for the paper "(Almost) Free Incentivized Exploration from Decentralized Learning Agents",

Cong Shen Research Group 0 Mar 08, 2022
PyTorch implementations of the NeRF model described in "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis"

PyTorch NeRF and pixelNeRF NeRF: Tiny NeRF: pixelNeRF: This repository contains minimal PyTorch implementations of the NeRF model described in "NeRF:

Michael A. Alcorn 178 Dec 20, 2022
Source code for the plant extraction workflow introduced in the paper “Agricultural Plant Cataloging and Establishment of a Data Framework from UAV-based Crop Images by Computer Vision”

Plant extraction workflow Source code for the plant extraction workflow introduced in the paper "Agricultural Plant Cataloging and Establishment of a

Maurice Günder 0 Apr 22, 2022
Invariant Causal Prediction for Block MDPs

MISA Abstract Generalization across environments is critical to the successful application of reinforcement learning algorithms to real-world challeng

Meta Research 41 Sep 17, 2022
Reimplementation of Dynamic Multi-scale filters for Semantic Segmentation.

Paddle implementation of Dynamic Multi-scale filters for Semantic Segmentation.

Hongqiang.Wang 2 Nov 01, 2021
A PyTorch implementation of "Signed Graph Convolutional Network" (ICDM 2018).

SGCN ⠀ A PyTorch implementation of Signed Graph Convolutional Network (ICDM 2018). Abstract Due to the fact much of today's data can be represented as

Benedek Rozemberczki 251 Nov 30, 2022
[CVPR 2019 Oral] Multi-Channel Attention Selection GAN with Cascaded Semantic Guidance for Cross-View Image Translation

SelectionGAN for Guided Image-to-Image Translation CVPR Paper | Extended Paper | Guided-I2I-Translation-Papers Citation If you use this code for your

Hao Tang 424 Dec 02, 2022
Objective of the repository is to learn and build machine learning models using Pytorch. 30DaysofML Using Pytorch

30 Days Of Machine Learning Using Pytorch Objective of the repository is to learn and build machine learning models using Pytorch. List of Algorithms

Mayur 119 Nov 24, 2022
Api's bulid in Flask perfom to manage Todo Task.

Citymall-task Api's bulid in Flask perfom to manage Todo Task. Installation Requrements : Python: 3.10.0 MongoDB create .env file with variables DB_UR

Aisha Tayyaba 1 Dec 17, 2021
Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

MetaAdaptRank This repository provides the implementation of meta-learning to reweight synthetic weak supervision data described in the paper Few-Shot

THUNLP 5 Jun 16, 2022
Code for BMVC2021 "MOS: A Low Latency and Lightweight Framework for Face Detection, Landmark Localization, and Head Pose Estimation"

MOS-Multi-Task-Face-Detect Introduction This repo is the official implementation of "MOS: A Low Latency and Lightweight Framework for Face Detection,

104 Dec 08, 2022
OCRA (Object-Centric Recurrent Attention) source code

OCRA (Object-Centric Recurrent Attention) source code Hossein Adeli and Seoyoung Ahn Please cite this article if you find this repository useful: For

Hossein Adeli 2 Jun 18, 2022
Task-based end-to-end model learning in stochastic optimization

Task-based End-to-end Model Learning in Stochastic Optimization This repository is by Priya L. Donti, Brandon Amos, and J. Zico Kolter and contains th

CMU Locus Lab 164 Dec 29, 2022
Canonical Appearance Transformations

CAT-Net: Learning Canonical Appearance Transformations Code to accompany our paper "How to Train a CAT: Learning Canonical Appearance Transformations

STARS Laboratory 54 Dec 24, 2022
P-Tuning v2: Prompt Tuning Can Be Comparable to Finetuning Universally Across Scales and Tasks

P-tuning v2 P-Tuning v2: Prompt Tuning Can Be Comparable to Finetuning Universally Across Scales and Tasks An optimized prompt tuning strategy for sma

THUDM 540 Dec 30, 2022
This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine

LSHTM_RCS This repository contains project created during the Data Challenge module at London School of Hygiene & Tropical Medicine (LSHTM) in collabo

Lukas Kopecky 3 Jan 30, 2022
Official code for the CVPR 2022 (oral) paper "Extracting Triangular 3D Models, Materials, and Lighting From Images".

nvdiffrec Joint optimization of topology, materials and lighting from multi-view image observations as described in the paper Extracting Triangular 3D

NVIDIA Research Projects 1.4k Jan 01, 2023