OCR Post Correction for Endangered Language Texts

Overview

📌 Coming soon: an update to the software including features from our paper on semi-supervised OCR post-correction, to be published in the Transactions of the Association for Computational Linguistics (TACL)!

Check out the paper here.

OCR Post Correction for Endangered Language Texts

This repository contains code for models and experiments from the paper "OCR Post Correction for Endangered Language Texts".

Textual data in endangered languages is often found in formats that are not machine-readable, including scanned images of paper books. Extracting the text is challenging because there is typically no annotated data to train an OCR system for each endangered language. Instead, we focus on post-correcting the OCR output from a general-purpose OCR system.

📌 In the paper, we present a dataset containing annotations for documents in three critically endangered languages: Ainu, Griko, Yakkha.

📌 Our model reduces the recognition error rate by 34% on average, over a state-of-the-art OCR system.

Learn more about the paper here!

OCR Post-Correction

The goal of OCR post-correction is to automatically correct errors in the text output from an existing OCR system.

The existing OCR system is used to obtain a first pass transcription of the input image (example below in the endangered language Griko):

First pass OCR transcription

The incorrectly recognized characters in the first pass are then corrected by the post-correction model.

Corrected transcription

Model

As seen in the example above, OCR post-correction is a text-based sequence-to-sequence task.

📌 We use a character-level encoder-decoder architecture with attention and add several adaptations for the low-resource setting. The paper has all the details!

📌 The model is trained in a supervised manner. The training data consists of first pass OCR outputs as the source with corresponding manually corrected transcriptions as the target.

📌 Some books that contain texts in endangered languages also contain translations of the text in another (usually high-resource) language. We incorporate an additional encoder in the model, with a multisource framework, to use the information from these translations if they are available.

We provide instructions for both single-source and multisource models:

  • The single-source model can be used for almost any document and is significantly easier to set up.

  • The multisource model can only be used if translations are available.

Dataset

This repository contains a sample from our dataset in sample_dataset, which you can use to train the post-correction model. Get the full dataset here!

However, this repository can be used to train OCR post-correction models for documents in any language!

🚀 If you want to use our model with a new set of documents, construct a dataset by following the steps here.

🚀 We'd love to hear about the new datasets and models you build: send us an email at [email protected]!

Running Experiments

Once you have a suitable dataset (e.g., sample_dataset or your own dataset), you can train a model and run experiments on OCR post-correction.

If you have your own dataset, you can use the utils/prepare_data.py script to create train, development, and test splits (see the last step here).

The steps are described below, illustrated with sample_dataset/postcorrection. If using another dataset, simply change the experiment settings to point to your dataset and run the same scripts.

Requirements

Python 3+ is required. Pip can be used to install the packages:

pip install -r postcorr_requirements.txt

Training

The process of training the post-correction model has two main steps:

  • Pretraining with first pass OCR outputs.
  • Training with manually corrected transcriptions in a supervised manner.

For a single-source model, modify the experimental settings in train_single-source.sh to point to the appropriate dataset and desired output folder. It is currently set up to use sample_dataset.

Then run

bash train_single-source.sh

For multisource, use train_multi-source.sh.

Log files and saved models are written to the user-specified experiment folder for both the pretraining and training steps. For a list of all available hyperparameters and options, look at postcorrection/constants.py and postcorrection/opts.py.

Testing

For testing with a single-source model, modify the experimental settings in test_single-source.sh. It is currently set up to use sample_dataset.

Then run

bash test_single-source.sh

For multisource, use test_multi-source.sh.

Citation

Please cite our paper if this repository was useful.

@inproceedings{rijhwani-etal-2020-ocr,
    title = "{OCR} {P}ost {C}orrection for {E}ndangered {L}anguage {T}exts",
    author = "Rijhwani, Shruti  and
      Anastasopoulos, Antonios  and
      Neubig, Graham",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.478",
    doi = "10.18653/v1/2020.emnlp-main.478",
    pages = "5931--5942",
}

License

Owner
Shruti Rijhwani
Ph.D. student at CMU, working on natural language processing.
Shruti Rijhwani
A curated list of programmatic weak supervision papers and resources

A curated list of programmatic weak supervision papers and resources

Jieyu Zhang 118 Jan 02, 2023
LSTMs (Long Short Term Memory) RNN for prediction of price trends

Price Prediction with Recurrent Neural Networks LSTMs BTC-USD price prediction with deep learning algorithm. Artificial Neural Networks specifically L

5 Nov 12, 2021
[CVPR 2020] 3D Photography using Context-aware Layered Depth Inpainting

[CVPR 2020] 3D Photography using Context-aware Layered Depth Inpainting [Paper] [Project Website] [Google Colab] We propose a method for converting a

Virginia Tech Vision and Learning Lab 6.2k Jan 01, 2023
PaddleBoBo是基于PaddlePaddle和PaddleSpeech、PaddleGAN等开发套件的虚拟主播快速生成项目

PaddleBoBo - 元宇宙时代,你也可以动手做一个虚拟主播。 PaddleBoBo是基于飞桨PaddlePaddle深度学习框架和PaddleSpeech、PaddleGAN等开发套件的虚拟主播快速生成项目。PaddleBoBo致力于简单高效、可复用性强,只需要一张带人像的图片和一段文字,就能

502 Jan 08, 2023
Python framework for Stochastic Differential Equations modeling

SDElearn: a Python package for SDE modeling This package implements functionalities for working with Stochastic Differential Equations models (SDEs fo

4 May 10, 2022
Repo for "Benchmarking Robustness of 3D Point Cloud Recognition against Common Corruptions" https://arxiv.org/abs/2201.12296

Benchmarking Robustness of 3D Point Cloud Recognition against Common Corruptions This repo contains the dataset and code for the paper Benchmarking Ro

Jiachen Sun 168 Dec 29, 2022
A Deep Reinforcement Learning Framework for Stock Market Trading

DQN-Trading This is a framework based on deep reinforcement learning for stock market trading. This project is the implementation code for the two pap

61 Jan 01, 2023
Unofficial Implementation of MLP-Mixer, Image Classification Model

MLP-Mixer Unoffical Implementation of MLP-Mixer, easy to use with terminal. Train and test easly. https://arxiv.org/abs/2105.01601 MLP-Mixer is an arc

Oğuzhan Ercan 6 Dec 05, 2022
Code for "PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clouds", CVPR 2021

PV-RAFT This repository contains the PyTorch implementation for paper "PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clou

Yi Wei 43 Dec 05, 2022
🍷 Gracefully claim weekly free games and monthly content from Epic Store.

EPIC 免费人 🚀 优雅地领取 Epic 免费游戏 Introduction 👋 Epic AwesomeGamer 帮助玩家优雅地领取 Epic 免费游戏。 使用 「Epic免费人」可以实现如下需求: get:搬空游戏商店,获取所有常驻免费游戏与免费附加内容; claim:领取周免游戏及其免

571 Dec 28, 2022
Simple keras FCN Encoder/Decoder model for MS-COCO (food subset) segmentation

FCN_MSCOCO_Food_Segmentation Simple keras FCN Encoder/Decoder model for MS-COCO (food subset) segmentation Input data: [http://mscoco.org/dataset/#ove

Alexander Kalinovsky 11 Jan 08, 2019
DeepOBS: A Deep Learning Optimizer Benchmark Suite

DeepOBS - A Deep Learning Optimizer Benchmark Suite DeepOBS is a benchmarking suite that drastically simplifies, automates and improves the evaluation

Aaron Bahde 7 May 12, 2020
An Unbiased Learning To Rank Algorithms (ULTRA) toolbox

Unbiased Learning to Rank Algorithms (ULTRA) This is an Unbiased Learning To Rank Algorithms (ULTRA) toolbox, which provides a codebase for experiment

back 3 Nov 18, 2022
MapReader: A computer vision pipeline for the semantic exploration of maps at scale

MapReader A computer vision pipeline for the semantic exploration of maps at scale MapReader is an end-to-end computer vision (CV) pipeline designed b

Living with Machines 25 Dec 26, 2022
Code for NeurIPS2021 submission "A Surrogate Objective Framework for Prediction+Programming with Soft Constraints"

This repository is the code for NeurIPS 2021 submission "A Surrogate Objective Framework for Prediction+Programming with Soft Constraints". Edit 2021/

10 Dec 20, 2022
An end-to-end project on customer segmentation

End-to-end Customer Segmentation Project Note: This project is in progress. Tools Used in This Project Prefect: Orchestrate workflows hydra: Manage co

Ocelot Consulting 8 Oct 06, 2022
Active window border replacement for window managers.

xborder Active window border replacement for window managers. Usage git clone https://github.com/deter0/xborder cd xborder chmod +x xborders ./xborder

deter 250 Dec 30, 2022
No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency

This repository contains the implementation for the paper: No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consiste

Alireza Golestaneh 75 Dec 30, 2022
The code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

CrossFormer This repository is the code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention. Introduction Existin

cheerss 238 Jan 06, 2023
Simple torch.nn.module implementation of Alias-Free-GAN style filter and resample

Alias-Free-Torch Simple torch module implementation of Alias-Free GAN. This repository including Alias-Free GAN style lowpass sinc filter @filter.py A

이준혁(Junhyeok Lee) 64 Dec 22, 2022