Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Last update: Dec 31, 2022

Related tags

Overview

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

The official code of ABINet (CVPR 2021, Oral).

ABINet uses a vision model and an explicit language model to recognize text in the wild, which are trained in end-to-end way. The language model (BCN) achieves bidirectional language representation in simulating cloze test, additionally utilizing iterative correction strategy.

Runtime Environment

We provide a pre-built docker image using the Dockerfile from docker/Dockerfile

Running in Docker

$ [email protected]:FangShancheng/ABINet.git
$ docker run --gpus all --rm -ti --ipc=host -v $(pwd)/ABINet:/app fangshancheng/fastai:torch1.1 /bin/bash

(Untested) Or using the dependencies
```
pip install -r requirements.txt
```

Datasets

Training datasets
1. MJSynth (MJ):
  - Use tools/create_lmdb_dataset.py to convert images into LMDB dataset
  - LMDB dataset BaiduNetdisk(passwd:n23k)
2. SynthText (ST):
  - Use tools/crop_by_word_bb.py to crop images from original SynthText dataset, and convert images into LMDB dataset by tools/create_lmdb_dataset.py
  - LMDB dataset BaiduNetdisk(passwd:n23k)
3. WikiText103, which is only used for pre-trainig language models:
  - Use notebooks/prepare_wikitext103.ipynb to convert text into CSV format.
  - CSV dataset BaiduNetdisk(passwd:dk01)
Evaluation datasets, LMDB datasets can be downloaded from BaiduNetdisk(passwd:1dbv), GoogleDrive.
1. ICDAR 2013 (IC13)
2. ICDAR 2015 (IC15)
3. IIIT5K Words (IIIT)
4. Street View Text (SVT)
5. Street View Text-Perspective (SVTP)
6. CUTE80 (CUTE)

The structure of data directory is

data
├── charset_36.txt
├── evaluation
│   ├── CUTE80
│   ├── IC13_857
│   ├── IC15_1811
│   ├── IIIT5k_3000
│   ├── SVT
│   └── SVTP
├── training
│   ├── MJ
│   │   ├── MJ_test
│   │   ├── MJ_train
│   │   └── MJ_valid
│   └── ST
├── WikiText-103.csv
└── WikiText-103_eval_d1.csv

Pretrained Models

Get the pretrained models from BaiduNetdisk(passwd:kwck), GoogleDrive. Performances of the pretrained models are summaried as follows:

Model	IC13	SVT	IIIT	IC15	SVTP	CUTE	AVG
ABINet-SV	97.1	92.7	95.2	84.0	86.7	88.5	91.4
ABINet-LV	97.0	93.4	96.4	85.9	89.5	89.2	92.7

Training

Pre-train vision model

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_vision_model.yaml

Pre-train language model

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/pretrain_language_model.yaml

Train ABINet

CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py --config=configs/train_abinet.yaml

Note:

You can set the checkpoint path for vision and language models separately for specific pretrained model, or set to None to train from scratch

Evaluation

CUDA_VISIBLE_DEVICES=0 python main.py --config=configs/train_abinet.yaml --phase test --image_only

Additional flags:

--checkpoint /path/to/checkpoint set the path of evaluation model
--test_root /path/to/dataset set the path of evaluation dataset
--model_eval [alignment|vision] which sub-model to evaluate
--image_only disable dumping visualization of attention masks

Visualization

Successful and failure cases on low-quality images:

Citation

If you find our method useful for your reserach, please cite

@article{fang2021read,
  title={Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition},
  author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2021}
}

License

This project is only free for academic research purposes, licensed under the 2-clause BSD License - see the LICENSE file for details.

Feel free to contact [email protected] if you have any questions.

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Related tags

Overview

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Runtime Environment

Datasets

Pretrained Models

Training

Evaluation

Visualization

Citation

License

Owner

MobileNetV1-V2，MobileNeXt，GhostNet，AdderNet，ShuffleNetV1-V2，Mobile+ViT etc.

Reusable constraint types to use with typing.Annotated

The final project for "Applying AI to Wearable Device Data" course from "AI for Healthcare" - Udacity.

C3D is a modified version of BVLC caffe to support 3D ConvNets.

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

MEDS: Enhancing Memory Error Detection for Large-Scale Applications

PyTorch implementation of the Flow Gaussian Mixture Model (FlowGMM) model from our paper

KwaiRec: A Fully-observed Dataset for Recommender Systems (Density: Almost 100%)

InterFaceGAN - Interpreting the Latent Space of GANs for Semantic Face Editing

Learning hierarchical attention for weakly-supervised chest X-ray abnormality localization and diagnosis

Part-aware Measurement for Robust Multi-View Multi-Human 3D Pose Estimation and Tracking

CLIP (Contrastive Language–Image Pre-training) for Italian

CvT-ASSD: Convolutional vision-Transformerbased Attentive Single Shot MultiBox Detector (ICTAI 2021 CCF-C 会议)The 33rd IEEE International Conference on Tools with Artificial Intelligence

Code for "My(o) Armband Leaks Passwords: An EMG and IMU Based Keylogging Side-Channel Attack" paper

Neural Message Passing for Computer Vision

MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments from a Single Moving Camera

Clustergram - Visualization and diagnostics for cluster analysis in Python

Pytorch implementation of NEGEV method. Paper: "Negative Evidence Matters in Interpretable Histology Image Classification".

PyTorch implementation of paper "IBRNet: Learning Multi-View Image-Based Rendering", CVPR 2021.

This repository contains the code for the paper in EMNLP 2021: "HRKD: Hierarchical Relational Knowledge Distillation for Cross-domain Language Model Compression".