A PyTorch implementation of "From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network" (ICCV2021)

Overview

From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network

The official code of VisionLAN (ICCV2021). VisionLAN successfully achieves the transformation from two-step to one-step recognition (from Two to One), which adaptively considers both visual and linguistic information in a unified structure without the need of extra language model.

ToDo List

  • Release code
  • Document for Installation
  • Trained models
  • Document for testing and training
  • Evaluation
  • re-organize and clean the parameters

Updates

2021/10/9 We upload the code, datasets, and trained models.
2021/10/9 Fix a bug in cfs_LF_1.py.

Requirements

Python2.7
Colour
LMDB
Pillow
opencv-python
torch==1.3.0
torchvision==0.4.1
editdistance
matplotlib==2.2.5

Step-by-step install

pip install -r requirements.txt

Data preparing

Training sets

SynthText We use the tool to crop images from original SynthText dataset, and convert images into LMDB dataset.

MJSynth We use tool to convert images into LMDB dataset. (We only use training set in this implementation)

We have upload these LMDB datasets in RuiKe (password:x6si).

Testing sets

Evaluation datasets, LMDB datasets can be downloaded from BaiduYun (password:fjyy) or RuiKe

IIIT5K Words (IIIT5K)
ICDAR 2013 (IC13)
Street View Text (SVT)
ICDAR 2015 (IC15)
Street View Text-Perspective (SVTP)
CUTE80 (CUTE)

The structure of data directory is

datasets
├── evaluation
│   ├── Sumof6benchmarks
│   ├── CUTE
│   ├── IC13
│   ├── IC15
│   ├── IIIT5K
│   ├── SVT
│   └── SVTP
└── train
    ├── MJSynth
    └── SynthText

Evaluation

Results on 6 benchmarks

Methods IIIT5K IC13 SVT IC15 SVTP CUTE
Paper 95.8 95.7 91.7 83.7 86.0 88.5
This implementation 95.9 96.3 90.7 84.1 85.3 88.9

Download our trained model in BaiduYun (password: e3kj) or RuiKe (password: cxqi), and put it in output/LA/final.pth.

CUDA_VISIBLE_DEVICES=0 python eval.py

Visualize character-wise mask map

Examples of the visualization of mask_c: image

   CUDA_VISIBLE_DEVICES=0 python visualize.py

You can modify the 'mask_id' in cfgs/cfgs_visualize to change the mask position for visualization.

Results on OST datasets

Occlusion Scene Text (OST) dataset is proposed to reflect the ability for recognizing cases with missing visual cues. This dataset is collected from 6 benchmarks (IC13, IC15, IIIT5K, SVT, SVTP and CT) containing 4832 images. Images in this dataset are manually occluded in weak or heavy degree. Weak and heavy degrees mean that we occlude the character using one or two lines. For each image, we randomly choose one degree to only cover one character.

Examples of images in OST dataset: image image

Methods Average Weak Heavy
Paper 60.3 70.3 50.3
This implementation 60.3 70.8 49.8

The LMDB dataset is available in BaiduYun (password:yrrj) or RuiKe (password: vmzr)

Training

4 2080Ti GPUs are used in this implementation.

Language-free (LF) process

Step 1: We first train the vision model without MLM. (Our trained LF_1 model(BaiduYun) (password:avs5) or RuiKe (password:qwzn))

   CUDA_VISIBLE_DEVICES=0,1,2,3 python train_LF_1.py

Step 2: We finetune the MLM with vision model (Our trained LF_2 model(BaiduYun) (password:04jg) or RuiKe (password:v67q))

   CUDA_VISIBLE_DEVICES=0,1,2,3 python train_LF_2.py

Language-aware (LA) process

Use the mask map to guide the linguistic learning in the vision model.

   CUDA_VISIBLE_DEVICES=0,1,2,3 python train_LA.py

Tip: In LA process, model with loss (Loss VisionLAN) higher than 0.3 and the training accuracy (Accuracy) lower than 91.0 after the first 200 training iters obains better performance.

Improvement

  1. Mask id randomly generated according to the max length can not well adapt to the occlusion of long text. Thus, evenly sampled mask id can further improve the performance of MLM.
  2. Heavier vision model is able to capture more robust linguistic information in our later experiments.

Citation

If you find our method useful for your reserach, please cite

 @article{wang2021two,
  title={From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network},
  author={Wang, Yuxin and Xie, Hongtao and Fang, Shancheng and Wang, Jing and Zhu, Shenggao and Zhang, Yongdong},
  journal={ICCV},
  year={2021}
}

Feedback

Suggestions and discussions are greatly welcome. Please contact the authors by sending email to [email protected]

SCALoss: Side and Corner Aligned Loss for Bounding Box Regression (AAAI2022).

SCALoss PyTorch implementation of the paper "SCALoss: Side and Corner Aligned Loss for Bounding Box Regression" (AAAI 2022). Introduction IoU-based lo

TuZheng 20 Sep 07, 2022
The official homepage of the COCO-Stuff dataset.

The COCO-Stuff dataset Holger Caesar, Jasper Uijlings, Vittorio Ferrari Welcome to official homepage of the COCO-Stuff [1] dataset. COCO-Stuff augment

Holger Caesar 715 Dec 31, 2022
Official Implementation of "Learning Disentangled Behavior Embeddings"

DBE: Disentangled-Behavior-Embedding Official implementation of Learning Disentangled Behavior Embeddings (NeurIPS 2021). Environment requirement The

Mishne Lab 12 Sep 28, 2022
FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering FeTaQA is a Free-form Table Question Answering dataset with 10K Wikipedia-based {table, question, free-form

Language, Information, and Learning at Yale 40 Dec 13, 2022
GLM (General Language Model)

GLM GLM is a General Language Model pretrained with an autoregressive blank-filling objective and can be finetuned on various natural language underst

THUDM 421 Jan 04, 2023
(EI 2022) Controllable Confidence-Based Image Denoising

Image Denoising with Control over Deep Network Hallucination Paper and arXiv preprint -- Our frequency-domain insights derive from SFM and the concept

Images and Visual Representation Laboratory (IVRL) at EPFL 5 Dec 18, 2022
Video Background Music Generation with Controllable Music Transformer (ACM MM 2021 Oral)

CMT Code for paper Video Background Music Generation with Controllable Music Transformer (ACM MM 2021 Best Paper Award) [Paper] [Site] Directory Struc

Zhaokai Wang 198 Dec 27, 2022
Neural Scene Flow Prior (NeurIPS 2021 spotlight)

Neural Scene Flow Prior Xueqian Li, Jhony Kaesemodel Pontes, Simon Lucey Will appear on Thirty-fifth Conference on Neural Information Processing Syste

Lilac Lee 85 Jan 03, 2023
Must-read Papers on Physics-Informed Neural Networks.

PINNpapers Contributed by IDRL lab. Introduction Physics-Informed Neural Network (PINN) has achieved great success in scientific computing since 2017.

IDRL 330 Jan 07, 2023
Simulated garment dataset for virtual try-on

Simulated garment dataset for virtual try-on This repository contains the dataset used in the following papers: Self-Supervised Collision Handling via

33 Dec 20, 2022
A quantum game modeling of pandemic (QHack 2022)

Contributors: @JongheumJung, @YoonjaeChung, @GyunghunKim Abstract In the regime of a global pandemic, leaders around the world need to consider variou

Yoonjae Chung 8 Apr 03, 2022
This is a collection of our NAS and Vision Transformer work.

This is a collection of our NAS and Vision Transformer work.

Microsoft 828 Dec 28, 2022
Partial implementation of ODE-GAN technique from the paper Training Generative Adversarial Networks by Solving Ordinary Differential Equations

ODE GAN (Prototype) in PyTorch Partial implementation of ODE-GAN technique from the paper Training Generative Adversarial Networks by Solving Ordinary

Somshubra Majumdar 15 Feb 10, 2022
PyTorch implementation for SDEdit: Image Synthesis and Editing with Stochastic Differential Equations

SDEdit: Image Synthesis and Editing with Stochastic Differential Equations Project | Paper | Colab PyTorch implementation of SDEdit: Image Synthesis a

536 Jan 05, 2023
Specificity-preserving RGB-D Saliency Detection

Specificity-preserving RGB-D Saliency Detection Authors: Tao Zhou, Huazhu Fu, Geng Chen, Yi Zhou, Deng-Ping Fan, and Ling Shao. 1. Preface This reposi

Tao Zhou 35 Jan 08, 2023
Python package for Bayesian Machine Learning with scikit-learn API

Python package for Bayesian Machine Learning with scikit-learn API Installing & Upgrading package pip install https://github.com/AmazaspShumik/sklearn

Amazasp Shaumyan 482 Jan 04, 2023
An Official Repo of CVPR '20 "MSeg: A Composite Dataset for Multi-Domain Segmentation"

This is the code for the paper: MSeg: A Composite Dataset for Multi-domain Semantic Segmentation (CVPR 2020, Official Repo) [CVPR PDF] [Journal PDF] J

226 Nov 05, 2022
DockStream: A Docking Wrapper to Enhance De Novo Molecular Design

DockStream Description DockStream is a docking wrapper providing access to a collection of ligand embedders and docking backends. Docking execution an

AstraZeneca - Molecular AI 72 Jan 02, 2023
The code of paper "Block Modeling-Guided Graph Convolutional Neural Networks".

Block Modeling-Guided Graph Convolutional Neural Networks This repository contains the demo code of the paper: Block Modeling-Guided Graph Convolution

22 Dec 08, 2022
KGDet: Keypoint-Guided Fashion Detection (AAAI 2021)

KGDet: Keypoint-Guided Fashion Detection (AAAI 2021) This is an official implementation of the AAAI-2021 paper "KGDet: Keypoint-Guided Fashion Detecti

Qian Shenhan 35 Dec 29, 2022