Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Overview

Traditional Chinese Text Recognition Dataset: Synthetic Dataset and Labeled Data

Authors: Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang and Yi-Ren Yeh

Paper: https://arxiv.org/abs/2111.13327

Scene text recognition (STR) has been widely studied in academia and industry. Training a text recognition model often requires a large amount of labeled data, but data labeling can be difficult, expensive, or time-consuming, especially for Traditional Chinese text recognition. To the best of our knowledge, public datasets for Traditional Chinese text recognition are lacking.

We generated over 20 million synthetic data and collected over 7,000 manually labeled data TC-STR 7k-word as the benchmark. Experimental results show that a text recognition model can achieve much better accuracy either by training from scratch with our generated synthetic data or by further fine-tuning with TC-STR 7k-word.

Synthetic Dataset: TCSynth

Inspired by MJSynth, SynthText and Belval/TextRecognitionDataGenerator, we propose a framework for generating scene text images for Traditional Chinese. To produce synthetic text images similar to real-world ones, we use different kinds of mechanisms for rendering, including word sampling, character spacing, font types/sizes, text coloring, text stroking, text skewing/distorting, background rendering, text Location and noise.

synth_text_pipeline

TCSynth dataset includes 21,535,590 synthetic text images.

TCSynth-VAL dataset includes 6,000 synthetic text images for validation.

LMDB Format

After untaring,

TCSynth/
├── data.mdb
└── lock.mdb

Our data structure of LMDB follows the repo. clovaai/deep-text-recognition-benchmark. The value queried by key 'num-samples'.encode() gets total number of text images. The indexes of text images starts from 1. Given the index, we can query binary of the image and its label by key 'image-%09d'.encode() % index and 'label-%09d'.encode() % index. The implement details are shown in the class LmdbConnector in lmdb_tools/lmdb_connector.py.

We also provide several tools to manipulate the LMDB shown in lmdb_tools. Before using those tools, we should install some dependencies. (tested with python 3.6)

pip install -r lmdb_tools/requirements.txt
  • Insert images into LMDB
python lmdb_tools/prepare_lmdb.py \
  --input_dir IMG_FOLDER \
  --gt_file GT \
  --output_dir LMDB_FOLDER
  • Insert images into LMDB (asynchronous version)
python lmdb_tools/prepare_lmdb_async.py \
  --input_dir IMG_FOLDER \
  --gt_file GT \
  --output_dir LMDB_FOLDER \
  --workers WORKERS
  • Extract images from LMDB (asynchronous version) (convert LMDB Format to Raw Format)
python lmdb_tools/extract_to_files.py \
  --input_lmdb LMDB_FOLDER \
  --output_dir IMG_FOLDER \
  --workers WORKERS

Raw Format

After untaring,

TCSynth_raw/
├── labels.txt
├── 0000/
│   ├── 00000001.jpg
│   ├── 00000002.jpg
│   ├── 00000003.jpg
│   └── ...
├── 0001/
├── 0002/
└── ...

format of labels.txt: {imagepath}\t{label}\n, for example:

0000/00000001.jpg 㒓
...

Labeled Data: TC-STR 7k-word

Our TC-STR 7k-word dataset collects about 1,554 images from Google image search to produce 7,543 cropped text images. To increase the diversity in our collected scene text images, we search for images under different scenarios and query keywords. Since the collected scene text images are to be used in evaluating text recognition performance, we manually crop text from the collected images and assign a label to each cropped text box.

TC-STR_demo

TC-STR 7k-word dataset includes a training set of 3,837 text images and a testing set of 3,706 images.

After untaring,

TC-STR/
├── train_labels.txt
├── test_labels.txt
└── images/
    ├── xxx_1.jpg
    ├── xxx_2.jpg
    ├── xxx_3.jpg
    └── ...

format of xxx_labels.txt: {imagepath}\t{label}\n, for example:

images/billboard_00000_010_雜貨鋪.jpg 雜貨鋪
images/sign_02616_999_民生路.png 民生路
...

Citation

Please consider citing this work in your publications if it helps your research.

@article{chen2021traditional,
  title={Traditional Chinese Synthetic Datasets Verified with Labeled Data for Scene Text Recognition},
  author={Yi-Chang Chen and Yu-Chuan Chang and Yen-Cheng Chang and Yi-Ren Yeh},
  journal={arXiv preprint arXiv:2111.13327},
  year={2021}
}
Owner
Yi-Chang Chen
大家好!我是YC,是一名資料科學家,熟悉機器學習和深度學習的各類技術,以及大數據分散式系統; 同時,我也是一名街頭藝人和部落客。我總是嘗試各種生命的可能性,因為我深信:人生的意義在於體驗一切身為人的經驗。
Yi-Chang Chen
STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs

STonKGs STonKGs is a Sophisticated Transformer that can be jointly trained on biomedical text and knowledge graphs. This multimodal Transformer combin

STonKGs 27 Aug 11, 2022
Chinese version of GPT2 training code, using BERT tokenizer.

GPT2-Chinese Description Chinese version of GPT2 training code, using BERT tokenizer or BPE tokenizer. It is based on the extremely awesome repository

Zeyao Du 5.6k Jan 04, 2023
Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB Sentiment Analysis This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial I

Daniel 0 Dec 27, 2021
CodeBERT: A Pre-Trained Model for Programming and Natural Languages.

CodeBERT This repo provides the code for reproducing the experiments in CodeBERT: A Pre-Trained Model for Programming and Natural Languages. CodeBERT

Microsoft 1k Jan 03, 2023
HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools

HuggingSound HuggingSound: A toolkit for speech-related tasks based on HuggingFace's tools. I have no intention of building a very complex tool here.

Jonatas Grosman 247 Dec 26, 2022
An implementation of the Pay Attention when Required transformer

Pay Attention when Required (PAR) Transformer-XL An implementation of the Pay Attention when Required transformer from the paper: https://arxiv.org/pd

7 Aug 11, 2022
A method to generate speech across multiple speakers

VoiceLoop PyTorch implementation of the method described in the paper VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop. VoiceLoop is a n

Facebook Archive 873 Dec 15, 2022
Toward Model Interpretability in Medical NLP

Toward Model Interpretability in Medical NLP LING380: Topics in Computational Linguistics Final Project James Cross ( 1 Mar 04, 2022

:id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.

Dedupe Python Library dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on

Dedupe.io 3.6k Jan 02, 2023
Stuff related to Ben Eater's 8bit breadboard computer

8bit breadboard computer simulator This is an assembler + simulator/emulator of Ben Eater's 8bit breadboard computer. For a version with its RAM upgra

Marijn van Vliet 29 Dec 29, 2022
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
✨Rubrix is a production-ready Python framework for exploring, annotating, and managing data in NLP projects.

✨A Python framework to explore, label, and monitor data for NLP projects

Recognai 1.5k Jan 02, 2023
A workshop with several modules to help learn Feast, an open-source feature store

Workshop: Learning Feast This workshop aims to teach users about Feast, an open-source feature store. We explain concepts & best practices by example,

Feast 52 Jan 05, 2023
Built for cleaning purposes in military institutions

Ferramenta do AL Construído para fins de limpeza em instituições militares. Instalação Requer python = 3.2 pip install -r requirements.txt Usagem Exe

0 Aug 13, 2022
Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

GPT2-Pytorch with Text-Generator Better Language Models and Their Implications Our model, called GPT-2 (a successor to GPT), was trained simply to pre

Tae-Hwan Jung 775 Jan 08, 2023
Python wrapper for Stanford CoreNLP tools v3.4.1

Python interface to Stanford Core NLP tools v3.4.1 This is a Python wrapper for Stanford University's NLP group's Java-based CoreNLP tools. It can eit

Dustin Smith 610 Sep 07, 2022
DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

(简体中文|English) Quick Start | Documents | Models List PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks i

5.6k Jan 03, 2023
Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Ελληνικά νέα (Python script) / Greek News Feed (Python script) Ελληνικά English Το 2017 είχα υλοποιήσει ένα Python script για να εμφανίζει τα τωρινά ν

Loren Kociko 1 Jun 14, 2022
Input english text, then translate it between languages n times using the Deep Translator Python Library.

mass-translator About Input english text, then translate it between languages n times using the Deep Translator Python Library. How to Use Install dep

2 Mar 04, 2022
Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

BERTGEN This repository is the implementation of the paper "BERTGEN: Multi-task Generation through BERT" (https://arxiv.org/abs/2106.03484). The codeb

<a href=[email protected]"> 9 Oct 26, 2022