A new video text spotting framework with Transformer

Last update: Jan 03, 2023

Related tags

Deep Learning TransVTSpotter

Overview

TransVTSpotter: End-to-end Video Text Spotter with Transformer

Introduction

A Multilingual, Open World Video Text Dataset and End-to-end Video Text Spotter with Transformer

Link to our MOVText: A Large-Scale, Multilingual Open World Dataset for Video Text Spotting

Updates

(08/04/2021) Refactoring the code.
(10/20/2021) The complete code has been released .

ICDAR2015(video) Tracking challenge

Methods	MOTA	MOTP	IDF1	Mostly Matched	Partially Matched	Mostly Lost
TransVTSpotter	45.75	73.58	57.56	658	611	647

Models are also available in Baidu Drive by code m4iv.

Notes

The training time is on 8 NVIDIA V100 GPUs with batchsize 16.
We use the models pre-trained on COCOTextV2.
We do not release the recognition code due to the company's regulations.

Demo

Installation

The codebases are built on top of Deformable DETR and TransTrack.

Requirements

Linux, CUDA>=9.2, GCC>=5.4
Python>=3.7
PyTorch ≥ 1.5 and torchvision that matches the PyTorch installation. You can install them together at pytorch.org to make sure of this
OpenCV is optional and needed by demo and visualization

Steps

Install and build libs

git clone [email protected]:weijiawu/TransVTSpotter.git
cd TransVTSpotter
cd models/ops
python setup.py build install
cd ../..
pip install -r requirements.txt

Prepare datasets and annotations

# pretrain COCOTextV2
python3 track_tools/convert_COCOText_to_coco.py

# ICDAR15
python3 track_tools/convert_ICDAR15video_to_coco.py

COCOTextV2 dataset is available in COCOTextV2.

python3 track_tools/convert_crowdhuman_to_coco.py

ICDAR2015 dataset is available in icdar2015.

python3 track_tools/convert_mot_to_coco.py

Pre-train on COCOTextV2

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py  --output_dir ./output/Pretrain_COCOTextV2 --dataset_file pretrain --coco_path ./Data/COCOTextV2 --batch_size 2  --with_box_refine --num_queries 500 --epochs 300 --lr_drop 100 --resume ./output/Pretrain_COCOTextV2/checkpoint.pth

python3 track_tools/Pretrain_model_to_mot.py

The pre-trained model is available COCOTextV2_pretrain.pth， password:59w8. And the MOTA 44% can be found here password:xnlw.

Train TransVTSpotter

python3 -m torch.distributed.launch --nproc_per_node=8 --use_env main_track.py  --output_dir ./output/ICDAR15 --dataset_file text --coco_path ./Data/ICDAR2015_video --batch_size 2  --with_box_refine  --num_queries 300 --epochs 80 --lr_drop 40 --resume ./output/Pretrain_COCOTextV2/pretrain_coco.pth

Visualize TransVTSpotter

python3 track_tools/Evaluation_ICDAR15_video/vis_tracking.py

License

TransVTSpotter is released under MIT License.

Citing

If you use TranVTSpotter in your research or wish to refer to the baseline results published here, please use the following BibTeX entries:

A new video text spotting framework with Transformer

Related tags

Overview

TransVTSpotter: End-to-end Video Text Spotter with Transformer

Introduction

Updates

ICDAR2015(video) Tracking challenge

Notes

Demo

Installation

Requirements

Steps

License

Citing

Owner

weijiawu

A python library for implementing a recommender system

Doing the asl sign language classification on static images using graph neural networks.

GeoMol: Torsional Geometric Generation of Molecular 3D Conformer Ensembles

The implementation of "Optimizing Shoulder to Shoulder: A Coordinated Sub-Band Fusion Model for Real-Time Full-Band Speech Enhancement"

a short visualisation script for pyvideo data

Research - dataset and code for 2016 paper Learning a Driving Simulator

a curated list of docker-compose files prepared for testing data engineering tools, databases and open source libraries.

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

Official PyTorch implementation of "BlendGAN: Implicitly GAN Blending for Arbitrary Stylized Face Generation" (NeurIPS 2021)

Simple Pose: Rethinking and Improving a Bottom-up Approach for Multi-Person Pose Estimation

本项目是一个带有前端界面的垃圾分类项目，加载了训练好的模型参数，模型为efficientnetb4，暂时为40分类问题。

TeachMyAgent is a testbed platform for Automatic Curriculum Learning methods in Deep RL.

Code for "LASR: Learning Articulated Shape Reconstruction from a Monocular Video". CVPR 2021.

For storing the complete exploration of Visual Question Answering for our B.Tech Project

Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research

Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

The official codes for the ICCV2021 Oral presentation "Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework"

CSE-519---Project - Job Title Analysis (Project for CSE 519 - Data Science Fundamentals)

Official PyTorch Implementation of SSMix (Findings of ACL 2021)

Posterior temperature optimized Bayesian models for inverse problems in medical imaging