A full-fledged version of Pix2Seq

Last update: Dec 27, 2022

Related tags

Overview

Stable-Pix2Seq

A full-fledged version of Pix2Seq

What it is. This is a full-fledged version of Pix2Seq. Compared with unofficial-pix2seq, stable-pix2seq contain most of the tricks mentioned in Pix2Seq like Sequence Augmentation, Batch Repretation, Warmup, Linear decay leanring rate and beam search(to be add later).

Difference between Pix2Seq. In sequence augmentation, we only augment random bounding box while original paper will mix with virual box from ground truth plus noise. Pix2seq also use input sequence dropout to regularize the training process.

Usage - Object detection

There are no extra compiled components in Stable-Pix2Seq and package dependencies are minimal, so the code is very simple to use. We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/gaopengcuhk/Stable-Pix2Seq.git

Then, install PyTorch 1.5+ and torchvision 0.6+:

conda install -c pytorch pytorch torchvision

Install pycocotools (for evaluation on COCO) and scipy (for training):

conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

That's it, should be good to train and evaluate detection models.

Data preparation

Download and extract COCO 2017 train and val images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

path/to/coco/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Training

To train baseline Stable-Pix2Seq on a single node with 8 gpus for 300 epochs run:

python -m torch.distributed.launch --master_port=3141 --nproc_per_node 8 --use_env main.py --coco_path ./coco/ --batch_size 4 --lr 0.0005

A single epoch takes 50 minutes on 8 V100, so 300 epoch training takes around 10 days on a single machine with 8 V100 cards.

Why slower than DETR and Unofficial-Pix2Seq?. Stable-Pix2Seq use batch repeat which double the training time. Besides, stable-pix2seq use 1333 image resolution will the time report in unofficial-pix2seq is trained on low resolution 512.

We train DETR with AdamW setting learning rate using a linear warmup and decay schedule. Due to batch repeat, the real barch size is 64. Horizontal flips, scales and crops are used for augmentation. Images are rescaled to have min size 800 and max size 1333. The transformer is trained with dropout of 0.1, and the whole model is trained with grad clip of 0.1.

Please use the learning rate 0.0005 with causion. It is tested on batch 198.

Evaluation

To evaluate Stable-Pix2Seq R50 on COCO val5k with multiple GPU run:

python -m torch.distributed.launch --master_port=3142 --nproc_per_node 8 --use_env main.py --coco_path ./coco/ --batch_size 4 --eval --resume checkpoint.pth

Acknowledgement

DETR

A full-fledged version of Pix2Seq

Related tags

Overview

Stable-Pix2Seq

Usage - Object detection

Data preparation

Training

Evaluation

Acknowledgement

Owner

peng gao

3D Human Pose Machines with Self-supervised Learning

thundernet ncnn

Transfer style api - An API to use with Tranfer Style App, where you can use two image and transfer the style

Cross-lingual Transfer for Speech Processing using Acoustic Language Similarity

Pairwise model for commonlit competition

RoMa: A lightweight library to deal with 3D rotations in PyTorch.

A CROSS-MODAL FUSION NETWORK BASED ON SELF-ATTENTION AND RESIDUAL STRUCTURE FOR MULTIMODAL EMOTION RECOGNITION

Config files for my GitHub profile.

This repository contains the source codes for the paper AtlasNet V2 - Learning Elementary Structures.

The official code for paper "R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling".

Text to image synthesis using thought vectors

SubOmiEmbed: Self-supervised Representation Learning of Multi-omics Data for Cancer Type Classification

基于PaddleOCR搭建的OCR server... 离线部署用

High performance Cross-platform Inference-engine, you could run Anakin on x86-cpu,arm, nv-gpu, amd-gpu,bitmain and cambricon devices.

Augmented CLIP - Training simple models to predict CLIP image embeddings from text embeddings, and vice versa.

PyTorch code of my WACV 2022 paper Improving Model Generalization by Agreement of Learned Representations from Data Augmentation

TraSw for FairMOT - A Single-Target Attack example (Attack ID: 19; Screener ID: 24):

Simple keras FCN Encoder/Decoder model for MS-COCO (food subset) segmentation

Red Team tool for exfiltrating files from a target's Google Drive that you have access to, via Google's API.

Churn-Prediction-Project - In this project, a churn prediction model is developed for a private bank as a term project for Data Mining class.