The code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

Last update: Jan 06, 2023

Overview

CrossFormer

This repository is the code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

Introduction

Existing vision transformers fail to build attention among objects/features of different scales (cross-scale attention), while such ability is very important to visual tasks. CrossFormer is a versatile vision transformer which solves this problem. Its core designs contain Cross-scale Embedding Layer (CEL), Long-Short Distance Attention (L/SDA), which work together to enable cross-scale attention.

CEL blends every input embedding with multiple-scale features. L/SDA split all embeddings into several groups, and the self-attention is only computed within each group (embeddings with the same color border belong to the same group.).

Further, we also propose a dynamic position bias (DPB) module, which makes the effective yet inflexible relative position bias apply to variable image size.

Now, experiments are done on four representative visual tasks, i.e., image classification, objection detection, and instance/semantic segmentation. Results show that CrossFormer outperforms existing vision transformers in these tasks, especially in dense prediction tasks (i.e., object detection and instance/semantic segmentation). We think it is because image classification only pays attention to one object and large-scale features, while dense prediction tasks rely more on cross-scale attention.

Prerequisites

Libraries (Python3.6-based)

pip3 install numpy scipy Pillow pyyaml torch==1.7.0 torchvision==0.8.1 timm==0.3.2

Dataset: ImageNet
Requirements for detection/instance segmentation and semantic segmentation are listed here: detection/README.md or segmentation/README.md

Getting Started

Training

## There should be two directories under the path_to_imagenet: train and validation

## CrossFormer-T
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/tiny_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output

## CrossFormer-S
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/small_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output

## CrossFormer-B
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/base_patch4_group7_224.yaml 
--batch-size 128 --data-path path_to_imagenet --output ./output

## CrossFormer-L
python -u -m torch.distributed.launch --nproc_per_node 8 main.py --cfg configs/large_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --output ./output

Testing

## Take CrossFormer-T as an example
python -u -m torch.distributed.launch --nproc_per_node 1 main.py --cfg configs/tiny_patch4_group7_224.yaml \
--batch-size 128 --data-path path_to_imagenet --eval --resume path_to_crossformer-t.pth

Training scripts for objection detection: detection/README.md.

Training scripts for semantic segmentation: segmentation/README.md.

Results

Image Classification

Models trained on ImageNet-1K and evaluated on its validation set. The input image size is 224 x 224.

Architectures	Params	FLOPs	Accuracy	Models
ResNet-50	25.6M	4.1G	76.2%	-
RegNetY-8G	39.0M	8.0G	81.7%	-
CrossFormer-T	27.8M	2.9G	81.5%	Google Drive/BaiduCloud, key: nkju
CrossFormer-S	30.7M	4.9G	82.5%	Google Drive/BaiduCloud, key: fgqj
CrossFormer-B	52.0M	9.2G	83.4%	Google Drive/BaiduCloud, key: 7md9
CrossFormer-L	92.0M	16.1G	84.0%	TBD

More results compared with other vision transformers can be seen in the paper.

Objection Detection & Instance Segmentation

Models trained on COCO 2017. Backbones are initialized with weights pre-trained on ImageNet-1K.

Backbone	Detection Head	Learning Schedule	Params	FLOPs	box AP	mask AP
ResNet-101	RetinaNet	1x	56.7M	315.0G	38.5	-
CrossFormer-S	RetinaNet	1x	40.8M	282.0G	44.4	-
CrossFormer-B	RetinaNet	1x	62.1M	389.0G	46.2	-
ResNet-101	Mask-RCNN	1x	63.2M	336.0G	40.4	36.4
CrossFormer-S	Mask-RCNN	1x	50.2M	301.0G	45.4	41.4
CrossFormer-B	Mask-RCNN	1x	71.5M	407.9G	47.2	42.7

More results and pretrained models for objection detection: detection/README.md.

Semantic Segmentation

Models trained on ADE20K. Backbones are initialized with weights pre-trained on ImageNet-1K.

Backbone	Segmentation Head	Iterations	Params	FLOPs	IOU	MS IOU
CrossFormer-S	FPN	80K	34.3M	209.8G	46.4	-
CrossFormer-B	FPN	80K	55.6M	320.1G	48.0	-
CrossFormer-L	FPN	80K	95.4M	482.7G	49.1	-
ResNet-101	UPerNet	160K	86.0M	1029.G	44.9	-
CrossFormer-S	UPerNet	160K	62.3M	979.5G	47.6	48.4
CrossFormer-B	UPerNet	160K	83.6M	1089.7G	49.7	50.6
CrossFormer-L	UPerNet	160K	125.5M	1257.8G	50.4	51.4

MS IOU means IOU with multi-scale testing.

More results and pretrained models for semantic segmentation: segmentation/README.md.

Citing Us

@article{crossformer2021,
  title     = {CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention},
  author    = {Wenxiao Wang and Lu Yao and Long Chen and Deng Cai and Xiaofei He and Wei Liu},
  journal   = {CoRR},
  volume    = {abs/2108.00154},
  year      = {2021},
}

Acknowledgement

Part of the code of this repository refers to Swin Transformer.

The code for our paper CrossFormer: A Versatile Vision Transformer Based on Cross-scale Attention.

Related tags

Overview

CrossFormer

Introduction

Prerequisites

Getting Started

Training

Testing

Results

Image Classification

Objection Detection & Instance Segmentation

Semantic Segmentation

Citing Us

Acknowledgement

Owner

cheerss

Relative Human dataset, CVPR 2022

Python Tensorflow 2 scripts for detecting objects of any class in an image without knowing their label.

Regulatory Instruments for Fair Personalized Pricing.

The code for our CVPR paper PISE: Person Image Synthesis and Editing with Decoupled GAN, Project Page, supp.

Official pytorch implementation of the paper: "SinGAN: Learning a Generative Model from a Single Natural Image"

Structured Edge Detection Toolbox

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Train Yolov4 using NBX-Jobs

Official and maintained implementation of the paper "OSS-Net: Memory Efficient High Resolution Semantic Segmentation of 3D Medical Data" [BMVC 2021].

[TNNLS 2021] The official code for the paper "Learning Deep Context-Sensitive Decomposition for Low-Light Image Enhancement"

Machine Learning University: Accelerated Computer Vision Class

[ICLR 2021] "CPT: Efficient Deep Neural Network Training via Cyclic Precision" by Yonggan Fu, Han Guo, Meng Li, Xin Yang, Yining Ding, Vikas Chandra, Yingyan Lin

Go from graph data to a secure and interactive visual graph app in 15 minutes. Batteries-included self-hosting of graph data apps with Streamlit, Graphistry, RAPIDS, and more!

E-Ink Magic Calendar that automatically syncs to Google Calendar and runs off a battery powered Raspberry Pi Zero

Complete* list of autonomous driving related datasets

Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021.

[ICCV 2021] Amplitude-Phase Recombination: Rethinking Robustness of Convolutional Neural Networks in Frequency Domain

Creating Artificial Life with Reinforcement Learning

Resources for our AAAI 2022 paper: "LOREN: Logic-Regularized Reasoning for Interpretable Fact Verification".

A denoising diffusion probabilistic model synthesises galaxies that are qualitatively and physically indistinguishable from the real thing.