This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Last update: Dec 30, 2022

Overview

Introduction

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers. We present a new architecture, named Convolutional vision Transformers (CvT), that improves Vision Transformers (ViT) in performance and efficienty by introducing convolutions into ViT to yield the best of both disignes. This is accomplished through two primary modifications: a hierarchy of Transformers containing a new convolutional token embedding, and a convolutional Transformer block leveraging a convolutional projection. These changes introduce desirable properties of convolutional neural networks (CNNs) to the ViT architecture (e.g. shift, scale, and distortion invariance) while maintaining the merits of Transformers (e.g. dynamic attention, global context, and better generalization). We validate CvT by conducting extensive experiments, showing that this approach achieves state-of-the-art performance over other Vision Transformers and ResNets on ImageNet-1k, with fewer parameters and lower FLOPs. In addition, performance gains are maintained when pretrained on larger dataset (e.g. ImageNet-22k) and fine-tuned to downstream tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of 87.7% on the ImageNet-1k val set. Finally, our results show that the positional encoding, a crucial component in existing Vision Transformers, can be safely removed in our model, simplifying the design for higher resolution vision tasks.

Main results

Models pre-trained on ImageNet-1k

Model	Resolution	Param	GFLOPs	Top-1
CvT-13	224x224	20M	4.5	81.6
CvT-21	224x224	32M	7.1	82.5
CvT-13	384x384	20M	16.3	83.0
CvT-32	384x384	32M	24.9	83.3

Models pre-trained on ImageNet-22k

Model	Resolution	Param	GFLOPs	Top-1
CvT-13	384x384	20M	16.3	83.3
CvT-32	384x384	32M	24.9	84.9
CvT-W24	384x384	277M	193.2	87.6

You can download all the models from our model zoo.

Quick start

Installation

Assuming that you have installed PyTroch and TorchVision, if not, please follow the officiall instruction to install them firstly. Intall the dependencies using cmd:

python -m pip install -r requirements.txt --user -q

The code is developed and tested using pytorch 1.7.1. Other versions of pytorch are not fully tested.

Data preparation

Please prepare the data as following:

|-DATASET
  |-imagenet
    |-train
    | |-class1
    | | |-img1.jpg
    | | |-img2.jpg
    | | |-...
    | |-class2
    | | |-img3.jpg
    | | |-...
    | |-class3
    | | |-img4.jpg
    | | |-...
    | |-...
    |-val
      |-class1
      | |-img5.jpg
      | |-...
      |-class2
      | |-img6.jpg
      | |-...
      |-class3
      | |-img7.jpg
      | |-...
      |-...

Run

Each experiment is defined by a yaml config file, which is saved under the directory of experiments. The directory of experiments has a tree structure like this:

experiments
|-{DATASET_A}
| |-{ARCH_A}
| |-{ARCH_B}
|-{DATASET_B}
| |-{ARCH_A}
| |-{ARCH_B}
|-{DATASET_C}
| |-{ARCH_A}
| |-{ARCH_B}
|-...

We provide a run.sh script for running jobs in local machine.

Usage: run.sh [run_options]
Options:
  -g|--gpus <1> - number of gpus to be used
  -t|--job-type <aml> - job type (train|test)
  -p|--port <9000> - master port
  -i|--install-deps - If install dependencies (default: False)

Training on local machine

bash run.sh -g 8 -t train --cfg experiments/imagenet/cvt/cvt-13-224x224.yaml

You can also modify the config paramters by the command line. For example, if you want to change the lr rate to 0.1, you can run the command:

bash run.sh -g 8 -t train --cfg experiments/imagenet/cvt/cvt-13-224x224.yaml TRAIN.LR 0.1

Notes:

The checkpoint, model, and log files will be saved in OUTPUT/{dataset}/{training config} by default.

Testing pre-trained models

bash run.sh -t test --cfg experiments/imagenet/cvt/cvt-13-224x224.yaml TEST.MODEL_FILE ${PRETRAINED_MODLE_FILE}

Citation

If you find this work or code is helpful in your research, please cite:

@article{wu2021cvt,
  title={Cvt: Introducing convolutions to vision transformers},
  author={Wu, Haiping and Xiao, Bin and Codella, Noel and Liu, Mengchen and Dai, Xiyang and Yuan, Lu and Zhang, Lei},
  journal={arXiv preprint arXiv:2103.15808},
  year={2021}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Related tags

Overview

Introduction

Main results

Models pre-trained on ImageNet-1k

Models pre-trained on ImageNet-22k

Quick start

Installation

Data preparation

Run

Training on local machine

Testing pre-trained models

Citation

Contributing

Trademarks

Owner

Microsoft

This is the source code for the experiments related to the paper Unsupervised Audio Source Separation Using Differentiable Parametric Source Models

Local trajectory planner based on a multilayer graph framework for autonomous race vehicles.

An example of semantic segmentation using tensorflow in eager execution.

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

The official PyTorch implementation of the paper: Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." .

This project demonstrates the use of neural networks and computer vision to create a classifier that interprets the Brazilian Sign Language.

StrongSORT: Make DeepSORT Great Again

Tilted Empirical Risk Minimization (ICLR '21)

Accelerated deep learning R&D

The repository offers the official implementation of our paper in PyTorch.

PyTorch implementation of Constrained Policy Optimization

Why Are You Weird? Infusing Interpretability in Isolation Forest for Anomaly Detection

git《USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation》(2020) GitHub: [fig2]

The Habitat-Matterport 3D Research Dataset - the largest-ever dataset of 3D indoor spaces.

[TPAMI 2021] iOD: Incremental Object Detection via Meta-Learning

Pytorch implementation of DeePSiM

Implementation of Wasserstein adversarial attacks.

This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?".

CS506-Spring2022 - Code and Slides for Boston University CS 506

StarGAN-ZSVC: Unofficial PyTorch Implementation

This is an official implementation of CvT: Introducing Convolutions to Vision Transformers.

Related tags

Overview

Introduction

Main results

Models pre-trained on ImageNet-1k

Models pre-trained on ImageNet-22k

Quick start

Installation

Data preparation

Run

Training on local machine

Testing pre-trained models

Citation

Contributing

Trademarks

Owner

Microsoft

This is the source code for the experiments related to the paper Unsupervised Audio Source Separation Using Differentiable Parametric Source Models

Local trajectory planner based on a multilayer graph framework for autonomous race vehicles.

An example of semantic segmentation using tensorflow in eager execution.

Official Implementation and Dataset of "PPR10K: A Large-Scale Portrait Photo Retouching Dataset with Human-Region Mask and Group-Level Consistency", CVPR 2021

The official PyTorch implementation of the paper: *Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." *.

This project demonstrates the use of neural networks and computer vision to create a classifier that interprets the Brazilian Sign Language.

StrongSORT: Make DeepSORT Great Again

Tilted Empirical Risk Minimization (ICLR '21)

Accelerated deep learning R&D

The repository offers the official implementation of our paper in PyTorch.

PyTorch implementation of Constrained Policy Optimization

Why Are You Weird? Infusing Interpretability in Isolation Forest for Anomaly Detection

git《USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation》(2020) GitHub: [fig2]

The Habitat-Matterport 3D Research Dataset - the largest-ever dataset of 3D indoor spaces.

[TPAMI 2021] iOD: Incremental Object Detection via Meta-Learning

Pytorch implementation of DeePSiM

Implementation of Wasserstein adversarial attacks.

This repository contains an implementation of ConvMixer for the ICLR 2022 submission "Patches Are All You Need?".

CS506-Spring2022 - Code and Slides for Boston University CS 506

StarGAN-ZSVC: Unofficial PyTorch Implementation

The official PyTorch implementation of the paper: Xili Dai, Xiaojun Yuan, Haigang Gong, Yi Ma. "Fully Convolutional Line Parsing." .