Skip to content

yan-hao-tian/ConTNet

Repository files navigation

ConTNet

Introduction

ConTNet (Convlution-Tranformer Network) is a neural network built by stacking convolutional layers and transformers alternately. This architecture is proposed in response to the following two issues: (1) The receptive field of convolution is limited by a local window (3x3), which potentially impairs the performance of ConvNets on downstream tasks. (2) Transformer-based models suffers from insufficient robustness, as a result, the training course requires multiple training tricks and tons of regularization strategies. In our ConTNet, these drawbacks are alleviated through the combination of convolution and transformer. Two perspectives are offered to understand the motivation. From the view of ConvNet, the transformer sub-layer is inserted between any two conv layers to enhance the non-local interactions of ConvNet. From the view of Transformer, the presence of convolution layers reintroduces the inductive bias as a cause of under-fitting. Through numerical experiments, we find that ConTNet achieves competitive performance on image recognition and downstream tasks. More notably, ConTNet can be optimized easily even in the same way as ResNet.

image image image

Training & Validation with this Repo

We give an example of one machine multi-gpus training.

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --master_port 29501 main.py --arch ConT-M --batch_size 256 --save_path debug_trial_cont_m --save_best True 

To validate a model, please add the arg --eval .

CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 --master_port 29501 main.py --arch ConT-M --batch_size 256 --save_path debug_trial --eval ./debug_trial_cont_m/checkpoint_bestTop1.pth

To implement resume training, please add the arg --resume.

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m torch.distributed.launch --nproc_per_node=4 --master_port 29501 main.py --arch ConT-M --batch_size 256 --save_path debug_trial --save_best True --resume ./debug_trial_cont_m/checkpoint_bestTop1.pth

Pretrained Weights on ImageNet

ImageNet-pretrained weights are available from Google Drive or Baidu Cloud(the code is 3k3s).

Main Results on ImageNet

name resolution acc@1 #params(M) FLOPs(G) model
Res-18 224x224 71.5 11.7 1.8
ConT-S 224x224 74.9 10.1 1.5
Res-50 224x224 77.1 25.6 4.0
ConT-M 224x224 77.6 19.2 3.1
Res-101 224x224 78.2 44.5 7.6
ConT-B 224x224 77.9 39.6 6.4
DeiT-Ti* 224x224 72.2 5.7 1.3
ConT-Ti* 224x224 74.9 5.8 0.8
Res-18* 224x224 73.2 11.7 1.8
ConT-S* 224x224 76.5 10.1 1.5
Res-50* 224x224 78.6 25.6 4.0
DeiT-S* 224x224 79.8 22.1 4.6
ConT-M* 224x224 80.2 19.2 3.1
Res-101* 224x224 80.0 44.5 7.6
DeiT-B* 224x224 81.8 86.6 17.6
ConT-B* 224x224 81.8 39.6 6.4

Note: * indicates training with strong augmentations(auto-augmentation and mixup).

Main Results on Downstream Tasks

Object detection results on COCO.

method backbone #params(M) FLOPs(G) AP APs APm APl
RetinaNet Res-50
ConTNet-M
32.0
27.0
235.6
217.2
36.5
37.9
20.4
23.0
40.3
40.6
48.1
50.4
FCOS Res-50
ConTNet-M
32.2
27.2
242.9
228.4
38.7
40.8
22.9
25.1
42.5
44.6
50.1
53.0
faster rcnn Res-50
ConTNet-M
41.5
36.6
241.0
225.6
37.4
40.0
21.2
25.4
41.0
43.0
48.1
52.0

Instance segmentation results on Cityscapes based on Mask-RCNN.

backbone APbb APsbb APmbb APlbb APmk APsmk APmmk APlmk
Res-50
ConT-M
38.2
40.5
21.9
25.1
40.9
44.4
49.5
52.7
34.7
38.1
18.3
20.9
37.4
41.0
47.2
50.3

Semantic segmentation results on cityscapes.

model mIOU
PSP-Res50 77.12
PSP-ConTM 78.28

Bib Citing

@article{yan2021contnet,
    title={ConTNet: Why not use convolution and transformer at the same time?},
    author={Haotian Yan and Zhe Li and Weijian Li and Changhu Wang and Ming Wu and Chuang Zhang},
    year={2021},
    journal={arXiv preprint arXiv:2104.13497}
}

About

This repo contains the code of "ConTNet: Why not use convolution and transformer at the same time?"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages