We will release the code of "ConTNet: Why not use convolution and transformer at the same time?" in this repo

Last update: Nov 08, 2022

Related tags

Overview

ConTNet

Introduction

ConTNet (Convlution-Tranformer Network) is proposed mainly in response to the following two issues: (1) ConvNets lack a large receptive field, limiting the performance of ConvNets on downstream tasks. (2) Transformer-based model is not robust enough and requires special training settings or hundreds of millions of images as the pretrain dataset, thereby limiting their adoption. ConTNet combines convolution and transformer alternately, which is very robust and can be optimized like ResNet unlike the recently-proposed transformer-based models (e.g., ViT, DeiT) that are sensitive to hyper-parameters and need many tricks when trained from scratch on a midsize dataset (e.g., ImageNet).

Main Results on ImageNet

name	resolution	[email protected]	#params(M)	FLOPs(G)
Res-18	224x224	71.5	11.7	1.8
ConT-S	224x224	74.9	10.1	1.5
Res-50	224x224	77.1	25.6	4.0
ConT-M	224x224	77.6	19.2	3.1
Res-101	224x224	78.2	44.5	7.6
ConT-B	224x224	77.9	39.6	6.4
DeiT-Ti^*	224x224	72.2	5.7	1.3
ConT-Ti^*	224x224	74.9	5.8	0.8
Res-18^*	224x224	73.2	11.7	1.8
ConT-S^*	224x224	76.5	10.1	1.5
Res-50^*	224x224	78.6	25.6	4.0
DeiT-S^*	224x224	79.8	22.1	4.6
ConT-M^*	224x224	80.2	19.2	3.1
Res-101^*	224x224	80.0	44.5	7.6
DeiT-B^*	224x224	81.8	86.6	17.6
ConT-B^*	224x224	81.8	39.6	6.4

Note: ^* indicates training with strong augmentations.

Main Results on Downstream Tasks

Object detection results on COCO.

method	backbone	#params(M)	FLOPs(G)	AP	APs	APm	APl
RetinaNet	Res-50 ConTNet-M	32.0 27.0	235.6 217.2	36.5 37.9	20.4 23.0	40.3 40.6	48.1 50.4
FCOS	Res-50 ConTNet-M	32.2 27.2	242.9 228.4	38.7 40.8	22.9 25.1	42.5 44.6	50.1 53.0
faster rcnn	Res-50 ConTNet-M	41.5 36.6	241.0 225.6	37.4 40.0	21.2 25.4	41.0 43.0	48.1 52.0

Instance segmentation results on Cityscapes based on Mask-RCNN.

backbone	AP^bb	AP_s^bb	AP_m^bb	AP_l^bb	AP^mk	AP_s^mk	AP_m^mk	AP_l^mk
Res-50 ConT-M	38.2 40.5	21.9 25.1	40.9 44.4	49.5 52.7	34.7 38.1	18.3 20.9	37.4 41.0	47.2 50.3

Semantic segmentation results on cityscapes.

model	mIOU
PSP-Res50	77.12
PSP-ConTM	78.28

Bib Citing

@article{yan2021contnet,
    title={ConTNet: Why not use convolution and transformer at the same time?},
    author={Haotian Yan and Zhe Li and Weijian Li and Changhu Wang and Ming Wu and Chuang Zhang},
    year={2021},
    journal={arXiv preprint arXiv:2104.13497}
}

We will release the code of "ConTNet: Why not use convolution and transformer at the same time?" in this repo

Related tags

Overview

ConTNet

Introduction

Main Results on ImageNet

Main Results on Downstream Tasks

Bib Citing

Owner

A criticism of a recent paper on buggy image downsampling methods in popular image processing and deep learning libraries.

Official PyTorch code for Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021)

FSL-Mate: A collection of resources for few-shot learning (FSL).

Benchmarks for Object Detection in Aerial Images

SCAN: Learning to Classify Images without Labels, incl. SimCLR. [ECCV 2020]

A library of scripts that interact with the PythonTurtle module to create games, drawings, and more

PyTorch code for Composing Partial Differential Equations with Physics-Aware Neural Networks

This repository allows you to anonymize sensitive information in images/videos. The solution is fully compatible with the DL-based training/inference solutions that we already published/will publish for Object Detection and Semantic Segmentation.

The ARCA23K baseline system

A face dataset generator with out-of-focus blur detection and dynamic interval adjustment.

The official project of SimSwap (ACM MM 2020)

Misc YOLOL scripts for use in the Starbase space sandbox videogame

Speed-Test - You can check your intenet speed using this tool

MG-GCN: Scalable Multi-GPU GCN Training Framework

Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

🔥 Real-time Super Resolution enhancement (4x) with content loss and relativistic adversarial optimization 🔥

Pytorch implementation of the unsupervised object discovery method LOST.

Implementation of ICLR 2020 paper "Revisiting Self-Training for Neural Sequence Generation"

Train an imgs.ai model on your own dataset

Code for ICCV 2021 paper "HuMoR: 3D Human Motion Model for Robust Pose Estimation"