MPViT:Multi-Path Vision Transformer for Dense Prediction

Overview

MPViT : Multi-Path Vision Transformer for Dense Prediction

This repository inlcudes official implementations and model weights for MPViT.

[Arxiv] [BibTeX]

MPViT : Multi-Path Vision Transformer for Dense Prediction
🏛️ ️️ 🏫 Youngwan Lee, 🏛️ ️️Jonghee Kim, 🏫 Jeff Willette, 🏫 Sung Ju Hwang
ETRI 🏛️ ️, KAIST 🏫

Abstract

We explore multi-scale patch embedding and multi-path structure, constructing the Multi-Path Vision Transformer (MPViT). MPViT embeds features of the same size (i.e., sequence length) with patches of different scales simultaneously by using overlapping convolutional patch embedding. Tokens of different scales are then independently fed into the Transformer encoders via multiple paths and the resulting features are aggregated, enabling both fine and coarse feature representations at the same feature level. Thanks to the diverse and multi-scale feature representations, our MPViTs scaling from Tiny(5M) to Base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers on ImageNet classification, object detection, instance segmentation, and semantic segmentation. These extensive results demonstrate that MPViT can serve as a versatile backbone network for various vision tasks.

Main results on ImageNet-1K

🚀 These all models are trained on ImageNet-1K with the same training recipe as DeiT and CoaT.

model resolution [email protected] #params FLOPs weight
MPViT-T 224x224 78.2 5.8M 1.6G weight
MPViT-XS 224x224 80.9 10.5M 2.9G weight
MPViT-S 224x224 83.0 22.8M 4.7G weight
MPViT-B 224x224 84.3 74.8M 16.4G weight

Main results on COCO object detection

🚀 All model are trained using ImageNet-1K pretrained weights.

☀️ MS denotes the same multi-scale training augmentation as in Swin-Transformer which follows the MS augmentation as in DETR and Sparse-RCNN. Therefore, we also follows the official implementation of DETR and Sparse-RCNN which are also based on Detectron2.

Please refer to detectron2/ for the details.

Backbone Method lr Schd box mAP mask mAP #params FLOPS weight
MPViT-T RetinaNet 1x 41.8 - 17M 196G model | metrics
MPViT-XS RetinaNet 1x 43.8 - 20M 211G model | metrics
MPViT-S RetinaNet 1x 45.7 - 32M 248G model | metrics
MPViT-B RetinaNet 1x 47.0 - 85M 482G model | metrics
MPViT-T RetinaNet MS+3x 44.4 - 17M 196G model | metrics
MPViT-XS RetinaNet MS+3x 46.1 - 20M 211G model | metrics
MPViT-S RetinaNet MS+3x 47.6 - 32M 248G model | metrics
MPViT-B RetinaNet MS+3x 48.3 - 85M 482G model | metrics
MPViT-T Mask R-CNN 1x 42.2 39.0 28M 216G model | metrics
MPViT-XS Mask R-CNN 1x 44.2 40.4 30M 231G model | metrics
MPViT-S Mask R-CNN 1x 46.4 42.4 43M 268G model | metrics
MPViT-B Mask R-CNN 1x 48.2 43.5 95M 503G model | metrics
MPViT-T Mask R-CNN MS+3x 44.8 41.0 28M 216G model | metrics
MPViT-XS Mask R-CNN MS+3x 46.6 42.3 30M 231G model | metrics
MPViT-S Mask R-CNN MS+3x 48.4 43.9 43M 268G model | metrics
MPViT-B Mask R-CNN MS+3x 49.5 44.5 95M 503G model | metrics

Deformable-DETR

All models are trained using the same training recipe.

Please refer to deformable_detr/ for the details.

backbone box mAP epochs link
ResNet-50 44.5 50 -
CoaT-lite S 47.0 50 link
CoaT-S 48.4 50 link
MPViT-S 49.0 50 link

Main results on ADE20K Semantic segmentation

All model are trained using ImageNet-1K pretrained weight.

Please refer to semantic_segmentation/ for the details.

Backbone Method Crop Size Lr Schd mIoU #params FLOPs weight
MPViT-S UperNet 512x512 160K 48.3 52M 943G weight
MPViT-B UperNet 512x512 160K 50.3 105M 1185G weight

Getting Started

We use pytorch==1.7.0 torchvision==0.8.1 cuda==10.1 libraries on NVIDIA V100 GPUs. If you use different versions of cuda, you may obtain different accuracies, but the differences are negligible.

Acknowledgement

This repository is built using the Timm library, DeiT, CoaT, Detectron2, mmsegmentation repositories.

This work was supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. 2020-0-00004, Development of Previsional Intelligence based on Long-term Visual Memory Network and No. 2014-3-00123, Development of High Performance Visual BigData Discovery Platform for Large-Scale Realtime Data Analysis).

License

Please refer to MPViT LSA.

Citing MPViT

@article{lee2021mpvit,
      title={MPViT: Multi-Path Vision Transformer for Dense Prediction}, 
      author={Youngwan Lee and Jonghee Kim and Jeff Willette and Sung Ju Hwang},
      year={2021},
      journal={arXiv preprint arXiv:2112.11010}
}
Owner
Youngwan Lee
Researcher at ETRI & Ph.D student in Graduate school of AI at KAIST.
Youngwan Lee
A simple algorithm for extracting tree height in sparse scene from point cloud data.

TREE HEIGHT EXTRACTION IN SPARSE SCENES BASED ON UAV REMOTE SENSING This is the offical python implementation of the paper "Tree Height Extraction in

6 Oct 28, 2022
Constructing Neural Network-Based Models for Simulating Dynamical Systems

Constructing Neural Network-Based Models for Simulating Dynamical Systems Note this repo is work in progress prior to reviewing This is a companion re

Christian Møldrup Legaard 21 Nov 25, 2022
[ICME 2021 Oral] CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning

CORE-Text: Improving Scene Text Detection with Contrastive Relational Reasoning This repository is the official PyTorch implementation of CORE-Text, a

Jingyang Lin 18 Aug 11, 2022
Pytorch Implementation for CVPR2018 Paper: Learning to Compare: Relation Network for Few-Shot Learning

LearningToCompare Pytorch Implementation for Paper: Learning to Compare: Relation Network for Few-Shot Learning Howto download mini-imagenet and make

Jackie Loong 246 Dec 19, 2022
[AAAI 2022] Sparse Structure Learning via Graph Neural Networks for Inductive Document Classification

Sparse Structure Learning via Graph Neural Networks for inductive document classification Make graph dataset create co-occurrence graph for datasets.

16 Dec 22, 2022
Prototype for Baby Action Detection and Classification

Baby Action Detection Table of Contents About Install Run Predictions Demo About An attempt to harness the power of Deep Learning to come up with a so

Shreyas K 30 Dec 16, 2022
RARA: Zero-shot Sim2Real Visual Navigation with Following Foreground Cues

RARA: Zero-shot Sim2Real Visual Navigation with Following Foreground Cues FGBG (foreground-background) pytorch package for defining and training model

Klaas Kelchtermans 1 Jun 02, 2022
Trains an agent with stochastic policy gradient ascent to solve the Lunar Lander challenge from OpenAI

Introduction This script trains an agent with stochastic policy gradient ascent to solve the Lunar Lander challenge from OpenAI. In order to run this

Momin Haider 0 Jan 02, 2022
Repo for our ICML21 paper Unsupervised Learning of Visual 3D Keypoints for Control

Unsupervised Learning of Visual 3D Keypoints for Control [Project Website] [Paper] Boyuan Chen1, Pieter Abbeel1, Deepak Pathak2 1UC Berkeley 2Carnegie

Boyuan Chen 34 Jul 22, 2022
Structured Edge Detection Toolbox

################################################################### # # # Structure

Piotr Dollar 779 Jan 02, 2023
These are the materials for the paper "Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations"

Few-shot-NLEs These are the materials for the paper "Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations". You can find the smal

Yordan Yordanov 0 Oct 21, 2022
Prediction of MBA refinance Index (Mortgage prepayment)

Prediction of MBA refinance Index (Mortgage prepayment) Deep Neural Network based Model The ability to predict mortgage prepayment is of critical use

Ruchil Barya 1 Jan 16, 2022
Tzer: TVM Implementation of "Coverage-Guided Tensor Compiler Fuzzing with Joint IR-Pass Mutation (OOPSLA'22)“.

Artifact • Reproduce Bugs • Quick Start • Installation • Extend Tzer Coverage-Guided Tensor Compiler Fuzzing with Joint IR-Pass Mutation This is the s

12 Dec 29, 2022
Node-level Graph Regression with Deep Gaussian Process Models

Node-level Graph Regression with Deep Gaussian Process Models Prerequests our implementation is mainly based on tensorflow 1.x and gpflow 1.x: python

1 Jan 16, 2022
The source code for CATSETMAT: Cross Attention for Set Matching in Bipartite Hypergraphs

catsetmat The source code for CATSETMAT: Cross Attention for Set Matching in Bipartite Hypergraphs To be able to run it, add catsetmat to PYTHONPATH H

2 Dec 19, 2022
Deep Q-network learning to play flappybird.

AI Plays Flappy Bird I've trained a DQN that learns to play flappy bird on it's own. Try the pre-trained model First install the pip requirements and

Anish Shrestha 3 Mar 01, 2022
FEMDA: Robust classification with Flexible Discriminant Analysis in heterogeneous data

FEMDA: Robust classification with Flexible Discriminant Analysis in heterogeneous data. Flexible EM-Inspired Discriminant Analysis is a robust supervised classification algorithm that performs well i

0 Sep 06, 2022
AutoVideo: An Automated Video Action Recognition System

AutoVideo is a system for automated video analysis. It is developed based on D3M infrastructure, which describes machine learning with generic pipeline languages. Currently, it focuses on video actio

Data Analytics Lab at Texas A&M University 267 Dec 17, 2022
Neurons Dataset API - The official dataloader and visualization tools for Neurons Datasets.

Neurons Dataset API - The official dataloader and visualization tools for Neurons Datasets. Introduction We propose our dataloader API for loading and

1 Nov 19, 2021
A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data

A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data Overview Clustering analysis is widely utilized in single-cell RNA-seque

AI-Biomed @NSCC-gz 3 May 08, 2022