ConvMAE: Masked Convolution Meets Masked Autoencoders

Overview

ConvMAE

ConvMAE: Masked Convolution Meets Masked Autoencoders

Peng Gao1, Teli Ma1, Hongsheng Li2, Jifeng Dai3, Yu Qiao1,

1 Shanghai AI Laboratory, 2 MMLab, CUHK, 3 Sensetime Research.

This repo is the official implementation of ConvMAE: Masked Convolution Meets Masked Autoencoders. It currently concludes codes and models for the following tasks:

ImageNet Pretrain: See PRETRAIN.md.
ImageNet Finetune: See FINETUNE.md.
Object Detection: See DETECTION.md.
Semantic Segmentation: See SEGMENTATION.md.

Updates

16/May/2022

The supported codes and models for COCO object detection and instance segmentation are available.

11/May/2022

  1. Pretrained models on ImageNet-1K for ConvMAE.
  2. The supported codes and models for ImageNet-1K finetuning and linear probing are provided.

08/May/2022

The preprint version is public at arxiv.

Introduction

ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.

  • We present the strong and efficient self-supervised framework ConvMAE, which is easy to implement but show outstanding performances on downstream tasks.
  • ConvMAE naturally generates hierarchical representations and exhibit promising performances on object detection and segmentation.
  • ConvMAE-Base improves the ImageNet finetuning accuracy by 1.4% compared with MAE-Base. On object detection with Mask-RCNN, ConvMAE-Base achieves 53.2 box AP and 47.1 mask AP with a 25-epoch training schedule while MAE-Base attains 50.3 box AP and 44.9 mask AP with 100 training epochs. On ADE20K with UperNet, ConvMAE-Base surpasses MAE-Base by 3.6 mIoU (48.1 vs. 51.7).

tenser

Pretrain on ImageNet-1K

The following table provides pretrained checkpoints and logs used in the paper.

ConvMAE-Base
pretrained checkpoints download
logs download

Main Results on ImageNet-1K

Models #Params(M) Supervision Encoder Ratio Pretrain Epochs FT [email protected](%) LIN [email protected](%) FT logs/weights LIN logs/weights
BEiT 88 DALLE 100% 300 83.0 37.6 - -
MAE 88 RGB 25% 1600 83.6 67.8 - -
SimMIM 88 RGB 100% 800 84.0 56.7 - -
MaskFeat 88 HOG 100% 300 83.6 N/A - -
data2vec 88 RGB 100% 800 84.2 N/A - -
ConvMAE-B 88 RGB 25% 1600 85.0 70.9 log/weight

Main Results on COCO

Mask R-CNN

Models Pretrain Pretrain Epochs Finetune Epochs #Params(M) FLOPs(T) box AP mask AP logs/weights
Swin-B IN21K w/ labels 300 36 109 0.7 51.4 45.4 -
Swin-L IN21K w/ labels 300 36 218 1.1 52.4 46.2 -
MViTv2-B IN21K w/ labels 300 36 73 0.6 53.1 47.4 -
MViTv2-L IN21K w/ labels 300 36 239 1.3 53.6 47.5 -
Benchmarking-ViT-B IN1K w/o labels 1600 100 118 0.9 50.4 44.9 -
Benchmarking-ViT-L IN1K w/o labels 1600 100 340 1.9 53.3 47.2 -
ViTDet IN1K w/o labels 1600 100 111 0.8 51.2 45.5 -
MIMDet-ViT-B IN1K w/o labels 1600 36 127 1.1 51.5 46.0 -
MIMDet-ViT-L IN1K w/o labels 1600 36 345 2.6 53.3 47.5 -
ConvMAE-B IN1K w/o lables 1600 25 104 0.9 53.2 47.1 log/weight

Main Results on ADE20K

UperNet

Models Pretrain Pretrain Epochs Finetune Iters #Params(M) FLOPs(T) mIoU logs/weights
DeiT-B IN1K w/ labels 300 16K 163 0.6 45.6 -
Swin-B IN1K w/ labels 300 16K 121 0.3 48.1 -
MoCo V3 IN1K 300 16K 163 0.6 47.3 -
DINO IN1K 400 16K 163 0.6 47.2 -
BEiT IN1K+DALLE 1600 16K 163 0.6 47.1 -
PeCo IN1K 300 16K 163 0.6 46.7 -
CAE IN1K+DALLE 800 16K 163 0.6 48.8 -
MAE IN1K 1600 16K 163 0.6 48.1 -
ConvMAE-B IN1K 1600 16K 153 0.6 51.7 soon

Main Results on Kinetics-400

Models Pretrain Epochs Finetune Epochs #Params(M) Top1 Top5 logs/weights
VideoMAE-B 200 100 87 77.8
VideoMAE-B 800 100 87 79.4
VideoMAE-B 1600 100 87 79.8
VideoMAE-B 1600 100 (w/ Repeated Aug) 87 80.7 94.7
SpatioTemporalLearner-B 800 150 (w/ Repeated Aug) 87 81.3 94.9
VideoConvMAE-B 200 100 86 80.1 94.3 Soon
VideoConvMAE-B 800 100 86 81.7 95.1 Soon
VideoConvMAE-B-MSD 800 100 86 82.7 95.5 Soon

Main Results on Something-Something V2

Models Pretrain Epochs Finetune Epochs #Params(M) Top1 Top5 logs/weights
VideoMAE-B 200 40 87 66.1
VideoMAE-B 800 40 87 69.3
VideoMAE-B 2400 40 87 70.3
VideoConvMAE-B 200 40 86 67.7 91.2 Soon
VideoConvMAE-B 800 40 86 69.9 92.4 Soon
VideoConvMAE-B-MSD 800 40 86 70.7 93.0 Soon

Getting Started

Prerequisites

  • Linux
  • Python 3.7+
  • CUDA 10.2+
  • GCC 5+

Training and evaluation

Acknowledgement

The pretraining and finetuning of our project are based on DeiT and MAE. The object detection and semantic segmentation parts are based on MIMDet and MMSegmentation respectively. Thanks for their wonderful work.

License

ConvMAE is released under the MIT License.

Citation

@article{gao2022convmae,
  title={ConvMAE: Masked Convolution Meets Masked Autoencoders},
  author={Gao, Peng and Ma, Teli and Li, Hongsheng and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.03892},
  year={2022}
}
Owner
Alpha VL Team of Shanghai AI Lab
Alpha VL Team of Shanghai AI Lab
Spam your friends and famly and when you do your famly will disown you and you will have no friends.

SpamBot9000 Spam your friends and family and when you do your family will disown you and you will have no friends. Terms of Use Disclaimer: Please onl

DJ15 0 Jun 09, 2022
Source code of SIGIR2021 Paper 'One Chatbot Per Person: Creating Personalized Chatbots based on Implicit Profiles'

DHAP Source code of SIGIR2021 Long Paper: One Chatbot Per Person: Creating Personalized Chatbots based on Implicit User Profiles . Preinstallation Fir

ZYMa 32 Dec 06, 2022
Learning To Have An Ear For Face Super-Resolution

Learning To Have An Ear For Face Super-Resolution [Project Page] This repository contains demo code of our CVPR2020 paper. Training and evaluation on

50 Nov 16, 2022
Replication Package for "An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Datasets"

Replication Package for "An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Data

2 Oct 06, 2022
Speed-Test - You can check your intenet speed using this tool

Speed-Test Tool By Hez_X AVAILABLE ON : Termux & Kali linux & Ubuntu (Linux E

Hez-X 3 Feb 17, 2022
A C implementation for creating 2D voronoi diagrams

Branch OSX/Linux Windows master dev jc_voronoi A fast C/C++ header only implementation for creating 2D Voronoi diagrams from a point set Uses Fortune'

Mathias Westerdahl 481 Dec 29, 2022
Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [pdf] The official repository for Self-Supervised Pre-Training for Transfo

Hao Luo 116 Jan 04, 2023
Code for the bachelors-thesis flaky fault localization

Flaky_Fault_Localization Scripts for the Bachelors-Thesis: "Flaky Fault Localization" by Christian Kasberger. The thesis examines the usefulness of sp

Christian Kasberger 1 Oct 26, 2021
Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

CoaDTI Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2 Abstract Environment The test was conducted i

Layne_Huang 7 Nov 14, 2022
Audio Visual Emotion Recognition using TDA

Audio Visual Emotion Recognition using TDA RAVDESS database with two datasets analyzed: Video and Audio dataset: Audio-Dataset: https://www.kaggle.com

Combinatorial Image Analysis research group 3 May 11, 2022
Official repository for Hierarchical Opacity Propagation for Image Matting

HOP-Matting Official repository for Hierarchical Opacity Propagation for Image Matting 🚧 🚧 🚧 Under Construction 🚧 🚧 🚧 🚧 🚧 🚧   Coming Soon   

Li Yaoyi 54 Dec 30, 2021
Lightweight Cuda Renderer with Python Wrapper.

pyRender Lightweight Cuda Renderer with Python Wrapper. Compile Change compile.sh line 5 to the glm library include path. This library can be download

Jingwei Huang 53 Dec 02, 2022
hySLAM is a hybrid SLAM/SfM system designed for mapping

HySLAM Overview hySLAM is a hybrid SLAM/SfM system designed for mapping. The system is based on ORB-SLAM2 with some modifications and refactoring. Raú

Brian Hopkinson 15 Oct 10, 2022
VOLO: Vision Outlooker for Visual Recognition

VOLO: Vision Outlooker for Visual Recognition, arxiv This is a PyTorch implementation of our paper. We present Vision Outlooker (VOLO). We show that o

Sea AI Lab 876 Dec 09, 2022
Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Junxian He 57 Jan 01, 2023
This repository contains part of the code used to make the images visible in the article "How does an AI Imagine the Universe?" published on Towards Data Science.

Generative Adversarial Network - Generating Universe This repository contains part of the code used to make the images visible in the article "How doe

Davide Coccomini 9 Dec 18, 2022
Continuous Security Group Rule Change Detection & Response at scale

Introduction Get notified of Security Group Changes across all AWS Accounts & Regions in an AWS Organization, with the ability to respond/revert those

Raajhesh Kannaa Chidambaram 3 Aug 13, 2022
Classify bird species based on their songs using SIamese Networks and 1D dilated convolutions.

The goal is to classify different birds species based on their songs/calls. Spectrograms have been extracted from the audio samples and used as features for classification.

Aditya Dutt 9 Dec 27, 2022
Stochastic Tensor Optimization for Robot Motion - A GPU Robot Motion Toolkit

STORM Stochastic Tensor Optimization for Robot Motion - A GPU Robot Motion Toolkit [Install Instructions] [Paper] [Website] This package contains code

NVIDIA Research Projects 101 Dec 12, 2022
The codes and related files to reproduce the results for Image Similarity Challenge Track 1.

ISC-Track1-Submission The codes and related files to reproduce the results for Image Similarity Challenge Track 1. Required dependencies To begin with

Wenhao Wang 115 Jan 02, 2023