PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

Last update: Oct 30, 2022

Related tags

Overview

MAE for Self-supervised ViT

Introduction

This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

This repo is mainly based on moco-v3, pytorch-image-models and BEiT

TODO

Main Results

The following results are based on ImageNet-1k self-supervised pre-training, followed by ImageNet-1k supervised training for linear evaluation or end-to-end fine-tuning.

Vit-Base

pretrain epochs	with pixel-norm	linear acc	fine-tuning acc
100	False	--	75.58 [1]
100	True	--	77.19
800	True	--	--

On 8 NVIDIA GeForce RTX 3090 GPUs, pretrain for 100 epochs needs about 9 hours, 4096 batch size needs about 24 GB GPU memory.

[1]. fine-tuning for 50 epochs;

Vit-Large

pretrain epochs	with pixel-norm	linear acc	fine-tuning acc
100	False	--	--
100	True	--	--

On 8 NVIDIA A40 GPUs, pretrain for 100 epochs needs about 34 hours, 4096 batch size needs about xx GB GPU memory.

Usage: Preparation

The code has been tested with CUDA 11.4, PyTorch 1.8.2.

Notes:

The batch size specified by -b is the total batch size across all GPUs from all nodes.
The learning rate specified by --lr is the base lr (corresponding to 256 batch-size), and is adjusted by the linear lr scaling rule.
In this repo, only multi-gpu, DistributedDataParallel training is supported; single-gpu or DataParallel training is not supported. This code is improved to better suit the multi-node setting, and by default uses automatic mixed-precision for pre-training.
Only pretraining and finetuning have been tested.

Usage: Self-supervised Pre-Training

Below is examples for MAE pre-training.

ViT-Base with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch 4096

python main_mae.py \
  -c cfgs/ViT-B16_ImageNet1K_pretrain.yaml \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

sh train_mae.sh

ViT-Large with 1-node (8-GPU, NVIDIA A40) pre-training, batch 2048

python main_mae.py \
  -c cfgs/ViT-L16_ImageNet1K_pretrain.yaml \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

Usage: End-to-End Fine-tuning ViT

Below is examples for MAE fine-tuning.

ViT-Base with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch 1024

python main_fintune.py \
  -c cfgs/ViT-B16_ImageNet1K_finetune.yaml \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

ViT-Large with 2-node (16-GPU, 8 NVIDIA GeForce RTX 3090 + 8 NVIDIA A40) training, batch 512

python main_fintune.py \
  -c cfgs/ViT-B16_ImageNet1K_finetune.yaml \
  --multiprocessing-distributed --world-size 2 --rank 0 \
  [your imagenet-folder with train and val folders]

On another node, run the same command with --rank 1.

Note:

We use --resume rather than --finetune in the DeiT repo, as its --finetune option trains under eval mode. When loading the pre-trained model, revise model_without_ddp.load_state_dict(checkpoint['model']) with strict=False.

[TODO] Usage: Linear Classification

By default, we use momentum-SGD and a batch size of 1024 for linear classification on frozen features/weights. This can be done with a single 8-GPU node.

python main_lincls.py \
  -a [architecture] --lr [learning rate] \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  --pretrained [your checkpoint path]/[your checkpoint file].pth.tar \
  [your imagenet-folder with train and val folders]

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

If you use the code of this repo, please cite the original papre and this repo:

@Article{he2021mae,
  author  = {Kaiming He* and Xinlei Chen* and Saining Xie and Yanghao Li and Piotr Dolla ́r and Ross Girshick},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  journal = {arXiv preprint arXiv:2111.06377},
  year    = {2021},
}

@misc{yang2021maepriv,
  author       = {Lu Yang* and Pu Cao* and Yang Nie and Qing Song},
  title        = {MAE-priv},
  howpublished = {\url{https://github.com/BUPT-PRIV/MAE-priv}},
  year         = {2021},
}

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

Related tags

Overview

MAE for Self-supervised ViT

Introduction

TODO

Main Results

Vit-Base

Vit-Large

Usage: Preparation

Notes:

Usage: Self-supervised Pre-Training

ViT-Base with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch 4096

ViT-Large with 1-node (8-GPU, NVIDIA A40) pre-training, batch 2048

Usage: End-to-End Fine-tuning ViT

ViT-Base with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch 1024

ViT-Large with 2-node (16-GPU, 8 NVIDIA GeForce RTX 3090 + 8 NVIDIA A40) training, batch 512

[TODO] Usage: Linear Classification

License

Citation

Owner

Fashion Landmark Estimation with HRNet

Unofficial implementation of Pix2SEQ

MODNet: Trimap-Free Portrait Matting in Real Time

Code for the paper "SmoothMix: Training Confidence-calibrated Smoothed Classifiers for Certified Robustness" (NeurIPS 2021)

wmctrl ported to Python Ctypes

🚀 PyTorch Implementation of "Progressive Distillation for Fast Sampling of Diffusion Models(v-diffusion)"

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

Scrutinizing XAI with linear ground-truth data

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Quantization library for PyTorch. Support low-precision and mixed-precision quantization, with hardware implementation through TVM.

Hamiltonian Dynamics with Non-Newtonian Momentum for Rapid Sampling

Official PyTorch implementation of the paper "Likelihood Training of Schrödinger Bridge using Forward-Backward SDEs Theory (SB-FBSDE)"

Github for the conference paper GLOD-Gaussian Likelihood OOD detector

code for "AttentiveNAS Improving Neural Architecture Search via Attentive Sampling"

Unsupervised Representation Learning via Neural Activation Coding

The repository includes the code for training cell counting applications. (Keras + Tensorflow)

Text and code for the forthcoming second edition of Think Bayes, by Allen Downey.

UMEC: Unified Model and Embedding Compression for Efficient Recommendation Systems

Repository for the "Gotta Go Fast When Generating Data with Score-Based Models" paper

Scale-aware Automatic Augmentation for Object Detection (CVPR 2021)