Skip to content

BUPT-PRIV/MAE-priv

Repository files navigation

MAE for Self-supervised ViT

Introduction

This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT. This repo is mainly based on moco-v3, pytorch-image-models, BEiT and MAE-pytorch.

image-mae

TODO

  • visualization of reconstruction image
  • linear probing
  • k-NN classification
  • more results
  • more datasets
  • transfer learning for detection and segmentation
  • multi-nodes training
  • ...

Main Result

We support two representations (repre.) for classification: GAP (Global Average Pooling) and Cls-token. According to paper, MAE works similarily well with both of them. In Cls-token mode, it is trained in encoder of MAE.

For k-NN evaluation, we use k=10 as default.

ViT-Small

pretrain epoch repre. ft. top1 lin. k-NN config weight log
100 GAP 76.58% 34.65% 19.7% pretrain finetune pretrain finetune pretrain finetune
100 Cls-token 75.77% 38.95% 23.7% pretrain finetune pretrain finetune pretrain finetune
200 GAP 76.86% 36.46% 19.8% pretrain finetune pretrain finetune pretrain finetune
400 GAP 77.56% / 80.02% / 80.89% 36.98% 20.8% pretrain finetune pretrain finetune pretrain finetune
800 GAP 77.93% / 80.87% / 81.11% 36.88% 20.7% pretrain finetune pretrain finetune pretrain finetune
1600 GAP - - pretrain finetune pretrain finetune pretrain finetune
  • We finetune models by 50 epochs as default. For 400 and 800 epochs pretraining, we use 50 / 100 / 150 epochs for fine-tuning (logs and weights provided under 50 epochs).
  • BaiduNetdisk (2lt1)

ViT-Base

pretrain epoch repre. ft. top1 k-NN config weight log
400 GAP 83.08% 28.9% pretrain finetune pretrain finetune pretrain finetune
  • Following paper, we finetune models by 100 epochs as default.
  • BaiduNetdisk (k2ef)

ViT-Large

pretrain epoch repre. ft. top1 lin. k-NN config weight log
100 GAP 83.51% 58.90% 33.08% pretrain finetune pretrain finetune pretrain finetune
  • Following paper, we finetune models by 50 epochs as default.
  • BaiduNetdisk (825g)

Usage

Preparation

The code has been tested with CUDA 11.4, PyTorch 1.8.2.

Notes:

  1. The batch size specified by -b is batch-size per card.
  2. The learning rate specified by --lr is the base lr (corresponding to 256 batch-size), and is adjusted by the linear lr scaling rule.
  3. In this repo, only multi-gpu, DistributedDataParallel training is supported; single-gpu or DataParallel training is not supported.
  4. We support cls-token (token) and global averaging pooling (GAP) for classification. Please verify the correspondence of pretraining and finetuning/linear probing. For cls-token mode during pretraining, cls-token is trained in encoder.

Self-supervised Pre-Training

Below is examples for MAE pre-training.

ViT-Small with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, batch-size 4096, GAP.

sh run_pretrain.sh \
	--config cfgs/pretrain/Vit-S_100E_GAP.yaml \
	--data_path /path/to/train/data

End-to-End Fine-tuning

ViT-Small with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, 50epochs, batch-size 4096, GAP.

sh run_finetune.sh \
	--config cfgs/finetune/ViT-S_50E_GAP.yaml \
	--data_path /path/to/data \
	--finetune /path/to/pretrain/model

Linear Classification

According to paper, we have two training modes: SGD + 4096 batch-size and LARS + 16384 batch-size.

ViT-Small with 1-node (8-GPU, NVIDIA GeForce RTX 3090) training, 50epochs, SGD + batch-size 4096, GAP.

sh run_lincls.sh \
	--config cfgs/lincls/ViT-S_SGD_GAP.yaml \
	--data_path /path/to/data \
	--finetune /path/to/pretrain/model

k-NN Evaluation of Pretrain Model

ViT-Small with 1-node (8-GPU, NVIDIA GeForce RTX 3090), GAP.

sh run_knn.sh \
	--config cfgs/finetune/ViT-S_50E_GAP.yaml \
	--data_path /path/to/data \
	--finetune /path/to/pretrain/model \
	--save_path /path/to/save/result

Visualization of Restruction

ViT-Base Pretrained by 400 Epochs.

python tools/run_mae_vis.py \
	--config cfgs/pretrain/ViT-B_400E_Norm_GAP.yaml \
	--save_path output/restruct/ \
	--model_path /path/to/pretrain/model \
	--img_path /path/to/image

Visualization of Restruction

ViT-Small w/ CLS-Token Pretrained by 100 Epochs.

python tools/vit_explain.py
--config cfgs/finetune/ViT-S_50E_CLS-Token.yaml
--finetune /path/to/pretrain/model
--image_path /path/to/image
--head_fusion max
--discard_ratio 0.9

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

If you use the code of this repo, please cite the original paper and this repo:

@Article{he2021mae,
  author  = {Kaiming He* and Xinlei Chen* and Saining Xie and Yanghao Li and Piotr Dolla ́r and Ross Girshick},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  journal = {arXiv preprint arXiv:2111.06377},
  year    = {2021},
}
@misc{yang2021maepriv,
  author       = {Lu Yang* and Pu Cao* and Yang Nie and Qing Song},
  title        = {MAE-priv},
  howpublished = {\url{https://github.com/BUPT-PRIV/MAE-priv}},
  year         = {2021},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published