PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

Last update: Jan 08, 2023

Related tags

Deep Learning moco-v3

Overview

MoCo v3 for Self-supervised ResNet and ViT

Introduction

This is a PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

The original MoCo v3 was implemented in Tensorflow and run in TPUs. This repo re-implements in PyTorch and GPUs. Despite the library and numerical differences, this repo reproduces the results and observations in the paper.

Main Results

The following results are based on ImageNet-1k self-supervised pre-training, followed by ImageNet-1k supervised training for linear evaluation or end-to-end fine-tuning. All results in these tables are based on a batch size of 4096.

ResNet-50, linear classification

pretrain epochs	pretrain crops	linear acc
100	2x224	68.9
300	2x224	72.8
1000	2x224	74.6

ViT, linear classification

model	pretrain epochs	pretrain crops	linear acc
ViT-Small	300	2x224	73.2
ViT-Base	300	2x224	76.7

ViT, end-to-end fine-tuning

model	pretrain epochs	pretrain crops	e2e acc
ViT-Small	300	2x224	81.4
ViT-Base	300	2x224	83.2

The end-to-end fine-tuning results are obtained using the DeiT repo, using all the default DeiT configs. ViT-B is fine-tuned for 150 epochs (vs DeiT-B's 300ep, which has 81.8% accuracy).

Usage: Preparation

Install PyTorch and download the ImageNet dataset following the official PyTorch ImageNet training code. Similar to MoCo v1/2, this repo contains minimal modifications on the official PyTorch ImageNet code. We assume the user can successfully run the official PyTorch ImageNet code. For ViT models, install timm (timm==0.4.9).

The code has been tested with CUDA 10.2/CuDNN 7.6.5, PyTorch 1.9.0 and timm 0.4.9.

Usage: Self-supervised Pre-Training

Below are three examples for MoCo v3 pre-training.

ResNet-50 with 2-node (16-GPU) training, batch 4096

On the first node, run:

python main_moco.py \
  --moco-m-cos --crop-min=.2 \
  --dist-url 'tcp://[your first node address]:[specified port]' \
  --multiprocessing-distributed --world-size 2 --rank 0 \
  [your imagenet-folder with train and val folders]

On the second node, run the same command with --rank 1. With a batch size of 4096, the training can fit into 2 nodes with a total of 16 Volta 32G GPUs.

ViT-Small with 1-node (8-GPU) training, batch 1024

python main_moco.py \
  -a vit_small -b 1024 \
  --optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
  --epochs=300 --warmup-epochs=40 \
  --stop-grad-conv1 --moco-m-cos --moco-t=.2 \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  [your imagenet-folder with train and val folders]

ViT-Base with 8-node training, batch 4096

With a batch size of 4096, ViT-Base is trained with 8 nodes:

python main_moco.py \
  -a vit_base \
  --optimizer=adamw --lr=1.5e-4 --weight-decay=.1 \
  --epochs=300 --warmup-epochs=40 \
  --stop-grad-conv1 --moco-m-cos --moco-t=.2 \
  --dist-url 'tcp://[your first node address]:[specified port]' \
  --multiprocessing-distributed --world-size 8 --rank 0 \
  [your imagenet-folder with train and val folders]

On other nodes, run the same command with --rank 1, ..., --rank 7 respectively.

Notes:

The batch size specified by -b is the total batch size across all GPUs.
The learning rate specified by --lr is the base lr, and is adjusted by the linear lr scaling rule in this line.
Using a smaller batch size has a more stable result (see paper), but has lower speed. Using a large batch size is critical for good speed in TPUs (as we did in the paper).
In this repo, only multi-gpu, DistributedDataParallel training is supported; single-gpu or DataParallel training is not supported. This code is improved to better suit the multi-node setting, and by default uses automatic mixed-precision for pre-training.

Usage: Linear Classification

By default, we use momentum-SGD and a batch size of 1024 for linear classification on frozen features/weights. This can be done with a single 8-GPU node.

python main_lincls.py \
  -a [architecture] --lr [learning rate] \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed --world-size 1 --rank 0 \
  --pretrained [your checkpoint path]/[your checkpoint file].pth.tar \
  [your imagenet-folder with train and val folders]

Usage: End-to-End Fine-tuning ViT

To perform end-to-end fine-tuning for ViT, use our script to convert the pre-trained ViT checkpoint to DEiT format:

python convert_to_deit.py \
  --input [your checkpoint path]/[your checkpoint file].pth.tar \
  --output [target checkpoint file].pth

Then run the training (in the DeiT repo) with the converted checkpoint:

python $DEIT_DIR/main.py \
  --resume [target checkpoint file].pth \
  --epochs 150

This gives us 83.2% accuracy for ViT-Base with 150-epoch fine-tuning.

Note:

We use --resume rather than --finetune in the DeiT repo, as its --finetune option trains under eval mode. When loading the pre-trained model, revise model_without_ddp.load_state_dict(checkpoint['model']) with strict=False.
Our ViT-Small is with heads=12 in the Transformer block, while by default in DeiT it is heads=6. Please modify the DeiT code accordingly when fine-tuning our ViT-Small model.

Model Configs

See the commands listed in CONFIG.md.

Transfer Learning

See the instruction in the transfer dir.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

Citation

@Article{chen2021mocov3,
  author  = {Xinlei Chen* and Saining Xie* and Kaiming He},
  title   = {An Empirical Study of Training Self-Supervised Vision Transformers},
  journal = {arXiv preprint arXiv:2104.02057},
  year    = {2021},
}

PyTorch implementation of MoCo v3 for self-supervised ResNet and ViT.

Related tags

Overview

MoCo v3 for Self-supervised ResNet and ViT

Introduction

Main Results

ResNet-50, linear classification

ViT, linear classification

ViT, end-to-end fine-tuning

Usage: Preparation

Usage: Self-supervised Pre-Training

ResNet-50 with 2-node (16-GPU) training, batch 4096

ViT-Small with 1-node (8-GPU) training, batch 1024

ViT-Base with 8-node training, batch 4096

Notes:

Usage: Linear Classification

Usage: End-to-End Fine-tuning ViT

Model Configs

Transfer Learning

License

Citation

Owner

Facebook Research

Lama-cleaner: Image inpainting tool powered by LaMa

This repository is the official implementation of Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning (NeurIPS21).

An example of semantic segmentation using tensorflow in eager execution.

[CVPR 2021] MetaSAug: Meta Semantic Augmentation for Long-Tailed Visual Recognition

This repository compare a selfie with images from identity documents and response if the selfie match.

RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

Code release for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification (TIP 2020)

Mixed Transformer UNet for Medical Image Segmentation

A list of all papers and resoureces on Semantic Segmentation

functorch is a prototype of JAX-like composable function transforms for PyTorch.

Official implementation of SynthTIGER (Synthetic Text Image GEneratoR) ICDAR 2021

Subdivision-based Mesh Convolutional Networks

InDuDoNet+: A Model-Driven Interpretable Dual Domain Network for Metal Artifact Reduction in CT Images

Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

Official Repository for the ICCV 2021 paper "PixelSynth: Generating a 3D-Consistent Experience from a Single Image"

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

A Probabilistic End-To-End Task-Oriented Dialog Model with Latent Belief States towards Semi-Supervised Learning

Code release for NeX: Real-time View Synthesis with Neural Basis Expansion

CLOOB training (JAX) and inference (JAX and PyTorch)