An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

Last update: Dec 29, 2022

Related tags

Overview

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

This is a coarse version for MAE, only make the pretrain model, the finetune and linear is comming soon.

1. Introduction

This repo is the MAE-vit model which impelement with pytorch, no reference any reference code so this is a non-official version. Because of the limitation of time and machine, I only trained the vit-tiny model for encoder.

2. Enveriments

python 3.7+
pytorch 1.7.1
pillow
timm
opencv-python

3. Model Config

Pretrain Config

BaseConfig
```
img_size = 224,
patch_size = 16,
```

Encoder The encoder if follow the Vit-tiny model config

encoder_dim = 192,
encoder_depth = 12,
encoder_heads = 3,

Decoder The decoder is followed the kaiming paper config.

decoder_dim = 512,
decoder_depth = 8,
decoder_heads = 16,

Mask
1. We use the shuffle patch after Sin-Cos position embeeding for encoder.
2. Mask the shuffle patch, keep the mask index.
3. Unshuffle the mask patch and combine with the encoder embeeding before the position embeeding for decoder.
4. Restruction decoder embeeidng by convtranspose.
5. Build the mask map with mask index for cal the loss(only consider the mask patch).

Finetune Config

Wait for the results

TODO:

Finetune Trainig
Linear Training

4. Results

Restruction the imagenet validation image from pretrain model, compare with the kaiming results, restruction quality is less than he. May be the encoder model is too small TT.

The Mae-Vit-tiny pretrain models is here, you can download to test the restruction result. Put the ckpt in weights folder.

5. Training & Inference

dataset prepare

/data/home/imagenet/xxx.jpeg, 0
/data/home/imagenet/xxx.jpeg, 1
...
/data/home/imagenet/xxx.jpeg, 999

Training

Pretrain

#!/bin/bash
OMP_NUM_THREADS=1
MKL_NUM_THREADS=1
export OMP_NUM_THREADS
export MKL_NUM_THREADS
cd MAE-Pytorch;
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -W ignore -m torch.distributed.launch --nproc_per_node 8 train_mae.py \
--batch_size 256 \
--num_workers 32 \
--lr 1.5e-4 \
--optimizer_name "adamw" \
--cosine 1 \
--max_epochs 300 \
--warmup_epochs 40 \
--num-classes 1000 \
--crop_size 224 \
--patch_size 16 \
--color_prob 0.0 \
--calculate_val 0 \
--weight_decay 5e-2 \
--lars 0 \
--mixup 0.0 \
--smoothing 0.0 \
--train_file $train_file \
--val_file $val_file \
--checkpoints-path $ckpt_folder \
--log-dir $log_folder

Finetune TODO:
- training
Linear TODO:
- training

Inference

pretrian

python mae_test.py --test_image xxx.jpg --ckpt weights.pth

classification TODO:
- training

6. TODO

VIT-BASE model training.
SwinTransformers for MAE.
Finetune & Linear training.

Finetune is trainig, the weights may be comming soon.

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

Related tags

Overview

An pytorch implementation of Masked Autoencoders Are Scalable Vision Learners

1. Introduction

2. Enveriments

3. Model Config

Pretrain Config

Finetune Config

4. Results

5. Training & Inference

6. TODO

Owner

FlyEgle

This is an official implementation of "Polarized Self-Attention: Towards High-quality Pixel-wise Regression"

2nd solution of ICDAR 2021 Competition on Scientific Literature Parsing, Task B.

Implementation of Vaswani, Ashish, et al. "Attention is all you need."

Metadata-Extractor - Metadata Extractor Script can be used to read in exif metadata

This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

unet for image segmentation

CharacterGAN: Few-Shot Keypoint Character Animation and Reposing

A smaller subset of 10 easily classified classes from Imagenet, and a little more French

SoK: Vehicle Orientation Representations for Deep Rotation Estimation

[CVPR 2021] Unsupervised Degradation Representation Learning for Blind Super-Resolution

Code for Contrastive-Geometry Networks for Generalized 3D Pose Transfer

Replication Package for "An Empirical Study of the Effectiveness of an Ensemble of Stand-alone Sentiment Detection Tools for Software Engineering Datasets"

This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning

The official codes for the ICCV2021 Oral presentation "Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework"

A modified version of DeepMind's Alphafold2 to divide CPU part (MSA and template searching) and GPU part (prediction model)

3ds-Ghidra-Scripts - Ghidra scripts to help with 3ds reverse engineering

Aesara is a Python library that allows one to define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays.

Unofficial JAX implementations of Deep Learning models

A package for music online and offline rhythmic information analysis including music Beat, downbeat, tempo and meter tracking.

Aerial Single-View Depth Completion with Image-Guided Uncertainty Estimation (RA-L/ICRA 2020)