This is an official implementation for "Self-Supervised Learning with Swin Transformers".

Overview

Self-Supervised Learning with Vision Transformers

By Zhenda Xie*, Yutong Lin*, Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao and Han Hu

This repo is the official implementation of "Self-Supervised Learning with Swin Transformers".

A important feature of this codebase is to include Swin Transformer as one of the backbones, such that we can evaluate the transferring performance of the learnt representations on down-stream tasks of object detection and semantic segmentation. This evaluation is usually not included in previous works due to the use of ViT/DeiT, which has not been well tamed for down-stream tasks.

It currently includes code and models for the following tasks:

Self-Supervised Learning and Linear Evaluation: Included in this repo. See get_started.md for a quick start.

Transferring Performance on Object Detection/Instance Segmentation: See Swin Transformer for Object Detection.

Transferring Performance on Semantic Segmentation: See Swin Transformer for Semantic Segmentation.

Highlights

  • Include down-stream evaluation: the first work to evaluate the transferring performance on down-stream tasks for SSL using Transformers
  • Small tricks: significantly less tricks than previous works, such as MoCo v3 and DINO
  • High accuracy on ImageNet-1K linear evaluation: 72.8 vs 72.5 (MoCo v3) vs 72.5 (DINO) using DeiT-S/16 and 300 epoch pre-training

Updates

05/13/2021

  1. Self-Supervised models with DeiT-Small on ImageNet-1K (MoBY-DeiT-Small-300Ep-Pretrained, MoBY-DeiT-Small-300Ep-Linear) are provided.
  2. The supporting code and config for self-supervised learning with DeiT-Small are provided.

05/11/2021

Initial Commits:

  1. Self-Supervised Pre-training models on ImageNet-1K (MoBY-Swin-T-300Ep-Pretrained, MoBY-Swin-T-300Ep-Linear) are provided.
  2. The supported code and models for self-supervised pre-training and ImageNet-1K linear evaluation, COCO object detection and ADE20K semantic segmentation are provided.

Introduction

MoBY: a self-supervised learning approach by combining MoCo v2 and BYOL

MoBY (the name MoBY stands for MoCo v2 with BYOL) is initially described in arxiv, which is a combination of two popular self-supervised learning approaches: MoCo v2 and BYOL. It inherits the momentum design, the key queue, and the contrastive loss used in MoCo v2, and inherits the asymmetric encoders, asymmetric data augmentations and the momentum scheduler in BYOL.

MoBY achieves reasonably high accuracy on ImageNet-1K linear evaluation: 72.8% and 75.3% top-1 accuracy using DeiT and Swin-T, respectively, by 300-epoch training. The performance is on par with recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks.

teaser_moby

Swin Transformer as a backbone

Swin Transformer (the name Swin stands for Shifted window) is initially described in arxiv, which capably serves as a general-purpose backbone for computer vision. It achieves strong performance on COCO object detection (58.7 box AP and 51.1 mask AP on test-dev) and ADE20K semantic segmentation (53.5 mIoU on val), surpassing previous models by a large margin.

We involve Swin Transformer as one of backbones to evaluate the transferring performance on down-stream tasks such as object detection. This differentiate this codebase with other approaches studying SSL on Transformer architectures.

ImageNet-1K linear evaluation

Method Architecture Epochs Params FLOPs img/s Top-1 Accuracy Pre-trained Checkpoint Linear Checkpoint
Supervised Swin-T 300 28M 4.5G 755.2 81.2 Here
MoBY Swin-T 100 28M 4.5G 755.2 70.9 TBA
MoBY1 Swin-T 100 28M 4.5G 755.2 72.0 TBA
MoBY DeiT-S 300 22M 4.6G 940.4 72.8 GoogleDrive/GitHub/Baidu GoogleDrive/GitHub/Baidu
MoBY Swin-T 300 28M 4.5G 755.2 75.3 GoogleDrive/GitHub/Baidu GoogleDrive/GitHub/Baidu
  • 1 denotes the result of MoBY which has adopted a trick from MoCo v3 that replace theLayerNorm layers before the MLP blocks by BatchNorm.

  • Access code for baidu is moby.

Transferring to Downstream Tasks

COCO Object Detection (2017 val)

Backbone Method Model Schd. box mAP mask mAP Params FLOPs
Swin-T Mask R-CNN Sup. 1x 43.7 39.8 48M 267G
Swin-T Mask R-CNN MoBY 1x 43.6 39.6 48M 267G
Swin-T Mask R-CNN Sup. 3x 46.0 41.6 48M 267G
Swin-T Mask R-CNN MoBY 3x 46.0 41.7 48M 267G
Swin-T Cascade Mask R-CNN Sup. 1x 48.1 41.7 86M 745G
Swin-T Cascade Mask R-CNN MoBY 1x 48.1 41.5 86M 745G
Swin-T Cascade Mask R-CNN Sup. 3x 50.4 43.7 86M 745G
Swin-T Cascade Mask R-CNN MoBY 3x 50.2 43.5 86M 745G

ADE20K Semantic Segmentation (val)

Backbone Method Model Crop Size Schd. mIoU mIoU (ms+flip) Params FLOPs
Swin-T UPerNet Sup. 512x512 160K 44.51 45.81 60M 945G
Swin-T UPerNet MoBY 512x512 160K 44.06 45.58 60M 945G

Citing MoBY and Swin

MoBY

@article{xie2021moby,
  title={Self-Supervised Learning with Swin Transformers}, 
  author={Zhenda Xie and Yutong Lin and Zhuliang Yao and Zheng Zhang and Qi Dai and Yue Cao and Han Hu},
  journal={arXiv preprint arXiv:2105.04553},
  year={2021}
}

Swin Transformer

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

Getting Started

Owner
Swin Transformer
This organization maintains repositories built on Swin Transformers. The pretrained models locate at https://github.com/microsoft/Swin-Transformer
Swin Transformer
DiSECt: Differentiable Simulator for Robotic Cutting

DiSECt: Differentiable Simulator for Robotic Cutting Website | Paper | Dataset | Video | Blog post DiSECt is a simulator for the cutting of deformable

NVIDIA Research Projects 73 Oct 29, 2022
Source code of D-HAN: Dynamic News Recommendation with Hierarchical Attention Network

D-HAN The source code of D-HAN This is the source code of D-HAN: Dynamic News Recommendation with Hierarchical Attention Network. However, only the co

30 Sep 22, 2022
data/code repository of "C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal Consistent Motion Transfer"

C2F-FWN data/code repository of "C2F-FWN: Coarse-to-Fine Flow Warping Network for Spatial-Temporal Consistent Motion Transfer" (https://arxiv.org/abs/

EKILI 46 Dec 14, 2022
PyTorch(Geometric) implementation of G^2GNN in "Imbalanced Graph Classification via Graph-of-Graph Neural Networks"

This repository is an official PyTorch(Geometric) implementation of G^2GNN in "Imbalanced Graph Classification via Graph-of-Graph Neural Networks". Th

Yu Wang (Jack) 13 Nov 18, 2022
TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline.

TorchX is a library containing standard DSLs for authoring and running PyTorch related components for an E2E production ML pipeline

193 Dec 22, 2022
MEDS: Enhancing Memory Error Detection for Large-Scale Applications

MEDS: Enhancing Memory Error Detection for Large-Scale Applications Prerequisites cmake and clang Build MEDS supporting compiler $ make Build Using Do

Secomp Lab at Purdue University 34 Dec 14, 2022
Convnet transfer - Code for paper How transferable are features in deep neural networks?

How transferable are features in deep neural networks? This repository contains source code necessary to reproduce the results presented in the follow

Jason Yosinski 143 Sep 13, 2022
Learning Features with Parameter-Free Layers (ICLR 2022)

Learning Features with Parameter-Free Layers (ICLR 2022) Dongyoon Han, YoungJoon Yoo, Beomyoung Kim, Byeongho Heo | Paper NAVER AI Lab, NAVER CLOVA Up

NAVER AI 65 Dec 07, 2022
A curated list of awesome resources combining Transformers with Neural Architecture Search

A curated list of awesome resources combining Transformers with Neural Architecture Search

Yash Mehta 173 Jan 03, 2023
Semantic Segmentation in Pytorch

PyTorch Semantic Segmentation Introduction This repository is a PyTorch implementation for semantic segmentation / scene parsing. The code is easy to

Hengshuang Zhao 1.2k Jan 01, 2023
vit for few-shot classification

Few-Shot ViT Requirements PyTorch (= 1.9) TorchVision timm (latest) einops tqdm numpy scikit-learn scipy argparse tensorboardx Pretrained Checkpoints

Martin Dong 26 Nov 30, 2022
Constrained Language Models Yield Few-Shot Semantic Parsers

Constrained Language Models Yield Few-Shot Semantic Parsers This repository contains tools and instructions for reproducing the experiments in the pap

Microsoft 43 Nov 23, 2022
"Inductive Entity Representations from Text via Link Prediction" @ The Web Conference 2021

Inductive entity representations from text via link prediction This repository contains the code used for the experiments in the paper "Inductive enti

Daniel Daza 45 Jan 09, 2023
Replication attempt for the Protein Folding Model

RGN2-Replica (WIP) To eventually become an unofficial working Pytorch implementation of RGN2, an state of the art model for MSA-less Protein Folding f

Eric Alcaide 36 Nov 29, 2022
A semantic segmentation toolbox based on PyTorch

Introduction vedaseg is an open source semantic segmentation toolbox based on PyTorch. Features Modular Design We decompose the semantic segmentation

407 Dec 15, 2022
FinRL­-Meta: A Universe for Data­-Driven Financial Reinforcement Learning. 🔥

FinRL-Meta: A Universe of Market Environments. FinRL-Meta is a universe of market environments for data-driven financial reinforcement learning. Users

AI4Finance Foundation 543 Jan 08, 2023
StorSeismic: An approach to pre-train a neural network to store seismic data features

StorSeismic: An approach to pre-train a neural network to store seismic data features This repository contains codes and resources to reproduce experi

Seismic Wave Analysis Group 11 Dec 05, 2022
Segmentation-Aware Convolutional Networks Using Local Attention Masks

Segmentation-Aware Convolutional Networks Using Local Attention Masks [Project Page] [Paper] Segmentation-aware convolution filters are invariant to b

144 Jun 29, 2022
Cross-modal Deep Face Normals with Deactivable Skip Connections

Cross-modal Deep Face Normals with Deactivable Skip Connections Victoria Fernández Abrevaya*, Adnane Boukhayma*, Philip H. S. Torr, Edmond Boyer (*Equ

72 Nov 27, 2022
Official implementation of the paper ``Unifying Nonlocal Blocks for Neural Networks'' (ICCV'21)

Spectral Nonlocal Block Overview Official implementation of the paper: Unifying Nonlocal Blocks for Neural Networks (ICCV'21) Spectral View of Nonloca

91 Dec 14, 2022