Usable Implementation of "Bootstrap Your Own Latent" self-supervised learning, from Deepmind, in Pytorch

Last update: Dec 29, 2022

Overview

Bootstrap Your Own Latent (BYOL), in Pytorch

Practical implementation of an astoundingly simple method for self-supervised learning that achieves a new state of the art (surpassing SimCLR) without contrastive learning and having to designate negative pairs.

This repository offers a module that one can easily wrap any image-based neural network (residual network, discriminator, policy network) to immediately start benefitting from unlabelled image data.

Update 1: There is now new evidence that batch normalization is key to making this technique work well

Update 2: A new paper has successfully replaced batch norm with group norm + weight standardization, refuting that batch statistics are needed for BYOL to work

Update 3: Finally, we have some analysis for why this works

Yannic Kilcher's excellent explanation

Now go save your organization from having to pay for labels :)

Install

$ pip install byol-pytorch

Usage

Simply plugin your neural network, specifying (1) the image dimensions as well as (2) the name (or index) of the hidden layer, whose output is used as the latent representation used for self-supervised training.

import torch
from byol_pytorch import BYOL
from torchvision import models

resnet = models.resnet50(pretrained=True)

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = 'avgpool'
)

opt = torch.optim.Adam(learner.parameters(), lr=3e-4)

def sample_unlabelled_images():
    return torch.randn(20, 3, 256, 256)

for _ in range(100):
    images = sample_unlabelled_images()
    loss = learner(images)
    opt.zero_grad()
    loss.backward()
    opt.step()
    learner.update_moving_average() # update moving average of target encoder

# save your improved network
torch.save(resnet.state_dict(), './improved-net.pt')

That's pretty much it. After much training, the residual network should now perform better on its downstream supervised tasks.

BYOL → SimSiam

A new paper from Kaiming He suggests that BYOL does not even need the target encoder to be an exponential moving average of the online encoder. I've decided to build in this option so that you can easily use that variant for training, simply by setting the use_momentum flag to False. You will no longer need to invoke update_moving_average if you go this route as shown in the example below.

import torch
from byol_pytorch import BYOL
from torchvision import models

resnet = models.resnet50(pretrained=True)

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = 'avgpool',
    use_momentum = False       # turn off momentum in the target encoder
)

opt = torch.optim.Adam(learner.parameters(), lr=3e-4)

def sample_unlabelled_images():
    return torch.randn(20, 3, 256, 256)

for _ in range(100):
    images = sample_unlabelled_images()
    loss = learner(images)
    opt.zero_grad()
    loss.backward()
    opt.step()

# save your improved network
torch.save(resnet.state_dict(), './improved-net.pt')

Advanced

While the hyperparameters have already been set to what the paper has found optimal, you can change them with extra keyword arguments to the base wrapper class.

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = 'avgpool',
    projection_size = 256,           # the projection size
    projection_hidden_size = 4096,   # the hidden dimension of the MLP for both the projection and prediction
    moving_average_decay = 0.99      # the moving average decay factor for the target encoder, already set at what paper recommends
)

By default, this library will use the augmentations from the SimCLR paper (which is also used in the BYOL paper). However, if you would like to specify your own augmentation pipeline, you can simply pass in your own custom augmentation function with the augment_fn keyword.

augment_fn = nn.Sequential(
    kornia.augmentation.RandomHorizontalFlip()
)

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = -2,
    augment_fn = augment_fn
)

In the paper, they seem to assure that one of the augmentations have a higher gaussian blur probability than the other. You can also adjust this to your heart's delight.

augment_fn = nn.Sequential(
    kornia.augmentation.RandomHorizontalFlip()
)

augment_fn2 = nn.Sequential(
    kornia.augmentation.RandomHorizontalFlip(),
    kornia.filters.GaussianBlur2d((3, 3), (1.5, 1.5))
)

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = -2,
    augment_fn = augment_fn,
    augment_fn2 = augment_fn2,
)

To fetch the embeddings or the projections, you simply have to pass in a return_embeddings = True flag to the BYOL learner instance

import torch
from byol_pytorch import BYOL
from torchvision import models

resnet = models.resnet50(pretrained=True)

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = 'avgpool'
)

imgs = torch.randn(2, 3, 256, 256)
projection, embedding = learner(imgs, return_embedding = True)

Alternatives

If your downstream task involves segmentation, please look at the following repository, which extends BYOL to 'pixel'-level learning.

https://github.com/lucidrains/pixel-level-contrastive-learning

Citation

@misc{grill2020bootstrap,
    title = {Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning},
    author = {Jean-Bastien Grill and Florian Strub and Florent Altché and Corentin Tallec and Pierre H. Richemond and Elena Buchatskaya and Carl Doersch and Bernardo Avila Pires and Zhaohan Daniel Guo and Mohammad Gheshlaghi Azar and Bilal Piot and Koray Kavukcuoglu and Rémi Munos and Michal Valko},
    year = {2020},
    eprint = {2006.07733},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}

@misc{chen2020exploring,
    title={Exploring Simple Siamese Representation Learning}, 
    author={Xinlei Chen and Kaiming He},
    year={2020},
    eprint={2011.10566},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Comments

Negative Loss, Transfer Learning/Fine-Tuning Question

Hi! Thanks for sharing this repo -- really clean and easy to use.

When training using the PyTorch Lightning script from the repo, my loss is negative (and gets more negative over time) when training. Is this expected?

I'm curious to know if you've fine-tuned a pretrained model using this BYOL as the README example suggested. If yes, how were the results? Any intuition regarding how many epochs to fine-tune for?

Thanks!

opened by rsomani95 13
AssertionError: hidden layer never emitted an output with multi-gpu training

I tried your library with a WideResnet40-2 model and used layer_index=-2.

The lightning example works fine for single-gpu but i got the error with multiple GPUs.

opened by reactivetype 7
How to transfer the trained ckpt to pytorch.pth model?

I use the example script to train a model, I got a ckpt file. but how could I extra the trained resnet50.pth instead of the whole SelfSupervisedLearner? Sorry I am new for pytorch lightning lib. What I want is the SelfSupervised resnet50.pth, because I want this to replace the original ImageNet-pretrained one. Thank you a lot.

opened by knaffe 5
Training loss decreased and then increased

Hi, I used your example on my own data. The training loss decreased and then increased after 100 epochs, which is wired. Did you meet similar situations? Is it hard to train the model? the batchsize is 128/256 lr is 0.1/0.2 weight_decay is 1e-6

opened by easonyang1996 4
Can't load ckpt

I use byol-pytorch-master/examples/lightning/train.py to generate ckpt locally after training, but when I load ckpt, there will be the following errors. How should I load it? Thanks a lot!

opened by AndrewTal 4
BYOL uses different augmentations for view1 and view2

In your implementation, you use the same data augmentation pipeline for both views (https://github.com/lucidrains/byol-pytorch/blob/master/byol_pytorch/byol_pytorch.py#L153) like it was done in SimCLR and MoCo.

In the paper, they use different augmentations for the two views:

Is this voluntary?

Love your work btw, the implementations are always very clean :)

opened by OlivierDehaene 4
Transferring results on Cifar and other datasets

Thanks for your open sourcing!

I notice that the BYOL has a large gap on the transferring downstream datasets: e.g., SimCLR reaches 71.6% on Cifar 100, while BYOL can reach to 78.4%.

I understand that this might depends on the downstream training protocols. And could you provide us a sample code on that, especially for the LBFGS optimized logistic regressor?

opened by jacobswan1 4

The saved network is same as the initial one?

Firstly, thank you so much for this clean implementation!!

The self-supervised training process looks good, but the saved (i.e. improved) model is exactly the same as the initial one on my side. Have you observed the same problem?

The code I tested:

import torch
from net.byol import BYOL
from torchvision import models
 
       
resnet = models.resnet50(pretrained=True)
param_1 = resnet.parameters()

learner = BYOL(
    resnet,
    image_size = 256,
    hidden_layer = 'avgpool'
)

opt = torch.optim.Adam(learner.parameters(), lr=3e-4)

def sample_unlabelled_images():
    return torch.randn(20, 3, 256, 256)

for _ in range(2):
    images = sample_unlabelled_images()
    loss = learner(images)
    opt.zero_grad()
    loss.backward()
    opt.step()
    learner.update_moving_average() # update moving average of target encoder

# save your improved network
torch.save(resnet.state_dict(), './checkpoints/improved-net.pt')

# restore the model      
resnet2 = models.resnet50()
resnet2.load_state_dict(torch.load('./checkpoints/improved-net.pt'))
param_2 = resnet2.parameters()

# test whether two models are the same 
for p1, p2 in zip(param_1, param_2):
    if p1.data.ne(p2.data).sum() > 0:
        print('They are different.')
print('They are same.')

opened by KimMeen 3

the maximum batch size can only be set to 32

When I run the code with a 2080ti GPU with 10G memory, the maximum batch size can only be set to 32. Is there any place in the code that takes up a lot of video memory?

opened by cuixianheng 3
Pretrained network

Hi, thanks for sharing the code and making it so easy to use. I see in the example you set resnet = models.resnet50(pretrained=True). Is this what is done in the paper? Shouldn't self-supervised-learned networks be trained from scratch?

Thanks again, P.

opened by pmorerio 3
Singleton Class Members

Forgive me for my unfamiliarity with software design, but I'm wondering why it is necessary to write a singleton wrapper for projector and target_encoder. Is there any disadvantage of initializing them in __init__?

opened by wentaoyuan 3
Increase EMA-parameter during training

Hi, I noticed that the EMA-parameter (called beta in the code, τ in the paper) is not updated during training. In the paper they describe that they increase τ from the start value to 1 during training: "Specifically, we set τ = 1 − (1 − τbase) · (cos(πk/K) + 1)/2 with k the current training step and K the maximum number of training steps." This makes a huge difference to the validation loss at the end of the training.

opened by Benjamin-Hansson 1
Why the loss is different from BYOL authors'

I found the loss is different from the loss said in BYOL paper which should be a L2 loss and I did't find explanation... The loss in this repo is a cosine loss, and I just want to know why. BTW, thanks for this great repo!

opened by Jing-XING 2
How to cluster/predict images?

Hi, I have trained using examples given with pytorch-lightning. I couldn't find code to do clustering of images after training. How can I find which image falls in which cluster? Is there any predictor API? I want to do something like this

opened by laxmimerit 1
BN layer weights and biases are not updated

Thanks for sharing this repo, great work!

I trained BYOL on my data and noticed that the weights and biases for BN layers are not updated on the saved model. I used resnet18 without pretrained weights resnet = models.resnet50(pretrained=False). After training for multiple epochs, the saved model has bn1.weight all equal to 1.0 and bn1.bias all equal to 0.0 .

Is this the expected behavior or am I missing something? Appreciate your response!

opened by kregmi 1
Warning: grad and param do not obey the gradient layout contract.

Has anybody gotten a similar warning when using it?

Warning: grad and param do not obey the gradient layout contract. This is not an error, but may impair performance. grad.sizes() = [512, 256, 1, 1], strides() = [256, 1, 1, 1] param.sizes() = [512, 256, 1, 1], strides() = [256, 1, 256, 256] (function operator())

opened by mohaEs 3

Releases(0.6.0)

0.6.0(Apr 6, 2022)

Source code(tar.gz)
Source code(zip)
0.5.7(Jul 15, 2021)

Source code(tar.gz)
Source code(zip)
0.5.6(Apr 12, 2021)

Source code(tar.gz)
Source code(zip)
0.5.5(Mar 31, 2021)

Source code(tar.gz)
Source code(zip)
0.5.4(Feb 13, 2021)

Source code(tar.gz)
Source code(zip)
0.5.3(Feb 6, 2021)

Source code(tar.gz)
Source code(zip)
0.5.2(Jan 12, 2021)

Source code(tar.gz)
Source code(zip)
0.5.1(Dec 15, 2020)

Source code(tar.gz)
Source code(zip)
0.5.0(Dec 8, 2020)

Source code(tar.gz)
Source code(zip)
0.4.0(Nov 23, 2020)

Source code(tar.gz)
Source code(zip)
0.3.2(Nov 1, 2020)

Source code(tar.gz)
Source code(zip)
0.3.1(Oct 14, 2020)

Source code(tar.gz)
Source code(zip)
0.3.0(Oct 13, 2020)

Source code(tar.gz)
Source code(zip)
0.2.0(Aug 10, 2020)

Source code(tar.gz)
Source code(zip)
0.1.5(Jul 24, 2020)

Source code(tar.gz)
Source code(zip)
0.1.4(Jun 28, 2020)

Source code(tar.gz)
Source code(zip)
0.1.3(Jun 24, 2020)

Source code(tar.gz)
Source code(zip)
0.1.2(Jun 22, 2020)

Average loss rather than sum across batch
Source code(tar.gz)
Source code(zip)
0.1.1(Jun 18, 2020)

Source code(tar.gz)
Source code(zip)
0.1.0(Jun 18, 2020)

Source code(tar.gz)
Source code(zip)
0.0.5(Jun 17, 2020)

Source code(tar.gz)
Source code(zip)
0.0.4(Jun 17, 2020)

Source code(tar.gz)
Source code(zip)
0.0.2(Jun 17, 2020)

First release
Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need

GitHub Repository

PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation

StructDepth PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimat

112 Nov 28, 2022

Python package for visualizing the loss landscape of parameterized quantum algorithms.

orqviz A Python package for easily visualizing the loss landscape of Variational Quantum Algorithms by Zapata Computing Inc. orqviz provides a collect

75 Dec 30, 2022

A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation

A Differentiable Recipe for Learning Visual Non-Prehensile Planar Manipulation This repository contains the source code of the paper A Differentiable

2 May 05, 2022

It's like Shape Editor in Maya but works with skeletons (transforms).

Skeleposer What is Skeleposer? Briefly, it's like Shape Editor in Maya, but works with transforms and joints. It can be used to make complex facial ri

1 Nov 11, 2022

Omniverse sample scripts - A guide for developing with Python scripts on NVIDIA Ominverse

Omniverse sample scripts ここでは、NVIDIA Omniverse ( https://www.nvidia.com/ja-jp/om

37 Nov 17, 2022

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals.

Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals This repo contains the Pytorch implementation of our paper: Unsupervised Seman

335 Dec 28, 2022

Learning to Reconstruct 3D Non-Cuboid Room Layout from a Single RGB Image

NonCuboidRoom Paper Learning to Reconstruct 3D Non-Cuboid Room Layout from a Single RGB Image Cheng Yang*, Jia Zheng*, Xili Dai, Rui Tang, Yi Ma, Xiao

67 Dec 15, 2022

LERP : Label-dependent and event-guided interpretable disease risk prediction using EHRs

LERP : Label-dependent and event-guided interpretable disease risk prediction using EHRs This is the code for the LERP. Dataset The dataset used is MI

5 Jun 18, 2022

Anti-Adversarially Manipulated Attributions for Weakly and Semi-Supervised Semantic Segmentation (CVPR 2021)

Anti-Adversarially Manipulated Attributions for Weakly and Semi-Supervised Semantic Segmentation Input Image Initial CAM Successive Maps with adversar

110 Dec 07, 2022

Explainable Zero-Shot Topic Extraction

Zero-Shot Topic Extraction with Common-Sense Knowledge Graph This repository contains the code for reproducing the results reported in the paper "Expl

56 Dec 14, 2022

TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning

TaCL: Improving BERT Pre-training with Token-aware Contrastive Learning Authors: Yixuan Su, Fangyu Liu, Zaiqiao Meng, Lei Shu, Ehsan Shareghi, and Nig

79 Nov 04, 2022

Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition (NeurIPS 2019)

MLCR This is the source code for paper Multi-label Co-regularization for Semi-supervised Facial Action Unit Recognition. Xuesong Niu, Hu Han, Shiguang

60 Nov 29, 2022

Yas CRNN model training - Yet Another Genshin Impact Scanner

Yas-Train Yet Another Genshin Impact Scanner 又一个原神圣遗物导出器介绍该仓库为 Yas 的模型训练程序相关资料 MobileNetV3 CRNN 使用假设你会设置基本的pytorch环境。生成数据集 python main.py gen 训练

18 Jan 08, 2023

OpenGAN: Open-Set Recognition via Open Data Generation

OpenGAN: Open-Set Recognition via Open Data Generation ICCV 2021 (oral) Real-world machine learning systems need to analyze novel testing data that di

90 Jan 06, 2023

Improving Deep Network Debuggability via Sparse Decision Layers

Improving Deep Network Debuggability via Sparse Decision Layers This repository contains the code for our paper: Leveraging Sparse Linear Layers for D

35 Nov 14, 2022

The official PyTorch code implementation of "Human Trajectory Prediction via Counterfactual Analysis" in ICCV 2021.

Human Trajectory Prediction via Counterfactual Analysis (CausalHTP) The official PyTorch code implementation of "Human Trajectory Prediction via Count

46 Dec 03, 2022

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond

GCNet for Object Detection By Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, Han Hu. This repo is a official implementation of "GCNet: Non-local Networ

1.1k Dec 29, 2022

Reinforcement Learning for the Blackjack

Reinforcement Learning for Blackjack Author: ZHA Mengyue Math Department of HKUST Problem Statement We study playing Blackjack by reinforcement learni

3 Jan 24, 2022

Soft actor-critic is a deep reinforcement learning framework for training maximum entropy policies in continuous domains.

This repository is no longer maintained. Please use our new Softlearning package instead. Soft Actor-Critic Soft actor-critic is a deep reinforcement

752 Jan 07, 2023

Semi-supervised Adversarial Learning to Generate Photorealistic Face Images of New Identities from 3D Morphable Model

Semi-supervised Adversarial Learning to Generate Photorealistic Face Images of New Identities from 3D Morphable Model Baris Gecer 1, Binod Bhattarai 1

190 Dec 29, 2022

Usable Implementation of "Bootstrap Your Own Latent" self-supervised learning, from Deepmind, in Pytorch

Related tags

Overview

Bootstrap Your Own Latent (BYOL), in Pytorch

Install

Usage

BYOL → SimSiam

Advanced

Alternatives

Citation

Comments

Releases(0.6.0)

0.6.0(Apr 6, 2022)

0.5.7(Jul 15, 2021)

0.5.6(Apr 12, 2021)

0.5.5(Mar 31, 2021)

0.5.4(Feb 13, 2021)

0.5.3(Feb 6, 2021)

0.5.2(Jan 12, 2021)

0.5.1(Dec 15, 2020)

0.5.0(Dec 8, 2020)

0.4.0(Nov 23, 2020)

0.3.2(Nov 1, 2020)

0.3.1(Oct 14, 2020)

0.3.0(Oct 13, 2020)

0.2.0(Aug 10, 2020)

0.1.5(Jul 24, 2020)

0.1.4(Jun 28, 2020)

0.1.3(Jun 24, 2020)

0.1.2(Jun 22, 2020)

0.1.1(Jun 18, 2020)

0.1.0(Jun 18, 2020)

0.0.5(Jun 17, 2020)

0.0.4(Jun 17, 2020)

0.0.2(Jun 17, 2020)