Adabelief-Optimizer - Repository for NeurIPS 2020 Spotlight "AdaBelief Optimizer: Adapting stepsizes by the belief in observed gradients"

Overview

AdaBelief Optimizer

NeurIPS 2020 Spotlight, trains fast as Adam, generalizes well as SGD, and is stable to train GANs.

Release of package

We have released adabelief-pytorch==0.2.0 and adabelief-tf==0.2.0. Please use the latest version from pip. Source code is available under folder pypi_packages/adabelief_pytorch0.2.0 and pypi_packages/adabelief_tf0.2.0.

Table of Contents

External Links

Project Page, arXiv , Reddit , Twitter, BiliBili (中文), BiliBili (Engligh), Youtube

Link to code for extra experiments with AdaBelief

Update for adabelief-pytorch==0.2.0 (Crucial)

In the next release of adabelief-pytorch, we will modify the default of several arguments, in order to fit the needs of for general tasks such as GAN and Transformer. Please check if you specify these arguments or use the default when upgrade from version 0.0.5 to higher.

Version epsilon weight_decouple rectify
adabelief-pytorch=0.0.5 1e-8 False False
latest version 0.2.0>0.0.5 1e-16 True True

Update for adabelief-tf==0.2.0 (Crucial)

In adabelief-tf==0.1.0, we modify adabelief-tf to have the same feature as adabelief-pytorch, inlcuding decoupled weight decay and learning rate rectification. Furthermore, we will add support for TensorFlow>=2.0 and Keras. The source code is in pypi_packages/adabelief_tf0.1.0. We tested with a text classification task and a word embedding task. The default value is updated, please check if you specify these arguments or use the default when upgrade from version 0.0.1 to higher.:

Version epsilon weight_decouple rectify
adabelief-tf=0.0.1 1e-8 Not supported Not supported
latest version 0.2.0>0.0.1 1e-14 Supported (Not an option in arguments) default: True

Quick Guide

  • Check if the code is from the latest official implementation (adabelief-pytorch==0.1.0, adabelief-tf==0.1.0) Default hyper-parameters are different from the old version.

  • check all hyper-parameters, DO NOT simply use the default,

    Epsilon in AdaBelief is different from Adam (typically eps_adabelief = eps_adam*eps_adam)
    ( eps of Adam in Tensorflow is 1e-7, in PyTorch is 1e-8, need to consider this when use AdaBelief in Tensorflow)

    If SGD is better than Adam -> Set a large eps (1e-8) in AdaBelief-pytorch (1e-7 in Tensorflow )
    If SGD is worse than Adam -> Set a small eps (1e-16) in AdaBelief-pytorch (1e-14 in Tensorflow, rectify=True often helps)

    If AdamW is better than Adam -> Turn on “weight_decouple” in AdaBelief-pytorch (this is on in adabelief-tf==0.1.0 and cannot shut down).
    Note that default weight decay is very different for Adam and AdamW, you might need to consider this when using AdaBelief with and without decoupled weight decay.

  • Check ALL hyper-parameters. Refer to our github page for a list of recommended hyper-parameters

Table of Hyper-parameters

Please check if you have specify all arguments and check your version is latest, the default might not be suitable for different tasks, see tables below

Hyper-parameters in PyTorch

  • Note weight decay varies with tasks, for different tasks the weight decay is untuned from the original repository (only changed the optimizer and other hyper-parameters).
Task lr beta1 beta2 epsilon weight_decay weight_decouple rectify fixed_decay amsgrad
Cifar 1e-3 0.9 0.999 1e-8 5e-4 False False False False
ImageNet 1e-3 0.9 0.999 1e-8 1e-2 True False False False
Object detection (PASCAL) 1e-4 0.9 0.999 1e-8 1e-4 False False False False
LSTM-1layer 1e-3 0.9 0.999 1e-16 1.2e-6 False False False False
LSTM 2,3 layer 1e-2 0.9 0.999 1e-12 1.2e-6. False False False False
GAN (small) 2e-4 0.5 0.999 1e-12 0 True=False (decay=0) False False False
SN-GAN (large) 2e-4 0.5 0.999 1e-16 0 True=False (decay=0) True False False
Transformer 5e-4 0.9 0.999 1e-16 1e-4 True True False False
Reinforcement (Rainbow) 1e-4 0.9 0.999 1e-10 0.0 True=False (decay=0) True False False
Reinforcement (HalfCheetah-v2) 1e-3 0.9 0.999 1e-12 0.0 True=False (decay=0) True False False

Hyper-parameters in Tensorflow (eps in Tensorflow might need to be larger than in PyTorch)

epsilon is used in a different way in Tensorflow (default 1e-7) compared to PyTorch (default 1e-8), so eps in Tensorflow might needs to be larger than in PyTorch (perhaps 100 times larger in Tensorflow, e.g. eps=1e-16 in PyTorch v.s eps=1e-14 in Tensorflow). But personally I don't have much experience with Tensorflow, it's likely that you need to slightly tune eps.

Installation and usage

1. PyTorch implementations

( Results in the paper are all generated using the PyTorch implementation in adabelief-pytorch package, which is the ONLY package that I have extensively tested for now.)

AdaBelief

Please install latest version (0.2.0), previous version (0.0.5) uses different default arguments.

pip install adabelief-pytorch==0.2.0
from adabelief_pytorch import AdaBelief
optimizer = AdaBelief(model.parameters(), lr=1e-3, eps=1e-16, betas=(0.9,0.999), weight_decouple = True, rectify = False)

Adabelief with Ranger optimizer

pip install ranger-adabelief==0.1.0
from ranger_adabelief import RangerAdaBelief
optimizer = RangerAdaBelief(model.parameters(), lr=1e-3, eps=1e-12, betas=(0.9,0.999))

2. Tensorflow implementation (eps of AdaBelief in Tensorflow is larger than in PyTorch, same for Adam)

pip install adabelief-tf==0.2.0
from adabelief_tf import AdaBeliefOptimizer
optimizer = AdaBeliefOptimizer(learning_rate=1e-3, epsilon=1e-14, rectify=False)

A quick look at the algorithm

Adam and AdaBelief are summarized in Algo.1 and Algo.2, where all operations are element-wise, with differences marked in blue. Note that no extra parameters are introduced in AdaBelief. For simplicity, we omit the bias correction step. Specifically, in Adam, the update direction is , where is the EMA (Exponential Moving Average) of ; in AdaBelief, the update direction is , where is the of . Intuitively, viewing as the prediction of , AdaBelief takes a large step when observation is close to prediction , and a small step when the observation greatly deviates from the prediction.

Reproduce results in the paper

(Comparison with 8 other optimizers: SGD, Adam, AdaBound, RAdam, AdamW, Yogi, MSVAG, Fromage)

See folder PyTorch_Experiments, for each subfolder, execute sh run.sh. See readme.txt in each subfolder for visualization, or refer to jupyter notebook for visualization.

Results on Image Recognition

Results on GAN training

Results on a small GAN with vanilla CNN generator

Results on Spectral Normalization GAN with a ResNet generator

Results on LSTM

Results on Transformer

Results on Toy Example

Discussions

Installation

Please install the latest version from pip, old versions might suffer from bugs. Source code for up-to-date package is available in folder pypi_packages.

Discussion on hyper-parameters

AdaBelief uses a different denominator from Adam, and is orthogonal to other techniques such as recification, decoupled weight decay, weight averaging et.al. This implies when you use some techniques with Adam, to get a good result with AdaBelief you might still need those techniques.

  • epsilon in AdaBelief plays a different role as in Adam, typically when you use epslison=x in Adam, using epsilon=x*x will give similar results in AdaBelief. The default value epsilon=1e-8 is not a good option in many cases, in version >0.1.0 the default eps is set as 1e-16.

  • If you task needs a "non-adaptive" optimizer, which means SGD performs much better than Adam(W), such as on image recognition, you need to set a large epsilon(e.g. 1e-8) for AdaBelief to make it more non-adaptive; if your task needs a really adaptive optimizer, which means Adam is much better than SGD, such as GAN and Transformer, then the recommended epsilon for AdaBelief is small (1e-12, 1e-16 ...).

  • If decoupled weight decay is very important for your task, which means AdamW is much better than Adam, then you need to set weight_decouple as True to turn on decoupled decay in AdaBelief. Note that many optimizers uses decoupled weight decay without specifying it as an options, e.g. RAdam, but we provide it as an option so users are aware of what technique is actually used.

  • Don't use "gradient threshold" (clamp each element independently) in AdaBelief, it could result in division by 0 and explosion in update; but "gradient clip" (shrink amplitude of the gradient vector but keeps its direction) is fine, though from my limited experience sometimes the clip range needs to be the same or larger than Adam.

Discussion on algorithms

1. Weight Decay:
  • Decoupling (argument weight_decouple appears in AdaBelief and RangerAdaBelief):
    Currently there are two ways to perform weight decay for adaptive optimizers, directly apply it to the gradient (Adam), or decouple weight decay from gradient descent (AdamW). This is passed to the optimizer by argument weight_decouple (default: False).

  • Fixed ratio (argument fixed_decay (default: False) appears in AdaBelief):
    (1) If weight_decouple == False, then this argument does not affect optimization.
    (2) If weight_decouple == True:

      If fixed_decay == False, the weight is multiplied by 1 -lr x weight_decay
      If fixed_decay == True, the weight is multiplied by 1 - weight_decay. This is implemented as an option but not used to produce results in the paper.

  • What is the acutal weight-decay we are using?
    This is seldom discussed in the literature, but personally I think it's very important. When we set weight_decay=1e-4 for SGD, the weight is scaled by 1 - lr x weight_decay. Two points need to be emphasized: (1) lr in SGD is typically larger than Adam (0.1 vs 0.001), so the weight decay in Adam needs to be set as a larger number to compensate. (2) lr decays, this means typically we use a larger weight decay in early phases, and use a small weight decay in late phases.

2. Epsilon:

AdaBelief seems to require a different epsilon from Adam. In CV tasks in this paper, epsilon is set as 1e-8. For GAN training it's set as 1e-16. We recommend try different epsilon values in practice, and sweep through a large region. We recommend use eps=1e-8 when SGD outperforms Adam, such as many CV tasks; recommend eps=1e-16 when Adam outperforms SGD, such as GAN and Transformer. Sometimes you might need to try eps=1e-12, such as in some reinforcement learning tasks.

3. Rectify (argument rectify in AdaBelief):

Whether to turn on the rectification as in RAdam. The recitification basically uses SGD in early phases for warmup, then switch to Adam. Rectification is implemented as an option, but is never used to produce results in the paper.

4. AMSgrad (argument amsgrad (default: False) in AdaBelief):

Whether to take the max (over history) of denominator, same as AMSGrad. It's set as False for all experiments.

5. Details to reproduce results
  • Results in the paper are generated using the PyTorch implementation in adabelief-pytorch package. This is the ONLY package that I have extensively tested for now.
  • We also provide a modification of ranger optimizer in ranger-adabelief which combines RAdam + LookAhead + Gradient Centralization + AdaBelief, but this is not used in the paper and is not extensively tested.
  • The adabelief-tf is a naive implementation in Tensorflow. It lacks many features such as decoupled weight decay, and is not extensively tested. Currently I don't have plans to improve it since I seldom use Tensorflow, please contact me if you want to collaborate and improve it.
  • The adabelief-tf==0.1.0 supports the same feature as adabelief-pytorch==0.1.0, including decoupled weight decay and rectification. But personally I don't have the chance to perform extensive tests as with the PyTorch version.
6. Learning rate schedule

The experiments on Cifar is the same as demo in AdaBound, with the only difference is the optimizer. The ImageNet experiment uses a different learning rate schedule, typically is decayed by 1/10 at epoch 30, 60, and ends at 90. For some reasons I have not extensively experimented, AdaBelief performs good when decayed at epoch 70, 80 and ends at 90, using the default lr schedule produces a slightly worse result. If you have any ideas on this please open an issue here or email me.

7. Some experience with RNN

I got some feedbacks on RNN on reddit discussion, here are a few tips:

  • The epsilon is suggested to set as a smaller value for RNN (e.g. 1e-12, 1e-16). Please try different epsilon values, it varies from task to task.
  • I might confuse "gradient threshold" with "gradient clip" in previous readme, clarify below:
    (1) By "gradient threshold" I refer to element-wise operation, which only takes values between a certain region [a,b]. Values outside this region will be set as a and b respectively.
    (2) By "gradient clip" I refer to the operation on a vector or tensor. Suppose X is a tensor, if ||X|| > thres, then X <- X/||X|| * thres. Take X as a vector, "gradient clip" shrinks the amplitude but keeps the direction.
    (3) "Gradient threshold" is incompatible with AdaBelief, because if gt is thresholded for a long time, then |gt-mt|~=0, and the division will explode; however, "gradient clip" is fine for Adabelief, yet the clip range still needs tuning (perhaps AdaBelief needs a larger range than Adam).
8. Contact

Please contact me at [email protected] or open an issue here if you would like to help improve it, especially the tensorflow version, or explore combination with other methods, some discussion on the theory part, or combination with other methods to create a better optimizer. Any thoughts are welcome!

Update Plan

To do

Done

  • Updated results on an SN-GAN is in https://github.com/juntang-zhuang/SNGAN-AdaBelief, AdaBelief achieves 12.36 FID (lower is better) on Cifar10, while Adam achieves 13.25 (number taken from the log of official repository PyTorch-studioGAN).
  • LSTM experiments uploaded to PyTorch_Experiments/LSTM
  • Identify the problem of Transformer with PyTorch 1.4, to be an old version fairseq is incompatible with new version PyTorch, works fine with latest fairseq.
    Code on Transformer to work with PyTorch 1.6 is at: https://github.com/juntang-zhuang/fairseq-adabelief
    Code for transformer to work with PyTorch 1.1 and CUDA9.0 is at: https://github.com/juntang-zhuang/transformer-adabelief
  • Tested on a toy example of reinforcement learning.
  • Released adabelief-pytorch==0.1.0 and adabelief-tf==0.1.0. The Tensorflow version now supports TF>=2.0 and Keras, with the same features as in the PyTorch version, including decoupled weight decay and rectification.
  • Released adabelief-pytorch==0.2.0. Fix the error with coupled weight decay in adabelief-pytorch==0.1.0, fix the amsgrad update in adabelief-pytorch==0.1.0. Add options to disable the message printing, by specify print_change_log=False when initiating the optimizer.
  • Released adabelief-tf==0.2.0. Add options to disable the message printing, by specify print_change_log=False when initiating the optimizer. Delte redundant computations, so 0.2.0 should be faster than 0.1.0. Removed dependencies on tensorflow-addons.
  • adabelief-pytorch==0.2.1 is compatible with mixed-precision training.

Citation

@article{zhuang2020adabelief,
  title={AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients},
  author={Zhuang, Juntang and Tang, Tommy and Ding, Yifan and Tatikonda, Sekhar and Dvornek, Nicha and Papademetris, Xenophon and Duncan, James},
  journal={Conference on Neural Information Processing Systems},
  year={2020}
}
Owner
Juntang Zhuang
Juntang Zhuang
Semi-Supervised Semantic Segmentation with Cross-Consistency Training (CCT)

Semi-Supervised Semantic Segmentation with Cross-Consistency Training (CCT) Paper, Project Page This repo contains the official implementation of CVPR

Yassine 344 Dec 29, 2022
PyTorch implementation of "Dataset Knowledge Transfer for Class-Incremental Learning Without Memory" (WACV2022)

Dataset Knowledge Transfer for Class-Incremental Learning Without Memory [Paper] [Slides] Summary Introduction Installation Reproducing results Citati

Habib Slim 5 Dec 05, 2022
HairCLIP: Design Your Hair by Text and Reference Image

Overview This repository hosts the official PyTorch implementation of the paper: "HairCLIP: Design Your Hair by Text and Reference Image". Our single

322 Jan 06, 2023
SPEAR: Semi suPErvised dAta progRamming

Semi-Supervised Data Programming for Data Efficient Machine Learning SPEAR is a library for data programming with semi-supervision. The package implem

decile-team 91 Dec 06, 2022
Pytorch Lightning Implementation of SC-Depth Methods.

SC_Depth_pl: This is a pytorch lightning implementation of SC-Depth (V1, V2) for self-supervised learning of monocular depth from video. In the V1 (IJ

JiaWang Bian 216 Dec 30, 2022
Ultra-lightweight human body posture key point CNN model. ModelSize:2.3MB HUAWEI P40 NCNN benchmark: 6ms/img,

Ultralight-SimplePose Support NCNN mobile terminal deployment Based on MXNET(=1.5.1) GLUON(=0.7.0) framework Top-down strategy: The input image is t

223 Dec 27, 2022
EDCNN: Edge enhancement-based Densely Connected Network with Compound Loss for Low-Dose CT Denoising

EDCNN: Edge enhancement-based Densely Connected Network with Compound Loss for Low-Dose CT Denoising By Tengfei Liang, Yi Jin, Yidong Li, Tao Wang. Th

workingcoder 115 Jan 05, 2023
Tool for working with Y-chromosome data from YFull and FTDNA

ycomp ycomp is a tool for working with Y-chromosome data from YFull and FTDNA. Run ycomp -h for information on how to use the program. Installation Th

Alexander Regueiro 2 Jun 18, 2022
An exploration of log domain "alternative floating point" for hardware ML/AI accelerators.

This repository contains the SystemVerilog RTL, C++, HLS (Intel FPGA OpenCL to wrap RTL code) and Python needed to reproduce the numerical results in

Facebook Research 373 Dec 31, 2022
ncnn is a high-performance neural network inference framework optimized for the mobile platform

ncnn ncnn is a high-performance neural network inference computing framework optimized for mobile platforms. ncnn is deeply considerate about deployme

Tencent 16.2k Jan 05, 2023
Object detection, 3D detection, and pose estimation using center point detection:

Objects as Points Object detection, 3D detection, and pose estimation using center point detection: Objects as Points, Xingyi Zhou, Dequan Wang, Phili

Xingyi Zhou 6.7k Jan 03, 2023
an implementation of softmax splatting for differentiable forward warping using PyTorch

softmax-splatting This is a reference implementation of the softmax splatting operator, which has been proposed in Softmax Splatting for Video Frame I

Simon Niklaus 338 Dec 28, 2022
Reinforcement Learning via Supervised Learning

Reinforcement Learning via Supervised Learning Installation Run pip install -e . in an environment with Python = 3.7.0, 3.9. The code depends on MuJ

Scott Emmons 49 Nov 28, 2022
Code for Contrastive-Geometry Networks for Generalized 3D Pose Transfer

Code for Contrastive-Geometry Networks for Generalized 3D Pose Transfer

18 Jun 28, 2022
DPC: Unsupervised Deep Point Correspondence via Cross and Self Construction (3DV 2021)

DPC: Unsupervised Deep Point Correspondence via Cross and Self Construction (3DV 2021) This repo is the implementation of DPC. Tested environment Pyth

Dvir Ginzburg 30 Nov 30, 2022
Official codebase used to develop Vision Transformer, MLP-Mixer, LiT and more.

Big Vision This codebase is designed for training large-scale vision models on Cloud TPU VMs. It is based on Jax/Flax libraries, and uses tf.data and

Google Research 701 Jan 03, 2023
HNN: Human (Hollywood) Neural Network

HNN: Human (Hollywood) Neural Network Learn the top 1000 actors on IMDB with your very own low cost, highly parallel, CUDAless biological neural netwo

Madhava Jay 0 Dec 21, 2021
A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

3k Jan 02, 2023
MetaBalance: High-Performance Neural Networks for Class-Imbalanced Data

This repository is the official PyTorch implementation of Meta-Balance. Find the paper on arxiv MetaBalance: High-Performance Neural Networks for Clas

Arpit Bansal 20 Oct 18, 2021
Notebook and code to synthesize complex and highly dimensional datasets using Gretel APIs.

Gretel Trainer This code is designed to help users successfully train synthetic models on complex datasets with high row and column counts. The code w

Gretel.ai 24 Nov 03, 2022