ESGD-M - A stochastic non-convex second order optimizer, suitable for training deep learning models, for PyTorch

Last update: Dec 29, 2022

Related tags

Overview

ESGD-M

ESGD-M is a stochastic non-convex second order optimizer, suitable for training deep learning models. It is based on ESGD (Equilibrated adaptive learning rates for non-convex optimization) and incorporates quasi-hyperbolic momentum (Quasi-hyperbolic momentum and Adam for deep learning) to accelerate convergence, which considerably improves its performance over plain ESGD.

ESGD-M obtains Hessian information through occasional Hessian-vector products (by default, every ten optimizer steps; each Hessian-vector product is approximately the same cost as a gradient evaluation) and uses it to adapt per-parameter learning rates. It estimates the diagonal of the absolute Hessian, diag(|H|), to use as a diagonal preconditioner.

To use this optimizer you must call .backward() with the create_graph=True option. Gradient accumulation steps and distributed training are currently not supported.

Learning rates

ESGD-M learning rates have a different meaning from SGD and Adagrad/Adam/etc. You may need to try learning rates in the range 1e-3 to 1.

SGD class optimizers:

If you rescale your parameters by a factor of n, you must scale your learning rate by a factor of n^2.
If you rescale your loss by a factor of n, you must scale your learning rate by a factor of 1 / n.

Adagrad/Adam class optimizers:

If you rescale your parameters by a factor of n, you must scale your learning rate by a factor of n.
If you rescale your loss by a factor of n, you do not have to scale your learning rate.

Second order optimizers (including ESGD-M):

You do not have to scale your learning rate if you rescale either your parameters or your loss.

Momentum

The default configuration is Nesterov momentum (if v is not specified then it will default to the value of beta_1, producing Nesterov momentum):

opt = ESGD(model.parameters(), lr=1, betas=(0.9, 0.999), v=0.9)

The Quasi-Hyperbolic Momentum recommended defaults can be obtained using:

opt = ESGD(model.parameters(), lr=1, betas=(0.999, 0.999), v=0.7)

Setting v equal to 1 will do normal (non-Nesterov) momentum.

The ESGD-M decay coefficient beta_2 refers not to the squared gradient as in Adam but to the squared Hessian diagonal estimate, which it uses in place of the squared gradient to provide per-parameter adaptive learning rates.

Hessian-vector products

The absolute Hessian diagonal diag(|H|) is estimated every update_d_every steps. The default is 10. Also, for the first d_warmup steps the diagonal will be estimated regardless, to obtain a lower variance estimate of diag(|H|) quickly. The estimation uses a Hessian-vector product, which takes around the same amount of time as a gradient evaluation to compute. You must explicitly signal to PyTorch that you want to do a double backward pass by:

opt.zero_grad(set_to_none=True)
loss = loss_fn(model(inputs), targets)
loss.backward(create_graph=True)
opt.step()

Weight decay

Weight decay is performed separately from the Hessian-vector product and the preconditioner, similar to AdamW except that the weight decay value provided by the user is multiplied by the current learning rate to determine the factor to decay the weights by.

Learning rate warmup

Because the diag(|H|) estimates are high variance, the adaptive learning rates are not very reliable before many steps have been taken and many estimates have been averaged together. To deal with this ESGD-M has a short exponential learning rate warmup by default (it is combined with any external learning rate schedulers). On each step (starting from 1) the learning rate will be:

lr * (1 - lr_warmup**step)

The default value for lr_warmup is 0.99, which reaches 63% of the specified learning rate in 100 steps and 95% in 300 steps.

ESGD-M - A stochastic non-convex second order optimizer, suitable for training deep learning models, for PyTorch

Related tags

Overview

ESGD-M

Learning rates

Momentum

Hessian-vector products

Weight decay

Learning rate warmup

Owner

Katherine Crowson

2D Time independent Schrodinger equation solver for arbitrary shape of well

An Approach to Explore Logistic Regression Models

ImageNet-CoG is a benchmark for concept generalization. It provides a full evaluation framework for pre-trained visual representations which measure how well they generalize to unseen concepts.

Few-Shot Graph Learning for Molecular Property Prediction

This is the offical website for paper ''Category-consistent deep network learning for accurate vehicle logo recognition''

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

A python package simulating the quasi-2D pseudospin-1/2 Gross-Pitaevskii equation with NVIDIA GPU acceleration.

Instant Real-Time Example-Based Style Transfer to Facial Videos

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

Repo for flood prediction using LSTMs and HAND

A Confidence-based Iterative Solver of Depths and Surface Normals for Deep Multi-view Stereo

A highly modular PyTorch framework with a focus on Neural Architecture Search (NAS).

This is a pytorch implementation for the BST model from Alibaba https://arxiv.org/pdf/1905.06874.pdf

Object DGCNN and DETR3D, Our implementations are built on top of MMdetection3D.

Testing and Estimation of structural breaks in Stata

Pytorch implemenation of Stochastic Multi-Label Image-to-image Translation (SMIT)

Single-Stage Instance Shadow Detection with Bidirectional Relation Learning (CVPR 2021 Oral)

Complementary Patch for Weakly Supervised Semantic Segmentation, ICCV21 (poster)

Alias-Free Generative Adversarial Networks (StyleGAN3) Official PyTorch implementation

The GitHub repository for the paper: “Time Series is a Special Sequence: Forecasting with Sample Convolution and Interaction“.