This is an early in-development version of training CLIP models with hivemind.

Overview

A transformer that does not hog your GPU memory

This is an early in-development codebase: if you want a stable and documented hivemind codebase, look at CALM or dalle-hivemind.

Readme under construction

LeanTransformer implements a specific version of transformer with two goals in mind:

  • using as little GPU memory as possible
  • stable training for very large models

The core philosophy of LeanTransformer is to replace torch.autograd with grad students. Automatic differentiation is great if you want to test ideas quickly, less so if a single training run can cost over $4 million (or >1000 years in grad school).

Related work: GSO

Our implementation partially replaces automatic differentiation with Grad Student Optimization (GSO) - a biologically inspired black box optimization algorithm. In the past, GSO has seen widespread adoption thanks to its strong theoretical foundations and unparalleled cost efficiency (Chom et al). Previous successfully applied GSO for hyperparameter tuning and natural language generation. To the best of our knowledge we are the first work to successfully apply distributed fault-tolerant GSO for optimizing the memory footprint of transformers. We summarize our findings below:

Memory saving features:

Other features:

Not implemented:

  • In reversible mode, one can further save memory by computing backward in chunks:
    • a few tokens at a time for feedforward layers, since grad(concat(mlp(x1), mlp(x2))) = concat(grad(mlp(x1)), grad(mlp(x2)))
    • a few heads at a time for self-attention, since grad(head1 + head2) = grad(head1) + grad(head2), where head1 and head2 are attention outputs after linear projection
  • Attention could be computed in O(sqrt(n)) memory (Rabe et al, 2021)
  • No sparse or linear attention: they are great for very long sequences. However, for large models, attention is not a bottleneck in typical NLP and vision tasks (tested gpt-3 up to length 4096).
  • Per-block grad scaling as described in (Ramesh et al, 2021) - we rely on Sandwich Norm to maintain stability up to 96 layers (did not test more). However, it would be nice to have per-block scaling to avoid the need for an extra LayerNorm.
  • Something else that we missed - please find us on discord.

A day will come a day when we explain all these modifications and provide instructions on how to tune them. But it is not this day!. Until then, we'll happily answer any questions on our discord.

Running the code

[under constructuion] - use the instructions from CALM readme

Acknowledgements:

  • Most of the architecture and stability optimizations were learned through the BigScience research workshop
  • YSDA community helped us survive through the early messy versions of this code
  • NeuroPark trained the first practical model (SahajBERT-XL, SoTA in bengali, details here)
  • TODO DALLE community: at least mention the demo, maybe we end up training something even cooler
  • TODO NCAI community: ask them how best to acknowledge them
  • TODO Hugging Face: ask them how best to acknowledge them
  • TODO Personal: stas00, samyam, jared, more? (this does not include co-authors: Tim,Lucile,Quentin,Denis,Gennady,etc; also, this does not include hivemind contributors)
Owner
<a href=[email protected]">
PyTorch Implementation of NCSOFT's FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis

FastPitchFormant - PyTorch Implementation PyTorch Implementation of FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis. Qu

Keon Lee 63 Jan 02, 2023
Code for the paper "Attention Approximates Sparse Distributed Memory"

Attention Approximates Sparse Distributed Memory - Codebase This is all of the code used to run analyses in the paper "Attention Approximates Sparse D

Trenton Bricken 14 Dec 05, 2022
LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021 We propose a cross encoder model (LTR_CrossEncoder) for information retrieval, re-retrie

Xuan Hieu Duong 7 Jan 12, 2022
Provably Rare Gem Miner.

Provably Rare Gem Miner just another random project by yoyoismee.eth useful link main site market contract useful thing you should know read contract

34 Nov 22, 2022
Paper list of log-based anomaly detection

Paper list of log-based anomaly detection

Weibin Meng 411 Dec 05, 2022
AFL binary instrumentation

E9AFL --- Binary AFL E9AFL inserts American Fuzzy Lop (AFL) instrumentation into x86_64 Linux binaries. This allows binaries to be fuzzed without the

242 Dec 12, 2022
Official implementation for “Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior”

Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior. The code will release soon. Implementation Python3 PyTorch=1.0 NVIDIA GPU+

FengZhang 34 Dec 04, 2022
PyTorch implementation of the paper: Long-tail Learning via Logit Adjustment

logit-adj-pytorch PyTorch implementation of the paper: Long-tail Learning via Logit Adjustment This code implements the paper: Long-tail Learning via

Chamuditha Jayanga 53 Dec 23, 2022
PyTorchVideo is a deeplearning library with a focus on video understanding work

PyTorchVideo is a deeplearning library with a focus on video understanding work. PytorchVideo provides resusable, modular and efficient components needed to accelerate the video understanding researc

Facebook Research 2.7k Jan 07, 2023
Implementation of paper "DCS-Net: Deep Complex Subtractive Neural Network for Monaural Speech Enhancement"

DCS-Net This is the implementation of "DCS-Net: Deep Complex Subtractive Neural Network for Monaural Speech Enhancement" Steps to run the model Edit V

Jack Walters 10 Apr 04, 2022
A PyTorch library for Vision Transformers

VFormer A PyTorch library for Vision Transformers Getting Started Read the contributing guidelines in CONTRIBUTING.rst to learn how to start contribut

Society for Artificial Intelligence and Deep Learning 142 Nov 28, 2022
Clean and readable code for Decision Transformer: Reinforcement Learning via Sequence Modeling

Minimal implementation of Decision Transformer: Reinforcement Learning via Sequence Modeling in PyTorch for mujoco control tasks in OpenAI gym

Nikhil Barhate 104 Jan 06, 2023
Adaptive Pyramid Context Network for Semantic Segmentation (APCNet CVPR'2019)

Adaptive Pyramid Context Network for Semantic Segmentation (APCNet CVPR'2019) Introduction Official implementation of Adaptive Pyramid Context Network

21 Nov 09, 2022
This repository is the official implementation of Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models

Using Time-Series Privileged Information for Provably Efficient Learning of Prediction Models Link to paper Abstract We study prediction of future out

Rickard Karlsson 2 Aug 19, 2022
Implement A3C for Mujoco gym envs

pytorch-a3c-mujoco Disclaimer: my implementation right now is unstable (you ca refer to the learning curve below), I'm not sure if it's my problems. A

Andrew 70 Dec 12, 2022
PyTorch implementation of the REMIND method from our ECCV-2020 paper "REMIND Your Neural Network to Prevent Catastrophic Forgetting"

REMIND Your Neural Network to Prevent Catastrophic Forgetting This is a PyTorch implementation of the REMIND algorithm from our ECCV-2020 paper. An ar

Tyler Hayes 72 Nov 27, 2022
A Japanese Medical Information Extraction Toolkit

JaMIE: a Japanese Medical Information Extraction toolkit Joint Japanese Medical Problem, Modality and Relation Recognition The Train/Test phrases requ

7 Dec 12, 2022
Code for You Only Cut Once: Boosting Data Augmentation with a Single Cut

You Only Cut Once (YOCO) YOCO is a simple method/strategy of performing augmenta

88 Dec 28, 2022
Keras implementation of PersonLab for Multi-Person Pose Estimation and Instance Segmentation.

PersonLab This is a Keras implementation of PersonLab for Multi-Person Pose Estimation and Instance Segmentation. The model predicts heatmaps and vari

OCTI 160 Dec 21, 2022
Code accompanying the paper Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs (Chen et al., CVPR 2020, Oral).

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs This repository contains PyTorch implementation of our pa

Shizhe Chen 178 Dec 29, 2022