A tutorial on "Bayesian Compression for Deep Learning" published at NIPS (2017).

Overview

Code release for "Bayesian Compression for Deep Learning"

In "Bayesian Compression for Deep Learning" we adopt a Bayesian view for the compression of neural networks. By revisiting the connection between the minimum description length principle and variational inference we are able to achieve up to 700x compression and up to 50x speed up (CPU to sparse GPU) for neural networks.

We visualize the learning process in the following figures for a dense network with 300 and 100 connections. White color represents redundancy whereas red and blue represent positive and negative weights respectively.

First layer weights Second Layer weights
alt text alt text

For dense networks it is also simple to reconstruct input feature importance. We show this for a mask and 5 randomly chosen digits. alt text

Results

Model Method Error [%] Compression
after pruning
Compression after
precision reduction
LeNet-5-Caffe DC 0.7 6* -
DNS 0.9 55* -
SWS 1.0 100* -
Sparse VD 1.0 63* 228
BC-GNJ 1.0 108* 361
BC-GHS 1.0 156* 419
VGG BC-GNJ 8.6 14* 56
BC-GHS 9.0 18* 59

Usage

We provide an implementation in PyTorch for fully connected and convolutional layers for the group normal-Jeffreys prior (aka Group Variational Dropout) via:

import BayesianLayers

The layers can be then straightforwardly included eas follows:

    class Net(nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            # activation
            self.relu = nn.ReLU()
            # layers
            self.fc1 = BayesianLayers.LinearGroupNJ(28 * 28, 300, clip_var=0.04)
            self.fc2 = BayesianLayers.LinearGroupNJ(300, 100)
            self.fc3 = BayesianLayers.LinearGroupNJ(100, 10)
            # layers including kl_divergence
            self.kl_list = [self.fc1, self.fc2, self.fc3]

        def forward(self, x):
            x = x.view(-1, 28 * 28)
            x = self.relu(self.fc1(x))
            x = self.relu(self.fc2(x))
            return self.fc3(x)

        def kl_divergence(self):
            KLD = 0
            for layer in self.kl_list:
                KLD += layer.kl_divergence()
            return KLD

The only additional effort is to include the KL-divergence in the objective. This is necessary if we want to the optimize the variational lower bound that leads to sparse solutions:

N = 60000.
discrimination_loss = nn.functional.cross_entropy

def objective(output, target, kl_divergence):
    discrimination_error = discrimination_loss(output, target)
    return discrimination_error + kl_divergence / N

Run an example

We provide a simple example, the LeNet-300-100 trained with the group normal-Jeffreys prior:

python example.py

Retraining a regular neural network

Instead of training a network from scratch we often need to compress an already existing network. In this case we can simply initialize the weights with those of the pretrained network:

    BayesianLayers.LinearGroupNJ(28*28, 300, init_weight=pretrained_weight, init_bias=pretrained_bias)

Reference

The paper "Bayesian Compression for Deep Learning" has been accepted to NIPS 2017. Please cite us:

@article{louizos2017bayesian,
  title={Bayesian Compression for Deep Learning},
  author={Louizos, Christos and Ullrich, Karen and Welling, Max},
  journal={Conference on Neural Information Processing Systems (NIPS)},
  year={2017}
}
Owner
Karen Ullrich
Research scientist (s/h) at FAIR NY + collab. w/ Vector Institute. <3 Deep Learning + Information Theory. Previously, Machine Learning PhD at UoAmsterdam.
Karen Ullrich
An implementation of Performer, a linear attention-based transformer, in Pytorch

Performer - Pytorch An implementation of Performer, a linear attention-based transformer variant with a Fast Attention Via positive Orthogonal Random

Phil Wang 900 Dec 22, 2022
A tiny package to compare two neural networks in PyTorch

Compare neural networks by their feature similarity

Anand Krishnamoorthy 180 Dec 30, 2022
A simple way to train and use PyTorch models with multi-GPU, TPU, mixed-precision

🤗 Accelerate was created for PyTorch users who like to write the training loop of PyTorch models but are reluctant to write and maintain the boilerplate code needed to use multi-GPUs/TPU/fp16.

Hugging Face 3.5k Jan 08, 2023
Fast and Easy-to-use Distributed Graph Learning for PyTorch Geometric

Fast and Easy-to-use Distributed Graph Learning for PyTorch Geometric

Quiver Team 221 Dec 22, 2022
PyTorch implementations of normalizing flow and its variants.

PyTorch implementations of normalizing flow and its variants.

Tatsuya Yatagawa 55 Dec 01, 2022
PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

PyTorch implementation of [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference] This demonstrates pruning a VGG16 based

Jacob Gildenblat 836 Dec 26, 2022
Pytorch implementation of Distributed Proximal Policy Optimization

Pytorch-DPPO Pytorch implementation of Distributed Proximal Policy Optimization: https://arxiv.org/abs/1707.02286 Using PPO with clip loss (from https

Alexis David Jacq 164 Jan 05, 2023
TorchSSL: A PyTorch-based Toolbox for Semi-Supervised Learning

TorchSSL: A PyTorch-based Toolbox for Semi-Supervised Learning

1k Dec 28, 2022
A PyTorch implementation of Learning to learn by gradient descent by gradient descent

Intro PyTorch implementation of Learning to learn by gradient descent by gradient descent. Run python main.py TODO Initial implementation Toy data LST

Ilya Kostrikov 300 Dec 11, 2022
higher is a pytorch library allowing users to obtain higher order gradients over losses spanning training loops rather than individual training steps.

higher is a library providing support for higher-order optimization, e.g. through unrolled first-order optimization loops, of "meta" aspects of these

Facebook Research 1.5k Jan 03, 2023
You like pytorch? You like micrograd? You love tinygrad! ❤️

For something in between a pytorch and a karpathy/micrograd This may not be the best deep learning framework, but it is a deep learning framework. Due

George Hotz 9.7k Jan 05, 2023
3D-RETR: End-to-End Single and Multi-View3D Reconstruction with Transformers

3D-RETR: End-to-End Single and Multi-View 3D Reconstruction with Transformers (BMVC 2021) Zai Shi*, Zhao Meng*, Yiran Xing, Yunpu Ma, Roger Wattenhofe

Zai Shi 36 Dec 21, 2022
270 Dec 24, 2022
PyTorch framework A simple and complete framework for PyTorch, providing a variety of data loading and simple task solutions that are easy to extend and migrate

PyTorch framework A simple and complete framework for PyTorch, providing a variety of data loading and simple task solutions that are easy to extend and migrate

Cong Cai 12 Dec 19, 2021
A very simple and small path tracer written in pytorch meant to be run on the GPU

MentisOculi Pytorch Path Tracer A very simple and small path tracer written in pytorch meant to be run on the GPU Why use pytorch and not some other c

Matthew B. Mirman 222 Dec 01, 2022
TorchShard is a lightweight engine for slicing a PyTorch tensor into parallel shards

TorchShard is a lightweight engine for slicing a PyTorch tensor into parallel shards. It can reduce GPU memory and scale up the training when the model has massive linear layers (e.g., ViT, BERT and

Kaiyu Yue 275 Nov 22, 2022
Tacotron 2 - PyTorch implementation with faster-than-realtime inference

Tacotron 2 (without wavenet) PyTorch implementation of Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions. This implementati

NVIDIA Corporation 4.1k Jan 03, 2023
An optimizer that trains as fast as Adam and as good as SGD.

AdaBound An optimizer that trains as fast as Adam and as good as SGD, for developing state-of-the-art deep learning models on a wide variety of popula

LoLo 2.9k Dec 27, 2022
Code snippets created for the PyTorch discussion board

PyTorch misc Collection of code snippets I've written for the PyTorch discussion board. All scripts were testes using the PyTorch 1.0 preview and torc

461 Dec 26, 2022
On the Variance of the Adaptive Learning Rate and Beyond

RAdam On the Variance of the Adaptive Learning Rate and Beyond We are in an early-release beta. Expect some adventures and rough edges. Table of Conte

Liyuan Liu 2.5k Dec 27, 2022