PyTorch Implementation of Vector Quantized Variational AutoEncoders.

Last update: Oct 06, 2021

Related tags

Deep Learning VQVAE-PyTorch

Overview

Pytorch implementation of VQVAE.

This paper combines 2 tricks:

Vector Quantization (check out this amazing blog for better understanding.)
Straight-Through (It solves the problem of back-propagation through discrete latent variables, which are intractable.)

This model has a neural network encoder and decoder, and a prior just like the vanila Variational AutoEncoder(VAE). But this model also has a latent embedding space called codebook(size: K x D). Here, K is the size of latent space and D is the dimension of each embedding e.

In vanilla variational autoencoders, the output from the encoder z(x) is used to parameterize a Normal/Gaussian distribution, which is sampled from to get a latent representation z of the input x using the 'reparameterization trick'. This latent representation is then passed to the decoder. However, In VQVAEs, z(x) is used as a "key" to do nearest neighbour lookup into the embedding codebook c, and get zq(x), the closest embedding in the space. This is called Vector Quantization(VQ) operation. Then, zq(x) is passed to the decoder, which reconstructs the input x. The decoder can either parameterize p(x|z) as the mean of Normal distribution using a transposed convolution layer like in vannila VAE, or it can autoregressively generate categorical distribution over [0,255] pixel values like PixelCNN. In this project, the first approach is used.

The loss function is combined of 3 components:

Regular Reconstruction loss
Vector Quantization loss
Commitment loss

Vector Quantization loss encourages the items in the codebook to move closer to the encoder output ||sg[ze(x) - e||^2] and Commitment loss encourages the output of the encoder to be close to embedding it picked, to commit to its codebook embedding. ||ze(x) - sg[e]]||^2 . commitment loss is multiplied with a constant beta, which is 1.0 for this project. Here, sg means "stop-gradient". Which means we don't propagate the gradients with respect to that term.

Results:

The Model is trained on MNIST and CIFAR10 datasets.

Target 👉 Reconstructed Image

👉

Details:

Trained models for MNIST and CIFAR10 are in the Trained models directory.
Hidden size of the bottleneck(z) for MNIST and CIFAR10 is 128, 256 respectively.

PyTorch Implementation of Vector Quantized Variational AutoEncoders.

Related tags

Overview

Results:

Target 👉 Reconstructed Image

Details:

Owner

Vrushank Changawala

Use your Philips Hue lights as Racing Flags. Works with Assetto Corsa, Assetto Corsa Competizione and iRacing.

Official implementation of the MM'21 paper Constrained Graphic Layout Generation via Latent Optimization

Code for our paper "MG-GAN: A Multi-Generator Model Preventing Out-of-Distribution Samples in Pedestrian Trajectory Prediction" published at ICCV 2021.

LAnguage Model Analysis

MonoScene: Monocular 3D Semantic Scene Completion

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice （『飞桨』核心框架，深度学习&机器学习高性能单机、分布式训练和跨平台部署）

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

BBB streaming without Xorg and Pulseaudio and Chromium and other nonsense (heavily WIP)

AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data

Implementation of "A MLP-like Architecture for Dense Prediction"

A program that uses computer vision to detect hand gestures, used for controlling movie players.

2D&3D human pose estimation

Delving into Localization Errors for Monocular 3D Object Detection, CVPR'2021

Transferable Unrestricted Attacks, which won 1st place in CVPR’21 Security AI Challenger: Unrestricted Adversarial Attacks on ImageNet.

Selfplay In MultiPlayer Environments

Guiding evolutionary strategies by (inaccurate) differentiable robot simulators @ NeurIPS, 4th Robot Learning Workshop

DynaTune: Dynamic Tensor Program Optimization in Deep Neural Network Compilation

GCNet: Non-local Networks Meet Squeeze-Excitation Networks and Beyond

On the adaptation of recurrent neural networks for system identification

Code for the Convolutional Vision Transformer (ConViT)