Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Overview

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

This is the repository containing code used for the Unleashing Transformers paper.

front_page_sample

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes
Sam Bond-Taylor*, Peter Hessey*, Hiroshi Sasaki, Toby P. Breckon, Chris G. Willcocks
* Authors contributed equally

Abstract

Whilst diffusion probabilistic models can generate high quality image content, key limitations remain in terms of both generating high-resolution imagery and their associated high computational requirements. Recent Vector-Quantized image models have overcome this limitation of image resolution but are prohibitively slow and unidirectional as they generate tokens via element-wise autoregressive sampling from the prior. By contrast, in this paper we propose a novel discrete diffusion probabilistic model prior which enables parallel prediction of Vector-Quantized tokens by using an unconstrained Transformer architecture as the backbone. During training, tokens are randomly masked in an order-agnostic manner and the Transformer learns to predict the original tokens. This parallelism of Vector-Quantized token prediction in turn facilitates unconditional generation of globally consistent high-resolution and diverse imagery at a fraction of the computational expense. In this manner, we can generate image resolutions exceeding that of the original training set samples whilst additionally provisioning per-image likelihood estimates (in a departure from generative adversarial approaches). Our approach achieves state-of-the-art results in terms of Density (LSUN Bedroom: 1.51; LSUN Churches: 1.12; FFHQ: 1.20) and Coverage (LSUN Bedroom: 0.83; LSUN Churches: 0.73; FFHQ: 0.80), and performs competitively on FID (LSUN Bedroom: 3.64; LSUN Churches: 4.07; FFHQ: 6.11) whilst offering advantages in terms of both computation and reduced training set requirements.

front_page_sample

arXiv | BibTeX | Project Page

Table of Contents

Setup

Currently, a dedicated graphics card capable of running CUDA is required to run the code used in this repository. All models used for the paper were trained on a single NVIDIA RTX 2080 Ti using CUDA version 11.1.

Set up conda environment

To run the code in this repository we recommend you set up a virtual environment using conda. To get set up quickly, use miniconda.

Run the following command to clone this repo using git and create and activate the conda environment unleashing:

git clone https://github.com/samb-t/unleashing-transformers.git && cd unleashing-transformers
conda create --name unleashing --file requirements.yml
conda activate unleashing  

You should now be able to run all commands available in the following sections.

Dataset Setup

To configure the default paths for datasets used for training the models in this repo, simply edit datasets.yaml - changing the paths attribute of each dataset you wish to use to the path where your dataset is saved locally.

Dataset Official Link Academic Torrents Link
FFHQ Official FFHQ Academic Torrents FFHQ
LSUN Official LSUN Academic Torrents LSUN

Commands

This section contains details on the basic commands for training and calculating metrics on the Absorbing Diffusion models. All training was completed on a single NVIDIA RTX 2080 Ti and these commands presume the same level of hardware. If your GPU has less VRAM than a 2080 Ti then you may need to train using smaller batch sizes and/or smaller models than the defaults.

For a detailed list of all commands options, including altering model architecture, logging output, checkpointing frequency, etc., please add the --help flag to the end of your command.

All commands should be run from the head directory, i.e. the directory containing the README file.

Set up visdom server

Before training, you'll need to start a visdom server in order to easily view model output (loss graphs, reconstructions, etc.). To do this, run the following command:

visdom -p 8097

This starts a visdom server listening on port 8097, which is the default used by our models. If you navigate to localhost:8097 you will see be able to view the live server.

To specify a different port when training any models, use the --visdom_port flag.

Train a Vector-Quantized autoencoder on LSUN Churches

The following command starts the training for a VQGAN on LSUN Churches:

python3 train_vqgan.py --dataset churches --log_dir vqae_churches --amp --batch_size 4

As specified with the --log_dir flag, results will be saved to the directory logs/vqae_churches. This includes all logs, model checkpoints and saved outputs. The --amp flag enables mixed-precision training, necessary for training using a batch size of 4 (the default) on a single 2080 Ti.

Train an Absorbing Diffusion sampler using the above Vector-Quantized autoencoder

After training the VQ model using the previous command, you'll be able to run the following commands to train a discrete diffusion prior on the latent space of the Vector-Quantized model:

python3 train_sampler.py --sampler absorbing --dataset churches --log_dir absorbing_churches --ae_load_dir vqae_churches --ae_load_step 2200000 --amp 

The sampler needs to load the trained Vector-Quantized autoencoder in order to generate the latents it will use as for training (and validation). Latents are cached after the first time this is run to speed up training.

Experiments on trained Absorbing Diffusion Sampler

This section contains simple template commands for calculating metrics and other experiments on trained samplers.

Calculate FID

python experiments/calc_FID.py --sampler absorbing --dataset churches --log_dir FID_log --ae_load_dir vqae_churches --ae_load_step 2200000  --load_dir absorbing_churches --load_step 2000000 --n_samples 50000

Calculate PRDC Scores

python experiments/calc_PRDC.py --sampler absorbing --dataset churches --log_dir PRDC_log --ae_load_dir vqae_churches --ae_load_step 2200000 --load_dir absorbing_churches --load_step 2000000 --n_samples 50000

Calculate ELBO Estimates

The following command fine-tunes a Vector-Quantized autoencoder to compute reconstruction likelihood, and then evaluates the ELBO of the overall model.

python experiments/calc_approximate_ELBO.py --sampler absorbing --dataset ffhq --log_dir nll_churches --ae_load_dir vqae_churches --ae_load_step 2200000 --load_dir absorbing_churches --load_step 2000000 --steps_per_eval 5000 --train_steps 10000

NOTE: the --steps_per_eval flag is required for this script, as a validation dataset is used.

Find Nearest Neighbours

Produces a random batch of samples and finds the nearest neighbour images in the training set based on LPIPS distance.

python experiments/calc_nearest_neighbours.py --sampler absorbing --dataset churches --log_dir nearest_neighbours_churches --ae_load_dir vqae_churches --ae_load_step 2200000 --load_dir absorbing_churches --load_step 2000000

Generate Higher Resolution Samples

By applying the absorbing diffusion model to various locations at once and aggregating denoising probabilities, larger samples than observed during training are able to be generated (see Figures 4 and 11).

python experiments/generate_big_samples.py --sampler absorbing --dataset churches --log_dir big_samples_churches --ae_load_dir vqae_churches --ae_load_step 2200000 load_dir absorbing_churches --load_step 2000000 --shape 32 16

Use the --shape flag to specify the dimensions of the latents to generate.

Related Work

The following papers were particularly helpful when developing this work:

BibTeX

@article{bond2021unleashing,
  title     = {Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes},
  author    = {Sam Bond-Taylor and Peter Hessey and Hiroshi Sasaki and Toby P. Breckon and Chris G. Willcocks},
  journal   = {arXiv preprint arXiv:2111.12701},
  year      = {2021}
}
Owner
Sam Bond-Taylor
PhD student at Durham University interested in deep generative modelling.
Sam Bond-Taylor
Algorithm to texture 3D reconstructions from multi-view stereo images

MVS-Texturing Welcome to our project that textures 3D reconstructions from images. This project focuses on 3D reconstructions generated using structur

Nils Moehrle 766 Jan 04, 2023
A multi-functional library for full-stack Deep Learning. Simplifies Model Building, API development, and Model Deployment.

chitra What is chitra? chitra (चित्र) is a multi-functional library for full-stack Deep Learning. It simplifies Model Building, API development, and M

Aniket Maurya 210 Dec 21, 2022
StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation

StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation Demo video: CVPR 2021 Oral: Single Channel Manipulation: Localized or attribu

Zongze Wu 267 Dec 30, 2022
Final Project for the CS238: Decision Making Under Uncertainty course at Stanford University in Autumn '21.

Final Project for the CS238: Decision Making Under Uncertainty course at Stanford University in Autumn '21. We optimized wind turbine placement in a wind farm, subject to wake effects, using Q-learni

Manasi Sharma 2 Sep 27, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 03, 2023
Cross-modal Retrieval using Transformer Encoder Reasoning Networks (TERN). With use of Metric Learning and FAISS for fast similarity search on GPU

Cross-modal Retrieval using Transformer Encoder Reasoning Networks This project reimplements the idea from "Transformer Reasoning Network for Image-Te

Minh-Khoi Pham 5 Nov 05, 2022
FrankMocap: A Strong and Easy-to-use Single View 3D Hand+Body Pose Estimator

FrankMocap pursues an easy-to-use single view 3D motion capture system developed by Facebook AI Research (FAIR). FrankMocap provides state-of-the-art 3D pose estimation outputs for body, hand, and bo

Facebook Research 1.9k Jan 07, 2023
Code for GNMR in ICDE 2021

GNMR Code for GNMR in ICDE 2021 Please unzip data files in Datasets/MultiInt-ML10M first. Run labcode_preSamp.py (with graph sampling) for ECommerce-c

7 Oct 27, 2022
Rethinking Transformer-based Set Prediction for Object Detection

Rethinking Transformer-based Set Prediction for Object Detection Here are the code for the ICCV paper. The code is adapted from Detectron2 and AdelaiD

Zhiqing Sun 62 Dec 03, 2022
Styleformer - Official Pytorch Implementation

Styleformer -- Official PyTorch implementation Styleformer: Transformer based Generative Adversarial Networks with Style Vector(https://arxiv.org/abs/

Jeeseung Park 159 Dec 12, 2022
An Unsupervised Graph-based Toolbox for Fraud Detection

An Unsupervised Graph-based Toolbox for Fraud Detection Introduction: UGFraud is an unsupervised graph-based fraud detection toolbox that integrates s

SafeGraph 99 Dec 11, 2022
YOLOv5 🚀 is a family of object detection architectures and models pretrained on the COCO dataset

YOLOv5 🚀 is a family of object detection architectures and models pretrained on the COCO dataset, and represents Ultralytics open-source research int

阿才 73 Dec 16, 2022
Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition

Zen-NAS: A Zero-Shot NAS for High-Performance Deep Image Recognition How Fast Compare to Other Zero-Shot NAS Proxies on CIFAR-10/100 Pre-trained Model

190 Dec 29, 2022
Official PyTorch implementation of "VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization" (CVPR 2021)

VITON-HD — Official PyTorch Implementation VITON-HD: High-Resolution Virtual Try-On via Misalignment-Aware Normalization Seunghwan Choi*1, Sunghyun Pa

Seunghwan Choi 250 Jan 06, 2023
The source code of CVPR17 'Generative Face Completion'.

GenerativeFaceCompletion Matcaffe implementation of our CVPR17 paper on face completion. In each panel from left to right: original face, masked input

Yijun Li 313 Oct 18, 2022
CVPR2021: Temporal Context Aggregation Network for Temporal Action Proposal Refinement

Temporal Context Aggregation Network - Pytorch This repo holds the pytorch-version codes of paper: "Temporal Context Aggregation Network for Temporal

Zhiwu Qing 63 Sep 27, 2022
[Machine Learning Engineer Basic Guide] 부스트캠프 AI Tech - Product Serving 자료

Boostcamp-AI-Tech-Product-Serving 부스트캠프 AI Tech - Product Serving 자료 Repository 구조 part1(MLOps 개론, Model Serving, 머신러닝 프로젝트 라이프 사이클은 별도의 코드가 없으며, part

Sung Yun Byeon 269 Dec 21, 2022
Generic U-Net Tensorflow implementation for image segmentation

Tensorflow Unet Warning This project is discontinued in favour of a Tensorflow 2 compatible reimplementation of this project found under https://githu

Joel Akeret 1.8k Dec 10, 2022
i3DMM: Deep Implicit 3D Morphable Model of Human Heads

i3DMM: Deep Implicit 3D Morphable Model of Human Heads CVPR 2021 (Oral) Arxiv | Poject Page This project is the official implementation our work, i3DM

Tarun Yenamandra 60 Jan 03, 2023
SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

The SLIDE package contains the source code for reproducing the main experiments in this paper. Dataset The Datasets can be downloaded in Amazon-

Intel Labs 72 Dec 16, 2022