Taming Transformers for High-Resolution Image Synthesis

Overview

Taming Transformers for High-Resolution Image Synthesis

CVPR 2021 (Oral)

teaser

Taming Transformers for High-Resolution Image Synthesis
Patrick Esser*, Robin Rombach*, Björn Ommer
* equal contribution

tl;dr We combine the efficiancy of convolutional approaches with the expressivity of transformers by introducing a convolutional VQGAN, which learns a codebook of context-rich visual parts, whose composition is modeled with an autoregressive transformer.

teaser arXiv | BibTeX | Project Page

News

  • Thanks to rom1504 it is now easy to train a VQGAN on your own datasets.
  • Included a bugfix for the quantizer. For backward compatibility it is disabled by default (which corresponds to always training with beta=1.0). Use legacy=False in the quantizer config to enable it. Thanks richcmwang and wcshin-git!
  • Our paper received an update: See https://arxiv.org/abs/2012.09841v3 and the corresponding changelog.
  • Added a pretrained, 1.4B transformer model trained for class-conditional ImageNet synthesis, which obtains state-of-the-art FID scores among autoregressive approaches and outperforms BigGAN.
  • Added pretrained, unconditional models on FFHQ and CelebA-HQ.
  • Added accelerated sampling via caching of keys/values in the self-attention operation, used in scripts/sample_fast.py.
  • Added a checkpoint of a VQGAN trained with f8 compression and Gumbel-Quantization. See also our updated reconstruction notebook.
  • We added a colab notebook which compares two VQGANs and OpenAI's DALL-E. See also this section.
  • We now include an overview of pretrained models in Tab.1. We added models for COCO and ADE20k.
  • The streamlit demo now supports image completions.
  • We now include a couple of examples from the D-RIN dataset so you can run the D-RIN demo without preparing the dataset first.
  • You can now jump right into sampling with our Colab quickstart notebook.

Requirements

A suitable conda environment named taming can be created and activated with:

conda env create -f environment.yaml
conda activate taming

Overview of pretrained models

The following table provides an overview of all models that are currently available. FID scores were evaluated using torch-fidelity. For reference, we also include a link to the recently released autoencoder of the DALL-E model. See the corresponding colab notebook for a comparison and discussion of reconstruction capabilities.

Dataset FID vs train FID vs val Link Samples (256x256) Comments
FFHQ (f=16) 9.6 -- ffhq_transformer ffhq_samples
CelebA-HQ (f=16) 10.2 -- celebahq_transformer celebahq_samples
ADE20K (f=16) -- 35.5 ade20k_transformer ade20k_samples.zip [2k] evaluated on val split (2k images)
COCO-Stuff (f=16) -- 20.4 coco_transformer coco_samples.zip [5k] evaluated on val split (5k images)
ImageNet (cIN) (f=16) 15.98/15.78/6.59/5.88/5.20 -- cin_transformer cin_samples different decoding hyperparameters
FacesHQ (f=16) -- -- faceshq_transformer
S-FLCKR (f=16) -- -- sflckr
D-RIN (f=16) -- -- drin_transformer
VQGAN ImageNet (f=16), 1024 10.54 7.94 vqgan_imagenet_f16_1024 reconstructions Reconstruction-FIDs.
VQGAN ImageNet (f=16), 16384 7.41 4.98 vqgan_imagenet_f16_16384 reconstructions Reconstruction-FIDs.
VQGAN OpenImages (f=8), 8192, GumbelQuantization 3.24 1.49 vqgan_gumbel_f8 --- Reconstruction-FIDs.
DALL-E dVAE (f=8), 8192, GumbelQuantization 33.88 32.01 https://github.com/openai/DALL-E reconstructions Reconstruction-FIDs.

Running pretrained models

The commands below will start a streamlit demo which supports sampling at different resolutions and image completions. To run a non-interactive version of the sampling process, replace streamlit run scripts/sample_conditional.py -- by python scripts/make_samples.py --outdir <path_to_write_samples_to> and keep the remaining command line arguments.

To sample from unconditional or class-conditional models, run python scripts/sample_fast.py -r <path/to/config_and_checkpoint>. We describe below how to use this script to sample from the ImageNet, FFHQ, and CelebA-HQ models, respectively.

S-FLCKR

teaser

You can also run this model in a Colab notebook, which includes all necessary steps to start sampling.

Download the 2020-11-09T13-31-51_sflckr folder and place it into logs. Then, run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-09T13-31-51_sflckr/

ImageNet

teaser

Download the 2021-04-03T19-39-50_cin_transformer folder and place it into logs. Sampling from the class-conditional ImageNet model does not require any data preparation. To produce 50 samples for each of the 1000 classes of ImageNet, with k=600 for top-k sampling, p=0.92 for nucleus sampling and temperature t=1.0, run

python scripts/sample_fast.py -r logs/2021-04-03T19-39-50_cin_transformer/ -n 50 -k 600 -t 1.0 -p 0.92 --batch_size 25   

To restrict the model to certain classes, provide them via the --classes argument, separated by commas. For example, to sample 50 ostriches, border collies and whiskey jugs, run

python scripts/sample_fast.py -r logs/2021-04-03T19-39-50_cin_transformer/ -n 50 -k 600 -t 1.0 -p 0.92 --batch_size 25 --classes 9,232,901   

We recommended to experiment with the autoregressive decoding parameters (top-k, top-p and temperature) for best results.

FFHQ/CelebA-HQ

Download the 2021-04-23T18-19-01_ffhq_transformer and 2021-04-23T18-11-19_celebahq_transformer folders and place them into logs. Again, sampling from these unconditional models does not require any data preparation. To produce 50000 samples, with k=250 for top-k sampling, p=1.0 for nucleus sampling and temperature t=1.0, run

python scripts/sample_fast.py -r logs/2021-04-23T18-19-01_ffhq_transformer/   

for FFHQ and

python scripts/sample_fast.py -r logs/2021-04-23T18-11-19_celebahq_transformer/   

to sample from the CelebA-HQ model. For both models it can be advantageous to vary the top-k/top-p parameters for sampling.

FacesHQ

teaser

Download 2020-11-13T21-41-45_faceshq_transformer and place it into logs. Follow the data preparation steps for CelebA-HQ and FFHQ. Run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-13T21-41-45_faceshq_transformer/

D-RIN

teaser

Download 2020-11-20T12-54-32_drin_transformer and place it into logs. To run the demo on a couple of example depth maps included in the repository, run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T12-54-32_drin_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.imagenet.DRINExamples}}}"

To run the demo on the complete validation set, first follow the data preparation steps for ImageNet and then run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T12-54-32_drin_transformer/

COCO

Download 2021-01-20T16-04-20_coco_transformer and place it into logs. To run the demo on a couple of example segmentation maps included in the repository, run

streamlit run scripts/sample_conditional.py -- -r logs/2021-01-20T16-04-20_coco_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.coco.Examples}}}"

ADE20k

Download 2020-11-20T21-45-44_ade20k_transformer and place it into logs. To run the demo on a couple of example segmentation maps included in the repository, run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T21-45-44_ade20k_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.ade20k.Examples}}}"

Training on custom data

Training on your own dataset can be beneficial to get better tokens and hence better images for your domain. Those are the steps to follow to make this work:

  1. install the repo with conda env create -f environment.yaml, conda activate taming and pip install -e .
  2. put your .jpg files in a folder your_folder
  3. create 2 text files a xx_train.txt and xx_test.txt that point to the files in your training and test set respectively (for example find $(pwd)/your_folder -name "*.jpg" > train.txt)
  4. adapt configs/custom_vqgan.yaml to point to these 2 files
  5. run python main.py --base configs/custom_vqgan.yaml -t True --gpus 0,1 to train on two GPUs. Use --gpus 0, (with a trailing comma) to train on a single GPU.

Data Preparation

ImageNet

The code will try to download (through Academic Torrents) and prepare ImageNet the first time it is used. However, since ImageNet is quite large, this requires a lot of disk space and time. If you already have ImageNet on your disk, you can speed things up by putting the data into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ (which defaults to ~/.cache/autoencoders/data/ILSVRC2012_{split}/data/), where {split} is one of train/validation. It should have the following structure:

${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
├── n01440764
│   ├── n01440764_10026.JPEG
│   ├── n01440764_10027.JPEG
│   ├── ...
├── n01443537
│   ├── n01443537_10007.JPEG
│   ├── n01443537_10014.JPEG
│   ├── ...
├── ...

If you haven't extracted the data, you can also place ILSVRC2012_img_train.tar/ILSVRC2012_img_val.tar (or symlinks to them) into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/ / ${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/, which will then be extracted into above structure without downloading it again. Note that this will only happen if neither a folder ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ nor a file ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready exist. Remove them if you want to force running the dataset preparation again.

You will then need to prepare the depth data using MiDaS. Create a symlink data/imagenet_depth pointing to a folder with two subfolders train and val, each mirroring the structure of the corresponding ImageNet folder described above and containing a png file for each of ImageNet's JPEG files. The png encodes float32 depth values obtained from MiDaS as RGBA images. We provide the script scripts/extract_depth.py to generate this data. Please note that this script uses MiDaS via PyTorch Hub. When we prepared the data, the hub provided the MiDaS v2.0 version, but now it provides a v2.1 version. We haven't tested our models with depth maps obtained via v2.1 and if you want to make sure that things work as expected, you must adjust the script to make sure it explicitly uses v2.0!

CelebA-HQ

Create a symlink data/celebahq pointing to a folder containing the .npy files of CelebA-HQ (instructions to obtain them can be found in the PGGAN repository).

FFHQ

Create a symlink data/ffhq pointing to the images1024x1024 folder obtained from the FFHQ repository.

S-FLCKR

Unfortunately, we are not allowed to distribute the images we collected for the S-FLCKR dataset and can therefore only give a description how it was produced. There are many resources on collecting images from the web to get started. We collected sufficiently large images from flickr (see data/flickr_tags.txt for a full list of tags used to find images) and various subreddits (see data/subreddits.txt for all subreddits that were used). Overall, we collected 107625 images, and split them randomly into 96861 training images and 10764 validation images. We then obtained segmentation masks for each image using DeepLab v2 trained on COCO-Stuff. We used a PyTorch reimplementation and include an example script for this process in scripts/extract_segmentation.py.

COCO

Create a symlink data/coco containing the images from the 2017 split in train2017 and val2017, and their annotations in annotations. Files can be obtained from the COCO webpage. In addition, we use the Stuff+thing PNG-style annotations on COCO 2017 trainval annotations from COCO-Stuff, which should be placed under data/cocostuffthings.

ADE20k

Create a symlink data/ade20k_root containing the contents of ADEChallengeData2016.zip from the MIT Scene Parsing Benchmark.

Training models

FacesHQ

Train a VQGAN with

python main.py --base configs/faceshq_vqgan.yaml -t True --gpus 0,

Then, adjust the checkpoint path of the config key model.params.first_stage_config.params.ckpt_path in configs/faceshq_transformer.yaml (or download 2020-11-09T13-33-36_faceshq_vqgan and place into logs, which corresponds to the preconfigured checkpoint path), then run

python main.py --base configs/faceshq_transformer.yaml -t True --gpus 0,

D-RIN

Train a VQGAN on ImageNet with

python main.py --base configs/imagenet_vqgan.yaml -t True --gpus 0,

or download a pretrained one from 2020-09-23T17-56-33_imagenet_vqgan and place under logs. If you trained your own, adjust the path in the config key model.params.first_stage_config.params.ckpt_path of configs/drin_transformer.yaml.

Train a VQGAN on Depth Maps of ImageNet with

python main.py --base configs/imagenetdepth_vqgan.yaml -t True --gpus 0,

or download a pretrained one from 2020-11-03T15-34-24_imagenetdepth_vqgan and place under logs. If you trained your own, adjust the path in the config key model.params.cond_stage_config.params.ckpt_path of configs/drin_transformer.yaml.

To train the transformer, run

python main.py --base configs/drin_transformer.yaml -t True --gpus 0,

More Resources

Comparing Different First Stage Models

The reconstruction and compression capabilities of different fist stage models can be analyzed in this colab notebook. In particular, the notebook compares two VQGANs with a downsampling factor of f=16 for each and codebook dimensionality of 1024 and 16384, a VQGAN with f=8 and 8192 codebook entries and the discrete autoencoder of OpenAI's DALL-E (which has f=8 and 8192 codebook entries). firststages1 firststages2

Other

Text-to-Image Optimization via CLIP

VQGAN has been successfully used as an image generator guided by the CLIP model, both for pure image generation from scratch and image-to-image translation. We recommend the following notebooks/videos/resources:

txt2img

Text prompt: 'A bird drawn by a child'

Shout-outs

Thanks to everyone who makes their code and models available. In particular,

BibTeX

@misc{esser2020taming,
      title={Taming Transformers for High-Resolution Image Synthesis}, 
      author={Patrick Esser and Robin Rombach and Björn Ommer},
      year={2020},
      eprint={2012.09841},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Owner
CompVis Heidelberg
Computer Vision research group at the Ruprecht-Karls-University Heidelberg
CompVis Heidelberg
Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis

Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis This is a PyTorch implementation of the model described in our pape

qzhb 6 Jul 08, 2021
T-LOAM: Truncated Least Squares Lidar-only Odometry and Mapping in Real-Time

T-LOAM: Truncated Least Squares Lidar-only Odometry and Mapping in Real-Time The first Lidar-only odometry framework with high performance based on tr

Pengwei Zhou 183 Dec 01, 2022
This repository compare a selfie with images from identity documents and response if the selfie match.

aws-rekognition-facecompare This repository compare a selfie with images from identity documents and response if the selfie match. This code was made

1 Jan 27, 2022
Object Tracking and Detection Using OpenCV

Object tracking is one such application of computer vision where an object is detected in a video, otherwise interpreted as a set of frames, and the object’s trajectory is estimated. For instance, yo

Happy N. Monday 4 Aug 21, 2022
Code, final versions, and information on the Sparkfun Graphical Datasheets

Graphical Datasheets Code, final versions, and information on the SparkFun Graphical Datasheets. Generated Cells After Running Script Example Complete

SparkFun Electronics 102 Jan 05, 2023
This is a deep learning-based method to segment deep brain structures and a brain mask from T1 weighted MRI.

DBSegment This tool generates 30 deep brain structures segmentation, as well as a brain mask from T1-Weighted MRI. The whole procedure should take ~1

Luxembourg Neuroimaging (Platform OpNeuroImg) 2 Oct 25, 2022
Keras attention models including botnet,CoaT,CoAtNet,CMT,cotnet,halonet,resnest,resnext,resnetd,volo,mlp-mixer,resmlp,gmlp,levit

Keras_cv_attention_models Keras_cv_attention_models Usage Basic Usage Layers Model surgery AotNet ResNetD ResNeXt ResNetQ BotNet VOLO ResNeSt HaloNet

319 Dec 28, 2022
BEGAN in PyTorch

BEGAN in PyTorch This project is still in progress. If you are looking for the working code, use BEGAN-tensorflow. Requirements Python 2.7 Pillow tqdm

Taehoon Kim 260 Dec 07, 2022
This is the official implementation of VaxNeRF (Voxel-Accelearated NeRF).

VaxNeRF Paper | Google Colab This is the official implementation of VaxNeRF (Voxel-Accelearated NeRF). This codebase is implemented using JAX, buildin

naruya 132 Nov 21, 2022
Fairness Metrics: All you need to know

Fairness Metrics: All you need to know Testing machine learning software for ethical bias has become a pressing current concern. Recent research has p

Anonymous2020 1 Jan 17, 2022
Pytorch implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Detection"

M-LSD: Towards Light-weight and Real-time Line Segment Detection Pytorch implementation of "M-LSD: Towards Light-weight and Real-time Line Segment Det

123 Jan 04, 2023
ICLR 2021 i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning

Introduction PyTorch code for the ICLR 2021 paper [i-Mix: A Domain-Agnostic Strategy for Contrastive Representation Learning]. @inproceedings{lee2021i

Kibok Lee 68 Nov 27, 2022
Code for KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs

KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs Check out the paper on arXiv: https://arxiv.org/abs/2103.13744 This repo cont

Christian Reiser 373 Dec 20, 2022
Code and models for "Pano3D: A Holistic Benchmark and a Solid Baseline for 360 Depth Estimation", OmniCV Workshop @ CVPR21.

Pano3D A Holistic Benchmark and a Solid Baseline for 360o Depth Estimation Pano3D is a new benchmark for depth estimation from spherical panoramas. We

Visual Computing Lab, Information Technologies Institute, Centre for Reseach and Technology Hellas 50 Dec 29, 2022
AI pipelines for Nvidia Jetson Platform

Jetson Multicamera Pipelines Easy-to-use realtime CV/AI pipelines for Nvidia Jetson Platform. This project: Builds a typical multi-camera pipeline, i.

NVIDIA AI IOT 96 Dec 23, 2022
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 01, 2023
This repository contains the code for "SBEVNet: End-to-End Deep Stereo Layout Estimation" paper by Divam Gupta, Wei Pu, Trenton Tabor, Jeff Schneider

SBEVNet: End-to-End Deep Stereo Layout Estimation This repository contains the code for "SBEVNet: End-to-End Deep Stereo Layout Estimation" paper by D

Divam Gupta 19 Dec 17, 2022
Source code for Fathony, Sahu, Willmott, & Kolter, "Multiplicative Filter Networks", ICLR 2021.

Multiplicative Filter Networks This repository contains a PyTorch MFN implementation and code to perform & reproduce experiments from the ICLR 2021 pa

Bosch Research 66 Jan 04, 2023
SelfAugment extends MoCo to include automatic unsupervised augmentation selection.

SelfAugment extends MoCo to include automatic unsupervised augmentation selection. In addition, we've included the ability to pretrain on several new datasets and included a wandb integration.

Colorado Reed 24 Oct 26, 2022
sense-py-AnishaBaishya created by GitHub Classroom

Compute Statistics Here we compute statistics for a bunch of numbers. This project uses the unittest framework to test functionality. Pass the tests T

1 Oct 21, 2021