Taming Transformers for High-Resolution Image Synthesis

Overview

Taming Transformers for High-Resolution Image Synthesis

CVPR 2021 (Oral)

teaser

Taming Transformers for High-Resolution Image Synthesis
Patrick Esser*, Robin Rombach*, Björn Ommer
* equal contribution

tl;dr We combine the efficiancy of convolutional approaches with the expressivity of transformers by introducing a convolutional VQGAN, which learns a codebook of context-rich visual parts, whose composition is modeled with an autoregressive transformer.

teaser arXiv | BibTeX | Project Page

News

  • Thanks to rom1504 it is now easy to train a VQGAN on your own datasets.
  • Included a bugfix for the quantizer. For backward compatibility it is disabled by default (which corresponds to always training with beta=1.0). Use legacy=False in the quantizer config to enable it. Thanks richcmwang and wcshin-git!
  • Our paper received an update: See https://arxiv.org/abs/2012.09841v3 and the corresponding changelog.
  • Added a pretrained, 1.4B transformer model trained for class-conditional ImageNet synthesis, which obtains state-of-the-art FID scores among autoregressive approaches and outperforms BigGAN.
  • Added pretrained, unconditional models on FFHQ and CelebA-HQ.
  • Added accelerated sampling via caching of keys/values in the self-attention operation, used in scripts/sample_fast.py.
  • Added a checkpoint of a VQGAN trained with f8 compression and Gumbel-Quantization. See also our updated reconstruction notebook.
  • We added a colab notebook which compares two VQGANs and OpenAI's DALL-E. See also this section.
  • We now include an overview of pretrained models in Tab.1. We added models for COCO and ADE20k.
  • The streamlit demo now supports image completions.
  • We now include a couple of examples from the D-RIN dataset so you can run the D-RIN demo without preparing the dataset first.
  • You can now jump right into sampling with our Colab quickstart notebook.

Requirements

A suitable conda environment named taming can be created and activated with:

conda env create -f environment.yaml
conda activate taming

Overview of pretrained models

The following table provides an overview of all models that are currently available. FID scores were evaluated using torch-fidelity. For reference, we also include a link to the recently released autoencoder of the DALL-E model. See the corresponding colab notebook for a comparison and discussion of reconstruction capabilities.

Dataset FID vs train FID vs val Link Samples (256x256) Comments
FFHQ (f=16) 9.6 -- ffhq_transformer ffhq_samples
CelebA-HQ (f=16) 10.2 -- celebahq_transformer celebahq_samples
ADE20K (f=16) -- 35.5 ade20k_transformer ade20k_samples.zip [2k] evaluated on val split (2k images)
COCO-Stuff (f=16) -- 20.4 coco_transformer coco_samples.zip [5k] evaluated on val split (5k images)
ImageNet (cIN) (f=16) 15.98/15.78/6.59/5.88/5.20 -- cin_transformer cin_samples different decoding hyperparameters
FacesHQ (f=16) -- -- faceshq_transformer
S-FLCKR (f=16) -- -- sflckr
D-RIN (f=16) -- -- drin_transformer
VQGAN ImageNet (f=16), 1024 10.54 7.94 vqgan_imagenet_f16_1024 reconstructions Reconstruction-FIDs.
VQGAN ImageNet (f=16), 16384 7.41 4.98 vqgan_imagenet_f16_16384 reconstructions Reconstruction-FIDs.
VQGAN OpenImages (f=8), 8192, GumbelQuantization 3.24 1.49 vqgan_gumbel_f8 --- Reconstruction-FIDs.
DALL-E dVAE (f=8), 8192, GumbelQuantization 33.88 32.01 https://github.com/openai/DALL-E reconstructions Reconstruction-FIDs.

Running pretrained models

The commands below will start a streamlit demo which supports sampling at different resolutions and image completions. To run a non-interactive version of the sampling process, replace streamlit run scripts/sample_conditional.py -- by python scripts/make_samples.py --outdir <path_to_write_samples_to> and keep the remaining command line arguments.

To sample from unconditional or class-conditional models, run python scripts/sample_fast.py -r <path/to/config_and_checkpoint>. We describe below how to use this script to sample from the ImageNet, FFHQ, and CelebA-HQ models, respectively.

S-FLCKR

teaser

You can also run this model in a Colab notebook, which includes all necessary steps to start sampling.

Download the 2020-11-09T13-31-51_sflckr folder and place it into logs. Then, run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-09T13-31-51_sflckr/

ImageNet

teaser

Download the 2021-04-03T19-39-50_cin_transformer folder and place it into logs. Sampling from the class-conditional ImageNet model does not require any data preparation. To produce 50 samples for each of the 1000 classes of ImageNet, with k=600 for top-k sampling, p=0.92 for nucleus sampling and temperature t=1.0, run

python scripts/sample_fast.py -r logs/2021-04-03T19-39-50_cin_transformer/ -n 50 -k 600 -t 1.0 -p 0.92 --batch_size 25   

To restrict the model to certain classes, provide them via the --classes argument, separated by commas. For example, to sample 50 ostriches, border collies and whiskey jugs, run

python scripts/sample_fast.py -r logs/2021-04-03T19-39-50_cin_transformer/ -n 50 -k 600 -t 1.0 -p 0.92 --batch_size 25 --classes 9,232,901   

We recommended to experiment with the autoregressive decoding parameters (top-k, top-p and temperature) for best results.

FFHQ/CelebA-HQ

Download the 2021-04-23T18-19-01_ffhq_transformer and 2021-04-23T18-11-19_celebahq_transformer folders and place them into logs. Again, sampling from these unconditional models does not require any data preparation. To produce 50000 samples, with k=250 for top-k sampling, p=1.0 for nucleus sampling and temperature t=1.0, run

python scripts/sample_fast.py -r logs/2021-04-23T18-19-01_ffhq_transformer/   

for FFHQ and

python scripts/sample_fast.py -r logs/2021-04-23T18-11-19_celebahq_transformer/   

to sample from the CelebA-HQ model. For both models it can be advantageous to vary the top-k/top-p parameters for sampling.

FacesHQ

teaser

Download 2020-11-13T21-41-45_faceshq_transformer and place it into logs. Follow the data preparation steps for CelebA-HQ and FFHQ. Run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-13T21-41-45_faceshq_transformer/

D-RIN

teaser

Download 2020-11-20T12-54-32_drin_transformer and place it into logs. To run the demo on a couple of example depth maps included in the repository, run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T12-54-32_drin_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.imagenet.DRINExamples}}}"

To run the demo on the complete validation set, first follow the data preparation steps for ImageNet and then run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T12-54-32_drin_transformer/

COCO

Download 2021-01-20T16-04-20_coco_transformer and place it into logs. To run the demo on a couple of example segmentation maps included in the repository, run

streamlit run scripts/sample_conditional.py -- -r logs/2021-01-20T16-04-20_coco_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.coco.Examples}}}"

ADE20k

Download 2020-11-20T21-45-44_ade20k_transformer and place it into logs. To run the demo on a couple of example segmentation maps included in the repository, run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T21-45-44_ade20k_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.ade20k.Examples}}}"

Training on custom data

Training on your own dataset can be beneficial to get better tokens and hence better images for your domain. Those are the steps to follow to make this work:

  1. install the repo with conda env create -f environment.yaml, conda activate taming and pip install -e .
  2. put your .jpg files in a folder your_folder
  3. create 2 text files a xx_train.txt and xx_test.txt that point to the files in your training and test set respectively (for example find $(pwd)/your_folder -name "*.jpg" > train.txt)
  4. adapt configs/custom_vqgan.yaml to point to these 2 files
  5. run python main.py --base configs/custom_vqgan.yaml -t True --gpus 0,1 to train on two GPUs. Use --gpus 0, (with a trailing comma) to train on a single GPU.

Data Preparation

ImageNet

The code will try to download (through Academic Torrents) and prepare ImageNet the first time it is used. However, since ImageNet is quite large, this requires a lot of disk space and time. If you already have ImageNet on your disk, you can speed things up by putting the data into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ (which defaults to ~/.cache/autoencoders/data/ILSVRC2012_{split}/data/), where {split} is one of train/validation. It should have the following structure:

${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
├── n01440764
│   ├── n01440764_10026.JPEG
│   ├── n01440764_10027.JPEG
│   ├── ...
├── n01443537
│   ├── n01443537_10007.JPEG
│   ├── n01443537_10014.JPEG
│   ├── ...
├── ...

If you haven't extracted the data, you can also place ILSVRC2012_img_train.tar/ILSVRC2012_img_val.tar (or symlinks to them) into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/ / ${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/, which will then be extracted into above structure without downloading it again. Note that this will only happen if neither a folder ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ nor a file ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready exist. Remove them if you want to force running the dataset preparation again.

You will then need to prepare the depth data using MiDaS. Create a symlink data/imagenet_depth pointing to a folder with two subfolders train and val, each mirroring the structure of the corresponding ImageNet folder described above and containing a png file for each of ImageNet's JPEG files. The png encodes float32 depth values obtained from MiDaS as RGBA images. We provide the script scripts/extract_depth.py to generate this data. Please note that this script uses MiDaS via PyTorch Hub. When we prepared the data, the hub provided the MiDaS v2.0 version, but now it provides a v2.1 version. We haven't tested our models with depth maps obtained via v2.1 and if you want to make sure that things work as expected, you must adjust the script to make sure it explicitly uses v2.0!

CelebA-HQ

Create a symlink data/celebahq pointing to a folder containing the .npy files of CelebA-HQ (instructions to obtain them can be found in the PGGAN repository).

FFHQ

Create a symlink data/ffhq pointing to the images1024x1024 folder obtained from the FFHQ repository.

S-FLCKR

Unfortunately, we are not allowed to distribute the images we collected for the S-FLCKR dataset and can therefore only give a description how it was produced. There are many resources on collecting images from the web to get started. We collected sufficiently large images from flickr (see data/flickr_tags.txt for a full list of tags used to find images) and various subreddits (see data/subreddits.txt for all subreddits that were used). Overall, we collected 107625 images, and split them randomly into 96861 training images and 10764 validation images. We then obtained segmentation masks for each image using DeepLab v2 trained on COCO-Stuff. We used a PyTorch reimplementation and include an example script for this process in scripts/extract_segmentation.py.

COCO

Create a symlink data/coco containing the images from the 2017 split in train2017 and val2017, and their annotations in annotations. Files can be obtained from the COCO webpage. In addition, we use the Stuff+thing PNG-style annotations on COCO 2017 trainval annotations from COCO-Stuff, which should be placed under data/cocostuffthings.

ADE20k

Create a symlink data/ade20k_root containing the contents of ADEChallengeData2016.zip from the MIT Scene Parsing Benchmark.

Training models

FacesHQ

Train a VQGAN with

python main.py --base configs/faceshq_vqgan.yaml -t True --gpus 0,

Then, adjust the checkpoint path of the config key model.params.first_stage_config.params.ckpt_path in configs/faceshq_transformer.yaml (or download 2020-11-09T13-33-36_faceshq_vqgan and place into logs, which corresponds to the preconfigured checkpoint path), then run

python main.py --base configs/faceshq_transformer.yaml -t True --gpus 0,

D-RIN

Train a VQGAN on ImageNet with

python main.py --base configs/imagenet_vqgan.yaml -t True --gpus 0,

or download a pretrained one from 2020-09-23T17-56-33_imagenet_vqgan and place under logs. If you trained your own, adjust the path in the config key model.params.first_stage_config.params.ckpt_path of configs/drin_transformer.yaml.

Train a VQGAN on Depth Maps of ImageNet with

python main.py --base configs/imagenetdepth_vqgan.yaml -t True --gpus 0,

or download a pretrained one from 2020-11-03T15-34-24_imagenetdepth_vqgan and place under logs. If you trained your own, adjust the path in the config key model.params.cond_stage_config.params.ckpt_path of configs/drin_transformer.yaml.

To train the transformer, run

python main.py --base configs/drin_transformer.yaml -t True --gpus 0,

More Resources

Comparing Different First Stage Models

The reconstruction and compression capabilities of different fist stage models can be analyzed in this colab notebook. In particular, the notebook compares two VQGANs with a downsampling factor of f=16 for each and codebook dimensionality of 1024 and 16384, a VQGAN with f=8 and 8192 codebook entries and the discrete autoencoder of OpenAI's DALL-E (which has f=8 and 8192 codebook entries). firststages1 firststages2

Other

Text-to-Image Optimization via CLIP

VQGAN has been successfully used as an image generator guided by the CLIP model, both for pure image generation from scratch and image-to-image translation. We recommend the following notebooks/videos/resources:

txt2img

Text prompt: 'A bird drawn by a child'

Shout-outs

Thanks to everyone who makes their code and models available. In particular,

BibTeX

@misc{esser2020taming,
      title={Taming Transformers for High-Resolution Image Synthesis}, 
      author={Patrick Esser and Robin Rombach and Björn Ommer},
      year={2020},
      eprint={2012.09841},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
Owner
CompVis Heidelberg
Computer Vision research group at the Ruprecht-Karls-University Heidelberg
CompVis Heidelberg
Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference This repo is the implementation for SD

Microsoft 36 Nov 28, 2022
Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Continual learning datasets Introduction This repository contains PyTorch image

berjaoui 5 Aug 28, 2022
Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

DART Implementation for ICLR2022 paper Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners. Environment

ZJUNLP 83 Dec 27, 2022
PyTorch implementation of Progressive Growing of GANs for Improved Quality, Stability, and Variation.

PyTorch implementation of Progressive Growing of GANs for Improved Quality, Stability, and Variation. Warning: the master branch might collapse. To ob

559 Dec 14, 2022
Intrusion Test Tool with Python

P3ntsT00L Uma ferramenta escrita em Python, feita para Teste de intrusão. Requisitos ter o python 3.9.8 instalado em sua máquina. ter a git instalada

josh washington 2 Dec 27, 2021
🔮 Execution time predictions for deep neural network training iterations across different GPUs.

Habitat: A Runtime-Based Computational Performance Predictor for Deep Neural Network Training Habitat is a tool that predicts a deep neural network's

Geoffrey Yu 44 Dec 27, 2022
Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

HackED 2022 Team 3IQ - 2022 Imposter Detector By Aneeljyot Alagh, Curtis Kan, Jo

Joshua Ji 3 Aug 20, 2022
Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis Implementation

Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis Implementation This project attempted to implement the paper Putting NeRF on a

254 Dec 27, 2022
STMTrack: Template-free Visual Tracking with Space-time Memory Networks

STMTrack This is the official implementation of the paper: STMTrack: Template-free Visual Tracking with Space-time Memory Networks. Setup Prepare Anac

Zhihong Fu 62 Dec 21, 2022
Code for our ACL 2021 paper "One2Set: Generating Diverse Keyphrases as a Set"

One2Set This repository contains the code for our ACL 2021 paper “One2Set: Generating Diverse Keyphrases as a Set”. Our implementation is built on the

Jiacheng Ye 63 Jan 05, 2023
House-GAN++: Generative Adversarial Layout Refinement Network towards Intelligent Computational Agent for Professional Architects

House-GAN++ Code and instructions for our paper: House-GAN++: Generative Adversarial Layout Refinement Network towards Intelligent Computational Agent

122 Dec 28, 2022
This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

About Repository This repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.' About Code

Arun Verma 1 Nov 09, 2021
The pytorch implementation of the paper "text-guided neural image inpainting" at MM'2020

TDANet: Text-Guided Neural Image Inpainting, MM'2020 (Oral) MM | ArXiv This repository implements the paper "Text-Guided Neural Image Inpainting" by L

LisaiZhang 75 Dec 22, 2022
Finding Donors for CharityML

Finding-Donors-for-CharityML - Investigated factors that affect the likelihood of charity donations being made based on real census data.

Moamen Abdelkawy 1 Dec 30, 2021
PrimitiveNet: Primitive Instance Segmentation with Local Primitive Embedding under Adversarial Metric (ICCV 2021)

PrimitiveNet Source code for the paper: Jingwei Huang, Yanfeng Zhang, Mingwei Sun. [PrimitiveNet: Primitive Instance Segmentation with Local Primitive

Jingwei Huang 47 Dec 06, 2022
YOLO-v5 기반 단안 카메라의 영상을 활용해 차간 거리를 일정하게 유지하며 주행하는 Adaptive Cruise Control 기능 구현

자율 주행차의 영상 기반 차간거리 유지 개발 Table of Contents 프로젝트 소개 주요 기능 시스템 구조 디렉토리 구조 결과 실행 방법 참조 팀원 프로젝트 소개 YOLO-v5 기반으로 단안 카메라의 영상을 활용해 차간 거리를 일정하게 유지하며 주행하는 Adap

14 Jun 29, 2022
Data reduction pipeline for KOALA on the AAT.

KOALA KOALA, the Kilofibre Optical AAT Lenslet Array, is a wide-field, high efficiency, integral field unit used by the AAOmega spectrograph on the 3.

4 Sep 26, 2022
Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*

Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*. The algorithm was extremely

1 Mar 28, 2022
Points2Surf: Learning Implicit Surfaces from Point Clouds (ECCV 2020 Spotlight)

Points2Surf: Learning Implicit Surfaces from Point Clouds (ECCV 2020 Spotlight)

Philipp Erler 329 Jan 06, 2023
Analyzing basic network responses to novel classes

novelty-detection Analyzing how AlexNet responds to novel classes with varying degrees of similarity to pretrained classes from ImageNet. If you find

Noam Eshed 34 Oct 02, 2022