Bi-directional Image and Text Generation

UMT-BITG (image & text generator)

Unifying Multimodal Transformer for Bi-directional Image and Text Generation,
Yupan Huang, Bei Liu, Yutong Lu, in ACM MM 2021 (Industrial Track).

UMT-DBITG (diverse image & text generator)

A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation,
Yupan Huang, Bei Liu, Jianlong Fu, Yutong Lu, in ACM MM 2021 (Video and Demo Track).

Poster or slides are available in the assets folder by visiting OneDrive.

Data & Pre-trained Models

Download preprocessed data and our pre-trained models by visiting OneDrive. We suggest following our data structures, which is consistent with the paths in config.py. You may need to modify the root_path in config.py. In addition, please following the instructions to prepare some other data:

Download grid features in path data/grid_features provided by X-LXMERT or follow feature extraction to extract these features.

wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_train_grid8.h5 -P data/grid_features
wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_valid_grid8.h5 -P data/grid_features
wget https://ai2-vision-x-lxmert.s3-us-west-2.amazonaws.com/butd_features/COCO/maskrcnn_test_grid8.h5 -P data/grid_features

For text-to-image evaluation on MSCOCO dataset, we need the real images to calculate the FID metric. For UMT-DBITG, we use MSCOCO karpathy split, which has been included in the OneDrive folder (images/imgs_karpathy). For UMT-BITG, please download MSCOCO validation set in path images/coco_val2014.

Citation

If you like our paper or code, please generously cite us:

@inproceedings{huang2021unifying,
  author    = {Yupan Huang and Bei Liu and Yutong Lu},
  title     = {Unifying Multimodal Transformer for Bi-directional Image and Text Generation},
  booktitle = {Proceedings of the 29th ACM International Conference on Multimedia},
  year      = {2021}
}

@inproceedings{huang2021diverse,
  author    = {Yupan Huang and Bei Liu and Jianlong Fu and Yutong Lu},
  title     = {A Picture is Worth a Thousand Words: A Unified System for Diverse Captions and Rich Images Generation},
  booktitle = {Proceedings of the 29th ACM International Conference on Multimedia},
  year      = {2021}
}

Acknowledgement

Our code is based on LaBERT and X-LXMERT. Our evaluation code is from pytorch-fid and inception_score. We sincerely thank them for their contributions!

Feel free to open issues or email to me for help to use this code. Any feedback is welcome!

A collection of models for image<->text generation in ACM MM 2021.

Related tags

Overview

Bi-directional Image and Text Generation

UMT-BITG (image & text generator)

UMT-DBITG (diverse image & text generator)

Data & Pre-trained Models

Citation

Acknowledgement

Owner

Multimedia Research

The official repository for our paper "The Devil is in the Detail: Simple Tricks Improve Systematic Generalization of Transformers". We significantly improve the systematic generalization of transformer models on a variety of datasets using simple tricks and careful considerations.

Editing a classifier by rewriting its prediction rules

Code for paper Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting

Code for our ICCV 2021 Paper "OadTR: Online Action Detection with Transformers".

[NeurIPS'21] Projected GANs Converge Faster

Code for generating the figures in the paper "Capacity of Group-invariant Linear Readouts from Equivariant Representations: How Many Objects can be Linearly Classified Under All Possible Views?"

Simple converter for deploying Stable-Baselines3 model to TFLite and/or Coral

Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Understanding Convolutional Neural Networks from Theoretical Perspective via Volterra Convolution

A lightweight face-recognition toolbox and pipeline based on tensorflow-lite

History Aware Multimodal Transformer for Vision-and-Language Navigation

An expansion for RDKit to read all types of files in one line

A simple rest api serving a deep learning model that classifies human gender based on their faces. (vgg16 transfare learning)

A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

Protect against subdomain takeover

[CVPR 2021] VirTex: Learning Visual Representations from Textual Annotations

A pytorch-based deep learning framework for multi-modal 2D/3D medical image segmentation

Pyramid Grafting Network for One-Stage High Resolution Saliency Detection. CVPR 2022

JudeasRx - graphical app for doing personalized causal medicine using the methods invented by Judea Pearl et al.

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features