[CVPR 2020] Transform and Tell: Entity-Aware News Image Captioning

Overview

Transform and Tell: Entity-Aware News Image Captioning

Teaser

This repository contains the code to reproduce the results in our CVPR 2020 paper Transform and Tell: Entity-Aware News Image Captioning. We propose an end-to-end model which generates captions for images embedded in news articles. News images present two key challenges: they rely on real-world knowledge, especially about named entities; and they typically have linguistically rich captions that include uncommon words. We address the first challenge by associating words in the caption with faces and objects in the image, via a multi-modal, multi-head attention mechanism. We tackle the second challenge with a state-of-the-art transformer language model that uses byte-pair-encoding to generate captions as a sequence of word parts.

On the GoodNews dataset, our model outperforms the previous state of the art by a factor of four in CIDEr score (13 to 54). This performance gain comes from a unique combination of language models, word representation, image embeddings, face embeddings, object embeddings, and improvements in neural network design. We also introduce the NYTimes800k dataset which is 70% larger than GoodNews, has higher article quality, and includes the locations of images within articles as an additional contextual cue.

A live demo can be accessed here. In the demo, you can provide the URL to a New York Times article. The server will then scrape the web page, extract the article and image, and feed them into our model to generate a caption.

Please cite with the following BibTeX:

@InProceedings{Tran_2020_CVPR,
  author = {Tran, Alasdair and Mathews, Alexander and Xie, Lexing},
  title = {Transform and Tell: Entity-Aware News Image Captioning},
  booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2020}
}

Requirements

# Install Anaconda for Python and then create a dedicated environment.
# This will make it easier to reproduce our experimental numbers.
conda env create -f environment.yml
conda activate tell

# This step is only needed if you want to use the Jupyter notebook
python -m ipykernel install --user --name tell --display-name "tell"

# Our Pytorch uses CUDA 10.2. Ensure that CUDA_HOME points to the right
# CUDA version. Chagne this depending on where you installed CUDA.
export CUDA_HOME=/usr/local/cuda-10.2

# We also pin the apex version, which is used for mixed precision training
cd libs/apex
git submodule init && git submodule update .
pip install -v --no-cache-dir --global-option="--pyprof" --global-option="--cpp_ext" --global-option="--cuda_ext" ./

# Install our package
cd ../.. && python setup.py develop

# Spacy is used to calcuate some of the evaluation metrics
spacy download en_core_web_lg

# We use nltk to tokenize the generated text to compute linguistic metrics
python -m nltk.downloader punkt

Getting Data

The quickest way to get the data is to send an email to [email protected] (where first is alasdair and last is tran) to request the MongoDB dump that contains the dataset. Alternatively, see here for instructions on how to get the data from scratch, which will take a few days.

Once we have obtained the data from the authors, which consists of two directories expt and data, you can simply put them at the root of this repo.

# If the data is download from our Cloudstor server, then you might need
# to first unzip the archives using either tar or 7z.

# First, let's start an empty local MongoDB server on port 27017. Below
# we set the cache size to 10GB of RAM. Change it depending on your system.
mkdir data/mongodb
mongod --bind_ip_all --dbpath data/mongodb --wiredTigerCacheSizeGB 10

# Next let's restore the NYTimes200k and GoodNews datasets
mongorestore --db nytimes --host=localhost --port=27017 --drop --gzip --archive=data/mongobackups/nytimes-2020-04-21.gz
mongorestore --db goodnews --host=localhost --port=27017 --drop --gzip --archive=data/mongobackups/goodnews-2020-04-21.gz

# Next we unarchive the image directories. For each dataset, you can see two
# directories: `images` and `images_processed`. The files in `images` are
# the orignal files scraped from the New York Times. You only need this
# if you want to recompute the face and object embeddings. Otherwise, all
# the experiments will use the images in `images_processed`, which have
# already been cropped and resized.
tar -zxf data/nytimes/images_processed.tar.gz -C data/nytimes/
tar -zxf data/goodnews/images_processed.tar.gz -C data/goodnews/

# We are now ready to train the models!

You can see an example of how we read the NYTimes800k samples from the MongoDB database here. Here's a minimum working example in Python:

import os
from PIL import Image
from pymongo import MongoClient

# Assume that you've already restored the database and the mongo server is running
client = MongoClient(host='localhost', port=27017)

# All of our NYTimes800k articles sit in the database `nytimes`
db = client.nytimes

# Here we select a random article in the training set.
article = db.articles.find_one({'split': 'train'})

# You can visit the original web page where this article came from
url = article['web_url']

# Each article contains a lot of fields. If you want the title, then
title = article['headline']['main'].strip()

# If you want the article text, then you will need to manually merge all
# paragraphs together.
sections = article['parsed_section']
paragraphs = []
for section in sections:
    if section['type'] == 'paragraph':
        paragraphs.append(section['text'])
article_text = '\n'.join(paragraphs)

# To get the caption of the first image in the article
pos = article['image_positions'][0]
caption = sections[pos]['text'].strip()

# If you want to load the actual image into memory
image_dir = 'data/nytimes/images_processed' # change this accordingly
image_path = os.path.join(image_dir, f"{sections[pos]['hash']}.jpg")
image = Image.open(image_path)

# You can also load the pre-computed FaceNet embeddings of the faces in the image
facenet_embeds = sections[pos]['facenet_details']['embeddings']

# Object embeddings are stored in a separate collection due to a size limit in mongo
obj = db.objects.find_one({'_id': sections[pos]['hash']})
object_embeds = obj['object_features']

Training and Evaluation

# Train the full model on NYTimes800k. This takes around 4 days on a Titan V GPU.
# The training will populate the directory expt/nytimes/9_transformer_objects/serialization
CUDA_VISIBLE_DEVICES=0 tell train expt/nytimes/9_transformer_objects/config.yaml -f

# Once training is finished, the best model weights are stored in
#   expt/nytimes/9_transformer_objects/serialization/best.th
# We can use this to generate captions on the NYTimes800k test set. This
# takes about one hour.
CUDA_VISIBLE_DEVICES=0 tell evaluate expt/nytimes/9_transformer_objects/config.yaml -m expt/nytimes/9_transformer_objects/serialization/best.th

# Compute the evaluation metrics on the test set
python scripts/compute_metrics.py -c data/nytimes/name_counters.pkl expt/nytimes/9_transformer_objects/serialization/generations.jsonl

There are also other model variants which are ablation studies. Check our paper for more details, but here's a summary:

Experiment Word Embedding Language Model Image Attention Weighted RoBERTa Location-Aware Face Attention Object Attention
1_lstm_glove GloVe LSTM
2_transformer_glove GloVe Transformer
3_lstm_roberta RoBERTa LSTM
4_no_image RoBERTa Transformer
5_transformer_roberta RoBERTa Transformer
6_transformer_weighted_roberta RoBERTa Transformer
7_trasnformer_location_aware RoBERTa Transformer
8_transformer_faces RoBERTa Transformer
9_transformer_objects RoBERTa Transformer

Acknowledgement

Owner
Alasdair Tran
Just another collection of fermions and bosons.
Alasdair Tran
CTC segmentation python package

CTC segmentation CTC segmentation can be used to find utterances alignments within large audio files. This repository contains the ctc-segmentation py

Ludwig Kürzinger 217 Jan 04, 2023
[NeurIPS 2020] Official Implementation: "SMYRF: Efficient Attention using Asymmetric Clustering".

SMYRF: Efficient attention using asymmetric clustering Get started: Abstract We propose a novel type of balanced clustering algorithm to approximate a

Giannis Daras 46 Dec 22, 2022
Faster RCNN pytorch windows

Faster-RCNN-pytorch-windows Faster RCNN implementation with pytorch for windows Open cmd, compile this comands: cd lib python setup.py build develop T

Hwa-Rang Kim 1 Nov 11, 2022
The official PyTorch code implementation of "Human Trajectory Prediction via Counterfactual Analysis" in ICCV 2021.

Human Trajectory Prediction via Counterfactual Analysis (CausalHTP) The official PyTorch code implementation of "Human Trajectory Prediction via Count

46 Dec 03, 2022
A self-supervised 3D representation learning framework named viewpoint bottleneck.

Pointly-supervised 3D Scene Parsing with Viewpoint Bottleneck Paper Created by Liyi Luo, Beiwen Tian, Hao Zhao and Guyue Zhou from Institute for AI In

63 Aug 11, 2022
The implemention of Video Depth Estimation by Fusing Flow-to-Depth Proposals

Flow-to-depth (FDNet) video-depth-estimation This is the implementation of paper Video Depth Estimation by Fusing Flow-to-Depth Proposals Jiaxin Xie,

32 Jun 14, 2022
Unofficial PyTorch implementation of Attention Free Transformer (AFT) layers by Apple Inc.

aft-pytorch Unofficial PyTorch implementation of Attention Free Transformer's layers by Zhai, et al. [abs, pdf] from Apple Inc. Installation You can i

Rishabh Anand 184 Dec 12, 2022
Code for "PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clouds", CVPR 2021

PV-RAFT This repository contains the PyTorch implementation for paper "PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clou

Yi Wei 43 Dec 05, 2022
BLEURT is a metric for Natural Language Generation based on transfer learning.

BLEURT: a Transfer Learning-Based Metric for Natural Language Generation BLEURT is an evaluation metric for Natural Language Generation. It takes a pa

Google Research 492 Jan 05, 2023
Official implementation for: Blended Diffusion for Text-driven Editing of Natural Images.

Blended Diffusion for Text-driven Editing of Natural Images Blended Diffusion for Text-driven Editing of Natural Images Omri Avrahami, Dani Lischinski

328 Dec 30, 2022
Trajectory Extraction of road users via Traffic Camera

Traffic Monitoring Citation The associated paper for this project will be published here as soon as possible. When using this software, please cite th

Julian Strosahl 14 Dec 17, 2022
Custom TensorFlow2 implementations of forward and backward computation of soft-DTW algorithm in batch mode.

Batch Soft-DTW(Dynamic Time Warping) in TensorFlow2 including forward and backward computation Custom TensorFlow2 implementations of forward and backw

19 Aug 30, 2022
这是一个yolox-pytorch的源码,可以用于训练自己的模型。

YOLOX:You Only Look Once目标检测模型在Pytorch当中的实现 目录 性能情况 Performance 实现的内容 Achievement 所需环境 Environment 小技巧的设置 TricksSet 文件下载 Download 训练步骤 How2train 预测步骤

Bubbliiiing 613 Jan 05, 2023
Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation) Download Synthia dataset The model uses

32 Sep 21, 2022
Code for the Convolutional Vision Transformer (ConViT)

ConViT : Vision Transformers with Convolutional Inductive Biases This repository contains PyTorch code for ConViT. It builds on code from the Data-Eff

Facebook Research 418 Jan 06, 2023
When are Iterative GPs Numerically Accurate?

When are Iterative GPs Numerically Accurate? This is a code repository for the paper "When are Iterative GPs Numerically Accurate?" by Wesley Maddox,

Wesley Maddox 1 Jan 06, 2022
PyTorch implementation of PSPNet

PSPNet with PyTorch Unofficial implementation of "Pyramid Scene Parsing Network" (https://arxiv.org/abs/1612.01105). This repository is just for caffe

Kazuto Nakashima 52 Nov 16, 2022
Code for our CVPR 2021 Paper "Rethinking Style Transfer: From Pixels to Parameterized Brushstrokes".

Rethinking Style Transfer: From Pixels to Parameterized Brushstrokes (CVPR 2021) Project page | Paper | Colab | Colab for Drawing App Rethinking Style

CompVis Heidelberg 153 Jan 04, 2023
Transferable Unrestricted Attacks, which won 1st place in CVPR’21 Security AI Challenger: Unrestricted Adversarial Attacks on ImageNet.

Transferable Unrestricted Adversarial Examples This is the PyTorch implementation of the Arxiv paper: Towards Transferable Unrestricted Adversarial Ex

equation 16 Dec 29, 2022
Contrastive Language-Image Pretraining

CLIP [Blog] [Paper] [Model Card] [Colab] CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pair

OpenAI 11.5k Jan 08, 2023