PyTorch code for MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Last update: Jan 06, 2023

Overview

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

PyTorch code for our ACL 2020 paper "MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning" by Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, and Mohit Bansal

Generating multi-sentence descriptions for videos is one of the most challenging captioning tasks due to its high requirements for not only visual relevance but also discourse-based coherence across the sentences in the paragraph. Towards this goal, we propose a new approach called Memory-Augmented Recurrent Transformer (MART), which uses a memory module to augment the transformer architecture. The memory module generates a highly summarized memory state from the video segments and the sentence history so as to help better prediction of the next sentence (w.r.t. coreference and repetition aspects), thus encouraging coherent paragraph generation. Extensive experiments, human evaluations, and qualitative analyses on two popular datasets ActivityNet Captions and YouCookII show that MART generates more coherent and less repetitive paragraph captions than baseline methods, while maintaining relevance to the input video events.

Related works:

Getting started

Prerequisites

Clone this repository

# no need to add --recursive as all dependencies are copied into this repo.
git clone https://github.com/jayleicn/recurrent-transformer.git
cd recurrent-transformer

Prepare feature files

Download features from Google Drive: rt_anet_feat.tar.gz (39GB) and rt_yc2_feat.tar.gz (12GB). These features are repacked from features provided by densecap.

mkdir video_feature && cd video_feature
tar -xf path/to/rt_anet_feat.tar.gz 
tar -xf path/to/rt_yc2_feat.tar.gz

Install dependencies

Python 2.7
PyTorch 1.1.0
nltk
easydict
tqdm
tensorboardX

Add project root to PYTHONPATH

source setup.sh

Note that you need to do this each time you start a new session.

Training and Inference

We give examples on how to perform training and inference with MART.

Build Vocabulary

bash scripts/build_vocab.sh DATASET_NAME

DATASET_NAME can be anet for ActivityNet Captions or yc2 for YouCookII.

MART training

The general training command is:

bash scripts/train.sh DATASET_NAME MODEL_TYPE

MODEL_TYPE can be one of [mart, xl, xlrg, mtrans, mart_no_recurrence], see details below.

MODEL_TYPE	Description
mart	Memory Augmented Recurrent Transformer
xl	Transformer-XL
xlrg	Transformer-XL with recurrent gradient
mtrans	Vanilla Transformer
mart_no_recurrence	mart with recurrence disabled

To train our MART model on ActivityNet Captions:

bash scripts/train.sh anet mart

Training log and model will be saved at results/anet_re_*.
Once you have a trained model, you can follow the instructions below to generate captions.

Generate captions

bash scripts/translate_greedy.sh anet_re_* val

Replace anet_re_* with your own model directory name. The generated captions are saved at results/anet_re_*/greedy_pred_val.json

Evaluate generated captions

bash scripts/eval.sh anet val results/anet_re_*/greedy_pred_val.json

The results should be comparable with the results we present at Table 2 of the paper. E.g., [email protected] 10.33; [email protected] 5.18.

Citations

If you find this code useful for your research, please cite our paper:

@inproceedings{lei2020mart,
  title={MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning},
  author={Lei, Jie and Wang, Liwei and Shen, Yelong and Yu, Dong and Berg, Tamara L and Bansal, Mohit},
  booktitle={ACL},
  year={2020}
}

Others

This code used resources from the following projects: transformers, transformer-xl, densecap, OpenNMT-py.

Contact

jielei [at] cs.unc.edu

PyTorch code for MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Related tags

Overview

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Related works:

Getting started

Prerequisites

Training and Inference

Citations

Others

Contact

Owner

Jie Lei 雷杰

Generative Adversarial Text to Image Synthesis

A Python multilingual toolkit for Sentiment Analysis and Social NLP tasks

chainladder - Property and Casualty Loss Reserving in Python

NitroFE is a Python feature engineering engine which provides a variety of modules designed to internally save past dependent values for providing continuous calculation.

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

StrongSORT: Make DeepSORT Great Again

Allele-specific pipeline for unbiased read mapping(WIP), QTL discovery(WIP), and allelic-imbalance analysis

Deep Q-network learning to play flappybird.

AAAI-22 paper: SimSR: Simple Distance-based State Representationfor Deep Reinforcement Learning

DrQ-v2: Improved Data-Augmented Reinforcement Learning

Source code for Transformer-based Multi-task Learning for Disaster Tweet Categorisation (UCD's participation in TREC-IS 2020A, 2020B and 2021A).

Reliable probability face embeddings

Udacity Suse Cloud Native Foundations Scholarship Course Walkthrough

Pomodoro timer that acknowledges the inexorable, infinite passage of time

Stable Neural ODE with Lyapunov-Stable Equilibrium Points for Defending Against Adversarial Attacks

JupyterNotebook - C/C++, Javascript, HTML, LaTex, Shell scripts in Jupyter Notebook Also run them on remote computer

Official implementation of CATs: Cost Aggregation Transformers for Visual Correspondence NeurIPS'21

Official PyTorch implementation of "RMGN: A Regional Mask Guided Network for Parser-free Virtual Try-on" (IJCAI-ECAI 2022)

Traffic4D: Single View Reconstruction of Repetitious Activity Using Longitudinal Self-Supervision

Rule based classification A hotel s customers dataset