Towards Diverse Paragraph Captioning for Untrimmed Videos

This repository contains PyTorch implementation of our paper Towards Diverse Paragraph Captioning for Untrimmed Videos (CVPR 2021).

Requirements

Python 3.6
Java 15.0.2
PyTorch 1.2
numpy, tqdm, h5py, scipy, six

Training & Inference

Data preparation

Download the pre-extracted video features of ActivityNet Captions or Charades Captions datasets from BaiduNetdisk (code: he21).
Decompress the downloaded files to the corresponding dataset folder in the ordered_feature/ directory.

Start training

Train our model without reinforcement learning, * can be activitynet or charades.

$ cd driver
$ CUDA_VISIBLE_DEVICES=0 python transformer.py ../results/*/dm.token/model.json ../results/*/dm.token/path.json --is_train

If you want to train the model with key frames selection, you can perform the following instruction instead.

$ cd driver
$ CUDA_VISIBLE_DEVICES=0 python transformer.py ../results/*/key_frames/model.json ../results/*/key_frames/path.json --is_train --resume_file ../results/*/key_frames/pretrained.th

It will achieve a slightly worse result with only a half of the video features used at inference phase for faster decoding. You need to download the pretrained.th model at first for the key-frame selection.

Fine-tune the pretrained model in step 1 with reinforcement learning.

$ cd driver
$ CUDA_VISIBLE_DEVICES=0 python transformer.py ../results/*/dm.token.rl/model.json ../results/*/dm.token.rl/path.json --is_train --resume_file ../results/*/dm.token/model/epoch.*.th

Evaluation

The trained checkpoints have been saved at the results/*/folder/model/ directory. After evaluation, the generated captions (corresponding to the name file in the public_split) and evaluating scores will be saved at results/*/folder/pred/tst/.

$ cd driver
$ CUDA_VISIBLE_DEVICES=0 python transformer.py ../results/*/folder/model.json ../results/*/folder/path.json --eval_set tst --resume_file ../results/*/folder/model/epoch.*.th

We also provide the pretrained models for the ActivityNet dataset here and Charades dataset here, which are re-run and achieve similar results with the paper.

Reference

If you find this repo helpful, please consider citing:

@inproceedings{song2021paragraph,
  title={Towards Diverse Paragraph Captioning for Untrimmed Videos},
  author={Song, Yuqing and Chen, Shizhe and Jin, Qin},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2021}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
cap_eval		cap_eval
data		data
driver		driver
framework		framework
metrics		metrics
models		models
modules		modules
ordered_feature		ordered_feature
public_split		public_split
readers		readers
resnet200		resnet200
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

syuqings/video-paragraph

Folders and files

Latest commit

History

Repository files navigation

Towards Diverse Paragraph Captioning for Untrimmed Videos

Requirements

Training & Inference

Data preparation

Start training

Evaluation

Reference

About

Resources

License

Stars

Watchers

Forks

Languages