Videocaptioning.pytorch - A simple implementation of video captioning

Last update: Jan 01, 2022

Related tags

Deep Learning videocaptioning.pytorch

Overview

pytorch implementation of video captioning

recommend installing pytorch and python packages using Anaconda

This code is based on video-caption.pytorch

requirements (my environment, other versions of pytorch and torchvision should also support this code (not been verified!))

cuda
pytorch 1.7.1
torchvision 0.8.2
python 3
ffmpeg (can install using anaconda)

python packages

tqdm
pillow
nltk

Data

MSR-VTT. Download and put them in ./data/msr-vtt-data directory

|-data
  |-msr-vtt-data
    |-train-video
    |-test-video
    |-annotations
      |-train_val_videodatainfo.json
      |-test_videodatainfo.json

MSVD. Download and put them in ./data/msvd-data directory

|-data
  |-msvd-data
    |-YouTubeClips
    |-annotations
      |-AllVideoDescriptions.txt

Options

all default options are defined in opt.py or corresponding code file, change them for your like.

Acknowledgements

Some code refers to ImageCaptioning.pytorch

Usage

(Optional) c3d features (not verified)

you can use video-classification-3d-cnn-pytorch to extract features from video.

Steps

preprocess MSVD annotations (convert txt file to json file)

refer to data/msvd-data/annotations/prepro_annotations.ipynb

preprocess videos and labels

# For MSR-VTT dataset
# Train and Validata set
CUDA_VISIBLE_DEVICES=0 python prepro_feats.py \
    --video_path ./data/msr-vtt-data/train-video \
    --video_suffix mp4 \
    --output_dir ./data/msr-vtt-data/resnet152 \
    --model resnet152 \
    --n_frame_steps 40

# Test set
CUDA_VISIBLE_DEVICES=0 python prepro_feats.py \
    --video_path ./data/msr-vtt-data/test-video \
    --video_suffix mp4 \
    --output_dir ./data/msr-vtt-data/resnet152 \
    --model resnet152 \
    --n_frame_steps 40

python prepro_vocab.py \
    --input_json data/msr-vtt-data/annotations/train_val_videodatainfo.json data/msr-vtt-data/annotations/test_videodatainfo.json \
    --info_json data/msr-vtt-data/info.json \
    --caption_json data/msr-vtt-data/caption.json \
    --word_count_threshold 4

# For MSVD dataset
CUDA_VISIBLE_DEVICES=0 python prepro_feats.py \
    --video_path ./data/msvd-data/YouTubeClips \
    --video_suffix avi \
    --output_dir ./data/msvd-data/resnet152 \
    --model resnet152 \
    --n_frame_steps 40

python prepro_vocab.py \
    --input_json data/msvd-data/annotations/MSVD_annotations.json \
    --info_json data/msvd-data/info.json \
    --caption_json data/msvd-data/caption.json \
    --word_count_threshold 2

Training a model

# For MSR-VTT dataset
CUDA_VISIBLE_DEVICES=0 python train.py \
    --epochs 1000 \
    --batch_size 300 \
    --checkpoint_path data/msr-vtt-data/save \
    --input_json data/msr-vtt-data/annotations/train_val_videodatainfo.json \
    --info_json data/msr-vtt-data/info.json \
    --caption_json data/msr-vtt-data/caption.json \
    --feats_dir data/msr-vtt-data/resnet152 \
    --model S2VTAttModel \
    --with_c3d 0 \
    --dim_vid 2048

# For MSVD dataset
CUDA_VISIBLE_DEVICES=0 python train.py \
    --epochs 1000 \
    --batch_size 300 \
    --checkpoint_path data/msvd-data/save \
    --input_json data/msvd-data/annotations/train_val_videodatainfo.json \
    --info_json data/msvd-data/info.json \
    --caption_json data/msvd-data/caption.json \
    --feats_dir data/msvd-data/resnet152 \
    --model S2VTAttModel \
    --with_c3d 0 \
    --dim_vid 2048

test

opt_info.json will be in same directory as saved model.

# For MSR-VTT dataset
CUDA_VISIBLE_DEVICES=0 python eval.py \
    --input_json data/msr-vtt-data/annotations/test_videodatainfo.json \
    --recover_opt data/msr-vtt-data/save/opt_info.json \
    --saved_model data/msr-vtt-data/save/model_xxx.pth \
    --batch_size 100

# For MSVD dataset
CUDA_VISIBLE_DEVICES=0 python eval.py \
    --input_json data/msvd-data/annotations/test_videodatainfo.json \
    --recover_opt data/msvd-data/save/opt_info.json \
    --saved_model data/msvd-data/save/model_xxx.pth \
    --batch_size 100

NOTE

This code is just a simple implementation of video captioning. And I have not verify whether the SCST training process and C3D feature are useful!

Acknowledgements

Some code refers to ImageCaptioning.pytorch

Videocaptioning.pytorch - A simple implementation of video captioning

Related tags

Overview

pytorch implementation of video captioning

requirements (my environment, other versions of pytorch and torchvision should also support this code (not been verified!))

python packages

Data

Options

Acknowledgements

Usage

(Optional) c3d features (not verified)

Steps

NOTE

Acknowledgements

Owner

Yiyu Wang

OneShot Learning-based hotword detection.

Introduction to Statistics and Basics of Mathematics for Data Science - The Hacker's Way

PyTorch implementation of "A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement."

TensorFlow2 Classification Model Zoo playing with TensorFlow2 on the CIFAR-10 dataset.

Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Simulation-based performance analysis of server-less Blockchain-enabled Federated Learning

PyTorch implementation of the paper Dynamic Data Augmentation with Gating Networks

abess: Fast Best-Subset Selection in Python and R

A simple log parser and summariser for IIS web server logs

The 2nd place solution of 2021 google landmark retrieval on kaggle.

Human segmentation models, training/inference code, and trained weights, implemented in PyTorch

[CVPR 2020] Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation

This is the official implementation for the paper "(Almost) Free Incentivized Exploration from Decentralized Learning Agents" in NeurIPS 2021.

Official Pytorch implementation for video neural representation (NeRV)

NeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-based Simulation (ACL-IJCNLP 2021)

Voice control for Garry's Mod

Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

Code for Paper: Self-supervised Learning of Motion Capture

Official implementation of NLOS-OT: Passive Non-Line-of-Sight Imaging Using Optimal Transport (IEEE TIP, accepted)

Pytorch implement of 'Unmixing based PAN guided fusion network for hyperspectral imagery'