[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Last update: Dec 18, 2022

Overview

Disentangled Representation Learning for Text-Video Retrieval

This is a PyTorch implementation of the paper Disentangled Representation Learning for Text-Video Retrieval:

@Article{DRLTVR2022,
  author  = {Qiang Wang and Yanhao Zhang and Yun Zheng and Pan Pan and Xian-Sheng Hua},
  journal = {arXiv:2203.07111},
  title   = {Disentangled Representation Learning for Text-Video Retrieval},
  year    = {2022},
}

Catalog

Setup
Fine-tuning code
Visualization demo

Setup

Setup code environment

git clone https://github.com/foolwood/DRL.git
cd DRL
conda create -n drl python=3.9
conda activate drl
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple/
pip install torch==1.8.1+cu102 torchvision==0.9.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html

Download CLIP Model (as pretraining)

cd tvr/models
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
# wget https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt
# wget https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt

Download Datasets

cd data/MSR-VTT
wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip ; unzip MSRVTT.zip
mv MSRVTT/videos/all ./videos ; mv MSRVTT/annotation/MSR_VTT.json ./anns/MSRVTT_data.json

Fine-tuning code

Train on MSR-VTT 1k.

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 \
main.py --do_train 1 --workers 8 --n_display 50 \
--epochs 5 --lr 1e-4 --coef_lr 1e-3 --batch_size 128 --batch_size_val 128 \
--anno_path data/MSR-VTT/anns --video_path data/MSR-VTT/videos --datatype msrvtt \
--max_words 32 --max_frames 12 --video_framerate 1 \
--base_encoder ViT-B/32 --agg_module seqTransf \
--interaction wti --wti_arch 2 --cdcr 3 --cdcr_alpha1 0.11 --cdcr_alpha2 0.0 --cdcr_lambda 0.001 \
--output_dir ckpts/ckpt_msrvtt_wti_cdcr

Reproduce the ablation experiments scripts

configs	feature	gpus	Text-Video					Video-Text					train time (h)
configs	feature	gpus	[email protected]	[email protected]	[email protected]	MdR	MnR	[email protected]	[email protected]	[email protected]	MdR	MnR	train time (h)
CLIP4Clip	ViT/B-32	4	42.8	72.1	81.4	2.0	16.3	44.1	70.5	80.5	2.0	11.8	10.5
zero-shot	ViT/B-32	4	31.1	53.7	63.4	4.0	41.6	26.5	50.1	61.7	5.0	39.9	-
Interaction
DP+None	ViT/B-32	4	42.9	70.6	81.4	2.0	15.4	43.0	71.1	81.1	2.0	11.8	2.5
DP+seqTransf	ViT/B-32	4	42.8	71.1	81.1	2.0	15.6	44.1	70.9	80.9	2.0	11.7	2.6
XTI+None	ViT/B-32	4	40.5	71.1	82.6	2.0	13.6	42.7	70.8	80.2	2.0	12.5	14.3
XTI+seqTransf	ViT/B-32	4	42.4	71.3	80.9	2.0	15.2	40.1	69.2	79.6	2.0	15.8	16.8
TI+seqTransf	ViT/B-32	4	44.8	73.0	82.2	2.0	13.4	42.6	72.7	82.8	2.0	9.1	2.6
WTI+seqTransf	ViT/B-32	4	46.6	73.4	83.5	2.0	13.0	45.4	73.4	81.9	2.0	9.2	2.6
Channel DeCorrelation Regularization
DP+seqTransf+CDCR	ViT/B-32	4	43.9	71.1	81.2	2.0	15.3	42.3	70.3	81.1	2.0	11.4	2.6
TI+seqTransf+CDCR	ViT/B-32	4	45.8	73.0	81.9	2.0	12.8	43.3	71.8	82.7	2.0	8.9	2.6
WTI+seqTransf+CDCR	ViT/B-32	4	47.6	73.4	83.3	2.0	12.8	45.1	72.9	83.5	2.0	9.2	2.6

Note: the performances are slight boosts due to new hyperparameters.

Visualization demo

Run our visualization demo using matplotlib (no GPU needed):

License

See LICENSE for details.

Acknowledgments

Our code is partly based on CLIP4Clip.

[arXiv22] Disentangled Representation Learning for Text-Video Retrieval

Related tags

Overview

Disentangled Representation Learning for Text-Video Retrieval

Catalog

Setup

Setup code environment

Download CLIP Model (as pretraining)

Download Datasets

Fine-tuning code

Visualization demo

License

Acknowledgments

Owner

Qiang Wang

Deep-Learning-Book-Chapter-Summaries - Attempting to make the Deep Learning Book easier to understand.

PGPortfolio: Policy Gradient Portfolio, the source code of "A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem"(https://arxiv.org/pdf/1706.10059.pdf).

[WACV 2020] Reducing Footskate in Human Motion Reconstruction with Ground Contact Constraints

Evaluating AlexNet features at various depths

Deep Learning Pipelines for Apache Spark

A library for graph deep learning research

Pytorch implementation for "Open Compound Domain Adaptation" (CVPR 2020 ORAL)

Predicting Axillary Lymph Node Metastasis in Early Breast Cancer Using Deep Learning on Primary Tumor Biopsy Slides

A framework to train language models to learn invariant representations.

StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks

AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation

ICS 4u HD project, start before-wards. A curtain shooting game using python.

A2LP for short, ECCV2020 spotlight, Investigating SSL principles for UDA problems

PyTorch implementation of SIFT descriptor

A Factor Model for Persistence in Investment Manager Performance

🦙 LaMa Image Inpainting, Resolution-robust Large Mask Inpainting with Fourier Convolutions, WACV 2022

An implementation of chunked, compressed, N-dimensional arrays for Python.

FEMDA: Robust classification with Flexible Discriminant Analysis in heterogeneous data

Pytorch domain adaptation package

A Pytorch Implementation of Source Data-free Domain Adaptation for a Faster R-CNN