[NeurIPS'20] Self-supervised Co-Training for Video Representation Learning. Tengda Han, Weidi Xie, Andrew Zisserman.

Last update: Jan 02, 2023

Related tags

Deep Learning CoCLR

Overview

CoCLR: Self-supervised Co-Training for Video Representation Learning

This repository contains the implementation of:

InfoNCE (MoCo on videos)
UberNCE (supervised contrastive learning on videos)
CoCLR

Link:

[Project Page] [PDF] [Arxiv]

News

[2021.01.29] Upload both RGB and optical flow dataset for UCF101 (links).
[2021.01.11] Update our paper for NeurIPS2020 final version: corrected InfoNCE-RGB-linearProbe baseline result in Table1 from 52.3% (pretrained for 800 epochs, unnessary and unfair) to 46.8% (pretrained for 500 epochs, fair comparison). Thanks @liuhualin333 for pointing out.
[2020.12.08] Update instructions.
[2020.11.17] Upload pretrained weights for UCF101 experiments.
[2020.10.30] Update "draft" dataloader files, CoCLR code, evaluation code as requested by some researchers. Will check and add detailed instructions later.

Pretrain Instruction

InfoNCE pretrain on UCF101-RGB

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 main_nce.py --net s3d --model infonce --moco-k 2048 \
--dataset ucf101-2clip --seq_len 32 --ds 1 --batch_size 32 \
--epochs 300 --schedule 250 280 -j 16

InfoNCE pretrain on UCF101-Flow

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 main_nce.py --net s3d --model infonce --moco-k 2048 \
--dataset ucf101-f-2clip --seq_len 32 --ds 1 --batch_size 32 \
--epochs 300 --schedule 250 280 -j 16

CoCLR pretrain on UCF101 for one cycle

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 main_coclr.py --net s3d --topk 5 --moco-k 2048 \
--dataset ucf101-2stream-2clip --seq_len 32 --ds 1 --batch_size 32 \
--epochs 100 --schedule 80 --name_prefix Cycle1-FlowMining_ -j 8 \
--pretrain {rgb_infoNCE_checkpoint.pth.tar} {flow_infoNCE_checkpoint.pth.tar}

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 main_coclr.py --net s3d --topk 5 --moco-k 2048 --reverse \
--dataset ucf101-2stream-2clip --seq_len 32 --ds 1 --batch_size 32 \
--epochs 100 --schedule 80 --name_prefix Cycle1-RGBMining_ -j 8 \
--pretrain {flow_infoNCE_checkpoint.pth.tar} {rgb_cycle1_checkpoint.pth.tar}

InfoNCE pretrain on K400-RGB

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
--nproc_per_node=4 main_infonce.py --net s3d --model infonce --moco-k 16384 \
--dataset k400-2clip --lr 1e-3 --seq_len 32 --ds 1 --batch_size 32 \
--epochs 300 --schedule 250 280 -j 16

InfoNCE pretrain on K400-Flow

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
--nproc_per_node=4 teco_fb_main.py --net s3d --model infonce --moco-k 16384 \
--dataset k400-f-2clip --lr 1e-3 --seq_len 32 --ds 1 --batch_size 32 \
--epochs 300 --schedule 250 280 -j 16

CoCLR pretrain on K400 for one cycle

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 main_coclr.py --net s3d --topk 5 --moco-k 16384 \
--dataset k400-2stream-2clip --seq_len 32 --ds 1 --batch_size 32 \
--epochs 50 --schedule 40 --name_prefix Cycle1-FlowMining_ -j 8 \
--pretrain {rgb_infoNCE_checkpoint.pth.tar} {flow_infoNCE_checkpoint.pth.tar}

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 main_coclr.py --net s3d --topk 5 --moco-k 16384 --reverse \
--dataset k400-2stream-2clip --seq_len 32 --ds 1 --batch_size 32 \
--epochs 50 --schedule 40 --name_prefix Cycle1-RGBMining_ -j 8 \
--pretrain {flow_infoNCE_checkpoint.pth.tar} {rgb_cycle1_checkpoint.pth.tar}

Finetune Instruction

cd eval/ e.g. finetune UCF101-rgb:

CUDA_VISIBLE_DEVICES=0,1 python main_classifier.py --net s3d --dataset ucf101 \
--seq_len 32 --ds 1 --batch_size 32 --train_what ft --epochs 500 --schedule 400 450 \
--pretrain {selected_rgb_pretrained_checkpoint.pth.tar}

then run the test with 10-crop (test-time augmentation is helpful, 10-crop gives better result than center-crop):

CUDA_VISIBLE_DEVICES=0,1 python main_classifier.py --net s3d --dataset ucf101 \
--seq_len 32 --ds 1 --batch_size 32 --train_what ft --epochs 500 --schedule 400 450 \
--test {selected_rgb_finetuned_checkpoint.pth.tar} --ten_crop

Nearest-neighbour Retrieval Instruction

cd eval/ e.g. nn-retrieval for UCF101-rgb

CUDA_VISIBLE_DEVICES=0 python main_classifier.py --net s3d --dataset ucf101 \
--seq_len 32 --ds 1 --test {selected_rgb_pretrained_checkpoint.pth.tar} --retrieval

Linear-probe Instruction

cd eval/

from extracted feature

The code support two methods on linear-probe, either feed the data end-to-end and freeze the backbone, or train linear layer on extracted features. Both methods give similar best results in our experiments.

e.g. on extracted features (after run NN-retrieval command above, features will be saved in os.path.dirname(checkpoint))

CUDA_VISIBLE_DEVICES=0 python feature_linear_probe.py --dataset ucf101 \
--test {feature_dirname} --final_bn --lr 1.0 --wd 1e-3

Note that the default setting should give an alright performance, maybe 1-2% lower than our paper's figure. For different datasets, lr and wd need to be tuned from lr: 0.1 to 1.0; wd: 1e-4 to 1e-1.

load data and freeze backbone

alternatively, feed data end-to-end and freeze the backbone.

CUDA_VISIBLE_DEVICES=0,1 python main_classifier.py --net s3d --dataset ucf101 \
--seq_len 32 --ds 1 --batch_size 32 --train_what last --epochs 100 --schedule 60 80 \
--optim sgd --lr 1e-1 --wd 1e-3 --final_bn --pretrain {selected_rgb_pretrained_checkpoint.pth.tar}

Similarly, lr and wd need to be tuned for different datasets for best performance.

Dataset

RGB for UCF101: [download-from-server] [download-from-gdrive] (tar file, 29GB, packed with lmdb)
TVL1 optical flow for UCF101: [download-from-server] [download-from-gdrive] (tar file, 20.5GB, packed with lmdb)
Note: I created these lmdb files with msgpack==0.6.2, when load them with msgpack>=1.0.0, you can do msgpack.loads(raw_data, raw=True)(issue#32)

Result

Finetune entire network for action classification on UCF101:

Pretrained Weights

Our models:

UCF101-RGB-CoCLR: [download] [[email protected]=51.8 on UCF101-RGB]
UCF101-Flow-CoCLR: [download] [[email protected]=48.4 on UCF101-Flow]

Baseline models:

UCF101-RGB-InfoNCE: [download] [[email protected]=33.1 on UCF101-RGB]
UCF101-Flow-InfoNCE: [download] [[email protected]=45.2 on UCF101-Flow]

Kinetics400-pretrained models：

K400-RGB-CoCLR: [download] [[email protected]=45.6, [email protected]=87.89 on UCF101-RGB]
K400-Flow-CoCLR: [download] [[email protected]=44.4, [email protected]=85.27 on UCF101-Flow]
Two-stream result by average the class probability: 0.8789 + 0.8527 => 0.9061

[NeurIPS'20] Self-supervised Co-Training for Video Representation Learning. Tengda Han, Weidi Xie, Andrew Zisserman.

Related tags

Overview

CoCLR: Self-supervised Co-Training for Video Representation Learning

Link:

News

Pretrain Instruction

Finetune Instruction

Nearest-neighbour Retrieval Instruction

Linear-probe Instruction

from extracted feature

load data and freeze backbone

Dataset

Result

Pretrained Weights

Owner

Tengda Han

Maximum Spatial Perturbation for Image-to-Image Translation (Official Implementation)

The codes I made while I practiced various TensorFlow examples

All the essential resources and template code needed to understand and practice data structures and algorithms in python with few small projects to demonstrate their practical application.

QuALITY: Question Answering with Long Input Texts, Yes!

Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

An End-to-End Machine Learning Library to Optimize AUC (AUROC, AUPRC).

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

Official code for Next Check-ins Prediction via History and Friendship on Location-Based Social Networks (MDM 2018)

Official implementation of NeuralFusion: Online Depth Map Fusion in Latent Space

Learning a mapping from images to psychological similarity spaces with neural networks.

Algebraic effect handlers in Python

MultiSiam: Self-supervised Multi-instance Siamese Representation Learning for Autonomous Driving

PyTorch implementation for NED. It can be used to manipulate the facial emotions of actors in videos based on emotion labels or reference styles.

This is 2nd term discrete maths project done by UCU students that uses backtracking to solve various problems.

Implementation of the "PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences" paper.

Implementation of TransGanFormer, an all-attention GAN that combines the finding from the recent GanFormer and TransGan paper

Data augmentation for NLP, accepted at EMNLP 2021 Findings

[IEEE Transactions on Computational Imaging] Self-Gated Memory Recurrent Network for Efficient Scalable HDR Deghosting

Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd.

Code for Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks