We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Related tags

Deep LearningGDT
Overview

Multi-Modal Self-Supervision using GDT and StiCa

This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized Data Transformations and Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning. In this repository, we provide PyTorch code for pretraining and testing our proposed GDT and StiCa models.

If you find GDT and STiCA useful in your research, please use the following BibTeX entries for citation.

@misc{patrick2020multimodal,
      title={Multi-modal Self-Supervision from Generalized Data Transformations}, 
      author={Mandela Patrick and Yuki M. Asano and Polina Kuznetsova and Ruth Fong and João F. Henriques and Geoffrey Zweig and Andrea Vedaldi},
      year={2021},
      booktitle={International Conference on Computer Vision (ICCV)},
}

@misc{m2021spacetime,
    title={Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning},
    author={Mandela Patrick and Yuki M. Asano and Bernie Huang and Ishan Misra and Florian Metze and Joao Henriques and Andrea Vedaldi},
    year={2021},
    booktitle={International Conference on Computer Vision (ICCV)},
}

Highlights

(1) GDT: Formulate and generalize most pretext tasks in a NCE objective.

Using this formulation, we test various pretext tasks previously unexplored and achieve SOTA downstream performance.

(2) STiCA: Importance of incorporating within-modal invariance in cross-modal learning

We show how to efficiently incorporate within-modal invariance learning using feature crops and achieve SOTA downstream performance.

Model Zoo

We provide GDT models pretrained on Kinetics-400 (K400), HowTo100M (HT100M), and Instagram-65M (IG65M) datasets, and StiCa models pretrained on Kinetics-400 (K400).

name dataset # of frames spatial crop HMDB51 Top1 UCF101 Top1 url
GDT K400 30 112 62.3 90.9 model
GDT HT100M 30 112 94.1 67.4 model
GDT IG65M 30 112 72.8 95.2 model
name dataset # of frames spatial crop HMDB51 Top1 UCF101 Top1 url
STiCA K400 60 112 67.0 93.1 Coming Soon

Installation

This repo was tested with Ubuntu 16.04.5 LTS, Python 3.7.5, PyTorch 1.3.1, Torchvision 0.4.1, and CUDA 10.0.

Step 1

  • Clone this repo to your local machine

Step 2

  • Install required packages using conda env create -f environment.yml

Step 3

  • Activate conda environment using conda activate GDT

Step 4

  • Install kornia library pip install kornia==0.1.4

Step 5

  • See below for how to pretrain GDT / StiCa or benchmark pretrained models

Data Preperation

For Kinetics-400/600, HMDB-51 and UCF-101 datasets:

  1. Ensure all datasets are in the format:
  2. $ROOT_DIR/$SPLIT/$CLASS/*
    

To prepare How-To-100M dataset, do the following:

  1. Download the word2vec matrix and dictionary, unzip the file, and place in datasets/data folder.
  2. wget https://www.rocq.inria.fr/cluster-willow/amiech/word2vec.zip
    unzip word2vec.zip
    mv word2vec.pth datasets/data/word2vec.pth 
    
  3. Download the csv files of captions.
  4. wget https://www.rocq.inria.fr/cluster-willow/amiech/howto100m/howto100m_captions.zip
    unzip howto100m_captions.zip
    
  5. Download the preprocessed HowTo100M videos (12TB in total) by filling this Google form: https://forms.gle/hztrfnFQUJWBtiki8.

Usage

GDT pretraining

To pretrain audio-visual GDT on K-400

Multi-node distributed training with SLURM cluster:

sbatch pretraining_scripts/pretrain_gdt_k400.sh ${HYPOTHESIS_DESC} ${HYPOTHESIS} 

Single-node distributed training:

python -m torch.distributed.launch --master_port=$RANDOM --nproc_per_node=2 --use_env main_gdt.py --batch_size $BS --lr $LR --hypothesis {1,2,3,4,5,6,7,8,9}

To pretrain video-text GDT on HT100M

Multi-node training with SLURM cluster:

sbatch pretraining_scripts/pretrain_gdt_ht100m.sh ${HYPOTHESIS_DESC} ${HYPOTHESIS} 

Single-node distributed training:

python -m torch.distributed.launch --master_port=$RANDOM --nproc_per_node=2 --use_env main_gdt.py --batch_size $BS --lr $LR --hypothesis {1,2,3,4,5,6,7,8,9} --dataset ht100m --decode_audio False --model vid_text_gdt --sample_rate 2

$HYPOTHESIS refers to the hypotheses explored in GDT. We experiment with the following:

1 - cross-modal baseline (cross_modal_baseline)
2 - variant to time reversal (v_reversal)
3 - invariant to time reversal (i_reversal)
4 - variant to time shift (v_shift)
5 - invariant to time shift (i_shift)
6 - variant to time reversal and variant to time shift (v_reversal_v_shift)
7 - invariant to time reversal, variant to time shift (i_reversal_v_shift)
8 - variant to time reversal, and invariant to time shift (v_reversal_i_shift)
9 - invariant to time reversal, invariant to time shift (i_reversal_i_shift)

Please modify the following in SLURM script:

  • SBATCH directives (e.g. partition, nodes, constraint,)
  • SAV_FOLDER
  • --root_dir (path of K-400 / HT100M train directory)

All experiments were run with 8 nodes (64 GPUs, volta32). Please scale batch-size and learning-rate appropriately.

STiCA pretraining

To pretrain audio-visual STiCA on K-400

Multi-node training with SLURM cluster:

sbatch scripts/pretrain_stica.sh $NUM_FRAMES $AUD_NUM_SEC $NUM_LARGE_CROPS $NUM_SMALL_CROPS $NUM_SMALL_TCROPS $NUM_LARGE_TCROPS $NUM_LAYER

Single-node distributed training:

python -m torch.distributed.launch --master_port=$RANDOM --nproc_per_node=2 --use_env main_stica.py --batch_size $BS --base_lr $LR

Hyper-parameters:

NUM_FRAMES - number of frames (e.g. 30)
AUD_NUM_SEC - number of seconds (30f: 1sec, 60f: 2s)
NUM_LARGE_CROPS - num of large feature spatial crops (e.g. 2)
NUM_SMALL_CROPS - num of small feature spatial crops (e.g. 4)
NUM_SMALL_TCROPS - num of large feature spatial crops (e.g. 1)
NUM_LARGE_TCROPS - num of small feature spatial crops (e.g. 2)
NUM_LAYER - num of transformer pooling layers (0 == GAP, >1 is num. of transformer layers)
e.g. sbatch scripts/pretrain_stica.sh 30 1 2 4 1 2 0

Please modify the following in SLURM script:

  • SBATCH directives (e.g. partition, nodes, constraint,)
  • SAV_FOLDER
  • --root_dir (path of K-400 / HT100M train directory)

All experiments were run with 8 nodes (64 GPUs, volta32). Please scale batch-size and learning-rate appropriately.

Benchmarking

To evaluate pretraining on video action recognition on UCF-101 and HMDB-51 datasets,

Locally:

python3 eval_video.py --dataset {ucf101, hmdb51} --fold {1,2,3} --weights-path {WEIGHTS_PATH} --model ${vid_text_gdt, stica, av_gdt}

On SLURM:

bash scripts/eval.sh ${WEIGHTS_PATH} ${OUTPUT_DIR} ${CKPT_NUM} ${CLIP_LEN} ${vid_text_gdt, stica, av_gdt} ${1, 2, 3}

Modify --root_dir, --ucf101-annotation-path, and --hmdb51-annotation-path in eval_video.py.

License

The majority of this work is licensed under CC-NC 4.0 International license.

Contributing

We actively welcome your pull requests. Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

Owner
Facebook Research
Facebook Research
Awesome-google-colab - Google Colaboratory Notebooks and Repositories

Unofficial Google Colaboratory Notebook and Repository Gallery Please contact me to take over and revamp this repo (it gets around 30k views and 200k

Derek Snow 1.2k Jan 03, 2023
Voice of Pajlada with model and weights.

Pajlada TTS Stripped down version of ForwardTacotron (https://github.com/as-ideas/ForwardTacotron) with pretrained weights for Pajlada's (https://gith

6 Sep 03, 2021
A Self-Supervised Contrastive Learning Framework for Aspect Detection

AspDecSSCL A Self-Supervised Contrastive Learning Framework for Aspect Detection This repository is a pytorch implementation for the following AAAI'21

Tian Shi 30 Dec 28, 2022
OpenCV, MediaPipe Pose Estimation, Affine Transform for Icon Overlay

Yoga Pose Identification and Icon Matching Project Goal Detect yoga poses performed by a user and overlay a corresponding icon image. Running the main

Anna Garverick 1 Dec 03, 2021
Densely Connected Convolutional Networks, In CVPR 2017 (Best Paper Award).

Densely Connected Convolutional Networks (DenseNets) This repository contains the code for DenseNet introduced in the following paper Densely Connecte

Zhuang Liu 4.5k Jan 03, 2023
Models Supported: AlbUNet [18, 34, 50, 101, 152] (1D and 2D versions for Single and Multiclass Segmentation, Feature Extraction with supports for Deep Supervision and Guided Attention)

AlbUNet-1D-2D-Tensorflow-Keras This repository contains 1D and 2D Signal Segmentation Model Builder for AlbUNet and several of its variants developed

Sakib Mahmud 1 Nov 15, 2021
A simple python program that can be used to implement user authentication tokens into your program...

token-generator A simple python module that can be used by developers to implement user authentication tokens into your program... code examples creat

octo 6 Apr 18, 2022
Space-invaders - Simple Game created using Python & PyGame, as my Beginner Python Project

Space Invaders This is a simple SPACE INVADER game create using PYGAME whihc hav

Gaurav Pandey 2 Jan 08, 2022
Consistency Regularization for Adversarial Robustness

Consistency Regularization for Adversarial Robustness Official PyTorch implementation of Consistency Regularization for Adversarial Robustness by Jiho

40 Dec 17, 2022
[ICCV 2021] Focal Frequency Loss for Image Reconstruction and Synthesis

Focal Frequency Loss - Official PyTorch Implementation This repository provides the official PyTorch implementation for the following paper: Focal Fre

Liming Jiang 460 Jan 04, 2023
RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving

RTS3D: Real-time Stereo 3D Detection from 4D Feature-Consistency Embedding Space for Autonomous Driving (AAAI2021). RTS3D is efficiency and accuracy s

71 Nov 29, 2022
This is a project based on ConvNets used to identify whether a road is clean or dirty. We have used MobileNet as our base architecture and the weights are based on imagenet.

PROJECT TITLE: CLEAN/DIRTY ROAD DETECTION USING TRANSFER LEARNING Description: This is a project based on ConvNets used to identify whether a road is

Faizal Karim 3 Nov 06, 2022
This repository contains datasets and baselines for benchmarking Chinese text recognition.

Benchmarking-Chinese-Text-Recognition This repository contains datasets and baselines for benchmarking Chinese text recognition. Please see the corres

FudanVI Lab 254 Dec 30, 2022
Data Engineering ZoomCamp

Data Engineering ZoomCamp I'm partaking in a Data Engineering Bootcamp / Zoomcamp and will be tracking my progress here. I can't promise these notes w

Aaron 61 Jan 06, 2023
This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

A Memory-saving Training Framework for Transformers This is the official PyTorch implementation for Mesa: A Memory-saving Training Framework for Trans

Zhuang AI Group 105 Dec 06, 2022
Compares various time-series feature sets on computational performance, within-set structure, and between-set relationships.

feature-set-comp Compares various time-series feature sets on computational performance, within-set structure, and between-set relationships. Reposito

Trent Henderson 7 May 25, 2022
Implementation of the GBST block from the Charformer paper, in Pytorch

Charformer - Pytorch Implementation of the GBST (gradient-based subword tokenization) module from the Charformer paper, in Pytorch. The paper proposes

Phil Wang 105 Dec 26, 2022
Capstone-Project-2 - A game program written in the Python language

Capstone-Project-2 My Pygame Game Information: Description This Pygame project i

Nhlakanipho Khulekani Hlophe 1 Jan 04, 2022
PyTorch Implementation of Unsupervised Depth Completion with Calibrated Backprojection Layers (ORAL, ICCV 2021)

Unsupervised Depth Completion with Calibrated Backprojection Layers PyTorch implementation of Unsupervised Depth Completion with Calibrated Backprojec

80 Dec 13, 2022
The hippynn python package - a modular library for atomistic machine learning with pytorch.

The hippynn python package - a modular library for atomistic machine learning with pytorch. We aim to provide a powerful library for the training of a

Los Alamos National Laboratory 37 Dec 29, 2022