Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

Last update: Dec 11, 2022

Related tags

Overview

merlot_reserve

Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

MERLOT Reserve (in submission) is a model for learning joint representations of vision, language, and sound from YouTube. The learned model can be used in a zero-shot or finetuned setting, where it does well on tasks like VCR and TVQA.

Visit our project page at rowanzellers.com/merlotreserve or read the full paper to learn more.

What's here

We are releasing the following:

JAX code, and model checkpoints, for the MERLOT model
Code for pretraining the model
Code for finetuning the model on VCR and TVQA
Code for doing zero-shot inference with the model

Environment and setup

There are two different ways to run MERLOT Reserve:

Pretraining on videos You'll need a TPU Pod VM for this. This step shouldn't be necessary for most people, as we have released model checkpoints.
Finetuning on VCR or TVQA I've done this on a TPU v3-8 VM. This should be possible on GPU(s), but I haven't tested this on such hardware.
Zero-shot inference I've ran this on a GPU (even an older, Titan X from 2016 works.)

Installation on a GPU Machine

Install Cuda 11.4 (I used this link) and CUDNN 8.2. You might have to add something like this to your PATH:

export LD_LIBRARY_PATH=/usr/local/cuda/lib64

Create the environment:

conda create --name mreserve python=3.8 && conda activate mreserve
conda install -y python=3.8 tqdm numpy pyyaml scipy ipython cython typing h5py pandas matplotlib

# Install jax
pip install jax[cuda11_cudnn82] -f https://storage.googleapis.com/jax-releases/jax_releases.html
# If doing this on TPUs instead of locally...
# pip install "jax[tpu]>=0.2.18" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html

# This is needed sometimes https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp
pip uninstall numpy
pip install numpy==1.19.5

pip install -r requirements.txt

You can then try out the interactive script at demo/demo_video.py. It will handle downloading the model checkpoint for you.

Installation on a Cloud TPU VM

See the instructions in pretrain/ to set up your environment on a TPU v3-8 VM.

Checkpoints

These should get auto-downloaded if you use PretrainedMerlotReserve in mreserve/modeling.py. All are flax checkpoint files:

# pretrained checkpoints
gs://merlotreserve/ckpts/base
gs://merlotreserve/ckpts/base_resadapt
gs://merlotreserve/ckpts/large
gs://merlotreserve/ckpts/large_resadapt

# finetuned checkpoints
gs://merlotreserve/vcr_ckpts/vcr_finetune_base
gs://merlotreserve/vcr_ckpts/vcr_finetune_large

gs://merlotreserve/tvqa_ckpts/tvqa_finetune_base
gs://merlotreserve/tvqa_ckpts/tvqa_finetune_large

# TVQA Data
gs://merlotreserve/finetune_data/tvqa/

# VCR data
gs://merlotreserve/finetune_data/vcr/

Code release for "MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound"

Related tags

Overview

merlot_reserve

What's here

Environment and setup

Installation on a GPU Machine

Installation on a Cloud TPU VM

Checkpoints

Owner

Rowan Zellers

Decensoring Hentai with Deep Neural Networks. Formerly named DeepMindBreak.

This repo contains the pytorch implementation for Dynamic Concept Learner (accepted by ICLR 2021).

Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding (CVPR2022)

A PyTorch implementation of Implicit Q-Learning

Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

Character Grounding and Re-Identification in Story of Videos and Text Descriptions

Self-supervised spatio-spectro-temporal represenation learning for EEG analysis

PyTorch implementation of Munchausen Reinforcement Learning based on DQN and SAC. Handles discrete and continuous action spaces

CL-Gym: Full-Featured PyTorch Library for Continual Learning

🤖 Project template for your next awesome AI project. 🦾

A model which classifies reviews as positive or negative.

CoReNet is a technique for joint multi-object 3D reconstruction from a single RGB image.

This repository is related to an Arabic tutorial, within the tutorial we discuss the common data structure and algorithms and their worst and best case for each, then implement the code using Python.

The official repository for "Revealing unforeseen diagnostic image features with deep learning by detecting cardiovascular diseases from apical four-chamber ultrasounds"

A novel Engagement Detection with Multi-Task Training (ED-MTT) system

UniFormer - official implementation of UniFormer

paper: Hyperspectral Remote Sensing Image Classification Using Deep Convolutional Capsule Network

Code for ACL2021 long paper: Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases

PRTR: Pose Recognition with Cascade Transformers

Model Agnostic Interpretability for Multiple Instance Learning