MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Related tags

Deep Learningmdetr
Overview

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

WebsiteColabPaper

This repository contains code and links to pre-trained models for MDETR (Modulated DETR) for pre-training on data having aligned text and images with box annotations, as well as fine-tuning on tasks requiring fine grained understanding of image and text.

We show big gains on the phrase grounding task (Flickr30k), Referring Expression Comprehension (RefCOCO, RefCOCO+ and RefCOCOg) as well as Referring Expression Segmentation (PhraseCut, CLEVR Ref+). We also achieve competitive performance on visual question answering (GQA, CLEVR).

MDETR

TL;DR. We depart from the fixed frozen object detector approach of several popular vision + language pre-trained models and achieve true end-to-end multi-modal understanding by training our detector in the loop. In addition, we only detect objects that are relevant to the given text query, where the class labels for the objects are just the relevant words in the text query. This allows us to expand our vocabulary to anything found in free form text, making it possible to detect and reason over novel combination of object classes and attributes.

For details, please see the paper: MDETR - Modulated Detection for End-to-End Multi-Modal Understanding by Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve and Nicolas Carion.

Aishwarya Kamath and Nicolas Carion made equal contributions to this codebase.

Usage

The requirements file has all the dependencies that are needed by MDETR.

We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/ashkamath/mdetr.git

Make a new conda env and activate it:

conda create -n mdetr_env python=3.8
conda activate mdetr_env

Install the the packages in the requirements.txt:

pip install -r requirements.txt

Multinode training

Distributed training is available via Slurm and submitit:

pip install submitit

Pre-training

The links to data, steps for data preparation and script for running finetuning can be found in Pretraining Instructions We also provide the pre-trained model weights for MDETR trained on our combined aligned dataset of 1.3 million images paired with text.

The models are summarized in the following table. Note that the performance reported is "raw", without any fine-tuning. For each dataset, we report the class-agnostic box [email protected], which measures how well the model finds the boxes mentioned in the text. All performances are reported on the respective validation sets of each dataset.

Backbone GQA Flickr Refcoco Url
Size
AP AP [email protected] AP Refcoco [email protected] Refcoco+ [email protected] Refcocog [email protected]
1 R101 58.9 75.6 82.5 60.3 72.1 58.0 55.7 model 3GB
2 ENB3 59.5 76.6 82.9 57.6 70.2 56.7 53.8 model 2.4GB
3 ENB5 59.9 76.4 83.7 61.8 73.4 58.8 57.1 model 2.7GB

Downstream tasks

Phrase grounding on Flickr30k

Instructions for data preparation and script to run evaluation can be found at Flickr30k Instructions

AnyBox protocol

Backbone Pre-training Image Data Val [email protected] Val [email protected] Val [email protected] Test [email protected] Test [email protected] Test [email protected] url size
Resnet-101 COCO+VG+Flickr 82.5 92.9 94.9 83.4 93.5 95.3 model 3GB
EfficientNet-B3 COCO+VG+Flickr 82.9 93.2 95.2 84.0 93.8 95.6 model 2.4GB
EfficientNet-B5 COCO+VG+Flickr 83.6 93.4 95.1 84.3 93.9 95.8 model 2.7GB

MergedBox protocol

Backbone Pre-training Image Data Val [email protected] Val [email protected] Val [email protected] Test [email protected] Test [email protected] Test [email protected] url size
Resnet-101 COCO+VG+Flickr 82.3 91.8 93.7 83.8 92.7 94.4 model 3GB

Referring expression comprehension on RefCOCO, RefCOCO+, RefCOCOg

Instructions for data preparation and script to run finetuning and evaluation can be found at Referring Expression Instructions

RefCOCO

Backbone Pre-training Image Data Val TestA TestB url size
Resnet-101 COCO+VG+Flickr 86.75 89.58 81.41 model 3GB
EfficientNet-B3 COCO+VG+Flickr 87.51 90.40 82.67 model 2.4GB

RefCOCO+

Backbone Pre-training Image Data Val TestA TestB url size
Resnet-101 COCO+VG+Flickr 79.52 84.09 70.62 model 3GB
EfficientNet-B3 COCO+VG+Flickr 81.13 85.52 72.96 model 2.4GB

RefCOCOg

Backbone Pre-training Image Data Val Test url size
Resnet-101 COCO+VG+Flickr 81.64 80.89 model 3GB
EfficientNet-B3 COCO+VG+Flickr 83.35 83.31 model 2.4GB

Referring expression segmentation on PhraseCut

Instructions for data preparation and script to run finetuning and evaluation can be found at PhraseCut Instructions

Backbone M-IoU Precision @0.5 Precision @0.7 Precision @0.9 url size
Resnet-101 53.1 56.1 38.9 11.9 model 1.5GB
EfficientNet-B3 53.7 57.5 39.9 11.9 model 1.2GB

Visual question answering on GQA

Instructions for data preparation and scripts to run finetuning and evaluation can be found at GQA Instructions

Backbone Test-dev Test-std url size
Resnet-101 62.48 61.99 model 3GB
EfficientNet-B5 62.95 62.45 model 2.7GB

Long-tailed few-shot object detection

Instructions for data preparation and scripts to run finetuning and evaluation can be found at LVIS Instructions

Data AP AP 50 AP r APc AP f url size
1% 16.7 25.8 11.2 14.6 19.5 model 3GB
10% 24.2 38.0 20.9 24.9 24.3 model 3GB
100% 22.5 35.2 7.4 22.7 25.0 model 3GB

Synthetic datasets

Instructions to reproduce our results on CLEVR-based datasets are available at CLEVR instructions

Overall Accuracy Count Exist
Compare Number Query Attribute Compare Attribute Url Size
99.7 99.3 99.9 99.4 99.9 99.9 model 446MB

License

MDETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Citation

If you find this repository useful please give it a star and cite as follows! :) :

    @article{kamath2021mdetr,
      title={MDETR--Modulated Detection for End-to-End Multi-Modal Understanding},
      author={Kamath, Aishwarya and Singh, Mannat and LeCun, Yann and Misra, Ishan and Synnaeve, Gabriel and Carion, Nicolas},
      journal={arXiv preprint arXiv:2104.12763},
      year={2021}
    }
Owner
Aishwarya Kamath
Find me @ ashkamath.github.io
Aishwarya Kamath
Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Quantile Regression DQN Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression (https://arx

Arsenii Senya Ashukha 80 Sep 17, 2022
Generating Images with Recurrent Adversarial Networks

Generating Images with Recurrent Adversarial Networks Python (Theano) implementation of Generating Images with Recurrent Adversarial Networks code pro

Daniel Jiwoong Im 121 Sep 08, 2022
Multimodal Temporal Context Network (MTCN)

Multimodal Temporal Context Network (MTCN) This repository implements the model proposed in the paper: Evangelos Kazakos, Jaesung Huh, Arsha Nagrani,

Evangelos Kazakos 13 Nov 24, 2022
Instance-based label smoothing for improving deep neural networks generalization and calibration

Instance-based Label Smoothing for Neural Networks Pytorch Implementation of the algorithm. This repository includes a new proposed method for instanc

Mohamed Maher 1 Aug 13, 2022
Using Language Model to Bootstrap Human Activity Recognition Ambient Sensors Based in Smart Homes

Using Language Model to Bootstrap Human Activity Recognition Ambient Sensors Based in Smart Homes This repository is the official implementation of Us

Damien Bouchabou 0 Oct 18, 2021
Shape-aware Semi-supervised 3D Semantic Segmentation for Medical Images

SASSnet Code for paper: Shape-aware Semi-supervised 3D Semantic Segmentation for Medical Images(MICCAI 2020) Our code is origin from UA-MT You can fin

klein 125 Jan 03, 2023
PyTorch implementation for "Mining Latent Structures with Contrastive Modality Fusion for Multimedia Recommendation"

MIRCO PyTorch implementation for paper: Latent Structures Mining with Contrastive Modality Fusion for Multimedia Recommendation Dependencies Python 3.

Big Data and Multi-modal Computing Group, CRIPAC 9 Dec 08, 2022
Explainer for black box models that predict molecule properties

Explaining why that molecule exmol is a package to explain black-box predictions of molecules. The package uses model agnostic explanations to help us

White Laboratory 172 Dec 19, 2022
Solutions and questions for AoC2021. Merry christmas!

Advent of Code 2021 Merry christmas! 🎄 🎅 To get solutions and approximate execution times for implementations, please execute the run.py script in t

Wilhelm Ågren 5 Dec 29, 2022
This repository contains the code for "Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP".

Self-Diagnosis and Self-Debiasing This repository contains the source code for Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based

Timo Schick 62 Dec 12, 2022
pcnaDeep integrates cutting-edge detection techniques with tracking and cell cycle resolving models.

pcnaDeep: a deep-learning based single-cell cycle profiler with PCNA signal Welcome! pcnaDeep integrates cutting-edge detection techniques with tracki

ChanLab 8 Oct 18, 2022
Open-Domain Question-Answering for COVID-19 and Other Emergent Domains

Open-Domain Question-Answering for COVID-19 and Other Emergent Domains This repository contains the source code for an end-to-end open-domain question

7 Sep 27, 2022
Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval This repo provides personal implementation of paper Approximate Ne

John 8 Oct 07, 2022
🎃 Core identification module of AI powerful point reading system platform.

ppReader-Kernel Intro Core identification module of AI powerful point reading system platform. Usage 硬件: Windows10、GPU:nvdia GTX 1060 、普通RBG相机 软件: con

CrashKing 1 Jan 11, 2022
PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud, CVPR 2019.

PointRCNN PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud Code release for the paper PointRCNN:3D Object Proposal Generation a

Shaoshuai Shi 1.5k Dec 27, 2022
TGS Salt Identification Challenge

TGS Salt Identification Challenge This is an open solution to the TGS Salt Identification Challenge. Note Unfortunately, we can no longer provide supp

neptune.ai 123 Nov 04, 2022
Paddle pit - Rethinking Spatial Dimensions of Vision Transformers

基于Paddle实现PiT ——Rethinking Spatial Dimensions of Vision Transformers,arxiv 官方原版代

Hongtao Wen 4 Jan 15, 2022
Jittor 64*64 implementation of StyleGAN

StyleGanJittor (Tsinghua university computer graphics course) Overview Jittor 64

Song Shengyu 3 Jan 20, 2022
Learning infinite-resolution image processing with GAN and RL from unpaired image datasets, using a differentiable photo editing model.

Exposure: A White-Box Photo Post-Processing Framework ACM Transactions on Graphics (presented at SIGGRAPH 2018) Yuanming Hu1,2, Hao He1,2, Chenxi Xu1,

Yuanming Hu 719 Dec 29, 2022
Domain Generalization for Mammography Detection via Multi-style and Multi-view Contrastive Learning

MSVCL_MICCAI2021 Installation Please follow the instruction in pytorch-CycleGAN-and-pix2pix to install. Example Usage An example of vendor-styles tran

Jaron Lee 11 Oct 19, 2022