Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Last update: Jan 04, 2023

Related tags

Overview

Detecting Twenty-thousand Classes using Image-level Supervision

Detic: A Detector with image classes that can use image-level labels to easily train detectors.

Detecting Twenty-thousand Classes using Image-level Supervision,
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra,
arXiv technical report (arXiv 2201.02605)

Features

Detects any class given class names (using CLIP).
We train the detector on ImageNet-21K dataset with 21K classes.
Cross-dataset generalization to OpenImages and Objects365 without finetuning.
State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.
Works for DETR-style detectors.

Installation

See installation instructions.

Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the web demo:

Run our demo using Colab (no GPU needed):

We use the default detectron2 demo interface. For example, to run our 21K model on a messy desk image (image credit David Fouhey) with the lvis vocabulary, run

mkdir models
wget https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth -O models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth
wget https://web.eecs.umich.edu/~fouhey/fun/desk/desk.jpg
python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out.jpg --vocabulary lvis --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

If setup correctly, the output should look like:

The same model can run with other vocabularies (COCO, OpenImages, or Objects365), or a custom vocabulary. For example:

python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out2.jpg --vocabulary custom --custom_vocabulary headphone,webcam,paper,coffe --confidence-threshold 0.3 --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

The output should look like:

Note that headphone, paper and coffe (typo intended) are not LVIS classes. Despite the misspelled class name, our detector can produce a reasonable detection for coffe.

Benchmark evaluation and training

Please first prepare datasets, then check our MODEL ZOO to reproduce results in our paper. We highlight key results below:

Open-vocabulary LVIS

mask mAP mask mAP_novel

Box-Supervised 30.2 16.4

Detic 32.4 24.9

	mask mAP	mask mAP_novel
Box-Supervised	30.2	16.4
Detic	32.4	24.9

Standard LVIS

	Detector/ Backbone	mask mAP	mask mAP_rare
Box-Supervised	CenterNet2-ResNet50	31.5	25.6
Detic	CenterNet2-ResNet50	33.2	29.7
Box-Supervised	CenterNet2-SwinB	40.7	35.9
Detic	CenterNet2-SwinB	41.7	41.7

	Detector/ Backbone	box mAP	box mAP_rare
Box-Supervised	DeformableDETR-ResNet50	31.7	21.4
Detic	DeformableDETR-ResNet50	32.5	26.2

Cross-dataset generalization

Backbone Objects365 box mAP OpenImages box mAP50

Box-Supervised SwinB 19.1 46.2

Detic SwinB 21.4 55.2

	Backbone	Objects365 box mAP	OpenImages box mAP50
Box-Supervised	SwinB	19.1	46.2
Detic	SwinB	21.4	55.2

License

The majority of Detic is licensed under the Apache 2.0 license, however portions of the project are available under separate license terms: SWIN-Transformer, CLIP, and TensorFlow Object Detection API are licensed under the MIT license; UniDet is licensed under the Apache 2.0 license; and the LVIS API is licensed under a custom license (https://github.com/lvis-dataset/lvis-api/blob/master/LICENSE)” If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than CC-BY-NC, MIT, or CC0

Ethical Considerations

Detic's wide range of detection capabilities may introduce similar challenges to many other visual recognition and open-set recognition methods. As the user can define arbitrary detection classes, class design and semantics may impact the model output.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@inproceedings{zhou2021detecting,
  title={Detecting Twenty-thousand Classes using Image-level Supervision},
  author={Zhou, Xingyi and Girdhar, Rohit and Joulin, Armand and Kr{\"a}henb{\"u}hl, Philipp and Misra, Ishan},
  booktitle={arXiv preprint arXiv:2201.02605},
  year={2021}
}

Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Related tags

Overview

Detecting Twenty-thousand Classes using Image-level Supervision

Features

Installation

Demo

Benchmark evaluation and training

License

Ethical Considerations

Citation

Owner

Meta Research

Code I use to automatically update my videos' metadata on YouTube

Rest API Written In Python To Classify NSFW Images.

PyTorch implementation of SQN based on CloserLook3D's encoder

Show Me the Whole World: Towards Entire Item Space Exploration for Interactive Personalized Recommendations

A full-fledged version of Pix2Seq

MIM: MIM Installs OpenMMLab Packages

Global Pooling, More than Meets the Eye: Position Information is Encoded Channel-Wise in CNNs, ICCV 2021

Noether Networks: meta-learning useful conserved quantities

Accuracy Aligned. Concise Implementation of Swin Transformer

Source code for "MusCaps: Generating Captions for Music Audio" (IJCNN 2021)

A Fast and Stable GAN for Small and High Resolution Imagesets - pytorch

Offcial repository for the IEEE ICRA 2021 paper Auto-Tuned Sim-to-Real Transfer.

Unofficial implementation of MLP-Mixer: An all-MLP Architecture for Vision

The official implementation of CircleNet: Anchor-free Detection with Circle Representation, MICCAI 2030

A PyTorch implementation of EfficientNet and EfficientNetV2 (coming soon!)

Propagate Yourself: Exploring Pixel-Level Consistency for Unsupervised Visual Representation Learning, CVPR 2021

Yolo algorithm for detection + centroid tracker to track vehicles

CARLA: A Python Library to Benchmark Algorithmic Recourse and Counterfactual Explanation Algorithms

This is a deep learning-based method to segment deep brain structures and a brain mask from T1 weighted MRI.