GLIP: Grounded Language-Image Pre-training

Updates

12/06/2021: GLIP paper on arxiv https://arxiv.org/abs/2112.03857. Code and Model are under internal review and will release soon. Stay tuned!

11/23/2021: Project page built.

Introduction

This repository is the project page for GLIP, containing necessary instructions to reproduce the results presented in the paper. This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks.

When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines.
After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA.
When transferred to 13 downstream object detection tasks, a few-shot GLIP rivals with a fully-supervised Dynamic Head.

Supervised baselines on COCO object detection: Faster-RCNN w/ ResNet50 (40.2) or ResNet101 (42.0) from Detectron2, and DyHead w/ Swin-Tiny (49.7).

Citations

Please consider citing this paper if you use the code:

@inproceedings{harold_GLIP2021,
      title={Grounded Language-Image Pre-training},
      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
      year={2021},
      booktitle={arXiv preprint arXiv:2112.03857},
}

GLIP: Grounded Language-Image Pre-training

Related tags

Overview

GLIP: Grounded Language-Image Pre-training

Updates

Introduction

Citations

Owner

Microsoft

MMRazor: a model compression toolkit for model slimming and AutoML

DeepStruc is a Conditional Variational Autoencoder which can predict the mono-metallic nanoparticle from a Pair Distribution Function.

Class-Balanced Loss Based on Effective Number of Samples. CVPR 2019

Explainability of the Implications of Supervised and Unsupervised Face Image Quality Estimations Through Activation Map Variation Analyses in Face Recognition Models

Council-GAN - Implementation for our paper Breaking the Cycle - Colleagues are all you need (CVPR 2020)

Collections for the lasted paper about multi-view clustering methods (papers, codes)

Official implementation of "Articulation Aware Canonical Surface Mapping"

This repository is an implementation of paper : Improving the Training of Graph Neural Networks with Consistency Regularization

Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

Eff video representation - Efficient video representation through neural fields

This repository is to support contributions for tools for the Project CodeNet dataset hosted in DAX

A medical imaging framework for Pytorch

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

A Python reference implementation of the CF data model

Pytorch implementation AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

Implementation of trRosetta and trDesign for Pytorch, made into a convenient package

Scalable machine learning based time series forecasting

Instance-Dependent Partial Label Learning

This is the official repository for our paper: ''Pruning Self-attentions into Convolutional Layers in Single Path''.

Your interactive network visualizing dashboard