GLIP: Grounded Language-Image Pre-training

Updates

12/06/2021: GLIP paper on arxiv https://arxiv.org/abs/2112.03857. Code and Model are under internal review and will release soon. Stay tuned!

11/23/2021: Project page built.

Introduction

This repository is the project page for GLIP, containing necessary instructions to reproduce the results presented in the paper. This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks.

When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines.
After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA.
When transferred to 13 downstream object detection tasks, a few-shot GLIP rivals with a fully-supervised Dynamic Head.

Supervised baselines on COCO object detection: Faster-RCNN w/ ResNet50 (40.2) or ResNet101 (42.0) from Detectron2, and DyHead w/ Swin-Tiny (49.7).

Citations

Please consider citing this paper if you use the code:

@inproceedings{harold_GLIP2021,
      title={Grounded Language-Image Pre-training},
      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
      year={2021},
      booktitle={arXiv preprint arXiv:2112.03857},
}

GLIP: Grounded Language-Image Pre-training

Related tags

Overview

GLIP: Grounded Language-Image Pre-training

Updates

Introduction

Citations

Owner

Microsoft

Scalable, event-driven, deep-learning-friendly backtesting library

Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

discovering subdomains, hidden paths, extracting unique links

Stacs-ci - A set of modules to enable integration of STACS with commonly used CI / CD systems

code and models for "Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation"

A package for "Procedural Content Generation via Reinforcement Learning" OpenAI Gym interface.

S2s2net - Sentinel-2 Super-Resolution Segmentation Network

Clinica is a software platform for clinical research studies involving patients with neurological and psychiatric diseases and the acquisition of multimodal data

The audio-video synchronization of MKV Container Format is exploited to achieve data hiding

AI-based, context-driven network device ranking

Memory Efficient Attention (O(sqrt(n)) for Jax and PyTorch

Rule Based Classification Project

Official implementation for paper Knowledge Bridging for Empathetic Dialogue Generation (AAAI 2021).

Manifold Alignment for Semantically Aligned Style Transfer

Efficient Deep Learning Systems course

A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization

Implementation for On Provable Benefits of Depth in Training Graph Convolutional Networks

(Preprint) Official PyTorch implementation of "How Do Vision Transformers Work?"

A learning-based data collection tool for human segmentation

Image based Human Fall Detection