Code repository for our paper "Learning to Generate Scene Graph from Natural Language Supervision" in ICCV 2021

Last update: Dec 24, 2022

Related tags

Overview

Scene Graph Generation from Natural Language Supervision

This repository includes the Pytorch code for our paper "Learning to Generate Scene Graph from Natural Language Supervision" accepted in ICCV 2021.

Top (our setting): Our goal is learning to generate localized scene graphs from image-text pairs. Once trained, our model takes an image and its detected objects as inputs and outputs the image scene graph. Bottom (our results): A comparison of results from our method and state-of-the-art (SOTA) with varying levels of supervision.

Overview
Qualitative Results
Installation
Data
Metrics
Pretrained Object Detector
Pretrained Scene Graph Generation Models
Model Training
Model Evaluation
Acknowledgement
Reference

Overview

Learning from image-text data has demonstrated recent success for many recognition tasks, yet is currently limited to visual features or individual visual concepts such as objects. In this paper, we propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph. To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph. Further, we design a Transformer-based model to predict these "pseudo" labels via a masked token prediction task. Learning from only image-sentence pairs, our model achieves 30% relative gain over a latest method trained with human-annotated unlocalized scene graphs. Our model also shows strong results for weakly and fully supervised scene graph generation. In addition, we explore an open-vocabulary setting for detecting scene graphs, and present the first result for open-set scene graph generation.

Qualitative Results

Our generated scene graphs learned from image descriptions

Partial visualization of Figure 3 in our paper: Our model trained by image-sentence pairs produces scene graphs with a high quality (e.g. "man-on-motorcycle" and "man-wearing-helmet" in first example). More comparison with other models trained by stronger supervision (e.g. unlocalized scene graph labels, localized scene graph labels) can be viewed in the Figure 3 of paper.

Our generated scene graphs in open-set and closed-set settings

Figure 4 in our paper: We explored open-set setting where the categories of target concepts (objects and predicates) are unknown during training. Compared to our closed-set model, our open-set model detects more concepts outside the evaluation dataset, Visual Genome (e.g. "swinge", "mouse", "keyboard"). Our results suggest an exciting avenue of large-scale training of open-set scene graph generation using image captioning dataset such as Conceptual Caption.

Installation

Check INSTALL.md for installation instructions.

Data

Check DATASET.md for instructions of data downloading.

Metrics

Explanation of metrics in this toolkit are given in METRICS.md

Pretrained Object Detector

In this project, we primarily use the detector Faster RCNN pretrained on Open Images dataset. To use this repo, you don't need to run this detector. You can directly download the extracted detection features, as the instruction in DATASET.md. If you're interested in this detector, the pretrained model can be found in TensorFlow 1 Detection Model Zoo: faster_rcnn_inception_resnet_v2_atrous_oidv4.

Additionally, to compare with previous fully supervised models, we also use the detector pretrained by Scene-Graph-Benchmark. You can download this Faster R-CNN model and extract all the files to the directory checkpoints/pretrained_faster_rcnn.

Pretrained Scene Graph Generation Models

Our pretrained SGG models can be downloaded on Google Drive. The details of these models can be found in Model Training section below. After downloading, please put all the folders to the directory checkpoints/. More pretrained models will be released. Stay tuned!

Model Training

To train our scene graph generation models, run the script

bash train.sh MODEL_TYPE

where MODEL_TYPE specifies the training supervision, the training dataset and the scene graph generation model. See details below.

Language supervised models: trained by image-text pairs
- Language_CC-COCO_Uniter: train our Transformer-based model on Conceptual Caption (CC) and COCO Caption (COCO) datasets
- Language_*_Uniter: train our Transformer-based model on single dataset. * represents the dataset name and can be CC, COCO, and VG
- Language_OpensetCOCO_Uniter: train our Transformer-based model on COCO dataset in open-set setting
- Language_CC-COCO_MotifNet: train Motif-Net model with language supervision from CC and COCO datasets
Weakly supervised models: trained by unlocalized scene graph labels
- Weakly_Uniter: train our Transformer-based model
Fully supervised models: trained by localized scene graph labels
- Sup_Uniter: train our Transformer-based model
- Sup_OnlineDetector_Uniter: train our Transformer-based model by using the object detector from Scene-Graph-Benchmark.

You can set CUDA_VISIBLE_DEVICES in train.sh to specify which GPUs are used for model training (e.g., the default script uses 2 GPUs).

Model Evaluation

To evaluate the trained scene graph generation model, you can reuse the commands in train.sh by simply changing WSVL.SKIP_TRAIN to True and setting OUTPUT_DIR as the path to your trained model. One example can be found in test.sh and just run bash test.sh.

Acknowledgement

This repository was built based on Scene-Graph-Benchmark for scene graph generation and UNITER for image-text representation learning.

We specially would like to thank Pengchuan Zhang for providing the object detector pretrained on Objects365 dataset which was used in our ablation study.

Reference

If you are using our code, please consider citing our paper.

@inproceedings{zhong2021SGGfromNLS,
  title={Learning to Generate Scene Graph from Natural Language Supervision},
  author={Zhong, Yiwu and Shi, Jing and Yang, Jianwei and Xu, Chenliang and Li, Yin},
  booktitle={ICCV},
  year={2021}
}

Code repository for our paper "Learning to Generate Scene Graph from Natural Language Supervision" in ICCV 2021

Related tags

Overview

Scene Graph Generation from Natural Language Supervision

Contents

Overview

Qualitative Results

Our generated scene graphs learned from image descriptions

Our generated scene graphs in open-set and closed-set settings

Installation

Data

Metrics

Pretrained Object Detector

Pretrained Scene Graph Generation Models

Model Training

Model Evaluation

Acknowledgement

Reference

Owner

Yiwu Zhong

ML models and internal tensors 3D visualizer

Web-interface + rest API for classification and regression (https://jeff1evesque.github.io/machine-learning.docs)

NeuralForecast is a Python library for time series forecasting with deep learning models

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

Public repository of the 3DV 2021 paper "Generative Zero-Shot Learning for Semantic Segmentation of 3D Point Clouds"

Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

A simple python module to generate anchor (aka default/prior) boxes for object detection tasks.

Official implementation of "Learning Not to Reconstruct" (BMVC 2021)

3D Human Pose Machines with Self-supervised Learning

This GitHub repo consists of Code and Some results of project- Diabetes Treatment using Gold nanoparticles. These Consist of ML Models used for prediction Diabetes and further the basic theory and working of Gold nanoparticles.

This repo includes our code for evaluating and improving transferability in domain generalization (NeurIPS 2021)

[NeurIPS-2021] Slow Learning and Fast Inference: Efficient Graph Similarity Computation via Knowledge Distillation

A curated list of awesome resources combining Transformers with Neural Architecture Search

scikit-learn: machine learning in Python

Contour-guided image completion with perceptual grouping (BMVC 2021 publication)

Seeing if I can put together an interactive version of 3b1b's Manim in Streamlit

Code for the paper "Improved Techniques for Training GANs"

Official implementation of Unfolded Deep Kernel Estimation for Blind Image Super-resolution.

Testbed of AI Systems Quality Management

OpenMatch: Open-set Consistency Regularization for Semi-supervised Learning with Outliers (NeurIPS 2021)