A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

Related tags

Deep LearningWSSGG
Overview

README.md shall be finished soon.

WSSGG

0 Overview

Our model uses the image's paired caption as weak supervision to learn the entities in the image and the relations among them. At inference time, it generates scene graphs without help from texts. To learn our model, we first allow context information to propagate on the text graph to enrich the entity word embeddings (Sec. 3.1). We found this enrichment provides better localization of the visual objects. Then, we optimize a text-query-guided attention model (Sec. 3.2) to provide the image-level entity prediction and associate the text entities with visual regions best describing them. We use the joint probability to choose boxes associated with both subject and object (Sec. 3.3), then use the top scoring boxes to learn better grounding (Sec. 3.4). Finally, we use an RNN (Sec. 3.5) to capture the vision-language common-sense and refine our predictions.

1 Installation

git clone "https://github.com/yekeren/WSSGG.git" && cd "WSSGG"

We use Tensorflow 1.5 and Python 3.6.4. To continue, please ensure that at least the correct Python version is installed. requirements.txt defines the list of python packages we installed. Simply run pip install -r requirements.txt to install these packages after setting up python. Next, run protoc protos/*.proto --python_out=. to compile the required protobuf protocol files, which are used for storing configurations.

pip install -r requirements.txt
protoc protos/*.proto --python_out=.

1.1 Faster-RCNN

Our Faster-RCNN implementation relies on the Tensorflow object detection API. Users can use git clone "https://github.com/tensorflow/models.git" "tensorflow_models" && ln -s "tensorflow_models/research/object_detection" to set up. Also, don't forget to using protoc to compire the protos used by the detection API.

The specific Faster-RCNN model we use is faster_rcnn_inception_resnet_v2_atrous_lowproposals_oidv2 to keep it the same as the VSPNet. More information is in Tensorflow object detection zoo.

git clone "https://github.com/tensorflow/models.git" "tensorflow_models" 
ln -s "tensorflow_models/research/object_detection"
cd tensorflow_models/research/; protoc object_detection/protos/*.proto --python_out=.; cd -

mkdir -p "zoo"
wget -P "zoo" "http://download.tensorflow.org/models/object_detection/faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid_2018_01_28.tar.gz"
tar xzvf zoo/faster_rcnn_inception_resnet_v2_atrous_lowproposals_oid_2018_01_28.tar.gz -C "zoo"

1.2 Language Parser

Though we indicate the dependency on spacy in requirements.txt, we still need to run python -m spacy download en for English. Then, we checkout the tool at SceneGraphParser by running git clone "https://github.com/vacancy/SceneGraphParser.git" && ln -s "SceneGraphParser/sng_parser"

python -m spacy download en
git clone "https://github.com/vacancy/SceneGraphParser.git"
ln -s "SceneGraphParser/sng_parser"

1.3 GloVe Embeddings

We use the pre-trained 300-D GloVe embeddings.

wget -P "zoo" "http://nlp.stanford.edu/data/glove.6B.zip"
unzip "zoo/glove.6B.zip" -d "zoo"

python "dataset-tools/export_glove_words_and_embeddings.py" \
  --glove_file "zoo/glove.6B.300d.txt" \
  --output_vocabulary_file "zoo/glove_word_tokens.txt" \
  --output_vocabulary_word_embedding_file "zoo/glove_word_vectors.npy"

2 Settings

To avoid the time-consuming Faster RCNN processes in 2.1 and 2.2, users can directly download the features we provided at the following URLs. Then, the scripts create_vg_settings.sh and create_coco_setting.sh will check the existense of the Faster-RCNN features and skip the processs if they are provided. Please note that in the following table, we assume the directory for holding the VG and COCO data to be vg-gt-cap and coco-cap.

Name URLs Please extract to directory
VG Faster-RCNN features https://storage.googleapis.com/weakly-supervised-scene-graphs-generation/vg_frcnn_proposals.zip vg-gt-cap/frcnn_proposals/
COCO Faster-RCNN features https://storage.googleapis.com/weakly-supervised-scene-graphs-generation/coco_frcnn_proposals.zip coco-cap/frcnn_proposals/

2.1 VG-GT-Graph and VG-Cap-Graph

Typing sh dataset-tools/create_vg_settings.sh "vg-gt-cap" will generate VG-related files under the folder "vg-gt-cap" (for both VG-GT-Graph and VG-Cap-Graph settings). Basically, it will download the datasets and launch the following programs under the dataset-tools directory.

Name Desc.
create_vg_frcnn_proposals.py Extract VG visual proposals using Faster-RCNN
create_vg_text_graphs.py Extract VG text graphs using Language Parser
create_vg_vocabulary Gather the VG vocabulary
create_vg_gt_graph_tf_record.py Generate TF record files for the VG-GT-Graph setting
create_vg_cap_graph_tf_record.py Generate TF record files for the VG-Cap-Graph setting

2.2 COCO-Cap-Graph

Typing sh dataset-tools/create_coco_settings.sh "coco-cap" "vg-gt-cap" will generate COCO-related files under the folder "coco-cap" (for COCO-Cap-Graph setting). Basically, it will download the datasets and launch the following programs under the dataset-tools directory. Please note that the "vg-gt-cap" directory should be created in that we need to get the split information (either Zareian et al. or Xu et al.).

Name Desc.
create_coco_frcnn_proposals.py Extract COCO visual proposals using Faster-RCNN
create_coco_text_graphs.py Extract COCO text graphs using Language Parser
create_coco_vocabulary Gather the COCO vocabulary
create_coco_cap_graph_tf_record.py Generate TF record files for the COCO-Cap-Graph setting

3 Training and Evaluation

Multi-GPUs (5 GPUs in our case) training cost less than 2.5 hours to train a single model, while single-GPU strategy requires more than 8 hours.

3.1 Multi-GPUs training

We use TF distributed training to train the models shown in our paper. For example, the following command shall create and train a model specified by the proto config file configs/GT-Graph-Zareian/base_phr_ite_seq.pbtxt, and save the trained model to a directory named "logs/base_phr_ite_seq". In train.sh, we create 1 ps, 1, chief, 3 workers, and 1 evaluator. The 6 instances are distributed on 5 GPUS (4 for training and 1 for evaluation).

sh train.sh \
  "configs/GT-Graph-Zareian/base_phr_ite_seq.pbtxt" \
  "logs/base_phr_ite_seq"

3.2 Single-GPU training

Our model can also be trained using single GPU strategy such as follow. However, we would suggest to half the learning rate or explore for better other hyper-parameters.

python "modeling/trainer_main.py" \
  --pipeline_proto "configs/GT-Graph-Zareian/base_phr_ite_seq.pbtxt" \
  --model_dir ""logs/base_phr_ite_seq""

3.3 Performance on test set

During the training process, there is an evaluator measuring the model's performance on the validation set and save the best model checkpoint. Finally, we use the following command to evaluate the saved model's performance on the test set. This evaluation process will last for 2-3 hours depends on the post-process parameters (e.g., see here). Currently, there are many kinds of stuff written in pure python, which we would later optimize to utilize GPU better to reduce the final evaluation time.

python "modeling/trainer_main.py" \
  --pipeline_proto "configs/GT-Graph-Zareian/base_phr_ite_seq.pbtxt" \
  --model_dir ""logs/base_phr_ite_seq"" \
  --job test

3.4 Primary configs and implementations

Take configs/GT-Graph-Zareian/base_phr_ite_seq.pbtxt as an example, the following configs control the model's behavior.

Name Desc. Impl.
linguistic_options Specify the phrasal context modeling, remove the section to disable it. models/cap2sg_linguistic.py
grounding_options Specify the grounding options. models/cap2sg_grounding.py
detection_options Specify the WSOD model, num_iterations to control the iterative process. models/cap2sg_detection.py
relation_options Specify the relation detection modeling. models/cap2sg_relation.py
common_sense_options Specify the sequential context modeling, remove the section to disable it. models/cap2sg_common_sense.py

4 Visualization

Please see cap2sg.ipynb.

5 Reference

If you find this project helps, please cite our CVPR2021 paper :)

@InProceedings{Ye_2021_CVPR,
  author = {Ye, Keren and Kovashka, Adriana},
  title = {Linguistic Structures as Weak Supervision for Visual Scene Graph Generation},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2021}
}

Also, please take a look at our old work in ICCV2019.

@InProceedings{Ye_2019_ICCV,
  author = {Ye, Keren and Zhang, Mingda and Kovashka, Adriana and Li, Wei and Qin, Danfeng and Berent, Jesse},
  title = {Cap2Det: Learning to Amplify Weak Caption Supervision for Object Detection},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = {October},
  year = {2019}
}
Owner
Keren Ye
Ph.D. student at the University of Pittsburgh. I am interested in both Computer Vision and Natural Language Processing.
Keren Ye
Code for reproducing experiments in "Improved Training of Wasserstein GANs"

Improved Training of Wasserstein GANs Code for reproducing experiments in "Improved Training of Wasserstein GANs". Prerequisites Python, NumPy, Tensor

Ishaan Gulrajani 2.2k Jan 01, 2023
Official Implementation of "Transformers Can Do Bayesian Inference"

Official Code for the Paper "Transformers Can Do Bayesian Inference" We train Transformers to do Bayesian Prediction on novel datasets for a large var

AutoML-Freiburg-Hannover 103 Dec 25, 2022
CKD - Collaborative Knowledge Distillation for Heterogeneous Information Network Embedding

Collaborative Knowledge Distillation for Heterogeneous Information Network Embed

zhousheng 9 Dec 05, 2022
Code for Paper "Evidential Softmax for Sparse MultimodalDistributions in Deep Generative Models"

Evidential Softmax for Sparse Multimodal Distributions in Deep Generative Models Abstract Many applications of generative models rely on the marginali

Stanford Intelligent Systems Laboratory 9 Jun 06, 2022
Official implementation of the PICASO: Permutation-Invariant Cascaded Attentional Set Operator

PICASO Official PyTorch implemetation for the paper PICASO:Permutation-Invariant Cascaded Attentive Set Operator. Requirements Python 3 torch = 1.0 n

Samira Zare 0 Dec 23, 2021
Powerful unsupervised domain adaptation method for dense retrieval.

Powerful unsupervised domain adaptation method for dense retrieval

Ubiquitous Knowledge Processing Lab 191 Dec 28, 2022
A PyTorch implementation for V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation

A PyTorch implementation of V-Net Vnet is a PyTorch implementation of the paper V-Net: Fully Convolutional Neural Networks for Volumetric Medical Imag

Matthew Macy 606 Dec 21, 2022
SwinTrack: A Simple and Strong Baseline for Transformer Tracking

SwinTrack This is the official repo for SwinTrack. A Simple and Strong Baseline Prerequisites Environment conda (recommended) conda create -y -n SwinT

LitingLin 196 Jan 04, 2023
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
Label Mask for Multi-label Classification

LM-MLC 一种基于完型填空的多标签分类算法 1 前言 本文主要介绍本人在全球人工智能技术创新大赛【赛道一】设计的一种基于完型填空(模板)的多标签分类算法:LM-MLC,该算法拟合能力很强能感知标签关联性,在多个数据集上测试表明该算法与主流算法无显著性差异,在该比赛数据集上的dev效果很好,但是由

52 Nov 20, 2022
Code for "Learning to Regrasp by Learning to Place"

Learning2Regrasp Learning to Regrasp by Learning to Place, CoRL 2021. Introduction We propose a point-cloud-based system for robots to predict a seque

Shuo Cheng (成硕) 18 Aug 27, 2022
Starter code for the ICCV 2021 paper, 'Detecting Invisible People'

Detecting Invisible People [ICCV 2021 Paper] [Website] Tarasha Khurana, Achal Dave, Deva Ramanan Introduction This repository contains code for Detect

Tarasha Khurana 28 Sep 16, 2022
Does MAML Only Work via Feature Re-use? A Data Set Centric Perspective

Does-MAML-Only-Work-via-Feature-Re-use-A-Data-Set-Centric-Perspective Does MAML Only Work via Feature Re-use? A Data Set Centric Perspective Installin

2 Nov 07, 2022
Code release for SLIP Self-supervision meets Language-Image Pre-training

SLIP: Self-supervision meets Language-Image Pre-training What you can find in this repo: Pre-trained models (with ViT-Small, Base, Large) and code to

Meta Research 621 Dec 31, 2022
Complex Answer Generation For Conversational Search Systems.

Complex Answer Generation For Conversational Search Systems. Code for Does Structure Matter? Leveraging Data-to-Text Generation for Answering Complex

Hanane Djeddal 0 Dec 06, 2021
My solutions for Stanford University course CS224W: Machine Learning with Graphs Fall 2021 colabs (GNN, GAT, GraphSAGE, GCN)

machine-learning-with-graphs My solutions for Stanford University course CS224W: Machine Learning with Graphs Fall 2021 colabs Course materials can be

Marko Njegomir 7 Dec 14, 2022
An Open-Source Toolkit for Prompt-Learning.

An Open-Source Framework for Prompt-learning. Overview • Installation • How To Use • Docs • Paper • Citation • What's New? Nov 2021: Now we have relea

THUNLP 2.3k Jan 07, 2023
Overview of architecture and implementation of TEDS-Net, as described in MICCAI 2021: "TEDS-Net: Enforcing Diffeomorphisms in Spatial Transformers to Guarantee TopologyPreservation in Segmentations"

TEDS-Net Overview of architecture and implementation of TEDS-Net, as described in MICCAI 2021: "TEDS-Net: Enforcing Diffeomorphisms in Spatial Transfo

Madeleine K Wyburd 14 Jan 04, 2023
An end-to-end project on customer segmentation

End-to-end Customer Segmentation Project Note: This project is in progress. Tools Used in This Project Prefect: Orchestrate workflows hydra: Manage co

Ocelot Consulting 8 Oct 06, 2022
Run containerized, rootless applications with podman

Why? restrict scope of file system access run any application without root privileges creates usable "Desktop applications" to integrate into your nor

119 Dec 27, 2022