Codes and scripts for "Explainable Semantic Space by Grounding Languageto Vision with Cross-Modal Contrastive Learning"

Related tags

Deep LearningVG-Bert
Overview

Visually Grounded Bert Language Model

This repository is the official implementation of Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning.

To cite this work: Zhang, Y., Choi, M., Han, K., & Liu, Z. Explainable Semantic Space by Grounding Language toVision with Cross-Modal Contrastive Learning. (accepted by Neurips 2021).

Abstract

In natural language processing, most models try to learn semantic representa- tions merely from texts. The learned representations encode the “distributional semantics” but fail to connect to any knowledge about the physical world. In contrast, humans learn language by grounding concepts in perception and action and the brain encodes “grounded semantics” for cognition. Inspired by this notion and recent work in vision-language learning, we design a two-stream model for grounding language learning in vision. The model includes a VGG-based visual stream and a Bert-based language stream. The two streams merge into a joint representational space. Through cross-modal contrastive learning, the model first learns to align visual and language representations with the MS COCO dataset. The model further learns to retrieve visual objects with language queries through a cross-modal attention module and to infer the visual relations between the retrieved objects through a bilinear operator with the Visual Genome dataset. After training, the model’s language stream is a stand-alone language model capable of embedding concepts in a visually grounded semantic space. This semantic space manifests principal dimensions explainable with human intuition and neurobiological knowl- edge. Word embeddings in this semantic space are predictive of human-defined norms of semantic features and are segregated into perceptually distinctive clusters. Furthermore, the visually grounded language model also enables compositional language understanding based on visual knowledge and multimodal image search with queries based on images, texts, or their combinations.

Requirements

The model was trained with Python version: 3.7.4.

numpy==1.17.2, scipy==1.3.1, torch==1.7.1, torchvision==0.8.2, transformers==4.2.1, Pillow==6.2.0, tokenizers==0.9.4

Training

To train the model(s) in the paper, run the following commands:

stage-1: visual stream pretraining:

python visual_stream_pretraining.py \
-a vgg16_attention \
--pretrained \
--batch-size 200 \
--use-position learn \
--lr 0.01 \
/directory-to-ImageNet-dataset/ImageNet2012/

stage-2: two-stream grounding on MS COCO dataset with corss-modal contrastive loss:

python run.py \
--stage two_stream_pretraining \
--data-train /directory-to-COCO-dataset/COCO_train2017.json \
--data-val /directory-to-COCO-dataset/COCO_val2017.json \
--optim adam \
--learning-rate 5e-5 \
--batch-size 180 \
--n_epochs 100 \
--pretrained-vgg \
--image-model VGG16_Attention \
--use-position learn \
--language-model Bert_base \
--embedding-dim 768 \
--sigma 0.1 \
--dropout-rate 0.3 \
--base-model-learnable-layers 8 \
--load-pretrained /directory-to-pretrain-image-model/ \
--exp-dir /output-directory/two-stream-pretraining/

stage-3: visual relational grounding on Visual Genome dataset:

python run.py \
--stage relational_grounding \
--data-train /directory-to-Visual-Genome-dataset/VG_train_dataset_finalized.json \
--data-val /directory-to-Visual-Genome-dataset/VG_val_dataset_finalized.json \
--optim adam \
--learning-rate 1e-5 \
--batch-size 180 \
--n_epochs 150 \
--pretrained-vgg \
--image-model VGG16_Attention \
--use-position learn \
--language-model Bert_object \
--num-heads 8 \
--embedding-dim 768 \
--subspace-dim 32 \
--relation-num 115 \
--temperature 1 \
--dropout-rate 0.1 \
--base-model-learnable-layers 2 \
--load-pretrained /directory-to-pretrain-two-stream-model/ \
--exp-dir /output-directory/relational-grounding/

Transfer learning for cross-modal image search:

python transfer_cross_modal_retrieval.py \
--data-train /directory-to-COCO-dataset/COCO_train2017.json \
--data-val /directory-to-COCO-dataset/COCO_val2017.json \
--optim adam \
--learning-rate 5e-5 \
--batch-size 300 \
--n_epochs 150 \
--pretrained-vgg \
--image-model VGG16_Attention \
--use-position absolute_learn \
--language-model Bert_object \
--num-heads 8 \
--embedding-dim 768 \
--subspace-dim 32 \
--relation-num 115 \
--sigma 0.1 \
--dropout-rate 0.1 \
--load-pretrained /directory-to-pretrain-two-stream-model/ \
--exp-dir /output-directory/cross-modal-retrieval/

Evaluation

We include the jupyter notebook scripts for running evaluation tasks in our paper. See README in evaluation/.

Dataset Condensation with Contrastive Signals

Dataset Condensation with Contrastive Signals This repository is the official implementation of Dataset Condensation with Contrastive Signals (DCC). T

3 May 19, 2022
This is an open source library implementing hyperbox-based machine learning algorithms

hyperbox-brain is a Python open source toolbox implementing hyperbox-based machine learning algorithms built on top of scikit-learn and is distributed

Complex Adaptive Systems (CAS) Lab - University of Technology Sydney 21 Dec 14, 2022
ShuttleNet: Position-aware Fusion of Rally Progress and Player Styles for Stroke Forecasting in Badminton (AAAI'22)

ShuttleNet: Position-aware Rally Progress and Player Styles Fusion for Stroke Forecasting in Badminton (AAAI 2022) Official code of the paper ShuttleN

Wei-Yao Wang 11 Nov 30, 2022
[IJCAI'21] Deep Automatic Natural Image Matting

Deep Automatic Natural Image Matting [IJCAI-21] This is the official repository of the paper Deep Automatic Natural Image Matting. Introduction | Netw

Jizhizi_Li 316 Jan 06, 2023
Residual Dense Net De-Interlace Filter (RDNDIF)

Residual Dense Net De-Interlace Filter (RDNDIF) Work in progress deep de-interlacer filter. It is based on the architecture proposed by Bernasconi et

Louis 7 Feb 15, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 08, 2023
Research Artifact of USENIX Security 2022 Paper: Automated Side Channel Analysis of Media Software with Manifold Learning

Automated Side Channel Analysis of Media Software with Manifold Learning Official implementation of USENIX Security 2022 paper: Automated Side Channel

Yuanyuan Yuan 175 Jan 07, 2023
Offical implementation of Shunted Self-Attention via Multi-Scale Token Aggregation

Shunted Transformer This is the offical implementation of Shunted Self-Attention via Multi-Scale Token Aggregation by Sucheng Ren, Daquan Zhou, Shengf

156 Dec 27, 2022
A mini library for Policy Gradients with Parameter-based Exploration, with reference implementation of the ClipUp optimizer from NNAISENSE.

PGPElib A mini library for Policy Gradients with Parameter-based Exploration [1] and friends. This library serves as a clean re-implementation of the

NNAISENSE 56 Jan 01, 2023
EvDistill: Asynchronous Events to End-task Learning via Bidirectional Reconstruction-guided Cross-modal Knowledge Distillation (CVPR'21)

EvDistill: Asynchronous Events to End-task Learning via Bidirectional Reconstruction-guided Cross-modal Knowledge Distillation (CVPR'21) Citation If y

addisonwang 18 Nov 11, 2022
Anime Face Detector using mmdet and mmpose

Anime Face Detector This is an anime face detector using mmdetection and mmpose. (To avoid copyright issues, I use generated images by the TADNE model

198 Jan 07, 2023
Testbed of AI Systems Quality Management

qunomon Description A testbed for testing and managing AI system qualities. Demo Sorry. Not deployment public server at alpha version. Requirement Ins

AIST AIRC 15 Nov 27, 2021
Good Classification Measures and How to Find Them

Good Classification Measures and How to Find Them This repository contains supplementary materials for the paper "Good Classification Measures and How

Yandex Research 7 Nov 13, 2022
Code for Robust Contrastive Learning against Noisy Views

Robust Contrastive Learning against Noisy Views This repository provides a PyTorch implementation of the Robust InfoNCE loss proposed in paper Robust

Ching-Yao Chuang 53 Jan 08, 2023
Official Implementation for the paper DeepFace-EMD: Re-ranking Using Patch-wise Earth Mover’s Distance Improves Out-Of-Distribution Face Identification

DeepFace-EMD: Re-ranking Using Patch-wise Earth Mover’s Distance Improves Out-Of-Distribution Face Identification Official Implementation for the pape

Anh M. Nguyen 36 Dec 28, 2022
NALSM: Neuron-Astrocyte Liquid State Machine

NALSM: Neuron-Astrocyte Liquid State Machine This package is a Tensorflow implementation of the Neuron-Astrocyte Liquid State Machine (NALSM) that int

Computational Brain Lab 4 Nov 28, 2022
The tl;dr on a few notable transformer/language model papers + other papers (alignment, memorization, etc).

The tl;dr on a few notable transformer/language model papers + other papers (alignment, memorization, etc).

Will Thompson 166 Jan 04, 2023
GeneGAN: Learning Object Transfiguration and Attribute Subspace from Unpaired Data

GeneGAN: Learning Object Transfiguration and Attribute Subspace from Unpaired Data By Shuchang Zhou, Taihong Xiao, Yi Yang, Dieqiao Feng, Qinyao He, W

Taihong Xiao 141 Apr 16, 2021
Deep Learning and Logical Reasoning from Data and Knowledge

Logic Tensor Networks (LTN) Logic Tensor Network (LTN) is a neurosymbolic framework that supports querying, learning and reasoning with both rich data

171 Dec 29, 2022
DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations This repository contains the data, scripts and baseline co

Alexa 51 Dec 17, 2022