Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

Overview

Learning the Best Pooling Strategy for Visual Semantic Embedding

License: MIT

Official PyTorch implementation of the paper Learning the Best Pooling Strategy for Visual Semantic Embedding (CVPR 2021 Oral).

Please use the following bib entry to cite this paper if you are using any resources from the repo.

@inproceedings{chen2021vseinfty,
     title={Learning the Best Pooling Strategy for Visual Semantic Embedding},
     author={Chen, Jiacheng and Hu, Hexiang and Wu, Hao and Jiang, Yuning and Wang, Changhu},
     booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
     year={2021}
} 

We referred to the implementations of VSE++ and SCAN to build up our codebase.

Introduction

Illustration of the standard Visual Semantic Embedding (VSE) framework with the proposed pooling-based aggregator, i.e., Generalized Pooling Operator (GPO). It is simple and effective, which automatically adapts to the appropriate pooling strategy given different data modality and feature extractor, and improves VSE models at negligible extra computation cost.

Image-text Matching Results

The following tables show partial results of image-to-text retrieval on COCO and Flickr30K datasets. In these experiments, we use BERT-base as the text encoder for our methods. This branch provides our code and pre-trained models for using BERT as the text backbone, please check out to the bigru branch for the code and pre-trained models for using BiGRU as the text backbone.

Note that the VSE++ entries in the following tables are the VSE++ model with the specified feature backbones, thus the results are different from the original VSE++ paper.

Results of 5-fold evaluation on COCO 1K Test Split

Visual Backbone Text Backbone R1 R5 R1 R5 Link
VSE++ BUTD region BERT-base 67.9 91.9 54.0 85.6 -
VSEInfty BUTD region BERT-base 79.7 96.4 64.8 91.4 Here
VSEInfty BUTD grid BERT-base 80.4 96.8 66.4 92.1 Here
VSEInfty WSL grid BERT-base 84.5 98.1 72.0 93.9 Here

Results on Flickr30K Test Split

Visual Backbone Text Backbone R1 R5 R1 R5 Link
VSE++ BUTD region BERT-base 63.4 87.2 45.6 76.4 -
VSEInfty BUTD region BERT-base 81.7 95.4 61.4 85.9 Here
VSEInfty BUTD grid BERT-base 81.5 97.1 63.7 88.3 Here
VSEInfty WSL grid BERT-base 88.4 98.3 74.2 93.7 Here

Result (in [email protected]) on Crisscrossed Caption benchmark (trained on COCO)

Visual Backbone Text Backbone I2T T2I T2T I2I
VSRN BUTD region BiGRU 52.4 40.1 41.0 44.2
DE EfficientNet-B4 grid BERT-base 55.9 41.7 42.6 38.5
VSEInfty BUTD grid BERT-base 60.6 46.2 45.9 44.4
VSEInfty WSL grid BERT-base 67.9 53.6 46.7 51.3

Preparation

Environment

We trained and evaluated our models with the following key dependencies:

  • Python 3.7.3

  • Pytorch 1.2.0

  • Transformers 2.1.0

Run pip install -r requirements.txt to install the exactly same dependencies as our experiments. However, we also verified that using the latest Pytorch 1.8.0 and Transformers 4.4.2 can also produce similar results.

Data

We organize all data used in the experiments in the following manner:

data
├── coco
│   ├── precomp  # pre-computed BUTD region features for COCO, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   ├── images   # raw coco images
│   │      ├── train2014
│   │      └── val2014
│   │
│   ├── cxc_annots # annotations for evaluating COCO-trained models on the CxC benchmark
│   │
│   └── id_mapping.json  # mapping from coco-id to image's file name
│   
│
├── f30k
│   ├── precomp  # pre-computed BUTD region features for Flickr30K, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   ├── flickr30k-images   # raw coco images
│   │      ├── xxx.jpg
│   │      └── ...
│   └── id_mapping.json  # mapping from f30k index to image's file name
│   
├── weights
│      └── original_updown_backbone.pth # the BUTD CNN weights
│
└── vocab  # vocab files provided by SCAN (only used when the text backbone is BiGRU)

The download links for original COCO/F30K images, precomputed BUTD features, and corresponding vocabularies are from the offical repo of SCAN. The precomp folders contain pre-computed BUTD region features, data/coco/images contains raw MS-COCO images, and data/f30k/flickr30k-images contains raw Flickr30K images.

The id_mapping.json files are the mapping from image index (ie, the COCO id for COCO images) to corresponding filenames, we generated these mappings to eliminate the need of the pycocotools package.

weights/original_updowmn_backbone.pth is the pre-trained ResNet-101 weights from Bottom-up Attention Model, we converted the original Caffe weights into Pytorch. Please download it from this link.

The data/coco/cxc_annots directory contains the necessary data files for running the Criscrossed Caption (CxC) evaluation. Since there is no official evaluation protocol in the CxC repo, we processed their raw data files and generated these data files to implement our own evaluation. We have verified our implementation by aligning the evaluation results of the official VSRN model with the ones reported by the CxC paper Please download the data files at this link.

Please download all necessary data files and organize them in the above manner, the path to the data directory will be the argument to the training script as shown below.

Training

Assuming the data root is /tmp/data, we provide example training scripts for:

  1. Grid feature with BUTD CNN for the image feature, BERT-base for the text feature. See train_grid.sh

  2. BUTD Region feature for the image feature, BERT-base for the text feature. See train_region.sh

To use other CNN initializations for the grid image feature, change the --backbone_source argument to different values:

  • (1). the default detector is to use the BUTD ResNet-101, we have adapted the original Caffe weights into Pytorch and provided the download link above;
  • (2). wsl is to use the backbones from large-scale weakly supervised learning;
  • (3). imagenet_res152 is to use the ResNet-152 pre-trained on ImageNet.

Evaluation

Run eval.py to evaluate specified models on either COCO and Flickr30K. For evaluting pre-trained models on COCO, use the following command (assuming there are 4 GPUs, and the local data path is /tmp/data):

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 eval.py --dataset coco --data_path /tmp/data/coco

For evaluting pre-trained models on Flickr-30K, use the command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 eval.py --dataset f30k --data_path /tmp/data/f30k

For evaluating pre-trained COCO models on the CxC dataset, use the command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 eval.py --dataset coco --data_path /tmp/data/coco --evaluate_cxc

For evaluating two-model ensemble, first run single-model evaluation commands above with the argument --save_results, and then use eval_ensemble.py to get the results (need to manually specify the paths to the saved results).

Owner
Jiacheng Chen
Jiacheng Chen
Code of Puregaze: Purifying gaze feature for generalizable gaze estimation, AAAI 2022.

PureGaze: Purifying Gaze Feature for Generalizable Gaze Estimation Description Our work is accpeted by AAAI 2022. Picture: We propose a domain-general

39 Dec 05, 2022
PyTorch implementation for the Neuro-Symbolic Sudoku Solver leveraging the power of Neural Logic Machines (NLM)

Neuro-Symbolic Sudoku Solver PyTorch implementation for the Neuro-Symbolic Sudoku Solver leveraging the power of Neural Logic Machines (NLM). Please n

Ashutosh Hathidara 60 Dec 10, 2022
Official implementation for “Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior”

HEP Unsupervised Low-Light Image Enhancement via Histogram Equalization Prior Implementation Python3 PyTorch=1.0 NVIDIA GPU+CUDA Training process The

FengZhang 34 Dec 04, 2022
VISNOTATE: An Opensource tool for Gaze-based Annotation of WSI Data

VISNOTATE: An Opensource tool for Gaze-based Annotation of WSI Data Introduction Requirements Installation and Setup Supported Hardware and Software R

SigmaLab 1 Jun 14, 2022
HMLET (Hybrid-Method-of-Linear-and-non-linEar-collaborative-filTering-method)

Methods HMLET (Hybrid-Method-of-Linear-and-non-linEar-collaborative-filTering-method) Dynamically selecting the best propagation method for each node

Yong 7 Dec 18, 2022
Deep-learning-roadmap - All You Need to Know About Deep Learning - A kick-starter

Deep Learning - All You Need to Know Sponsorship To support maintaining and upgrading this project, please kindly consider Sponsoring the project deve

Instill AI 4.4k Dec 26, 2022
Video2x - A lossless video/GIF/image upscaler achieved with waifu2x, Anime4K, SRMD and RealSR.

Official Discussion Group (Telegram): https://t.me/video2x A Discord server is also available. Please note that most developers are only on Telegram.

K4YT3X 5.9k Dec 31, 2022
Code of 3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces

3D Shape Variational Autoencoder Latent Disentanglement via Mini-Batch Feature Swapping for Bodies and Faces Installation After cloning the repo open

37 Dec 03, 2022
Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

UnRigidFlow This is the official PyTorch implementation of UnRigidFlow (IJCAI2019). Here are two sample results (~10MB gif for each) of our unsupervis

Liang Liu 28 Nov 16, 2022
GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery

GLODISMO: Gradient-Based Learning of Discrete Structured Measurement Operators for Signal Recovery This is the code to the paper: Gradient-Based Learn

3 Feb 15, 2022
Provably Rare Gem Miner.

Provably Rare Gem Miner just another random project by yoyoismee.eth useful link main site market contract useful thing you should know read contract

34 Nov 22, 2022
Contenido del curso Bases de datos del DCC PUC versión 2021-2

IIC2413 - Bases de Datos Tabla de contenidos Equipo Profesores Ayudantes Contenidos Calendario Evaluaciones Resumen de notas Foro Política de integrid

54 Nov 23, 2022
scikit-learn: machine learning in Python

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. The project was started

scikit-learn 52.5k Jan 08, 2023
QI-Q RoboMaster2022 CV Algorithm

QI-Q RoboMaster2022 CV Algorithm

2 Jan 10, 2022
DeepFill v1/v2 with Contextual Attention and Gated Convolution, CVPR 2018, and ICCV 2019 Oral

Generative Image Inpainting An open source framework for generative image inpainting task, with the support of Contextual Attention (CVPR 2018) and Ga

2.9k Dec 16, 2022
Must-read Papers on Physics-Informed Neural Networks.

PINNpapers Contributed by IDRL lab. Introduction Physics-Informed Neural Network (PINN) has achieved great success in scientific computing since 2017.

IDRL 330 Jan 07, 2023
Code release for Universal Domain Adaptation(CVPR 2019)

Universal Domain Adaptation Code release for Universal Domain Adaptation(CVPR 2019) Requirements python 3.6+ PyTorch 1.0 pip install -r requirements.t

THUML @ Tsinghua University 229 Dec 23, 2022
Implementation of Uformer, Attention-based Unet, in Pytorch

Uformer - Pytorch Implementation of Uformer, Attention-based Unet, in Pytorch. It will only offer the concat-cross-skip connection. This repository wi

Phil Wang 72 Dec 19, 2022
Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

ResDAVEnet-VQ Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech What is in this repo? M

Wei-Ning Hsu 21 Aug 23, 2022
SIR model parameter estimation using a novel algorithm for differentiated uniformization.

TenSIR Parameter estimation on epidemic data under the SIR model using a novel algorithm for differentiated uniformization of Markov transition rate m

The Spang Lab 4 Nov 30, 2022