Cross-modal Retrieval using Transformer Encoder Reasoning Networks (TERN). With use of Metric Learning and FAISS for fast similarity search on GPU

Overview

Cross-modal Retrieval using Transformer Encoder Reasoning Networks

This project reimplements the idea from "Transformer Reasoning Network for Image-Text Matching and Retrieval". To solve the task of cross-modal retrieval, representative features from both modal are extracted using distinctive pipeline and then projected into the same embedding space. Because the features are sequence of vectors, Transformer-based model can be utilised to work best. In this repo, my highlight contribution is:

  • Reimplement TERN module, which exploits the effectiveness of using Transformer on bottom-up attention features and bert features.
  • Take advantage of facebookresearch's FAISS for efficient similarity search and clustering of dense vectors.
  • Experiment various metric learning loss objectives from KevinMusgrave's Pytorch Metric Learning

The figure below shows the overview of the architecture

screen

Datasets

  • I trained TERN on Flickr30k dataset which contains 31,000 images collected from Flickr, together with 5 reference sentences provided by human annotators for each image. For each sample, visual and text features are pre-extracted as numpy files

  • Some samples from the dataset:

Images Captions
screen 1. An elderly man is setting the table in front of an open door that leads outside to a garden.
2. The guy in the black sweater is looking onto the table below.
3. A man in a black jacket picking something up from a table.
4. An old man wearing a black jacket is looking on the table.
5. The gray-haired man is wearing a sweater.
screen 1. Two men are working on a bicycle on the side of the road.
2. Three men working on a bicycle on a cobblestone street.
3. Two men wearing shorts are working on a blue bike.
4. Three men inspecting a bicycle on a street.
5. Three men examining a bicycle.

Execution

  • Installation
pip install -r requirements.txt
apt install libomp-dev
pip install faiss-gpu
  • Specify dataset paths and configuration in the config file

  • For training

PYTHONPATH=. python tools/train.py 
  • For evaluation
PYTHONPATH=. python tools/eval.py \
                --top_k= <top k similarity> \
                --weight= <model checkpoint> \

Notebooks

  • Notebook Inference TERN on Flickr30k dataset
  • Notebook Use FasterRCNN to extract Bottom Up embeddings
  • Notebook Use BERT to extract text embeddings

Results

  • Validation m on Flickr30k dataset (trained for 100 epochs):
Model Weights i2t/[email protected] t2i/[email protected]
TERN link 0.5174 0.7496
  • Some visualization
Query text: Two dogs are running along the street
screen
Query text: The woman is holding a violin
screen
Query text: Young boys are playing baseball
screen
Query text: A man is standing, looking at a lake
screen

Paper References

@misc{messina2021transformer,
      title={Transformer Reasoning Network for Image-Text Matching and Retrieval}, 
      author={Nicola Messina and Fabrizio Falchi and Andrea Esuli and Giuseppe Amato},
      year={2021},
      eprint={2004.09144},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@misc{anderson2018bottomup,
      title={Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering}, 
      author={Peter Anderson and Xiaodong He and Chris Buehler and Damien Teney and Mark Johnson and Stephen Gould and Lei Zhang},
      year={2018},
      eprint={1707.07998},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@article{JDH17,
  title={Billion-scale similarity search with GPUs},
  author={Johnson, Jeff and Douze, Matthijs and J{\'e}gou, Herv{\'e}},
  journal={arXiv preprint arXiv:1702.08734},
  year={2017}
}

Code References

Owner
Minh-Khoi Pham
Passionate Machine Learner
Minh-Khoi Pham
AdamW optimizer and cosine learning rate annealing with restarts

AdamW optimizer and cosine learning rate annealing with restarts This repository contains an implementation of AdamW optimization algorithm and cosine

Maksym Pyrozhok 133 Dec 20, 2022
A package related to building quasi-fibration symmetries

qf A package related to building quasi-fibration symmetries. If you'd like to learn more about how it works, see the brief explanation and References

Paolo Boldi 1 Dec 01, 2021
Code release for DS-NeRF (Depth-supervised Neural Radiance Fields)

Depth-supervised NeRF: Fewer Views and Faster Training for Free Project | Paper | YouTube Pytorch implementation of our method for learning neural rad

524 Jan 08, 2023
GoodNews Everyone! Context driven entity aware captioning for news images

This is the code for a CVPR 2019 paper, called GoodNews Everyone! Context driven entity aware captioning for news images. Enjoy! Model preview: Huge T

117 Dec 19, 2022
ISNAS-DIP: Image Specific Neural Architecture Search for Deep Image Prior [CVPR 2022]

ISNAS-DIP: Image-Specific Neural Architecture Search for Deep Image Prior (CVPR 2022) Metin Ersin Arican*, Ozgur Kara*, Gustav Bredell, Ender Konukogl

Özgür Kara 24 Dec 18, 2022
Repo for our ICML21 paper Unsupervised Learning of Visual 3D Keypoints for Control

Unsupervised Learning of Visual 3D Keypoints for Control [Project Website] [Paper] Boyuan Chen1, Pieter Abbeel1, Deepak Pathak2 1UC Berkeley 2Carnegie

Boyuan Chen 34 Jul 22, 2022
Example scripts for the detection of lanes using the ultra fast lane detection model in Tensorflow Lite.

TFlite Ultra Fast Lane Detection Inference Example scripts for the detection of lanes using the ultra fast lane detection model in Tensorflow Lite. So

Ibai Gorordo 12 Aug 27, 2022
Template repository to build PyTorch projects from source on any version of PyTorch/CUDA/cuDNN.

The Ultimate PyTorch Source-Build Template Translations: 한국어 TL;DR PyTorch built from source can be x4 faster than a naïve PyTorch install. This repos

Joonhyung Lee/이준형 651 Dec 12, 2022
Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

Phil Wang 189 Nov 22, 2022
Cmsc11 arcade - Final Project for CMSC11

cmsc11_arcade Final Project for CMSC11 Developers: Limson, Mark Vincent Peñafiel

Gregory 1 Jan 18, 2022
Code for the paper titled "Generalized Depthwise-Separable Convolutions for Adversarially Robust and Efficient Neural Networks" (NeurIPS 2021 Spotlight).

Generalized Depthwise-Separable Convolutions for Adversarially Robust and Efficient Neural Networks This repository contains the code and pre-trained

Hassan Dbouk 7 Dec 05, 2022
Raindrop strategy for Irregular time series

Graph-Guided Network For Irregularly Sampled Multivariate Time Series Overview This repository contains processed datasets and implementation code for

Zitnik Lab @ Harvard 74 Jan 03, 2023
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

DeCLIP Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. Our paper is available in arxiv Updates ** Ou

Sense-GVT 470 Dec 30, 2022
Public Implementation of ChIRo from "Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations"

Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations This directory contains the model architectures and experimental

35 Dec 05, 2022
Pytorch implementation of Generative Models as Distributions of Functions 🌿

Generative Models as Distributions of Functions This repo contains code to reproduce all experiments in Generative Models as Distributions of Function

Emilien Dupont 117 Dec 29, 2022
Adversarial Attacks are Reversible via Natural Supervision

Adversarial Attacks are Reversible via Natural Supervision ICCV2021 Citation @InProceedings{Mao_2021_ICCV, author = {Mao, Chengzhi and Chiquier

Computer Vision Lab at Columbia University 20 May 22, 2022
AttGAN: Facial Attribute Editing by Only Changing What You Want (IEEE TIP 2019)

News 11 Jan 2020: We clean up the code to make it more readable! The old version is here: v1. AttGAN TIP Nov. 2019, arXiv Nov. 2017 TensorFlow impleme

Zhenliang He 568 Dec 14, 2022
Official PyTorch implementation of "AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks"

AASIST This repository provides the overall framework for training and evaluating audio anti-spoofing systems proposed in 'AASIST: Audio Anti-Spoofing

Clova AI Research 56 Jan 02, 2023
ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge (ManiSkill Challenge), a large-scale learning-from-demonstrations benchmark for object manipulation.

ManiSkill-Learn ManiSkill-Learn is a framework for training agents on SAPIEN Open-Source Manipulation Skill Challenge, a large-scale learning-from-dem

Hao Su's Lab, UCSD 48 Dec 30, 2022
AOT-GAN for High-Resolution Image Inpainting (codebase for image inpainting)

AOT-GAN for High-Resolution Image Inpainting Arxiv Paper | AOT-GAN: Aggregated Contextual Transformations for High-Resolution Image Inpainting Yanhong

Multimedia Research 214 Jan 03, 2023