This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation)

Last update: Dec 29, 2022

Related tags

Overview

This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation)

Usage example

python dynamic_inverted_softmax.py --sims_train_test_path msrvtt/tt-ce-train-captions-test-videos-seed0.pkl --sims_test_path msrvtt/tt-ce-test-captions-test-videos-seed0.pkl --test_query_masks_path msrvtt/tt-ce-test-query_masks.pkl

To test QB-Norm on your own data you need to:

Extract the similarity matrix between the caption from the training split and the videos from the testing split path/to/sims/train/test
Extract testing split similarity matrix (similarities between testing captions and testing video) path/to/sims/test
Run QB-Norm

python dynamic_inverted_softmax.py --sims_train_test_path path/to/sims/train/test --sims_test_path path/to/sims/test

Data

The similarity matrices for each method were extracted using the official repositories as follows: CE+, TT-CE+, CLIP2Video, CLIP4Clip (for CLIP4Clip we used the official repo to train from scratch new models since they do not provide pre-trained weights), CLIP, MMT, Audio-Retrieval.

You can download the extracted similarity matrices for training and testing here: MSRVTT, MSVD, DiDeMo, LSMDC.

Text-Video retrieval results

QB-Norm Results on MSRVTT Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
CE+	Full	t2v	_{^14.4_(0.1)}	_{^37.4_(0.1)}	_{^50.2_(0.1)}	_{^10.0_(0.0)}	_{^30.0_(0.1)}
CE+ (+QB-Norm)	Full	t2v	_{^16.4_(0.0)}	_{^40.3_(0.1)}	_{^52.9_(0.1)}	_{^9.0_(0.0)}	_{^32.7_(0.1)}
TT-CE+	Full	t2v	_{^14.9_(0.1)}	_{^38.3_(0.1)}	_{^51.5_(0.1)}	_{^10.0_(0.0)}	_{^30.9_(0.1)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^17.3_(0.0)}	_{^42.1_(0.2)}	_{^54.9_(0.1)}	_{^8.0_(0.0)}	_{^34.2_(0.1)}

QB-Norm Results on MSVD Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^25.4_(0.3)}	_{^56.9_(0.4)}	_{^71.3_(0.2)}	_{^4.0_(0.0)}	_{^46.9_(0.3)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^26.6_(1.0)}	_{^58.6_(1.3)}	_{^71.8_(1.1)}	_{^4.0_(0.0)}	_{^48.2_(1.2)}
CLIP2Video	Full	t2v	_^47.0	_^76.8	_^85.9	_^2.0	_^67.7
CLIP2Video (+QB-Norm)	Full	t2v	_^48.0	_^77.9	_^86.2	_^2.0	_^68.5

QB-Norm Results on DiDeMo Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^21.6_(0.7)}	_{^48.6_(0.4)}	_{^62.9_(0.6)}	_{^6.0_(0.0)}	_{^40.4_(0.4)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^24.2_(0.7)}	_{^50.8_(0.7)}	_{^64.4_(0.1)}	_{^5.3_(0.5)}	_{^43.0_(0.2)}
CLIP4Clip	Full	t2v	_^43.0	_^70.5	_^80.0	_^2.0	_^62.4
CLIP4Clip (+QB-Norm)	Full	t2v	_^43.5	_^71.4	_^80.9	_^2.0	_^63.1

QB-Norm Results on LSMDC Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^17.2_(0.4)}	_{^36.5_(0.6)}	_{^46.3_(0.3)}	_{^13.7_(0.5)}	_{^30.7_(0.3)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^17.8_(0.4)}	_{^37.7_(0.5)}	_{^47.6_(0.6)}	_{^12.7_(0.5)}	_{^31.7_(0.3)}
CLIP4Clip	Full	t2v	_^21.3	_^40.0	_^49.5	_^11.0	_^34.8
CLIP4Clip (+QB-Norm)	Full	t2v	_^22.4	_^40.1	_^49.5	_^11.0	_^35.4

QB-Norm Results on VaTeX Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
TT-CE+	Full	t2v	_{^53.2_(0.2)}	_{^87.4_(0.1)}	_{^93.3_(0.0)}	_{^1.0_(0.0)}	_{^75.7_(0.1)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^54.8_(0.1)}	_{^88.2_(0.1)}	_{^93.8_(0.1)}	_{^1.0_(0.0)}	_{^76.8_(0.0)}
CLIP2Video	Full	t2v	_^57.4	_^87.9	_^93.6	_^1.0	_^77.9
CLIP2Video (+QB-Norm)	Full	t2v	_^58.8	_^88.3	_^93.8	_^1.0	_^78.7

QB-Norm Results on QuerYD Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
CE+	Full	t2v	_{^13.2_(2.0)}	_{^37.1_(2.9)}	_{^50.5_(1.9)}	_{^10.3_(1.2)}	_{^29.1_(2.2)}
CE+ (+QB-Norm)	Full	t2v	_{^14.1_(1.8)}	_{^38.6_(1.3)}	_{^51.1_(1.6)}	_{^10.0_(0.8)}	_{^30.2_(1.7)}
TT-CE+	Full	t2v	_{^14.4_(0.5)}	_{^37.7_(1.7)}	_{^50.9_(1.6)}	_{^9.8_(1.0)}	_{^30.3_(0.9)}
TT-CE+ (+QB-Norm)	Full	t2v	_{^15.1_(1.6)}	_{^38.3_(2.4)}	_{^51.2_(2.8)}	_{^10.3_(1.7)}	_{^30.9_(2.3)}

Text-Image retrieval results

QB-Norm Results on MSCoCo Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
CLIP	5k	t2i	_^30.3	_^56.1	_^67.1	_^4.0	_^48.5
CLIP (+QB-Norm)	5k	t2i	_^34.8	_^59.9	_^70.4	_^3.0	_^52.8
MMT-Oscar	5k	t2i	_^52.2	_^80.2	_^88.0	_^1.0	_^71.7
MMT-Oscar (+QB-Norm)	5k	t2i	_^53.9	_^80.5	_^88.1	_^1.0	_^72.6

Text-Audio retrieval results

QB-Norm Results on AudioCaps Benchmark

Model	Split	Task	[email protected]	[email protected]	[email protected]	MdR	Geom
AR-CE	Full	t2a	_{^23.1_(0.6)}	_{^55.1_(0.7)}	_{^70.7_(0.6)}	_{^4.7_(0.5)}	_{^44.8_(0.7)}
AR-CE (+QB-Norm)	Full	t2a	_{^23.9_(0.2)}	_{^57.1_(0.3)}	_{^71.6_(0.4)}	_{^4.0_(0.0)}	_{^46.0_(0.3)}

References

If you find this code useful or use the extracted similarity matrices, please consider citing:

@misc{bogolin2021cross,
      title={Cross Modal Retrieval with Querybank Normalisation}, 
      author={Simion-Vlad Bogolin and Ioana Croitoru and Hailin Jin and Yang Liu and Samuel Albanie},
      year={2021},
      eprint={2112.12777},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

This repo provides code for QB-Norm (Cross Modal Retrieval with Querybank Normalisation)

Related tags

Overview

Data

Text-Video retrieval results

Text-Image retrieval results

Text-Audio retrieval results

References

Owner

The trained model and denoising example for paper : Cardiopulmonary Auscultation Enhancement with a Two-Stage Noise Cancellation Approach

A fast model to compute optical flow between two input images.

PyTorch code for: Learning to Generate Grounded Visual Captions without Localization Supervision

The open-source and free to use Python package miseval was developed to establish a standardized medical image segmentation evaluation procedure

FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

Code & Data for the Paper "Time Masking for Temporal Language Models", WSDM 2022

[AAAI 2022] Separate Contrastive Learning for Organs-at-Risk and Gross-Tumor-Volume Segmentation with Limited Annotation

An implementation of an abstract algebra for music tones (pitches).

Code and Datasets from the paper "Self-supervised contrastive learning for volcanic unrest detection from InSAR data"

Generative Flow Networks

RodoSol-ALPR Dataset

Interpretable-contrastive-word-mover-s-embedding

LLVM-based compiler for LightGBM gradient-boosted trees. Speeds up prediction by ≥10x.

A Python package for time series augmentation

Retinal vessel segmentation based on GT-UNet

This is the solution for 2nd rank in Kaggle competition: Feedback Prize - Evaluating Student Writing.

PyTorch implementation of Deformable Convolution

LiDAR Distillation: Bridging the Beam-Induced Domain Gap for 3D Object Detection

Multi-objective constrained optimization for energy applications via tree ensembles