Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

Overview

Optimizing Dense Retrieval Model Training with Hard Negatives

Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma

This repo provides code, retrieval results, and trained models for our SIGIR Full paper Optimizing Dense Retrieval Model Training with Hard Negatives. The previous version is Learning To Retrieve: How to Train a Dense Retrieval Model Effectively and Efficiently.

We achieve very impressive retrieval results on both passage and document retrieval bechmarks. The proposed two algorithms (STAR and ADORE) are very efficient. IMHO, they are well worth trying and most likely improve your retriever's performance by a large margin.

The following figure shows the pros and cons of different training methods. You can train an effective Dense Retrieval model in three steps. Firstly, warmup your model using random negatives or BM25 top negatives. Secondly, use our proposed STAR to train the query encoder and document encoder. Thirdly, use our proposed ADORE to train the query encoder. image

Retrieval Results and Trained Models

Passage Retrieval Dev [email protected] Dev [email protected] Test [email protected] Files
Inbatch-Neg 0.264 0.837 0.583 Model
Rand-Neg 0.301 0.853 0.612 Model
STAR 0.340 0.867 0.642 Model Train Dev TRECTest
ADORE (Inbatch-Neg) 0.316 0.860 0.658 Model
ADORE (Rand-Neg) 0.326 0.865 0.661 Model
ADORE (STAR) 0.347 0.876 0.683 Model Train Dev TRECTest Leaderboard
Doc Retrieval Dev [email protected] Dev [email protected] Test [email protected] Files
Inbatch-Neg 0.320 0.864 0.544 Model
Rand-Neg 0.330 0.859 0.572 Model
STAR 0.390 0.867 0.605 Model Train Dev TRECTest
ADORE (Inbatch-Neg) 0.362 0.884 0.580 Model
ADORE (Rand-Neg) 0.361 0.885 0.585 Model
ADORE (STAR) 0.405 0.919 0.628 Model Train Dev TRECTest Leaderboard

If you want to use our first-stage leaderboard runs, contact me and I will send you the file.

If any links fail or the files go wrong, please contact me or open a issue.

Requirements

To install requirements, run the following commands:

git clone [email protected]:jingtaozhan/DRhard.git
cd DRhard
python setup.py install

However, you need to set up a new python enverionment for data preprocessing (see below).

Data Download

To download all the needed data, run:

bash download_data.sh

Data Preprocess

You need to set up a new environment with transformers==2.8.0 to tokenize the text. This is because we find the tokenizer behaves differently among versions 2, 3 and 4. To replicate the results in our paper with our provided trained models, it is necessary to use version 2.8.0 for preprocessing. Otherwise, you may need to re-train the DR models.

Run the following codes.

python preprocess.py --data_type 0; python preprocess.py --data_type 1

Inference

With our provided trained models, you can easily replicate our reported experimental results. Note that minor variance may be observed due to environmental difference.

STAR

The following codes use the provided STAR model to compute query/passage embeddings and perform similarity search on the dev set. (You can use --faiss_gpus option to use gpus for much faster similarity search.)

python ./star/inference.py --data_type passage --max_doc_length 256 --mode dev   
python ./star/inference.py --data_type doc --max_doc_length 512 --mode dev   

Run the following code to evaluate on MSMARCO Passage dataset.

python ./msmarco_eval.py ./data/passage/preprocess/dev-qrel.tsv ./data/passage/evaluate/star/dev.rank.tsv
Eval Started
#####################
MRR @10: 0.3404237731386721
QueriesRanked: 6980
#####################

Run the following code to evaluate on MSMARCO Document dataset.

python ./msmarco_eval.py ./data/doc/preprocess/dev-qrel.tsv ./data/doc/evaluate/star/dev.rank.tsv 100
Eval Started
#####################
MRR @100: 0.3903422772218344
QueriesRanked: 5193
#####################

ADORE

ADORE computes the query embeddings. The document embeddings are pre-computed by other DR models, like STAR. The following codes use the provided ADORE(STAR) model to compute query embeddings and perform similarity search on the dev set. (You can use --faiss_gpus option to use gpus for much faster similarity search.)

python ./adore/inference.py --model_dir ./data/passage/trained_models/adore-star --output_dir ./data/passage/evaluate/adore-star --preprocess_dir ./data/passage/preprocess --mode dev --dmemmap_path ./data/passage/evaluate/star/passages.memmap
python ./adore/inference.py --model_dir ./data/doc/trained_models/adore-star --output_dir ./data/doc/evaluate/adore-star --preprocess_dir ./data/doc/preprocess --mode dev --dmemmap_path ./data/doc/evaluate/star/passages.memmap

Evaluate ADORE(STAR) model on dev passage dataset:

python ./msmarco_eval.py ./data/passage/preprocess/dev-qrel.tsv ./data/passage/evaluate/adore-star/dev.rank.tsv

You will get

Eval Started
#####################
MRR @10: 0.34660697230181425
QueriesRanked: 6980
#####################

Evaluate ADORE(STAR) model on dev document dataset:

python ./msmarco_eval.py ./data/doc/preprocess/dev-qrel.tsv ./data/doc/evaluate/adore-star/dev.rank.tsv 100

You will get

Eval Started
#####################
MRR @100: 0.4049777020859768
QueriesRanked: 5193
#####################

Convert QID/PID Back

Our data preprocessing reassigns new ids for each query and document. Therefore, you may want to convert the ids back. We provide a script for this.

The following code shows an example to convert ADORE-STAR's ranking results on the dev passage dataset.

python ./cvt_back.py --input_dir ./data/passage/evaluate/adore-star/ --preprocess_dir ./data/passage/preprocess --output_dir ./data/passage/official_runs/adore-star --mode dev --dataset passage
python ./msmarco_eval.py ./data/passage/dataset/qrels.dev.small.tsv ./data/passage/official_runs/adore-star/dev.rank.tsv

You will get

Eval Started
#####################
MRR @10: 0.34660697230181425
QueriesRanked: 6980
#####################

Train

Instructions will be ready this weekend (7.18).

Owner
Jingtao Zhan
Ph.D at THUIR.
Jingtao Zhan
face_recognization (FaceNet) + TFHE (HNP) + hand_face_detection (Mediapipe)

SuperControlSystem Face_Recognization (FaceNet) 面部识别 (FaceNet) Fully Homomorphic Encryption over the Torus (HNP) 环面全同态加密 (TFHE) Hand_Face_Detection (M

liziyu0104 2 Dec 30, 2021
Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for prediction.

Predicitng_viability Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for

Gopalika Sharma 1 Nov 08, 2021
PyTorch ,ONNX and TensorRT implementation of YOLOv4

PyTorch ,ONNX and TensorRT implementation of YOLOv4

4.2k Jan 01, 2023
Official implementation of "A Shared Representation for Photorealistic Driving Simulators" in PyTorch.

A Shared Representation for Photorealistic Driving Simulators The official code for the paper: "A Shared Representation for Photorealistic Driving Sim

VITA lab at EPFL 7 Oct 13, 2022
People log into different sites every day to get information and browse through these sites one by one

HyperLink People log into different sites every day to get information and browse through these sites one by one. And they are exposed to advertisemen

0 Feb 17, 2022
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

Pytorch Lightning 21.1k Jan 08, 2023
Learning Representations that Support Robust Transfer of Predictors

Transfer Risk Minimization (TRM) Code for Learning Representations that Support Robust Transfer of Predictors Prepare the Datasets Preprocess the Scen

Yilun Xu 15 Dec 07, 2022
This repository contains all data used for writing a research paper Multiple Object Trackers in OpenCV: A Benchmark, presented in ISIE 2021 conference in Kyoto, Japan.

OpenCV-Multiple-Object-Tracking Python is version 3.6.7 to install opencv: pip uninstall opecv-python pip uninstall opencv-contrib-python pip install

6 Dec 19, 2021
Accelerating BERT Inference for Sequence Labeling via Early-Exit

Sequence-Labeling-Early-Exit Code for ACL 2021 paper: Accelerating BERT Inference for Sequence Labeling via Early-Exit Requirement: Please refer to re

李孝男 23 Oct 14, 2022
LibFewShot: A Comprehensive Library for Few-shot Learning.

LibFewShot Make few-shot learning easy. Supported Methods Meta MAML(ICML'17) ANIL(ICLR'20) R2D2(ICLR'19) Versa(NeurIPS'18) LEO(ICLR'19) MTL(CVPR'19) M

<a href=[email protected]&L"> 603 Jan 05, 2023
Subpopulation detection in high-dimensional single-cell data

PhenoGraph for Python3 PhenoGraph is a clustering method designed for high-dimensional single-cell data. It works by creating a graph ("network") repr

Dana Pe'er Lab 42 Sep 05, 2022
Implementing Vision Transformer (ViT) in PyTorch

Lightning-Hydra-Template A clean and scalable template to kickstart your deep learning project 🚀 ⚡ 🔥 Click on Use this template to initialize new re

2 Dec 24, 2021
Converting CPT to bert form for use

cpt-encoder 将CPT转成bert形式使用 说明 刚刚刷到又出了一种模型:CPT,看论文显示,在很多中文任务上性能比mac bert还好,就迫不及待想把它用起来。 根据对源码的研究,发现该模型在做nlu建模时主要用的encoder部分,也就是bert,因此我将这部分权重转为bert权重类型

黄辉 1 Oct 14, 2021
Asymmetric metric learning for knowledge transfer

Asymmetric metric learning This is the official code that enables the reproduction of the results from our paper: Asymmetric metric learning for knowl

20 Dec 06, 2022
Python package to add text to images, textures and different backgrounds

nider Python package for text images generation and watermarking Free software: MIT license Documentation: https://nider.readthedocs.io. nider is an a

Vladyslav Ovchynnykov 131 Dec 30, 2022
Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

Phil Wang 272 Dec 23, 2022
Fantasy Points Prediction and Dream Team Formation

Fantasy-Points-Prediction-and-Dream-Team-Formation Collected Data from open source resources that have over 100 Parameters for predicting cricket play

Akarsh Singh 2 Sep 13, 2022
Pytorch Implementation of Auto-Compressing Subset Pruning for Semantic Image Segmentation

Pytorch Implementation of Auto-Compressing Subset Pruning for Semantic Image Segmentation Introduction ACoSP is an online pruning algorithm that compr

Merantix 8 Dec 07, 2022
Galileo library for large scale graph training by JD

近年来,图计算在搜索、推荐和风控等场景中获得显著的效果,但也面临超大规模异构图训练,与现有的深度学习框架Tensorflow和PyTorch结合等难题。 Galileo(伽利略)是一个图深度学习框架,具备超大规模、易使用、易扩展、高性能、双后端等优点,旨在解决超大规模图算法在工业级场景的落地难题,提

JD Galileo Team 128 Nov 29, 2022
Deep Federated Learning for Autonomous Driving

FADNet: Deep Federated Learning for Autonomous Driving Abstract Autonomous driving is an active research topic in both academia and industry. However,

AIOZ AI 12 Dec 01, 2022