Train Dense Passage Retriever (DPR) with a single GPU

Overview

Gradient Cached Dense Passage Retrieval

Gradient Cached Dense Passage Retrieval (GC-DPR) - is an extension of the original DPR library.
We introduce Gradient Cache technique which enables scaling batch size of the contrastive form loss and therefore the number of in-batch negatives far beyond GPU RAM limitation. With GC-DPR , you can reproduce the state-of-the-art open Q&A system trained on 8 x 32GB V100 GPUs with a single 11 GB GPU.

To use Gradient Cache in your own project, checkout our GradCache package.

Gradient Cache Technique

  • Retriever training quality depends on its effective batch size.
  • The contrastive form loss (NLL with in-batch negatives) conditions on the entire batch and requires fitting the entire batch into GPU memory.
  • Gradient Cache technique separates back propagation into loss to representation part and representation to encoder model part. It uses an extra forward pass without gradient tracking to compute representations and then store representations' gradients in cache.
  • The cached representation gradients remove data dependency in encoder back propagation and allows updating encoders one sub-batch at a time to fit GPU memory.

Details can be found in our paper Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup , to appeaer at The 6th Workshop on Representation Learning for NLP (RepL4NLP) 2021. Talk to us at ACL if you are interested!

@inproceedings{gao2021scaling,
     title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
     author={Luyu Gao, Yunyi Zhang, Jiawei Han, Jamie Callan},
     booktitle ={Proceedings of the 6th Workshop on Representation Learning for NLP},
     year={2021},
}

Understanding the Code

One of the goals of this repo is to help clarify our gradient cache technique. The initial commit of the repo is a snapshot of DPR. You can run diff against it to see all changes made in GC-DPR. In general, we have modified train and inference codes. The model code is kept unchanged, and therefore users should be able to use trained checkpoints interchangeably between DPR and GC-DPR.

Additional Features

Several changes were made to speed and/or facilitate easy reproduction of the original result. You may get some speed up using GC-DPR even when not using gradient.

  • Automatic Mix Precision (AMP): GC-DPR replaces apex AMP with Pytorch native AMP. It is supported for both training and inference.
  • Multi-process Data Loading: batches are loaded using Pytorch's DataLoader during training and inference in loader processes, which largely avoids GPU idle waiting for text tokenization and batch creation.
  • Gradient Compensation: Pytorch's Distributed Data Parallel (DDP) average reduces model parameter gradients across GPUs while dense retriever training all-gather representations to compute loss. This means the actual gradient will be divided by the number of GPUs. For reproducibility, we scale/unscale gradient according to the number of available devices to replicate this effect. Another option is to use _register_comm_hook on the DDP object. Scaling loss is opted as it is easier to understand.

The follwoing instructions are adapted from the orignial DPR's.

Installation

You can clone this repo and install with pip.

git clone https://github.com/luyug/GC-DPR.git
cd GC-DPR
pip install .

GC-DPR is tested on Python 3.8.5, PyTorch 1.6.0 and Huggingface transformers 3.0.2.

Resources & Data formats

You can follow the original DPR instructions:

First, you need to prepare data for either retriever or reader training. Each of the DPR components has its own input/output data formats. You can see format descriptions below. DPR provides NQ & Trivia preprocessed datasets (and model checkpoints) to be downloaded from the cloud using our data/download_data.py tool. One needs to specify the resource name to be downloaded. Run 'python data/download_data.py' to see all options.

python data/download_data.py \
   --resource {key from download_data.py's RESOURCES_MAP}  \
   [optional --output_dir {your location}]  

The resource name matching is prefix-based. So if you need to download all data resources, just use --resource data.

Retriever input data format

The data format of the Retriever training data is JSON.
It contains pools of 2 types of negative passages per question, as well as positive passages and some additional information.

[  
  {  
   "question": "....",  
   "answers": ["...", "...", "..."],  
   "positive_ctxs": [{  
      "title": "...",  
      "text": "...."  
   }],  
   "negative_ctxs": ["..."],  
   "hard_negative_ctxs": ["..."]  
  },  
  ...  
]  

Elements' structure for negative_ctxs & hard_negative_ctxs is exactly the same as for positive_ctxs. The preprocessed data available for downloading also contains some extra attributes which may be useful for model modifications (like bm25 scores per passage). Still, they are not currently in use by DPR.

You can download prepared NQ dataset used in the paper by using 'data.retriever.nq' key prefix. Only dev & train subsets are available in this format. We also provide question & answers only CSV data files for all train/dev/test splits. Those are used for the model evaluation since our NQ preprocessing step looses a part of original samples set. Use 'data.retriever.qas.*' resource keys to get respective sets for evaluation.

python data/download_data.py  
   --resource data.retriever  
   [optional --output_dir {your location}]  

Retriever training

In order to start training on one machine:

python train_dense_encoder.py  
   --encoder_model_type {hf_bert | pytext_bert | fairseq_roberta}  
   --pretrained_model_cfg {bert-base-uncased| roberta-base}  
   --train_file {train files glob expression}  
   --dev_file {dev files glob expression}  
   --output_dir {dir to save checkpoints}  
   --grad_cache
   --q_chunk_size {question sub-batch sizes}
   --ctx_chunk_size {context sub-batch sizes}
   --fp16  # if train with mixed precision

We introduce the following parameters to for gradient cached training,

  • --grad_cache activates gradient cached training
  • --q_chunk_size sub-batch size for updating the question encoder, default to 16
  • --ctx_chunk_size sub-batch size for updating context encoder, default to 8

The default settings work for 11GB RTX 2080 ti.

Retriever inference

You can follow the original DPR instructions:

Generating representation vectors for the static documents dataset is a highly parallelizable process which can take up to a few days if computed on a single GPU. You might want to use multiple available GPU servers by running the script on each of them independently and specifying their own shards.

python generate_dense_embeddings.py \
   --model_file {path to biencoder checkpoint} \
   --ctx_file {path to psgs_w100.tsv file} \
   --shard_id {shard_num, 0-based} --num_shards {total number of shards} \
   --out_file ${out files location + name PREFX}  \
   --fp16  # if use mixed precision inference

In addition, we provide the --fp16 flag which will run encoder inference with mixed precision using the autocast context manager. Mixed precision inference gives roughly 2x speed up. The generated embeddings will be converted to full precision before saving for index search on x86 CPU.

Retriever validation against the entire set of documents:

python dense_retriever.py \
   --model_file ${path to biencoder checkpoint} \
   --ctx_file  {path to all documents .tsv file} \
   --qa_file {path to test|dev .csv file} \
   --encoded_ctx_file "{encoded document files glob expression}" \
   --out_file {path to output json file with results} \
  --n-docs 200  

We provide in addition flags --encode_q_and_save, --q_encoding_path, --re_encode_q which allows separation of question encoding and index search. One could therefore opt to run question encoding on a GPU server and search on a CPU server.

The tool writes retrieved results for subsequent reader model training into specified out_file.
It is a json with the following format:

[  
    {  
        "question": "...",  
        "answers": ["...", "...", ... ],  
        "ctxs": [  
            {  
                "id": "...", # passage id from database tsv file  
                "title": "",  
                "text": "....",  
                "score": "...",  # retriever score  
                "has_answer": true|false  
     },  
]  

Results are sorted by their similarity score, from most relevant to least relevant.

Optional reader model input data pre-processing.

Since the reader model uses a specific combination of positive and negative passages for each question and also needs to know the answer span location in the bpe-tokenized form, it is recommended to preprocess and serialize the output from the retriever model before starting the reader training. This saves hours at train time.
If you don't run this preprocessing, the Reader training pipeline checks if the input file(s) extension is .pkl and if not, preprocesses and caches results automatically in the same folder as the original files.

python preprocess_reader_data.py \
   --retriever_results {path to a file with results from dense_retriever.py} \
   --gold_passages {path to gold passages info} \
   --do_lower_case \
   --pretrained_model_cfg {pretrained_cfg} \
   --encoder_model_type {hf_bert | pytext_bert | fairseq_roberta} \
   --out_file {path to for output files} \
   --is_train_set  

Reader model training

python train_reader.py \
   --encoder_model_type {hf_bert | pytext_bert | fairseq_roberta} \
   --pretrained_model_cfg {bert-base-uncased| roberta-base} \
   --train_file "{globe expression for train files from #5 or #6 above}" \
   --dev_file "{globe expression for train files}" \
   --output_dir {path to output dir}  

Notes:

  • if you use pytext_bert or fairseq_roberta, you need to download pre-trained weights and specify --pretrained_file parameter. Specify the dir location of the downloaded files for 'pretrained.fairseq.roberta-base' resource prefix for RoBERTa model or the file path for pytext BERT (resource name 'pretrained.pytext.bert-base.model').
  • Reader training pipeline does model validation every --eval_step batches
  • As the bi-encoder, it saves model checkpoints on every validation
  • Like the bi-encoder, there is no stop condition besides a specified amount of epochs to train.
  • Like the bi-encoder, there is no best checkpoint selection logic, so one needs to select that based on dev set validation performance which is logged in the train process output.
  • Our current code only calculates the Exact Match metric.

Reader model inference

In order to make an inference, run train_reader.py without specifying train_file. Make sure to specify model_file with the path to the checkpoint, passages_per_question_predict with number of passages per question (being used when saving the prediction file), and eval_top_docs with a list of top passages threshold values from which to choose question's answer span (to be printed as logs). The example command line is as follows.

python train_reader.py \
  --prediction_results_file {some dir}/results.json \
  --eval_top_docs 10 20 40 50 80 100 \
  --dev_file {path to data.retriever_results.nq.single.test file} \
  --model_file {path to the reader checkpoint} \
  --dev_batch_size 80 \
  --passages_per_question_predict 100 \
  --sequence_length 350  

Best hyperparameter settings

e2e example with the best settings for NQ dataset. Most of this section follows DPR's instructions. The retriever training session is adjusted for a single 11 GB GPU.

1. Download all retriever training and validation data:

python data/download_data.py --resource data.wikipedia_split.psgs_w100  
python data/download_data.py --resource data.retriever.nq  
python data/download_data.py --resource data.retriever.qas.nq  

2. Biencoder(Retriever) training in single set mode.

Train on a single 11GB RTX 2080 ti GPU,

python train_dense_encoder.py \
   --max_grad_norm 2.0 \
   --encoder_model_type hf_bert \
   --pretrained_model_cfg bert-base-uncased \
   --seed 12345 \
   --sequence_length 256 \
   --warmup_steps 1237 \
   --batch_size 128 \
   --do_lower_case \
   --train_file "{glob expression to train files downloaded as 'data.retriever.nq-train' resource}" \
   --dev_file "{glob expression to dev files downloaded as 'data.retriever.nq-dev' resource}" \
   --output_dir {your output dir} \
   --learning_rate 2e-05 \
   --num_train_epochs 40 \
   --dev_batch_size 16 \
   --val_av_rank_start_epoch 30 \
   --fp16 \
   --grad_cache \
   --global_loss_buf_sz 2097152 \
   --val_av_rank_max_qs 1000

This takes less than 2 days to complete the training for 40 epochs. The best checkpoint for bi-encoder is usually the last, but it should not be so different if you take any after epoch ~ 25. We use smaller value for --val_av_rank_max_qs to speed up validation.

3. Generate embeddings for Wikipedia.

Just use instructions for "Generating representations for large documents set". It takes about 40 minutes to produce 21 mln passages representation vectors on 50 2 GPU servers.

4. Evaluate retrieval accuracy and generate top passage results for each of the train/dev/test datasets.

python dense_retriever.py \
   --model_file {path to checkpoint file from step 1} \
   --ctx_file {path to psgs_w100.tsv file} \
   --qa_file {path to test/dev qas file} \
   --encoded_ctx_file "{glob expression for generated files from step 3}" \
   --out_file {path for output json files} \
   --n-docs 100 \
   --validation_workers 32 \
   --batch_size 64  

Adjust batch_size based on the available number of GPUs, 64 should work for 2 GPU server.

5. Reader training

Here's the DPR's original instruction for reader training. We trained reader model for large datasets using a single 8 GPU x 32 GB server.

python train_reader.py \
   --seed 42 \
   --learning_rate 1e-5 \
   --eval_step 2000 \
   --do_lower_case \
   --eval_top_docs 50 \
   --encoder_model_type hf_bert \
   --pretrained_model_cfg bert-base-uncased \
   --train_file "{glob expression for train output files from step 4}" \
   --dev_file {glob expression for dev output file from step 4} \
   --warmup_steps 0 \
   --sequence_length 350 \
   --batch_size 16 \
   --passages_per_question 24 \
   --num_train_epochs 100000 \
   --dev_batch_size 72 \
   --passages_per_question_predict 50 \
   --output_dir {your save dir path}  

We found that using the learning rate above works best with static schedule, so one needs to stop training manually based on evaluation performance dynamics.
Our best results were achieved on 16-18 training epochs or after ~60k model updates.

We provide all input and intermediate results for e2e pipeline for NQ dataset and most of the similar resources for Trivia.

Misc.

  • TREC validation requires regexp based matching. We support only retriever validation in the regexp mode. See --match parameter option.
  • WebQ validation requires entity normalization, which is not included as of now.

Reference

If you find GC-DPR helpful, please consider citing our paper:

@inproceedings{gao2021scaling,
      title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
      author={Luyu Gao, Yunyi Zhang, Jiawei Han, Jamie Callan},
      booktitle ={Proceedings of the 6th Workshop on Representation Learning for NLP},
      year={2021},
}

Also consider citing the original DPR paper:

@misc{karpukhin2020dense,  
    title={Dense Passage Retrieval for Open-Domain Question Answering},  
    author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih},  
    year={2020},  
    eprint={2004.04906},  
    archivePrefix={arXiv},  
    primaryClass={cs.CL}  
}  

License

GC-DPR inherits DPR's CC-BY-NC 4.0 licensed as of now.

Owner
Luyu Gao
NLP PhD [email protected], CMU
Luyu Gao
Poisson Surface Reconstruction for LiDAR Odometry and Mapping

Poisson Surface Reconstruction for LiDAR Odometry and Mapping Surfels TSDF Our Approach Table: Qualitative comparison between the different mapping te

Photogrammetry & Robotics Bonn 305 Dec 21, 2022
🌾 PASTIS 🌾 Panoptic Agricultural Satellite TIme Series

🌾 PASTIS 🌾 Panoptic Agricultural Satellite TIme Series (optical and radar) The PASTIS Dataset Dataset presentation PASTIS is a benchmark dataset for

86 Jan 04, 2023
Implementation of Nyström Self-attention, from the paper Nyströmformer

Nyström Attention Implementation of Nyström Self-attention, from the paper Nyströmformer. Yannic Kilcher video Install $ pip install nystrom-attention

Phil Wang 95 Jan 02, 2023
This is the repository for paper NEEDLE: Towards Non-invertible Backdoor Attack to Deep Learning Models.

This is the repository for paper NEEDLE: Towards Non-invertible Backdoor Attack to Deep Learning Models.

1 Oct 25, 2021
An algorithm study of the 6th iOS 10 set of Boost Camp Web Mobile

알고리즘 스터디 🔥 부스트캠프 웹모바일 6기 iOS 10조의 알고리즘 스터디 입니다. 개인적인 사정 등으로 S034, S055만 참가하였습니다. 스터디 목적 상진: 코테 합격 + 부캠끝나고 아침에 일어나기 위해 필요한 사이클 기완: 꾸준하게 자리에 앉아 공부하기 +

2 Jan 11, 2022
Code for Greedy Gradient Ensemble for Visual Question Answering (ICCV 2021, Oral)

Greedy Gradient Ensemble for De-biased VQA Code release for "Greedy Gradient Ensemble for Robust Visual Question Answering" (ICCV 2021, Oral). GGE can

21 Jun 29, 2022
official implemntation for "Contrastive Learning with Stronger Augmentations"

CLSA CLSA is a self-supervised learning methods which focused on the pattern learning from strong augmentations. Copyright (C) 2020 Xiao Wang, Guo-Jun

Lab for MAchine Perception and LEarning (MAPLE) 47 Nov 29, 2022
Geometry-Aware Learning of Maps for Camera Localization (CVPR2018)

Geometry-Aware Learning of Maps for Camera Localization This is the PyTorch implementation of our CVPR 2018 paper "Geometry-Aware Learning of Maps for

NVIDIA Research Projects 321 Nov 26, 2022
Official Matlab Implementation for "Tiny Obstacle Discovery by Occlusion-aware Multilayer Regression", TIP 2020

Tiny Obstacle Discovery by Occlusion-aware Multilayer Regression Official Matlab Implementation for "Tiny Obstacle Discovery by Occlusion-aware Multil

Xuefeng 5 Jan 15, 2022
LF-YOLO (Lighter and Faster YOLO) is used to detect defect of X-ray weld image.

This project is based on ultralytics/yolov3. LF-YOLO (Lighter and Faster YOLO) is used to detect defect of X-ray weld image. Download $ git clone http

26 Dec 13, 2022
ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information

ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information This repository contains code, model, dataset for ChineseBERT at ACL2021. Ch

413 Dec 01, 2022
PyTorch implementation of the wavelet analysis from Torrence & Compo

Continuous Wavelet Transforms in PyTorch This is a PyTorch implementation for the wavelet analysis outlined in Torrence and Compo (BAMS, 1998). The co

Tom Runia 262 Dec 21, 2022
Experiments for Neural Flows paper

Neural Flows: Efficient Alternative to Neural ODEs [arxiv] TL;DR: We directly model the neural ODE solutions with neural flows, which is much faster a

54 Dec 07, 2022
Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification [pdf] The official repository for Self-Supervised Pre-Training for Transfo

Hao Luo 116 Jan 04, 2023
Object detection evaluation metrics using Python.

Object detection evaluation metrics using Python.

Louis Facun 2 Sep 06, 2022
Answering Open-Domain Questions of Varying Reasoning Steps from Text

This repository contains the authors' implementation of the Iterative Retriever, Reader, and Reranker (IRRR) model in the EMNLP 2021 paper "Answering Open-Domain Questions of Varying Reasoning Steps

26 Dec 22, 2022
Trying to understand alias-free-gan.

alias-free-gan-explanation Trying to understand alias-free-gan in my own way. [Chinese Version 中文版本] CC-BY-4.0 License. Tzu-Heng Lin motivation of thi

Tzu-Heng Lin 12 Mar 17, 2022
An example to implement a new backbone with OpenMMLab framework.

Backbone example on OpenMMLab framework English | 简体中文 Introduction This is an template repo about how to use OpenMMLab framework to develop a new bac

Ma Zerun 22 Dec 29, 2022
Implementation for paper "Towards the Generalization of Contrastive Self-Supervised Learning"

Contrastive Self-Supervised Learning on CIFAR-10 Paper "Towards the Generalization of Contrastive Self-Supervised Learning", Weiran Huang, Mingyang Yi

Weiran Huang 13 Nov 30, 2022
Face Detection and Alignment using Multi-task Cascaded Convolutional Networks (MTCNN)

Face-Detection-with-MTCNN Face detection is a computer vision problem that involves finding faces in photos. It is a trivial problem for humans to sol

Chetan Hirapara 3 Oct 07, 2022