SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Last update: Nov 09, 2022

Overview

Introduction

This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper.

Chen, Jia, et al. "Axiomatically Regularized Pre-training for Ad hoc Search." To Appear in the Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022.

Requirements

python 3.7
torch==1.9.0
transformers==4.9.2
tqdm, nltk, numpy, boto3
trec_eval for evaluation on TREC DL 2019
anserini for generating "RANK" axiom scores

Why this repo?

In this repo, you can pre-train ARES_simple and Transformer_ICT models, and fine-tune all pre-trained models with the same architecture as BERT. The papers are listed as follows:

You can download the pre-trained ARES checkpoint ARES_simple from Google drive and extract it.

Pre-training Data

Download data

Download the MS MARCO corpus from the official website.
Download the ADORE+STAR Top100 Candidates files from this repo.

Pre-process data

To save memory, we store most files using the numpy memmap or jsonl format in the ./preprocess directory.

Document files:

doc_token_ids.memmap: each line is the token ids for a document
docid2idx.json: {docid: memmap_line_id}

Query files:

queries.doctrain.jsonl: MS MARCO training queries {"id" qid, "ids": token_ids} for each line
queries.docdev.jsonl: MS MARCO validating queries {"id" qid, "ids": token_ids} for each line
queries.dl2019.jsonl: TREC DL 2019 queries {"id" qid, "ids": token_ids} for each line

Human label files:

msmarco-doctrain-qrels.tsv: qid 0 docid 1 for training set
dev-qrels.txt: qid relevant_docid for validating set
2019qrels-docs.txt: qid relevant_docid for TREC DL 2019 set

Top 100 candidate files:

train.rank.tsv, dev.rank.tsv, test.rank.tsv: qid docid rank for each line

Pseudo queries and axiomatic features:

doc2qs.jsonl: {"docid": docid, "queries": [qids]} for each line
sample_qs_token_ids.memmap: each line is the token ids for a pseudo query
sample_qid2id.json: {qid: memmap_line_id}
axiom.memmap: axiom can be one of the ['rank', 'prox-1', 'prox-2', 'rep-ql', 'rep-tfidf', 'reg', 'stm-1', 'stm-2', 'stm-3'], each line is an axiomatic score for a query

Quick Start

Note that to accelerate the training process, we adopt the parallel training technique. The scripts for pre-training and fine-tuning are as follow:

Pre-training

export BERT_DIR=/path/to/bert-base/
export XGB_DIR=/path/to/xgboost.model

cd pretrain

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 NCCL_BLOCKING_WAIT=1 \
python  -m torch.distributed.launch --nproc_per_node=6 --nnodes=1 train.py \
        --model_type ARES \
        --PRE_TRAINED_MODEL_NAME BERT_DIR \
        --gpu_num 6 --world_size 6 \
        --MLM --axiom REP RANK REG PROX STM \
        --clf_model XGB_DIR

Here model type can be ARES or ICT.

Zero-shot evaluation (based on AS top100)

export MODEL_DIR=/path/to/ares-simple/
export CKPT_NAME=ares.ckpt

cd finetune

CUDA_VISIBLE_DEVICES=0 python train.py \
        --test \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --model_type ARES \
        --model_name ARES_simple \
        --load_ckpt \
        --model_path CKPT_NAME

You can get:

#####################
<----- MS Dev ----->
MRR @10: 0.2991
MRR @100: 0.3130
QueriesRanked: 5193
#####################

on MS MARCO dev set and:

#############################
<--------- DL 2019 --------->
QueriesRanked: 43
nDCG @10: 0.5955
nDCG @100: 0.4863
#############################

on DL 2019 set.

Fine-tuning

export MODEL_DIR=/path/to/ares-simple/

cd finetune

CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_BLOCKING_WAIT=1 \
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 train.py \
        --model_type ARES \
        --distributed_train \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --gpu_num 4 --world_size 4 \
        --model_name ARES_simple

Visualization

export MODEL_DIR=/path/to/ares-simple/
export SAVE_DIR=/path/to/output/
export CKPT_NAME=ares.ckpt

cd visualization

CUDA_VISIBLE_DEVICES=0 python visual.py \
    --PRE_TRAINED_MODEL_NAME MODEL_DIR \
    --model_name ARES_simple \
    --visual_q_num 1 \
    --visual_d_num 5 \
    --save_path SAVE_DIR \
    --model_path CKPT_NAME

Results

Zero-shot performance:

Model Name	MS MARCO [email protected]	MS MARCO [email protected]	DL [email protected]	DL [email protected]	COVID	EQ
BM25	0.2962	0.3107	0.5776	0.4795	0.4857	0.6690
BERT	0.1820	0.2012	0.4059	0.4198	0.4314	0.6055
PROP_wiki	0.2429	0.2596	0.5088	0.4525	0.4857	0.5991
PROP_marco	0.2763	0.2914	0.5317	0.4623	0.4829	0.6454
ARES_strict	0.2630	0.2785	0.4942	0.4504	0.4786	0.6923
ARES_hard	0.2627	0.2780	0.5189	0.4613	0.4943	0.6822
ARES_simple	0.2991	0.3130	0.5955	0.4863	0.4957	0.6916

Few-shot performance:

Visualization (attribution values have been normalized within a document):

Citation

If you find our work useful, please do not save your star and cite our work:

@inproceedings{chen2022axiomatically,
  title={Axiomatically Regularized Pre-training for Ad hoc Search},
  author={Chen, Jia and Liu, Yiqun and Fang, Yan and Mao, Jiaxin and Fang, Hui and Yang, Shenghao and Xie, Xiaohui and Zhang, Min and Ma, Shaoping},
  booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2022}
}

Notice

Please make sure that all the pre-trained model parameters have been loaded correctly, or the zero-shot and the fine-tuning performance will be greatly impacted.
We welcome anyone who would like to contribute to this repo. 🤗
If you have any other questions, please feel free to contact me via [email protected] or open an issue.
Code for data preprocessing will come soon. Please stay tuned~

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Related tags

Overview

Introduction

Requirements

Why this repo?

Pre-training Data

Download data

Pre-process data

Quick Start

Pre-training

Zero-shot evaluation (based on AS top100)

Fine-tuning

Visualization

Results

Citation

Notice

Owner

Jia Chen

A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

An implementation of WaveNet with fast generation

运小筹公众号是致力于分享运筹优化(LP、MIP、NLP、随机规划、鲁棒优化)、凸优化、强化学习等研究领域的内容以及涉及到的算法的代码实现。

DLO8012: Natural Language Processing & CSL804: Computational Lab - II

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

LeBenchmark: a reproducible framework for assessing SSL from speech

Package for controllable summarization

BERT Attention Analysis

null

NLP - Machine learning

This is Assignment1 code for the Web Data Processing System.

BeautyNet is an AI powered model which can tell you whether you're beautiful or not.

TalkNet: Audio-visual active speaker detection Model

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

A PyTorch implementation of the Transformer model in "Attention is All You Need".

An easier way to build neural search on the cloud

Natural Language Processing Specialization

뉴스 도메인 질의응답 시스템 (21-1학기 졸업 프로젝트)