GitHub

Introduction

This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper.

Chen, Jia, et al. "Axiomatically Regularized Pre-training for Ad hoc Search." To Appear in the Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2022.

Requirements

python 3.7
torch==1.9.0
transformers==4.9.2
tqdm, nltk, numpy, boto3
trec_eval for evaluation on TREC DL 2019
anserini for generating "RANK" axiom scores

Why this repo?

In this repo, you can pre-train ARES_simple and Transformer_ICT models, and fine-tune all pre-trained models with the same architecture as BERT. The papers are listed as follows:

You can download the pre-trained ARES checkpoint ARES_simple from Google drive and extract it.

Pre-training Data

Download data

Download the MS MARCO corpus from the official website.
Download the ADORE+STAR Top100 Candidates files from this repo.

Pre-process data

To save memory, we store most files using the numpy memmap or jsonl format in the ./preprocess directory.

Document files:

doc_token_ids.memmap: each line is the token ids for a document
docid2idx.json: {docid: memmap_line_id}

Query files:

queries.doctrain.jsonl: MS MARCO training queries {"id" qid, "ids": token_ids} for each line
queries.docdev.jsonl: MS MARCO validating queries {"id" qid, "ids": token_ids} for each line
queries.dl2019.jsonl: TREC DL 2019 queries {"id" qid, "ids": token_ids} for each line

Human label files:

msmarco-doctrain-qrels.tsv: qid 0 docid 1 for training set
dev-qrels.txt: qid relevant_docid for validating set
2019qrels-docs.txt: qid relevant_docid for TREC DL 2019 set

Top 100 candidate files:

train.rank.tsv, dev.rank.tsv, test.rank.tsv: qid docid rank for each line

Pseudo queries and axiomatic features:

doc2qs.jsonl: {"docid": docid, "queries": [qids]} for each line
sample_qs_token_ids.memmap: each line is the token ids for a pseudo query
sample_qid2id.json: {qid: memmap_line_id}
axiom.memmap: axiom can be one of the ['rank', 'prox-1', 'prox-2', 'rep-ql', 'rep-tfidf', 'reg', 'stm-1', 'stm-2', 'stm-3'], each line is an axiomatic score for a query

Quick Start

Example Usage

from model.modeling import ARESReranker

model = ARESReranker.from_pretrained(model_path).to(device)

query1 = "What is the best way to get to the airport"
query2 = "what do you like to eat?"

doc1 = "The best way to get to the airport is to take the bus"
doc2 = "I like to eat apples"

qd_pairs = [
        (query1, doc1), (query1, doc2),
        (query2, doc1), (query2, doc2)
]

score = model.score(qd_pairs)

You will get

scores: [ 41.60 -33.66 
          -38.00 30.03 ]

Note that to accelerate the training process, we adopt the parallel training technique. The scripts for pre-training and fine-tuning are as follow:

Pre-training

export BERT_DIR=/path/to/bert-base/
export XGB_DIR=/path/to/xgboost.model

cd pretrain

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 NCCL_BLOCKING_WAIT=1 \
python  -m torch.distributed.launch --nproc_per_node=6 --nnodes=1 train.py \
        --model_type ARES \
        --PRE_TRAINED_MODEL_NAME BERT_DIR \
        --gpu_num 6 --world_size 6 \
        --MLM --axiom REP RANK REG PROX STM \
        --clf_model XGB_DIR

Here model type can be ARES or ICT.

Zero-shot evaluation (based on AS top100)

export MODEL_DIR=/path/to/ares-simple/
export CKPT_NAME=ares.ckpt

cd finetune

CUDA_VISIBLE_DEVICES=0 python train.py \
        --test \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --model_type ARES \
        --model_name ARES_simple \
        --load_ckpt \
        --model_path CKPT_NAME

You can get:

#####################
<----- MS Dev ----->
MRR @10: 0.2991
MRR @100: 0.3130
QueriesRanked: 5193
#####################

on MS MARCO dev set and:

#############################
<--------- DL 2019 --------->
QueriesRanked: 43
nDCG @10: 0.5955
nDCG @100: 0.4863
#############################

on DL 2019 set.

Fine-tuning

export MODEL_DIR=/path/to/ares-simple/

cd finetune

CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_BLOCKING_WAIT=1 \
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 train.py \
        --model_type ARES \
        --distributed_train \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --gpu_num 4 --world_size 4 \
        --model_name ARES_simple

Visualization

export MODEL_DIR=/path/to/ares-simple/
export SAVE_DIR=/path/to/output/
export CKPT_NAME=ares.ckpt

cd visualization

CUDA_VISIBLE_DEVICES=0 python visual.py \
    --PRE_TRAINED_MODEL_NAME MODEL_DIR \
    --model_name ARES_simple \
    --visual_q_num 1 \
    --visual_d_num 5 \
    --save_path SAVE_DIR \
    --model_path CKPT_NAME

Results

Zero-shot performance:

Model Name	MS MARCO MRR@10	MS MARCO MRR@100	DL NDCG@10	DL NDCG@100	COVID	EQ
BM25	0.2962	0.3107	0.5776	0.4795	0.4857	0.6690
BERT	0.1820	0.2012	0.4059	0.4198	0.4314	0.6055
PROP_wiki	0.2429	0.2596	0.5088	0.4525	0.4857	0.5991
PROP_marco	0.2763	0.2914	0.5317	0.4623	0.4829	0.6454
ARES_strict	0.2630	0.2785	0.4942	0.4504	0.4786	0.6923
ARES_hard	0.2627	0.2780	0.5189	0.4613	0.4943	0.6822
ARES_simple	0.2991	0.3130	0.5955	0.4863	0.4957	0.6916

Few-shot performance:

Visualization (attribution values have been normalized within a document):

Citation

If you find our work useful, please do not save your star and cite our work:

@inproceedings{chen2022axiomatically,
  title={Axiomatically Regularized Pre-training for Ad hoc Search},
  author={Chen, Jia and Liu, Yiqun and Fang, Yan and Mao, Jiaxin and Fang, Hui and Yang, Shenghao and Xie, Xiaohui and Zhang, Min and Ma, Shaoping},
  booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2022}
}

Notice

Please make sure that all the pre-trained model parameters have been loaded correctly, or the zero-shot and the fine-tuning performance will be greatly impacted.
We welcome anyone who would like to contribute to this repo. 🤗
If you have any other questions, please feel free to contact me via chenjia0831@gmail.com or open an issue.
Code for data preprocessing will come soon. Please stay tuned~

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
example		example
finetune		finetune
imgs		imgs
model		model
preprocess		preprocess
pretrain		pretrain
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

xuanyuan14/ARES

Folders and files

Latest commit

History

Repository files navigation

Introduction

Requirements

Why this repo?

Pre-training Data

Download data

Pre-process data

Quick Start

Example Usage

Pre-training

Zero-shot evaluation (based on AS top100)

Fine-tuning

Visualization

Results

Citation

Notice

About

Topics

Resources

License

Stars

Watchers

Forks

Languages