SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Overview

img

THUIR License made-with-python code-size

Introduction

This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper.

Requirements

  • python 3.7
  • torch==1.9.0
  • transformers==4.9.2
  • tqdm, nltk, numpy, boto3
  • trec_eval for evaluation on TREC DL 2019
  • anserini for generating "RANK" axiom scores

Why this repo?

In this repo, you can pre-train ARESsimple and TransformerICT models, and fine-tune all pre-trained models with the same architecture as BERT. The papers are listed as follows:

You can download the pre-trained ARES checkpoint ARESsimple from Google drive and extract it.

Pre-training Data

Download data

Download the MS MARCO corpus from the official website.
Download the ADORE+STAR Top100 Candidates files from this repo.

Pre-process data

To save memory, we store most files using the numpy memmap or jsonl format in the ./preprocess directory.

Document files:

  • doc_token_ids.memmap: each line is the token ids for a document
  • docid2idx.json: {docid: memmap_line_id}

Query files:

  • queries.doctrain.jsonl: MS MARCO training queries {"id" qid, "ids": token_ids} for each line
  • queries.docdev.jsonl: MS MARCO validating queries {"id" qid, "ids": token_ids} for each line
  • queries.dl2019.jsonl: TREC DL 2019 queries {"id" qid, "ids": token_ids} for each line

Human label files:

  • msmarco-doctrain-qrels.tsv: qid 0 docid 1 for training set
  • dev-qrels.txt: qid relevant_docid for validating set
  • 2019qrels-docs.txt: qid relevant_docid for TREC DL 2019 set

Top 100 candidate files:

  • train.rank.tsv, dev.rank.tsv, test.rank.tsv: qid docid rank for each line

Pseudo queries and axiomatic features:

  • doc2qs.jsonl: {"docid": docid, "queries": [qids]} for each line
  • sample_qs_token_ids.memmap: each line is the token ids for a pseudo query
  • sample_qid2id.json: {qid: memmap_line_id}
  • axiom.memmap: axiom can be one of the ['rank', 'prox-1', 'prox-2', 'rep-ql', 'rep-tfidf', 'reg', 'stm-1', 'stm-2', 'stm-3'], each line is an axiomatic score for a query

Quick Start

Note that to accelerate the training process, we adopt the parallel training technique. The scripts for pre-training and fine-tuning are as follow:

Pre-training

export BERT_DIR=/path/to/bert-base/
export XGB_DIR=/path/to/xgboost.model

cd pretrain

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 NCCL_BLOCKING_WAIT=1 \
python  -m torch.distributed.launch --nproc_per_node=6 --nnodes=1 train.py \
        --model_type ARES \
        --PRE_TRAINED_MODEL_NAME BERT_DIR \
        --gpu_num 6 --world_size 6 \
        --MLM --axiom REP RANK REG PROX STM \
        --clf_model XGB_DIR

Here model type can be ARES or ICT.

Zero-shot evaluation (based on AS top100)

export MODEL_DIR=/path/to/ares-simple/
export CKPT_NAME=ares.ckpt

cd finetune

CUDA_VISIBLE_DEVICES=0 python train.py \
        --test \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --model_type ARES \
        --model_name ARES_simple \
        --load_ckpt \
        --model_path CKPT_NAME

You can get:

#####################
<----- MS Dev ----->
MRR @10: 0.2991
MRR @100: 0.3130
QueriesRanked: 5193
#####################

on MS MARCO dev set and:

#############################
<--------- DL 2019 --------->
QueriesRanked: 43
nDCG @10: 0.5955
nDCG @100: 0.4863
#############################

on DL 2019 set.

Fine-tuning

export MODEL_DIR=/path/to/ares-simple/

cd finetune

CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_BLOCKING_WAIT=1 \
python -m torch.distributed.launch --nproc_per_node=4 --nnodes=1 train.py \
        --model_type ARES \
        --distributed_train \
        --PRE_TRAINED_MODEL_NAME MODEL_DIR \
        --gpu_num 4 --world_size 4 \
        --model_name ARES_simple

Visualization

export MODEL_DIR=/path/to/ares-simple/
export SAVE_DIR=/path/to/output/
export CKPT_NAME=ares.ckpt

cd visualization

CUDA_VISIBLE_DEVICES=0 python visual.py \
    --PRE_TRAINED_MODEL_NAME MODEL_DIR \
    --model_name ARES_simple \
    --visual_q_num 1 \
    --visual_d_num 5 \
    --save_path SAVE_DIR \
    --model_path CKPT_NAME

Results

Zero-shot performance:

Model Name MS MARCO [email protected] MS MARCO [email protected] DL [email protected] DL [email protected] COVID EQ
BM25 0.2962 0.3107 0.5776 0.4795 0.4857 0.6690
BERT 0.1820 0.2012 0.4059 0.4198 0.4314 0.6055
PROPwiki 0.2429 0.2596 0.5088 0.4525 0.4857 0.5991
PROPmarco 0.2763 0.2914 0.5317 0.4623 0.4829 0.6454
ARESstrict 0.2630 0.2785 0.4942 0.4504 0.4786 0.6923
AREShard 0.2627 0.2780 0.5189 0.4613 0.4943 0.6822
ARESsimple 0.2991 0.3130 0.5955 0.4863 0.4957 0.6916

Few-shot performance: img

Visualization (attribution values have been normalized within a document): img

Citation

If you find our work useful, please do not save your star and cite our work:

@inproceedings{chen2022axiomatically,
  title={Axiomatically Regularized Pre-training for Ad hoc Search},
  author={Chen, Jia and Liu, Yiqun and Fang, Yan and Mao, Jiaxin and Fang, Hui and Yang, Shenghao and Xie, Xiaohui and Zhang, Min and Ma, Shaoping},
  booktitle={Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval},
  year={2022}
}

Notice

  • Please make sure that all the pre-trained model parameters have been loaded correctly, or the zero-shot and the fine-tuning performance will be greatly impacted.
  • We welcome anyone who would like to contribute to this repo. 🤗
  • If you have any other questions, please feel free to contact me via [email protected] or open an issue.
  • Code for data preprocessing will come soon. Please stay tuned~
Owner
Jia Chen
My life is a beauty. 🦋
Jia Chen
[ICCV 2021] Instance-level Image Retrieval using Reranking Transformers

Instance-level Image Retrieval using Reranking Transformers Fuwen Tan, Jiangbo Yuan, Vicente Ordonez, ICCV 2021. Abstract Instance-level image retriev

UVA Computer Vision 86 Dec 28, 2022
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022
Simple bots or Simbots is a library designed to create simple bots using the power of python. This library utilises Intent, Entity, Relation and Context model to create bots .

Simple bots or Simbots is a library designed to create simple chat bots using the power of python. This library utilises Intent, Entity, Relation and

14 Dec 15, 2021
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022
BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

BROS (BERT Relying On Spatiality) is a pre-trained language model focusing on text and layout for better key information extraction from documents. Given the OCR results of the document image, which

Clova AI Research 94 Dec 30, 2022
超轻量级bert的pytorch版本,大量中文注释,容易修改结构,持续更新

bert4pytorch 2021年8月27更新: 感谢大家的star,最近有小伙伴反映了一些小的bug,我也注意到了,奈何这个月工作上实在太忙,更新不及时,大约会在9月中旬集中更新一个只需要pip一下就完全可用的版本,然后会新添加一些关键注释。 再增加对抗训练的内容,更新一个完整的finetune

muqiu 317 Dec 18, 2022
Image2pcl - Enter the metaverse with 2D image to 3D projections

Image2PCL Enter the metaverse with 2D image to 3D projections! This is an implem

Benjamin Ho 0 Feb 05, 2022
SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition, speaker recognit

SpeechBrain 5.1k Jan 09, 2023
Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch. Topics: Face detection with Detectron 2, Time Series anomaly detection with LSTM Autoenc

Venelin Valkov 1.8k Dec 31, 2022
A programming language with logic of Python, and syntax of all languages.

Pytov The idea was to take all well known syntaxes, and combine them into one programming language with many posabilities. Installation Install using

Yuval Rosen 14 Dec 07, 2022
Ecco is a python library for exploring and explaining Natural Language Processing models using interactive visualizations.

Visualize, analyze, and explore NLP language models. Ecco creates interactive visualizations directly in Jupyter notebooks explaining the behavior of Transformer-based language models (like GPT2, BER

Jay Alammar 1.6k Dec 25, 2022
Indonesia spellchecker with python

indonesia-spellchecker Ganti kata yang terdapat pada file teks.txt untuk diperiksa kebenaran kata. Run on local machine python3 main.py

Rahmat Agung Julians 1 Sep 14, 2022
Deep Learning Topics with Computer Vision & NLP

Deep learning Udacity Course Deep Learning Topics with Computer Vision & NLP for the AWS Machine Learning Engineer Nanodegree Program Tasks are mostly

Simona Mircheva 1 Jan 20, 2022
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022
Proquabet - Convert your prose into proquints and then you essentially have Vogon poetry

Proquabet Turn your prose into a constant stream of encrypted and meaningless-so

Milo Fultz 2 Oct 10, 2022
test

Lidar-data-decode In this project, you can decode your lidar data frame(pcap file) and make your own datasets(test dataset) in Windows without any hug

46 Dec 05, 2022
Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

Tao Zhuo 9 Dec 17, 2022
Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022)

SyntaxGen Syntax-aware Multi-spans Generation for Reading Comprehension (TASLP 2022) In this repo, we upload all the scripts for this work. Due to siz

Zhuosheng Zhang 3 Jun 13, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
Python package for performing Entity and Text Matching using Deep Learning.

DeepMatcher DeepMatcher is a Python package for performing entity and text matching using deep learning. It provides built-in neural networks and util

461 Dec 28, 2022