The offcial repository for 'CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos', SIGIR2022

Overview

CharacterBERT-DR

The offcial repository for CharacterBERT and Self-Teaching for Improving the Robustness of Dense Retrievers on Queries with Typos, Shengyao Zhuang and Guido Zuccon, SIGIR2022

Installation

Our code is developed based on Tevatron DR training toolkit (v0.0.1).

First clone this repository and then install with pip: pip install --editable .

Note: The current code base has been tested with, torch==1.8.1, faiss-cpu==1.7.1, transformers==4.9.2, datasets==1.11.0, textattack=0.3.4

Preparing data and model

Typo queries and dataset

All the queries and qrels used in our paper are in the /data folder.

Download CharacterBERT

Download CharacterBERT trained with general domain with this link.

Train

CharacterBERT-DR + ST training

python -m tevatron.driver.train \
--model_name_or_path bert-base-uncased \
--character_bert_path ./general_character_bert \
--output_dir model_msmarco_characterbert_st \
--passage_field_separator [SEP] \
--save_steps 40000 \
--dataset_name Tevatron/msmarco-passage \
--fp16 \
--per_device_train_batch_size 16 \
--learning_rate 5e-6 \
--max_steps 150000 \
--dataloader_num_workers 10 \
--cache_dir ./cache \
--logging_steps 150 \
--character_query_encoder True \
--self_teaching True

If you want to do typo augmentation training introduced in our previous paper. Replace --self_teaching True with --typo_augmentation True.

If you want to train a standard BERT DR intead of CharacterBERT DR, remove --character_bert_path and --character_query_encoder arguments.

If you do not want to train the model, we provide our trained model checkpoints for you to download:

Model Checkpoints
StandardBERT-DR
StandardBERT-DR + Aug
StandardBERT-DR + ST
CharacterBERT-DR
CharacterBERT-DR + Aug
CharacterBERT-DR + ST

Inference

Encode queries and corpus

After you have the trained model, you can run the following command to encode queries and corpus into dense vectors:

mkdir msmarco_charcterbert_st_embs
# encode query
python -m tevatron.driver.encode \
  --output_dir=temp \
  --model_name_or_path model_msmarco_characterbert_st/checkpoint-final \
  --fp16 \
  --per_device_eval_batch_size 128 \
  --encode_in_path data/dl-typo/query.typo.tsv \
  --encoded_save_path msmarco_charcterbert_st_embs/query_dltypo_typo_emb.pkl \
  --q_max_len 32 \
  --encode_is_qry \
  --character_query_encoder True


# encode corpus
for s in $(seq -f "%02g" 0 19)
do
python -m tevatron.driver.encode \
  --output_dir=temp \
  --model_name_or_path model_msmarco_characterbert_st/checkpoint-final \
  --fp16 \
  --per_device_eval_batch_size 128 \
  --p_max_len 128 \
  --dataset_name Tevatron/msmarco-passage-corpus \
  --encoded_save_path model_msmarco_characterbert_st/corpus_emb.${s}.pkl \
  --encode_num_shard 20 \
  --encode_shard_index ${s} \
  --cache_dir cache \
  --character_query_encoder True \
  --passage_field_separator [SEP]
done

If you are using our provided model checkpoints, change --model_name_or_path to the downloaded model path. If you running inference with standard BERT, remove --character_query_encoder True argument.

Retrieval

Run the following commands to generate ranking file and convert it to TREC format:

python -m tevatron.faiss_retriever \
--query_reps model_msmarco_characterbert_st/query_dltypo_typo_emb.pkl \
--passage_reps model_msmarco_characterbert_st/'corpus_emb.*.pkl' \
--depth 1000 \
--batch_size -1 \
--save_text \
--save_ranking_to character_bert_st_dltypo_typo_rank.txt


python -m tevatron.utils.format.convert_result_to_trec \
              --input character_bert_st_dltypo_typo_rank.txt \
              --output character_bert_st_dltypo_typo_rank.txt.trec

Evaluation

We use trec_eval to evaluate the results:

trec_eval -l 2 -m ndcg_cut.10 -m map -m recip_rank data/dl-typo/qrels.txt character_bert_st_dltypo_typo_rank.txt.trec

If you use our provided CharacterBERT-DR + ST checkpoint, you will get:

map                     all     0.3483
recip_rank              all     0.6154
ndcg_cut_10             all     0.4730

We note that if you train the model by yourself, you may get slightly different results due to the randomness of dataloader and Tevatron self-contained msmarco-passage training dataset has been updated. We also note that, for our DL-typo dataset, the top10 passages are all judged in the ranking file generated by our provided checkpoints. Hence newly trained model may not comparable as there will be unjudged top10 passages in its ranking file.

Owner
ielab
The Information Engineering Lab
ielab
Code for the ICCV 2021 paper "Pixel Difference Networks for Efficient Edge Detection" (Oral).

Microsoft365_devicePhish Abusing Microsoft 365 OAuth Authorization Flow for Phishing Attack This is a simple proof-of-concept script that allows an at

Alex 236 Dec 21, 2022
Yggdrasil - A simplistic bot designed to streamline your server experience

Ygggdrasil A simplistic bot designed to streamline your server experience. Desig

Sntx_ 1 Dec 14, 2022
EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos.

EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos. In this project, we provide the basic code for fitt

ZJU3DV 2.2k Jan 05, 2023
Unofficial keras(tensorflow) implementation of MAE model from Masked Autoencoders Are Scalable Vision Learners

MAE-keras Unofficial keras(tensorflow) implementation of MAE model described in 'Masked Autoencoders Are Scalable Vision Learners'. This work has been

Yewon 11 Jun 12, 2022
Code Repo for the ACL21 paper "Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning"

Common Sense Beyond English: Evaluating and Improving Multilingual LMs for Commonsense Reasoning This is the Github repository of our paper, "Common S

INK Lab @ USC 19 Nov 30, 2022
CC-GENERATOR - A python script for generating CC

CC-GENERATOR A python script for generating CC NOTE: This tool is for Educationa

Lêkzï 6 Oct 14, 2022
RoMa: A lightweight library to deal with 3D rotations in PyTorch.

RoMa: A lightweight library to deal with 3D rotations in PyTorch. RoMa (which stands for Rotation Manipulation) provides differentiable mappings betwe

NAVER 90 Dec 27, 2022
3D-aware GANs based on NeRF (arXiv).

CIPS-3D This repository will contain the code of the paper, CIPS-3D: A 3D-Aware Generator of GANs Based on Conditionally-Independent Pixel Synthesis.

Peterou 563 Dec 31, 2022
Official PyTorch Implementation of Convolutional Hough Matching Networks, CVPR 2021 (oral)

Convolutional Hough Matching Networks This is the implementation of the paper "Convolutional Hough Matching Network" by J. Min and M. Cho. Implemented

Juhong Min 70 Nov 22, 2022
PyTorch Implementation for Fracture Detection in Wrist Bone X-ray Images

wrist-d PyTorch Implementation for Fracture Detection in Wrist Bone X-ray Images note: Paper: Under Review at MPDI Diagnostics Submission Date: Novemb

Fatih UYSAL 5 Oct 12, 2022
MvtecAD unsupervised Anomaly Detection

MvtecAD unsupervised Anomaly Detection This respository is the unofficial implementations of DFR: Deep Feature Reconstruction for Unsupervised Anomaly

0 Feb 25, 2022
Python implementation of "Elliptic Fourier Features of a Closed Contour"

PyEFD An Python/NumPy implementation of a method for approximating a contour with a Fourier series, as described in [1]. Installation pip install pyef

Henrik Blidh 71 Dec 09, 2022
Generalized Data Weighting via Class-level Gradient Manipulation

Generalized Data Weighting via Class-level Gradient Manipulation This repository is the official implementation of Generalized Data Weighting via Clas

18 Nov 12, 2022
Code for paper "Extract, Denoise and Enforce: Evaluating and Improving Concept Preservation for Text-to-Text Generation" EMNLP 2021

The repo provides the code for paper "Extract, Denoise and Enforce: Evaluating and Improving Concept Preservation for Text-to-Text Generation" EMNLP 2

Yuning Mao 18 May 24, 2022
Analysis of rationale selection in neural rationale models

Neural Rationale Interpretability Analysis We analyze the neural rationale models proposed by Lei et al. (2016) and Bastings et al. (2019), as impleme

Yiming Zheng 3 Aug 31, 2022
Run object detection model on the Raspberry Pi

Using TensorFlow Lite with Python is great for embedded devices based on Linux, such as Raspberry Pi.

Dimitri Yanovsky 6 Oct 08, 2022
Search and filter videos based on objects that appear in them using convolutional neural networks

Thingscoop: Utility for searching and filtering videos based on their content Description Thingscoop is a command-line utility for analyzing videos se

Anastasis Germanidis 354 Dec 04, 2022
🔅 Shapash makes Machine Learning models transparent and understandable by everyone

🎉 What's new ? Version New Feature Description Tutorial 1.6.x Explainability Quality Metrics To help increase confidence in explainability methods, y

MAIF 2.1k Dec 27, 2022
Repositorio oficial del curso IIC2233 Programación Avanzada 🚀✨

IIC2233 - Programación Avanzada Evaluación Las evaluaciones serán efectuadas por medio de actividades prácticas en clases y tareas. Se calculará la no

IIC2233 @ UC 0 Dec 15, 2022
Object detection GUI based on PaddleDetection

PP-Tracking GUI界面测试版 本项目是基于飞桨开源的实时跟踪系统PP-Tracking开发的可视化界面 在PaddlePaddle中加入pyqt进行GUI页面研发,可使得整个训练过程可视化,并通过GUI界面进行调参,模型预测,视频输出等,通过多种类型的识别,简化整体预测流程。 GUI界面

杨毓栋 68 Jan 02, 2023