NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

Last update: Apr 07, 2022

Related tags

Text Data & NLP pretrain4ir_tutorial

Overview

pretrain4ir_tutorial

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

用作NLPIR实验室, Pre-training for IR方向入门.

代码包括了如下部分:

tasks/ : 生成预训练数据
pretrain/: 在生成的数据上Pre-training (MLM + NSP)
finetune/: Fine-tuning on MS MARCO

Preinstallation

First, prepare a Python3 environment, and run the following commands:

  git clone [email protected]:zhengyima/pretrain4ir_tutorial.git pretrain4ir_tutorial
  cd pretrain4ir_tutorial
  pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Besides, you should download the BERT model checkpoint in format of huggingface transformers, and save them in a directory BERT_MODEL_PATH. In our paper, we use the version of bert-base-uncased. you can download it from the huggingface official model zoo, or Tsinghua mirror.

生成预训练数据

代码库提供了最简单易懂的预训练任务 rand。该任务随机从文档中选取1~5个词作为query, 用来demo面向IR的预训练。

生成rand预训练任务数据命令: cd tasks/rand && bash gen.sh

你可以自己编写脚本, 仿照rand任务, 生成你自己认为合理的预训练任务的数据。

Notes: 运行rand任务的shell之前, 你需要先将 gen.sh 脚本中的 msmarco_docs_path 参数改为MSMARCO数据集的文档tsv 路径; 将bert_model参数改为下载好的bert模型目录;

模型预训练

代码库提供了模型预训练的相关代码, 见pretrain。该代码完成了MLM+NSP两个任务的预训练。

模型预训练命令: cd pretrain && bash train_bert.sh

Notes: 注意要修改train_bert中的相应参数：将bert_model参数改为下载好的bert模型目录; train_file改为你上一步生成好的预训练数据文件路径。

模型Fine-tune

代码库提供了在MSMARCO Document Ranking任务上进行Fine-tune的相关代码。见finetune。该代码完成了在MSMARCO上通过point-wise进行fine-tune的流程。

模型fine-tune命令: cd finetune && bash train_bert.sh

Leaderboard

Tasks	[email protected] on dev set
PROP-MARCO	0.4201
PROP-WIKI	0.4188
BERT-Base	0.4184
rand	0.4123

Homework

设计一个你认为合理的预训练任务, 并对BERT模型进行预训练, 并在MSMARCO上完成fine-tune, 在Leaderboard上更新你在dev set上的结果。

你需要做的是:

编写你自己的预训练数据生成脚本, 放到 tasks/yourtask 目录下。
使用以上脚本, 生成自己的预训练数据。
运行代码库提供的pre-train与fine-tune脚本, 跑出结果, 更新Leaderboard。

NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

Related tags

Overview

pretrain4ir_tutorial

Preinstallation

生成预训练数据

模型预训练

模型Fine-tune

Leaderboard

Homework

Links

Owner

ZYMa

apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

SDL: Synthetic Document Layout dataset

Pytorch NLP library based on FastAI

Seq2seq attn - Use the Seq2Seq method to implement machine translation and introduce Attention mechanism to improve the results

A natural language processing model for sequential sentence classification in medical abstracts.

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

Implementation of paper Does syntax matter? A strong baseline for Aspect-based Sentiment Analysis with RoBERTa.

Stand-alone language identification system

IMDB film review sentiment classification based on BERT's supervised learning model.

Grover is a model for Neural Fake News -- both generation and detectio

Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Code release for "COTR: Correspondence Transformer for Matching Across Images"

VMD Audio/Text control with natural language

nlp基础任务

BERT Attention Analysis

Predict an emoji that is associated with a text

A CSRankings-like index for speech researchers

Open source code for AlphaFold.

CYGNUS, the Cynical AI, combines snarky responses with uncanny aggression.