Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models.

Overview

Tevatron

Tevatron is a simple and efficient toolkit for training and running dense retrievers with deep language models. The toolkit has a modularized design for easy research; a set of command line tools are also provided for fast development and testing. A set of easy-to-use interfaces to Huggingfac's state-of-the-art pre-trained transformers ensures Tevatron's superior performance.

Tevatron is currently under initial development stage. We will be actively adding new features and API changes may happen. Suggestions, feature requests and PRs are welcomed.

Features

  • Command line interface for dense retriever training/encoding and dense index search.
  • Flexible and extendable Pytorch retriever models.
  • Highly efficient Trainer, a subclass of Huggingface Trainer, that naively support training performance features like mixed precision and distributed data parallel.
  • Fast and memory-efficient train/inference data access based on memory mapping with Apache Arrow through Huggingface datasets.

Installation

First install neural network and similarity search backends, namely Pytorch and FAISS. Check out the official installation guides for Pytorch and for FAISS.

Then install Tevatron with pip,

pip install tevatron

Or typically for develoment/research, clone this repo and install as editable,

git https://github.com/texttron/tevatron
cd tevatron
pip install --editable .

Note: The current code base has been tested with, torch==1.8.2, faiss-cpu==1.7.1, transformers==4.9.2, datasets==1.11.0

Data Format

Training: Each line of the the Train file is a training instance,

{'query': TEXT_TYPE, 'positives': List[TEXT_TYPE], 'negatives': List[TEXT_TYPE]}
...

Inference/Encoding: Each line of the the encoding file is a piece of text to be encoded,

{text_id: "xxx", 'text': TEXT_TYPE}
...

Here TEXT_TYPE can be either raw string or pre-tokenized ids, i.e. List[int]. Using the latter can help lower data processing latency during training to reduce/eliminate GPU wait. Note: the current code requires text_id of passages/contexts to be convertible to integer, e.g. integers or string of integers.

Training (Simple)

To train a simple dense retriever, call the tevatron.driver.train module,

python -m tevatron.driver.train \  
  --output_dir $OUTDIR \  
  --model_name_or_path bert-base-uncased \  
  --do_train \  
  --save_steps 20000 \  
  --train_dir $TRAIN_DIR \
  --fp16 \  
  --per_device_train_batch_size 8 \  
  --learning_rate 5e-6 \  
  --num_train_epochs 2 \  
  --dataloader_num_workers 2

Here we picked bert-base-uncased BERT weight from Huggingface Hub and turned on AMP with --fp16 to speed up training. Several command flags are provided in addition to configure the learned model, e.g. --add_pooler which adds an linear projection. A full list command line arguments can be found in tevatron.arguments.

Training (Research)

Check out the run.py in examples directory for a fully configurable train/test loop. Typically you will do,

from tevatron.modeling import DenseModel
from tevatron.trainer import DenseTrainer as Trainer

...
model = DenseModel.build(
        model_args,
        data_args,
        training_args,
        config=config,
        cache_dir=model_args.cache_dir,
    )
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        data_collator=collator,
    )
...
trainer.train()

Encoding

To encode, call the tevatron.driver.encode module. For large corpus, split the corpus into shards to parallelize.

for s in shard1 shar2 shard3
do
python -m tevatron.driver.encode \  
  --output_dir=$OUTDIR \  
  --tokenizer_name $TOK \  
  --config_name $CONFIG \  
  --model_name_or_path $MODEL_DIR \  
  --fp16 \  
  --per_device_eval_batch_size 128 \  
  --encode_in_path $CORPUS_DIR/$s.json \  
  --encoded_save_path $ENCODE_DIR/$s.pt
done

Index Search

Call the tevatron.faiss_retriever module,

python -m tevatron.faiss_retriever \  
--query_reps $ENCODE_QRY_DIR/qry.pt \  
--passage_reps $ENCODE_DIR/'*.pt' \  
--depth $DEPTH \
--batch_size -1 \
--save_text \
--save_ranking_to rank.tsv

Encoded corpus or corpus shards are loaded based on glob pattern matching of argument --passage_reps. Argument --batch_size controls number of queries passed to the FAISS index each search call and -1 will pass all queries in one call. Larger batches typically run faster (due to better memory access patterns and hardware utilization.) Setting flag --save_text will save the ranking to a tsv file with each line being qid pid score.

Alternatively paralleize search over the shards,

for s in shard1 shar2 shard3
do
python -m tevatron.faiss_retriever \  
--query_reps $ENCODE_QRY_DIR/qry.pt \  
--passage_reps $ENCODE_DIR/$s.pt \  
--depth $DEPTH \  
--save_ranking_to $INTERMEDIATE_DIR/$s
done

Then combine the results using the reducer module,

python -m tevatron.faiss_retriever.reducer \  
--score_dir $INTERMEDIATE_DIR \  
--query $ENCODE_QRY_DIR/qry.pt \  
--save_ranking_to rank.txt  

Contacts

If you have a toolkit specific question, feel free to open an issue.

You can also reach out to us for general comments/suggestions/questions through email.

Owner
texttron
texttron
Korean Simple Contrastive Learning of Sentence Embeddings using SKT KoBERT and kakaobrain KorNLU dataset

KoSimCSE Korean Simple Contrastive Learning of Sentence Embeddings implementation using pytorch SimCSE Installation git clone https://github.com/BM-K/

34 Nov 24, 2022
Klexikon: A German Dataset for Joint Summarization and Simplification

Klexikon: A German Dataset for Joint Summarization and Simplification Dennis Aumiller and Michael Gertz Heidelberg University Under submission at LREC

Dennis Aumiller 8 Jan 03, 2023
An implementation of WaveNet with fast generation

pytorch-wavenet This is an implementation of the WaveNet architecture, as described in the original paper. Features Automatic creation of a dataset (t

Vincent Herrmann 858 Dec 27, 2022
Simple program that translates the name of files into English

Simple program that translates the name of files into English. Useful for when editing/inspecting programs that were developed in a foreign language.

0 Dec 22, 2021
My implementation of Safaricom Machine Learning Codility test. The code has bugs, logical I guess I made errors and any correction will be appreciated.

Safaricom_Codility Machine Learning 2022 The test entails two questions. Question 1 was on Machine Learning. Question 2 was on SQL I ran out of time.

Lawrence M. 1 Mar 03, 2022
Maha is a text processing library specially developed to deal with Arabic text.

An Arabic text processing library intended for use in NLP applications Maha is a text processing library specially developed to deal with Arabic text.

Mohammad Al-Fetyani 184 Nov 27, 2022
Natural Language Processing

NLP Natural Language Processing apps Multilingual_NLP.py start #This script is demonstartion of Mul

Ritesh Sharma 1 Oct 31, 2021
PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit.

PyKaldi is a Python scripting layer for the Kaldi speech recognition toolkit. It provides easy-to-use, low-overhead, first-class Python wrappers for t

922 Dec 31, 2022
simpleT5 is built on top of PyTorch-lightning⚡️ and Transformers🤗 that lets you quickly train your T5 models.

Quickly train T5 models in just 3 lines of code + ONNX support simpleT5 is built on top of PyTorch-lightning ⚡️ and Transformers 🤗 that lets you quic

Shivanand Roy 220 Dec 30, 2022
A simple chatbot based on chatterbot that you can use for anything has basic features

Chatbotium A simple chatbot based on chatterbot that you can use for anything has basic features. I have some errors Read the paragraph below: Known b

Herman 1 Feb 16, 2022
Khandakar Muhtasim Ferdous Ruhan 1 Dec 30, 2021
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

Sapienza NLP group 16 Sep 09, 2022
A NLP program: tokenize method, PoS Tagging with deep learning

IRIS NLP SYSTEM A NLP program: tokenize method, PoS Tagging with deep learning Report Bug · Request Feature Table of Contents About The Project Built

Zakaria 7 Dec 13, 2022
A method for cleaning and classifying text using transformers.

NLP Translation and Classification The repository contains a method for classifying and cleaning text using NLP transformers. Overview The input data

Ray Chamidullin 0 Nov 15, 2022
apple's universal binaries BUT MUCH WORSE (PRACTICAL SHITPOST) (NOT PRODUCTION READY)

hyperuniversality investment opportunity: what if we could run multiple architectures in a single file, again apple universal binaries, but worse how

luna 2 Oct 19, 2021
In this Notebook I've build some machine-learning and deep-learning to classify corona virus tweets, in both multi class classification and binary classification.

Hello, This Notebook Contains Example of Corona Virus Tweets Multi Class Classification. - Classes is: Extremely Positive, Positive, Extremely Negativ

Khaled Tofailieh 3 Dec 06, 2022
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

msg systems ag 169 Dec 21, 2022
A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Basic-UI-for-GPT-J-6B-with-low-vram A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory. There seem to be some

90 Dec 25, 2022