Code and data of the ACL 2021 paper: Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision

Overview

MetaAdaptRank

This repository provides the implementation of meta-learning to reweight synthetic weak supervision data described in the paper Few-Shot Text Ranking with Meta Adapted Synthetic Weak Supervision.

CONTACT

For any question, please contact Si Sun by email [email protected] (respond to emails more quickly), we will try our best to solve :)

QUICKSTART

python 3.7
Pytorch 1.5.0

0/ Data Preparation

First download and prepare the following data into the data folder:

1 Contrastive Supervision Synthesis

1.1 Source-domain NLG training

  • We train two query generators (QG & ContrastQG) with the MS MARCO dataset using train_nlg.sh in the run_shells folder:

    bash prepro_nlg_dataset.sh
    
  • Optional arguments:

    --generator_mode            choices=['qg', 'contrastqg']
    --pretrain_generator_type   choices=['t5-small', 't5-base']
    --train_file                The path to the source-domain nlg training dataset
    --save_dir                  The path to save the checkpoints data; default: ../results
    

1.2 Target-domain NLG inference

  • The whole nlg inference pipline contains five steps:

    • 1.2.1/ Data preprocess
    • 1.2.2/ Seed query generation
    • 1.2.3/ BM25 subset retrieval
    • 1.2.4/ Contrastive doc pairs sampling
    • 1.2.5/ Contrastive query generation
  • 1.2.1/ Data preprocess. convert target-domain documents into the nlg format using prepro_nlg_dataset.sh in the preprocess folder:

    bash prepro_nlg_dataset.sh
    
  • Optional arguments:

    --dataset_name          choices=['clueweb09', 'robust04', 'trec-covid']
    --input_path            The path to the target dataset
    --output_path           The path to save the preprocess data; default: ../data/prepro_target_data
    
  • 1.2.2/ Seed query generation. utilize the trained QG model to generate seed queries for each target documents using nlg_inference.sh in the run_shells folder:

    bash nlg_inference.sh
    
  • Optional arguments:

    --generator_mode            choices='qg'
    --pretrain_generator_type   choices=['t5-small', 't5-base']
    --target_dataset_name       choices=['clueweb09', 'robust04', 'trec-covid']
    --generator_load_dir        The path to the pretrained QG checkpoints.
    
  • 1.2.3/ BM25 subset retrieval. utilize BM25 to retrieve document subset according to the seed queries using do_subset_retrieve.sh in the bm25_retriever folder:

    bash do_subset_retrieve.sh
    
  • Optional arguments:

    --dataset_name          choices=['clueweb09', 'robust04', 'trec-covid']
    --generator_folder      choices=['t5-small', 't5-base']
    
  • 1.2.4/ Contrastive doc pairs sampling. pairwise sample contrastive doc pairs from the BM25 retrieved subset using sample_contrast_pairs.sh in the preprocess folder:

    bash sample_contrast_pairs.sh
    
  • Optional arguments:

    --dataset_name          choices=['clueweb09', 'robust04', 'trec-covid']
    --generator_folder      choices=['t5-small', 't5-base']
    
  • 1.2.5/ Contrastive query generation. utilize the trained ContrastQG model to generate new queries based on contrastive document pairs using nlg_inference.sh in the run_shells folder:

    bash nlg_inference.sh
    
  • Optional arguments:

    --generator_mode            choices='contrastqg'
    --pretrain_generator_type   choices=['t5-small', 't5-base']
    --target_dataset_name       choices=['clueweb09', 'robust04', 'trec-covid']
    --generator_load_dir        The path to the pretrained ContrastQG checkpoints.
    

2 Meta Learning to Reweight

2.1 Data Preprocess

  • Prepare the contrastive synthetic supervision data (CTSyncSup) into the data/synthetic_data folder.

    • CTSyncSup_clueweb09
    • CTSyncSup_robust04
    • CTSyncSup_trec-covid

    >> example data format

  • Preprocess the target-domain datasets into the 5-fold cross-validation format using run_cv_preprocess.sh in the preprocess folder:

    bash run_cv_preprocess.sh
    
  • Optional arguments:

    --dataset_class         choices=['clueweb09', 'robust04', 'trec-covid']
    --input_path            The path to the target dataset
    --output_path           The path to save the preprocess data; default: ../data/prepro_target_data
    

2.2 Train and Test Models

  • The whole process of training and testing MetaAdaptRank contains three steps:

    • 2.2.1/ Meta-pretraining. The model is trained on synthetic weak supervision data, where the synthetic data are reweighted using meta-learning. The training fold of the target dataset is considered as target data that guides meta-reweighting.

    • 2.2.2/ Fine-tuning. The meta-pretrained model is continuously fine-tuned on the training folds of the target dataset.

    • 2.2.3/ Ensemble and Coor-Ascent. Coordinate Ascent is used to combine the last representation layers of all fine-tuned models, as LeToR features, with the retrieval scores from the base retriever.

  • 2.2.1/ Meta-pretraining using train_meta_bert.sh in the run_shells folder:

    bash train_meta_bert.sh
    

    Optional arguments for meta-pretraining:

    --cv_number             choices=[0, 1, 2, 3, 4]
    --pretrain_model_type   choices=['bert-base-cased', 'BiomedNLP-PubMedBERT-base-uncased-abstract']
    --train_dir             The path to the synthetic weak supervision data
    --target_dir            The path to the target dataset
    --save_dir              The path to save the output files and checkpoints; default: ../results
    

    Complete optional arguments can be seen in config.py in the scripts folder.

  • 2.2.2/ Fine-tuning using train_metafine_bert.sh in the run_shells folder:

    bash train_metafine_bert.sh
    

    Optional arguments for fine-tuning:

    --cv_number             choices=[0, 1, 2, 3, 4]
    --pretrain_model_type   choices=['bert-base-cased', 'BiomedNLP-PubMedBERT-base-uncased-abstract']
    --train_dir             The path to the target dataset
    --checkpoint_folder     The path to the checkpoint of the meta-pretrained model
    --save_dir              The path to save output files and checkpoint; default: ../results
    
  • 2.2.3/ Testing the fine-tuned model to collect LeToR features through test.sh in the run_shells folder:

    bash test.sh
    

    Optional arguments for testing:

    --cv_number             choices=[0, 1, 2, 3, 4]
    --pretrain_model_type   choices=['bert-base-cased', 'BiomedNLP-PubMedBERT-base-uncased-abstract']
    --target_dir            The path to the target evaluation dataset
    --checkpoint_folder     The path to the checkpoint of the fine-tuned model
    --save_dir              The path to save output files and the **features** file; default: ../results
    
  • 2.2.4/ Ensemble. Train and test five models for each fold of the target dataset (5-fold cross-validation), and then ensemble and convert their output features to coor-ascent format using combine_features.sh in the ensemble folder:

    bash combine_features.sh
    

    Optional arguments for ensemble:

    --qrel_path             The path to the qrels of the target dataset
    --result_fold_1         The path to the testing result folder of the first fold model
    --result_fold_2         The path to the testing result folder of the second fold model
    --result_fold_3         The path to the testing result folder of the third fold model
    --result_fold_4         The path to the testing result folder of the fourth fold model
    --result_fold_5         The path to the testing result folder of the fifth fold model
    --save_dir              The path to save the ensembled `features.txt` file; default: ../combined_features
    
  • 2.2.5/ Coor-Ascent. Run coordinate ascent using run_ranklib.sh in the ensemble folder:

    bash run_ranklib.sh
    

    Optional arguments for coor-ascent:

    --qrel_path             The path to the qrels of the target dataset
    --ranklib_path          The path to the ensembled features.
    

    The final evaluation results will be output in the ranklib_path.

Results

All TREC files listed in this paper can be found in Tsinghua Cloud.

Owner
THUNLP
Natural Language Processing Lab at Tsinghua University
THUNLP
Official respository for "Modeling Defocus-Disparity in Dual-Pixel Sensors", ICCP 2020

Official respository for "Modeling Defocus-Disparity in Dual-Pixel Sensors", ICCP 2020 BibTeX @INPROCEEDINGS{punnappurath2020modeling, author={Abhi

Abhijith Punnappurath 22 Oct 01, 2022
TACTO: A Fast, Flexible and Open-source Simulator for High-Resolution Vision-based Tactile Sensors

TACTO: A Fast, Flexible and Open-source Simulator for High-Resolution Vision-based Tactile Sensors This package provides a simulator for vision-based

Facebook Research 255 Dec 27, 2022
Blender Add-On for slicing meshes with planes

MeshSlicer Blender Add-On for slicing meshes with multiple overlapping planes at once. This is a simple Blender addon to slice a silmple mesh with mul

52 Dec 12, 2022
A project for developing transformer-based models for clinical relation extraction

Clinical Relation Extration with Transformers Aim This package is developed for researchers easily to use state-of-the-art transformers models for ext

uf-hobi-informatics-lab 101 Dec 19, 2022
QMagFace: Simple and Accurate Quality-Aware Face Recognition

Quality-Aware Face Recognition 26.11.2021 start readme QMagFace: Simple and Accurate Quality-Aware Face Recognition Research Paper Implementation - To

Philipp Terhörst 59 Jan 04, 2023
Source code of AAAI 2022 paper "Towards End-to-End Image Compression and Analysis with Transformers".

Towards End-to-End Image Compression and Analysis with Transformers Source code of our AAAI 2022 paper "Towards End-to-End Image Compression and Analy

37 Dec 21, 2022
Source code for Fixed-Point GAN for Cloud Detection

FCD: Fixed-Point GAN for Cloud Detection PyTorch source code of Nyborg & Assent (2020). Abstract The detection of clouds in satellite images is an ess

Joachim Nyborg 8 Dec 22, 2022
Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at [email protected]

TableParser Repo for "TableParser: Automatic Table Parsing with Weak Supervision from Spreadsheets" at DS3 Lab 11 Dec 13, 2022

PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-supervised ViT.

MAE for Self-supervised ViT Introduction This is an unofficial PyTorch implementation of Masked Autoencoders Are Scalable Vision Learners for self-sup

36 Oct 30, 2022
Neural Point-Based Graphics

Neural Point-Based Graphics Project   Video   Paper Neural Point-Based Graphics Kara-Ali Aliev1 Artem Sevastopolsky1,2 Maria Kolos1,2 Dmitry Ulyanov3

Ali Aliev 252 Dec 13, 2022
Rl-quickstart - Reinforcement Learning Quickstart

Reinforcement Learning Quickstart To get setup with the repository, git clone ht

UCLA DataRes 3 Jun 16, 2022
PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].

Smooth ReLU in PyTorch Unofficial PyTorch reimplementation of the Smooth ReLU (SmeLU) activation function proposed in the paper Real World Large Scale

Christoph Reich 10 Jan 02, 2023
Official implementation of Few-Shot and Continual Learning with Attentive Independent Mechanisms

Few-Shot and Continual Learning with Attentive Independent Mechanisms This repository is the official implementation of Few-Shot and Continual Learnin

Chikan_Huang 25 Dec 08, 2022
Degree-Quant: Quantization-Aware Training for Graph Neural Networks.

Degree-Quant This repo provides a clean re-implementation of the code associated with the paper Degree-Quant: Quantization-Aware Training for Graph Ne

35 Oct 07, 2022
MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tricks

MEAL-V2 This is the official pytorch implementation of our paper: "MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet without Tric

Zhiqiang Shen 653 Dec 19, 2022
Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

NÜWA - Pytorch (wip) Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch. This repository will be popul

Phil Wang 463 Dec 28, 2022
GoodNews Everyone! Context driven entity aware captioning for news images

This is the code for a CVPR 2019 paper, called GoodNews Everyone! Context driven entity aware captioning for news images. Enjoy! Model preview: Huge T

117 Dec 19, 2022
Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

Non-Metric Space Library (NMSLIB) Important Notes NMSLIB is generic but fast, see the results of ANN benchmarks. A standalone implementation of our fa

2.9k Jan 04, 2023
Fake News Detection Using Machine Learning Methods

Fake-News-Detection-Using-Machine-Learning-Methods Fake news is always a real and dangerous issue. However, with the presence and abundance of various

Achraf Safsafi 1 Jan 11, 2022
Mahadi-Now - This Is Pakistani Just Now Login Tools

PAKISTANI JUST NOW LOGIN TOOLS Install apt update apt upgrade apt install python

MAHADI HASAN AFRIDI 19 Apr 06, 2022