A multilingual version of MS MARCO passage ranking dataset

Last update: Dec 27, 2022

Related tags

Overview

mMARCO

A multilingual version of MS MARCO passage ranking dataset

This repository presents a neural machine translation-based method for translating the MS MARCO passage ranking dataset. The code available here is the same used in our paper mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset.

Translated Datasets

As described in our work, we made available 8 translated versions of MS MARCO passage ranking dataset. The translated passages collection and the queries set (training and validation) are available at:

Released Model Checkpoints

Our available fine-tuned models are:

Model	Description	[email protected]*
ptT5-base-pt-msmarco	a PTT5 model fine-tuned on Portuguese MS MARCO	0.188
ptT5-base-en-pt-msmarco	a PTT5 model fine-tuned on English and Portuguese MS MARCO	0.343
mT5-base-en-pt-msmarco	a mT5 model fine-tuned on both English and Portuguese MS MARCO	0.375
mT5-base-multi-msmarco	a mT5 model fine-tuned on mMARCO	0.366
mMiniLM-pt-msmarco	a mMiniLM model fine-tuned on Portuguese MS MARCO	-
mMiniLM-en-pt-msmarco	a mMiniLM model fine-tuned on both English and Portuguese MS MARCO	0.375
mMiniLM-multi-msmarco	a mMiniLM model fine-tuned on mMARCO	0.363

* [email protected] on English MS MARCO

Dataset

We translate MS MARCO passage ranking dataset, a large-scale IR dataset comprising more than half million anonymized questions that were sampled from Bing's search query logs.

Translation Model

To translate the MS MARCO dataset, we use MarianNMT an open-source neural machine translation framework originally written in C++ for fast training and translation. The Language Technology Research Group at the University of Helsinki made available more than a thousand language pairs for translation, supported by HuggingFace framework.

How To Translate

In order to allow other users to translate the MS MARCO passage ranking dataset to other languages (or a dataset of your own will), we provide the translate.py script. This script expects a .tsv file, in which each line follows a document_id \t document_text format.

python translate.py --model_name_or_path Helsinki-NLP/opus-mt-{src}-{tgt} --target_language tgt_code--input_file collection.tsv --output_dir translated_data/

After translating, it is necessary to reassemble the file, as the documents were split into sentences.

python create_translated_collection.py --input_file translated_data/translated_file --output_file translated_{tgt}_collection

Translating the entire passages collection of MS MARCO took about 80 hours using a Tesla V100.

How to Cite

If you extend or use this work, please cite the paper where it was introduced:

@misc{bonifacio2021mmarco,
      title={mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset}, 
      author={Luiz Henrique Bonifacio and Israel Campiotti and Roberto Lotufo and Rodrigo Nogueira},
      year={2021},
      eprint={2108.13897},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

A multilingual version of MS MARCO passage ranking dataset

Related tags

Overview

mMARCO

Translated Datasets

Released Model Checkpoints

Dataset

Translation Model

How To Translate

How to Cite

Owner

Neural Turing Machine (NTM) & Differentiable Neural Computer (DNC) with pytorch & visdom

Code and data for "TURL: Table Understanding through Representation Learning"

Contains a bunch of different python programm tasks

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Official Pytorch implementation of the paper: "Locally Shifted Attention With Early Global Integration"

[ICML 2020] Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control

This is the official repository of XVFI (eXtreme Video Frame Interpolation)

PyTorch implementation of SCAFFOLD (Stochastic Controlled Averaging for Federated Learning, ICML 2020).

PyTorch implementation of a Real-ESRGAN model trained on custom dataset

PyTorch implemention of ICCV'21 paper SGPA: Structure-Guided Prior Adaptation for Category-Level 6D Object Pose Estimation

EgoNN: Egocentric Neural Network for Point Cloud Based 6DoF Relocalization at the City Scale

Semantic segmentation task for ADE20k & cityscapse dataset, based on several models.

code from "Tensor decomposition of higher-order correlations by nonlinear Hebbian plasticity"

Implementation of SiameseXML (ICML 2021)

PyTorch implementation of Barlow Twins.

Code for Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding

GAN-STEM-Conv2MultiSlice - Exploring Generative Adversarial Networks for Image-to-Image Translation in STEM Simulation

RGB-D Local Implicit Function for Depth Completion of Transparent Objects

API for RL algorithm design & testing of BCA (Building Control Agent) HVAC on EnergyPlus building energy simulator by wrapping their EMS Python API

A NSFW content filter.