DaReCzech is a dataset for text relevance ranking in Czech

Overview

DaReCzech Dataset

DaReCzech is a dataset for text relevance ranking in Czech. The dataset consists of more than 1.6M annotated query-documents pairs, which makes it one of the largest available datasets for this task.

The dataset was introduced in paper Siamese BERT-based Model for Web Search Relevance RankingEvaluated on a New Czech Dataset which has been accepted at the IAAI 2022 (Innovative Application Award).

Obtaining the Annotated Data

Please, first read a disclaimer that contains the terms of use. If you comply with them, send an email to [email protected] and the link to the dataset will be sent to you.

Overview

DaReCzech is divided into four parts:

  • Train-big (more than 1.4M records) – intended for training of a (neural) text relevance model
  • Train-small (97k records) – intended for GBRT training (with a text relevance feature trained on Train-big)
  • Dev (41k records)
  • Test (64k records)

Each set is distributed as a .tsv file with 6 columns:

  • ID – unique record ID
  • query – user query
  • url – URL of annotated document
  • doc – representation of the document under the URL, each document is represented using its title, URL and Body Text Extract (BTE) that was obtained using the internal module of our search engine
  • title: document title
  • label – the annotated relevance of the document to the query. There are 5 relevance labels ranging from 0 (the document is not useful for given query) to 1 (document is for given query useful)

The files are UTF-8 encoded. The values never contain a tab and are not quoted nor escaped – to load the dataset in pandas, use

import csv
import pandas as pd
pd.read_csv(path, sep='\t', quoting=csv.QUOTE_NONE)

Baselines

We provide code to train two BERT-based baseline models: a query-doc model (train_querydoc_model.py) and a siamese model (train_siamese_model.py).

Before running the scripts, install requirements that are listed in requirements.txt. The scripts were tested with Python 3.6.

pip install -r requirements.txt

Model Training

To train a query-doc model with default settings, run:

python train_querydoc_model.py train_big.tsv dev.tsv outputs

To train a siamese model without a teacher, run:

python train_siamese_model.py train_big.tsv dev.tsv outputs

To train a siamese model with a trained query-doc teacher, run:

python train_siamese_model.py train_big.tsv dev.tsv outputs --teacher path_to_query_doc_checkpoint

Note that example scripts run training with our (unsupervisedly) pretrained Small-E-Czech model.

Model Evaluation

To evaluate the trained query-doc model on test data, run:

python evaluate_model.py model_path test.tsv --is_querydoc

To evaluate the trained siamese model on test data, run:

python evaluate_model.py model_path test.tsv --is_siamese

Acknowledgements

If you use the dataset in your work, please cite the original paper:

@article{kocian2021siamese,
  title={Siamese BERT-based Model for Web Search Relevance RankingEvaluated on a New Czech Dataset},
  author={Kocián, Matěj and Náplava, Jakub and Štancl, Daniel and Kadlec, Vladimír},
  journal={arXiv preprint arXiv:2112.01810},
  year={2021}
}
Owner
Seznam.cz a.s.
Seznam.cz a.s.
[NeurIPS 2020] Semi-Supervision (Unlabeled Data) & Self-Supervision Improve Class-Imbalanced / Long-Tailed Learning

Rethinking the Value of Labels for Improving Class-Imbalanced Learning This repository contains the implementation code for paper: Rethinking the Valu

Yuzhe Yang 656 Dec 28, 2022
Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22)

Near-Optimal Sparse Allreduce for Distributed Deep Learning (published in PPoPP'22) Ok-Topk is a scheme for distributed training with sparse gradients

Shigang Li 9 Oct 29, 2022
RCT-ART is an NLP pipeline built with spaCy for converting clinical trial result sentences into tables through jointly extracting intervention, outcome and outcome measure entities and their relations.

Randomised controlled trial abstract result tabulator RCT-ART is an NLP pipeline built with spaCy for converting clinical trial result sentences into

2 Sep 16, 2022
official implementation for the paper "Simplifying Graph Convolutional Networks"

Simplifying Graph Convolutional Networks Updates As pointed out by #23, there was a subtle bug in our preprocessing code for the reddit dataset. After

Tianyi 727 Jan 01, 2023
Official PyTorch implementation of the ICRA 2021 paper: Adversarial Differentiable Data Augmentation for Autonomous Systems.

Adversarial Differentiable Data Augmentation This repository provides the official PyTorch implementation of the ICRA 2021 paper: Adversarial Differen

Manli 3 Oct 15, 2022
ByteTrack: Multi-Object Tracking by Associating Every Detection Box

ByteTrack ByteTrack is a simple, fast and strong multi-object tracker. ByteTrack: Multi-Object Tracking by Associating Every Detection Box Yifu Zhang,

Yifu Zhang 2.9k Jan 04, 2023
Exploiting a Zoo of Checkpoints for Unseen Tasks

Exploiting a Zoo of Checkpoints for Unseen Tasks This repo includes code to reproduce all results in the above Neurips paper, authored by Jiaji Huang,

Baidu Research 8 Sep 06, 2022
Multi-Target Adversarial Frameworks for Domain Adaptation in Semantic Segmentation

Multi-Target Adversarial Frameworks for Domain Adaptation in Semantic Segmentation Paper Multi-Target Adversarial Frameworks for Domain Adaptation in

Valeo.ai 20 Jun 21, 2022
Weakly-supervised object detection.

Wetectron Wetectron is a software system that implements state-of-the-art weakly-supervised object detection algorithms. Project CVPR'20, ECCV'20 | Pa

NVIDIA Research Projects 342 Jan 05, 2023
Loopy belief propagation for factor graphs on discrete variables, in JAX!

PGMax implements general factor graphs for discrete probabilistic graphical models (PGMs), and hardware-accelerated differentiable loopy belief propagation (LBP) in JAX.

Vicarious 62 Dec 23, 2022
Plato: A New Framework for Federated Learning Research

a new software framework to facilitate scalable federated learning research.

System <a href=[email protected] Lab"> 192 Jan 05, 2023
BASH - Biomechanical Animated Skinned Human

We developed a method animating a statistical 3D human model for biomechanical analysis to increase accessibility for non-experts, like patients, athletes, or designers.

Machine Learning and Data Analytics Lab FAU 66 Nov 19, 2022
Neural Turing Machine (NTM) & Differentiable Neural Computer (DNC) with pytorch & visdom

Neural Turing Machine (NTM) & Differentiable Neural Computer (DNC) with pytorch & visdom Sample on-line plotting while training(avg loss)/testing(writ

Jingwei Zhang 269 Nov 15, 2022
Pytorch implementation of the popular Improv RNN model originally proposed by the Magenta team.

Pytorch Implementation of Improv RNN Overview This code is a pytorch implementation of the popular Improv RNN model originally implemented by the Mage

Sebastian Murgul 3 Nov 11, 2022
GeDML is an easy-to-use generalized deep metric learning library

GeDML is an easy-to-use generalized deep metric learning library

Borui Zhang 32 Dec 05, 2022
Pytorch implementation for the Temporal and Object Quantification Networks (TOQ-Nets).

TOQ-Nets-PyTorch-Release Pytorch implementation for the Temporal and Object Quantification Networks (TOQ-Nets). Temporal and Object Quantification Net

Zhezheng Luo 9 Jun 30, 2022
PyTorch code for MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning PyTorch code for our ACL 2020 paper "MART: Memory-Augmented Recur

Jie Lei 雷杰 151 Jan 06, 2023
Official code repository for the EMNLP 2021 paper

Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization PyTorch code for the EMNLP 2021 paper "Integrating Visuospatia

Adyasha Maharana 23 Dec 19, 2022
A python script to convert images to animated sus among us crewmate twerk jifs as seen on r/196

img_sussifier A python script to convert images to animated sus among us crewmate twerk jifs as seen on r/196 Examples How to use install python pip i

41 Sep 30, 2022
Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation

DistMIS Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation. DistriMIS Distributing Deep Learning Hyperparameter Tuning

HiEST 2 Sep 09, 2022