Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Last update: Nov 28, 2022

Related tags

Deep Learning SDR

Overview

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference

This repo is the implementation for SDR.

Tested environment

Python 3.7
PyTorch 1.7
CUDA 11.0

Lower CUDA and PyTorch versions should work as well.

Installation
Datasets
Train with our datasets
Hierarchical Inference
Cite

License, Security, support and code of conduct specifications are under the Instructions directory.

Installation

Run

bash instructions/installation.sh

Datasets

The published datasets are:

Video games
- 21,935 articles
- Expert annotated test set. 90 articles with 12 ground-truth recommendations.
- Examples:
  - Grand Theft Auto - Mafia
  - Burnout Paradise - Forza Horizon 3
Wines
- 1635 articles
- Crafted by a human sommelier, 92 articles with ~10 ground-truth recommendations.
- Examples:
  - Pinot Meunier - Chardonnay
  - Dom Pérignon - Moët & Chandon

For more details and direct download see Wines and Video Games.

Training

The training process downloads the datasets automatically.

python sdr_main.py --dataset_name video_games

The code is based on PyTorch-Lightning, all PL hyperparameters are supported. (limit_train/val/test_batches, check_val_every_n_epoch etc.)

Tensorboard support

All metrics are being logged automatically and stored in

SDR/output/document_similarity/SDR/arch_SDR/dataset_name_<dataset>/<time_of_run>

Run tesnroboard --logdir=<path> to see the the logs.

Inference

The hierarchical inference described in the paper is implemented as a stand-alone service and can be used with any backbone algorithm (models/reco/hierarchical_reco.py).

python sdr_main.py --dataset_name <name> --resume_from_checkpoint <checkpoint> --test_only

Results

Citing & Authors

If you find this repository or the annotated datasets helpful, feel free to cite our publication -

SDR: Self-Supervised Document-to-Document Similarity Ranking viaContextualized Language Models and Hierarchical Inference

 @misc{ginzburg2021selfsupervised,
     title={Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference}, 
     author={Dvir Ginzburg and Itzik Malkiel and Oren Barkan and Avi Caciularu and Noam Koenigstein},
     year={2021},
     eprint={2106.01186},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}

Contact: Dvir Ginzburg, Itzik Malkiel.

Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Related tags

Overview

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference

Tested environment

Contents

Installation

Datasets

Training

Tensorboard support

Inference

Results

Citing & Authors

Owner

Microsoft

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

🔮 A refreshing functional take on deep learning, compatible with your favorite libraries

Populating 3D Scenes by Learning Human-Scene Interaction https://posa.is.tue.mpg.de/

Aircraft design optimization made fast through modern automatic differentiation

This repository contains the code for "SBEVNet: End-to-End Deep Stereo Layout Estimation" paper by Divam Gupta, Wei Pu, Trenton Tabor, Jeff Schneider

Official implementation of our paper "Learning to Bootstrap for Combating Label Noise"

DeLiGAN - This project is an implementation of the Generative Adversarial Network

Recall Loss for Semantic Segmentation (This repo implements the paper: Recall Loss for Semantic Segmentation)

OCR Post Correction for Endangered Language Texts

Cross-Task Consistency Learning Framework for Multi-Task Learning

Author Disambiguation using Knowledge Graph Embeddings with Literals

A collection of semantic image segmentation models implemented in TensorFlow

Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021)

A universal memory dumper using Frida

TraSw for FairMOT - A Single-Target Attack example (Attack ID: 19; Screener ID: 24):

Code for Efficient Visual Pretraining with Contrastive Detection

Code for "Learning to Segment Rigid Motions from Two Frames".

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

GenshinMapAutoMarkTools - Tools To add/delete/refresh resources mark in Genshin Impact Map

DataCLUE: 国内首个以数据为中心的AI测评（含模型分析报告）