Self-Supervised Document-to-Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference

Related tags

Deep LearningSDR
Overview

Self-Supervised Document Similarity Ranking (SDR) via Contextualized Language Models and Hierarchical Inference

This repo is the implementation for SDR.

 

Tested environment

  • Python 3.7
  • PyTorch 1.7
  • CUDA 11.0

Lower CUDA and PyTorch versions should work as well.

 

Contents

License, Security, support and code of conduct specifications are under the Instructions directory.  

Installation

Run

bash instructions/installation.sh 

 

Datasets

The published datasets are:

  • Video games
    • 21,935 articles
    • Expert annotated test set. 90 articles with 12 ground-truth recommendations.
    • Examples:
      • Grand Theft Auto - Mafia
      • Burnout Paradise - Forza Horizon 3
  • Wines
    • 1635 articles
    • Crafted by a human sommelier, 92 articles with ~10 ground-truth recommendations.
    • Examples:
      • Pinot Meunier - Chardonnay
      • Dom Pérignon - Moët & Chandon

For more details and direct download see Wines and Video Games.

 

Training

The training process downloads the datasets automatically.

python sdr_main.py --dataset_name video_games

The code is based on PyTorch-Lightning, all PL hyperparameters are supported. (limit_train/val/test_batches, check_val_every_n_epoch etc.)

Tensorboard support

All metrics are being logged automatically and stored in

SDR/output/document_similarity/SDR/arch_SDR/dataset_name_<dataset>/<time_of_run>

Run tesnroboard --logdir=<path> to see the the logs.

 

Inference

The hierarchical inference described in the paper is implemented as a stand-alone service and can be used with any backbone algorithm (models/reco/hierarchical_reco.py).

 

python sdr_main.py --dataset_name <name> --resume_from_checkpoint <checkpoint> --test_only

Results

Citing & Authors

If you find this repository or the annotated datasets helpful, feel free to cite our publication -

SDR: Self-Supervised Document-to-Document Similarity Ranking viaContextualized Language Models and Hierarchical Inference

 @misc{ginzburg2021selfsupervised,
     title={Self-Supervised Document Similarity Ranking via Contextualized Language Models and Hierarchical Inference}, 
     author={Dvir Ginzburg and Itzik Malkiel and Oren Barkan and Avi Caciularu and Noam Koenigstein},
     year={2021},
     eprint={2106.01186},
     archivePrefix={arXiv},
     primaryClass={cs.CL}
}

Contact: Dvir Ginzburg, Itzik Malkiel.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Torch-ngp - A pytorch implementation of the hash encoder proposed in instant-ngp

HashGrid Encoder (WIP) A pytorch implementation of the HashGrid Encoder from ins

hawkey 1k Jan 01, 2023
Official Code Implementation of the paper : XAI for Transformers: Better Explanations through Conservative Propagation

Official Code Implementation of The Paper : XAI for Transformers: Better Explanations through Conservative Propagation For the SST-2 and IMDB expermin

Ameen Ali 23 Dec 30, 2022
A high-performance anchor-free YOLO. Exceeding yolov3~v5 with ONNX, TensorRT, NCNN, and Openvino supported.

YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and industrial communities. For more details, please refer to our rep

7.7k Jan 06, 2023
Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

An Image is Worth 16x16 Words, What is a Video Worth? paper Official PyTorch Implementation Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor DAMO Academy, Al

213 Nov 12, 2022
catch-22: CAnonical Time-series CHaracteristics

catch22 - CAnonical Time-series CHaracteristics About catch22 is a collection of 22 time-series features coded in C that can be run from Python, R, Ma

Carl H Lubba 229 Oct 21, 2022
Relative Human dataset, CVPR 2022

Relative Human (RH) contains multi-person in-the-wild RGB images with rich human annotations, including: Depth layers (DLs): relative depth relationsh

Yu Sun 112 Dec 02, 2022
An efficient PyTorch implementation of the evaluation metrics in recommender systems.

recsys_metrics An efficient PyTorch implementation of the evaluation metrics in recommender systems. Overview • Installation • How to use • Benchmark

Xingdong Zuo 12 Dec 02, 2022
FinEAS: Financial Embedding Analysis of Sentiment 📈

FinEAS: Financial Embedding Analysis of Sentiment 📈 (SentenceBERT for Financial News Sentiment Regression) This repository contains the code for gene

LHF Labs 31 Dec 13, 2022
A CNN implementation using only numpy. Supports multidimensional images, stride, etc.

A CNN implementation using only numpy. Supports multidimensional images, stride, etc. Speed up due to heavy use of slicing and mathematical simplification..

2 Nov 30, 2021
Python scripts form performing stereo depth estimation using the HITNET model in ONNX.

ONNX-HITNET-Stereo-Depth-estimation Python scripts form performing stereo depth estimation using the HITNET model in ONNX. Stereo depth estimation on

Ibai Gorordo 30 Nov 08, 2022
This code is for eCaReNet: explainable Cancer Relapse Prediction Network.

eCaReNet This code is for eCaReNet: explainable Cancer Relapse Prediction Network. (Towards Explainable End-to-End Prostate Cancer Relapse Prediction

Institute of Medical Systems Biology 2 Jul 28, 2022
A bunch of random PyTorch models using PyTorch's C++ frontend

PyTorch Deep Learning Models using the C++ frontend Gettting started Clone the repo 1. https://github.com/mrdvince/pytorchcpp 2. cd fashionmnist or

Vince 0 Jul 13, 2021
[UNMAINTAINED] Automated machine learning for analytics & production

auto_ml Automated machine learning for production and analytics Installation pip install auto_ml Getting started from auto_ml import Predictor from au

Preston Parry 1.6k Jan 02, 2023
Image Segmentation with U-Net Algorithm on Carvana Dataset using AWS Sagemaker

Image Segmentation with U-Net Algorithm on Carvana Dataset using AWS Sagemaker This is a full project of image segmentation using the model built with

Htin Aung Lu 1 Jan 04, 2022
SimulLR - PyTorch Implementation of SimulLR

PyTorch Implementation of SimulLR There is an interesting work[1] about simultan

11 Dec 22, 2022
An implementation of the AdaOPS (Adaptive Online Packing-based Search), which is an online POMDP Solver used to solve problems defined with the POMDPs.jl generative interface.

AdaOPS An implementation of the AdaOPS (Adaptive Online Packing-guided Search), which is an online POMDP Solver used to solve problems defined with th

9 Oct 05, 2022
A flexible submap-based framework towards spatio-temporally consistent volumetric mapping and scene understanding.

Panoptic Mapping This package contains panoptic_mapping, a general framework for semantic volumetric mapping. We provide, among other, a submap-based

ETHZ ASL 194 Dec 20, 2022
Code for our paper A Transformer-Based Feature Segmentation and Region Alignment Method For UAV-View Geo-Localization,

FSRA This repository contains the dataset link and the code for our paper A Transformer-Based Feature Segmentation and Region Alignment Method For UAV

Dmmm 32 Dec 18, 2022
Data Preparation, Processing, and Visualization for MoVi Data

MoVi-Toolbox Data Preparation, Processing, and Visualization for MoVi Data, https://www.biomotionlab.ca/movi/ MoVi is a large multipurpose dataset of

Saeed Ghorbani 51 Nov 27, 2022
This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

TransFG: A Transformer Architecture for Fine-grained Recognition Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-gra

Ju He 307 Jan 03, 2023