An Open-Source Package for Information Retrieval.

Last update: Dec 27, 2022

Related tags

Deep Learning information-retrieval

Overview

OpenMatch

An Open-Source Package for Information Retrieval.

😃 What's New

Top Spot on TREC-COVID Challenge (May 2020, Round2)

The twin goals of the challenge are to evaluate search algorithms and systems for helping scientists, clinicians, policy makers, and others manage the existing and rapidly growing corpus of scientific literature related to COVID-19, and to discover methods that will assist with managing scientific information in future global biomedical crises.
>> Reproduce Our Submit >> About COVID-19 Dataset >> Our Paper

Overview

OpenMatch integrates excellent neural methods and technologies to provide a complete solution for deep text matching and understanding. The documentation and tutorial of OpenMatch are available at here.

1/ Document Retrieval

Document Retrieval refers to extracting a set of related documents from large-scale document-level data based on user queries.

* Sparse Retrieval

Sparse Retriever is defined as a sparse bag-of-words retrieval model.

* Dense Retrieval

Dense Retriever performs retrieval by encoding documents and queries into dense low-dimensional vectors, and selecting the document that has the highest inner product with the query

2/ Document Reranking

Document reranking aims to further match user query and documents retrieved by the previous step with the purpose of obtaining a ranked list of relevant documents.

* Neural Ranker

Neural Ranker uses neural network as ranker to reorder documents.

* Feature Ensemble

Feature Ensemble can fuse neural features learned by neural ranker with the features of non-neural methods to obtain more robust performance

3/ Domain Transfer Learning

Domain Transfer Learning can leverages external knowledge graphs or weak supervision data to guide and help ranker to overcome data scarcity.

* Knowledge Enhancemnet

Knowledge Enhancement incorporates entity semantics of external knowledge graphs to enhance neural ranker.

* Data Augmentation

Data Augmentation leverages weak supervision data to improve the ranking accuracy in certain areas that lacks large scale relevance labels.

Stage	Model	Paper
1/ Sparse Retrieval	BM25	Best Match25 ~Tool
1/ Dense Retrieval	ANN	Approximate nearest neighbor ~Tool

2/ Neural Ranker	K-NRM	End-to-End Neural Ad-hoc Ranking with Kernel Pooling ~Paper
2/ Neural Ranker	Conv-KNRM	Convolutional Neural Networks for Soft-Matching N-Grams in Ad-hoc Search ~Paper
2/ Neural Ranker	TK	Interpretable & Time-Budget-Constrained Contextualization for Re-Ranking ~Paper
2/ Neural Ranker	BERT	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding ~Paper
2/ Feature Ensemble	Coordinate Ascent	Linear feature-based models for information retrieval. Information Retrieval ~Paper

3/ Knowledge Enhancement	EDRM	Entity-Duet Neural Ranking: Understanding the Role of Knowledge Graph Semantics in Neural Information Retrieval ~Paper
3/ Data Augmentation	ReInfoSelect	Selective Weak Supervision for Neural Information Retrieval ~Paper

Note that the BERT model is following huggingface's implementation - transformers, so other bert-like models are also available in our toolkit, e.g. electra, scibert.

Installation

* From PyPI

pip install git+https://github.com/thunlp/OpenMatch.git

* From Source

git clone https://github.com/thunlp/OpenMatch.git
cd OpenMatch
python setup.py install

* From Docker

To build an OpenMatch docker image from Dockerfile

docker build -t <image_name> .

To run your docker image just built above as a container

docker run --gpus all --name=<container_name> -it -v /:/all/ --rm <image_name>:<TAG>

Quick Start

* Detailed examples are available here.

import torch
import OpenMatch as om

query = "Classification treatment COVID-19"
doc = "By retrospectively tracking the dynamic changes of LYM% in death cases and cured cases, this study suggests that lymphocyte count is an effective and reliable indicator for disease classification and prognosis in COVID-19 patients."

* For bert-like models:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("allenai/scibert_scivocab_uncased")
input_ids = tokenizer.encode(query, doc)
model = om.models.Bert("allenai/scibert_scivocab_uncased")
ranking_score, ranking_features = model(torch.tensor(input_ids).unsqueeze(0))

* For other models:

tokenizer = om.data.tokenizers.WordTokenizer(pretrained="./data/glove.6B.300d.txt")
query_ids, query_masks = tokenizer.process(query, max_len=16)
doc_ids, doc_masks = tokenizer.process(doc, max_len=128)
model = om.models.KNRM(vocab_size=tokenizer.get_vocab_size(),
                       embed_dim=tokenizer.get_embed_dim(),
                       embed_matrix=tokenizer.get_embed_matrix())
ranking_score, ranking_features = model(torch.tensor(query_ids).unsqueeze(0),
                                        torch.tensor(query_masks).unsqueeze(0),
                                        torch.tensor(doc_ids).unsqueeze(0),
                                        torch.tensor(doc_masks).unsqueeze(0))

* The GloVe can be downloaded using:

wget http://nlp.stanford.edu/data/glove.6B.zip -P ./data
unzip ./data/glove.6B.zip -d ./data

* Evaluation

metric = om.Metric()
res = metric.get_metric(qrels, ranking_list, 'ndcg_cut_20')
res = metric.get_mrr(qrels, ranking_list, 'mrr_cut_10')

Experiments

* Ad-hoc Search

Retriever	Reranker	Coor-Ascent	ClueWeb09	Robust04	ClueWeb12
SDM	KNRM	-	0.1880	0.3016	0.0968
SDM	Conv-KNRM	-	0.1894	0.2907	0.0896
SDM	EDRM	-	0.2015	0.2993	0.0937
SDM	TK	-	0.2306	0.2822	0.0966
SDM	BERT Base	-	0.2701	0.4168	0.1183
SDM	ELECTRA Base	-	0.2861	0.4668	0.1078

* MS MARCO Passage Ranking

Retriever	Reranker	Coor-Ascent	dev	eval
BM25	BERT Base	-	0.349	0.345
BM25	ELECTRA Base	-	0.352	0.344
BM25	RoBERTa Large	-	0.386	0.375
BM25	ELECTRA Large	-	0.388	0.376

* MS MARCO Document Ranking

Retriever	Reranker	Coor-Ascent	dev	eval
ANCE FirstP	-	-	0.373	0.334
ANCE MaxP	-	-	0.383	0.342
ANCE FirstP+BM25	BERT Base FirstP	+	0.431	0.380
ANCE MaxP	BERT Base MaxP	+	0.432	0.391

* Classic Features

Methods	ClueWeb09-B		Robust04		TREC-COVID
Methods	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]
BM25 (Anserini)	0.2773	0.1426	0.4129	0.1117	0.6979	0.7670
RankSVM (Dai et al.)	0.289	n.a.	0.420	n.a.	n.a.	n.a.
RankSVM (OpenMatch)	0.2825	0.1476	0.4309	0.1173	0.6995	0.7570
Coor-Ascent (Dai et al.)	0.295	n.a.	0.427	n.a.	n.a.	n.a.
Coor-Ascent (OpenMatch)	0.2969	0.1581	0.4340	0.1171	0.7041	0.7770

Contribution

Thanks to all the people who contributed to OpenMatch!

Kaitao Zhang, Si Sun, Zhenghao Liu, Aowei Lu

Project Organizers

Zhiyuan Liu
- Tsinghua University
- Homepage
Chenyan Xiong
- Microsoft Research AI
- Homepage
Maosong Sun
- Tsinghua University
- Homepage

Citation

@inproceedings{openmatch,
  author = {Liu, Zhenghao and Zhang, Kaitao and Xiong, Chenyan and Liu, Zhiyuan and Sun, Maosong},
  title = {OpenMatch: An Open Source Library for Neu-IR Research},
  booktitle = {Proceedings of SIGIR},
  year = {2021},
  url = {https://doi.org/10.1145/3404835.3462789},
  pages = {2531–2535}
}

An Open-Source Package for Information Retrieval.

Related tags

Overview

OpenMatch

😃 What's New

Overview

1/ Document Retrieval

* Sparse Retrieval

* Dense Retrieval

2/ Document Reranking

* Neural Ranker

* Feature Ensemble

3/ Domain Transfer Learning

* Knowledge Enhancemnet

* Data Augmentation

Installation

* From PyPI

* From Source

* From Docker

Quick Start

Experiments

Contribution

Project Organizers

Citation

Owner

THUNLP

SatelliteNeRF - PyTorch-based Neural Radiance Fields adapted to satellite domain

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.

Large-scale language modeling tutorials with PyTorch

Blender add-on: Add to Cameras menu: View → Camera, View → Add Camera, Camera → View, Previous Camera, Next Camera

One implementation of the paper "DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing".

EasyMocap is an open-source toolbox for markerless human motion capture from RGB videos.

Elastic weight consolidation technique for incremental learning.

A fast Protein Chain / Ligand Extractor and organizer.

Code for the paper "Reinforcement Learning as One Big Sequence Modeling Problem"

Code for EMNLP 2021 main conference paper "Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification"

OMAMO: orthology-based model organism selection

PyTorch implemention of ICCV'21 paper SGPA: Structure-Guided Prior Adaptation for Category-Level 6D Object Pose Estimation

A port of muP to JAX/Haiku

Reimplementation of NeurIPS'19: "Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting" by Shu et al.

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Code and dataset for AAAI 2021 paper FixMyPose: Pose Correctional Describing and Retrieval Hyounghun Kim, Abhay Zala, Graham Burri, Mohit Bansal.

Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

Unit-Convertor - Unit Convertor Built With Python

Implementation of parameterized soft-exponential activation function.

A Python-based development platform for automated trading systems - from backtesting to optimisation to livetrading.