Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Overview

Cherche

Neural search



Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. Cherche's main strength is its ability to build diverse and end-to-end pipelines.

Alt text

Installation 🤖

pip install cherche

To install the development version:

pip install git+https://github.com/raphaelsty/cherche

Documentation 📜

Documentation is available here. It provides details about retrievers, rankers, pipelines, question answering, summarization, and examples.

QuickStart 💨

Documents 📑

Cherche allows findings the right document within a list of objects. Here is an example of a corpus.

from cherche import data

documents = data.load_towns()

documents[:3]
[{'id': 0,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris is the capital and most populous city of France.'},
 {'id': 1,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of science, and arts."},
 {'id': 2,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France.'
  }]

Retriever ranker 🔍

Here is an example of a neural search pipeline composed of a TfIdf that quickly retrieves documents, followed by a ranking model. The ranking model sorts the documents produced by the retriever based on the semantic similarity between the query and the documents.

from cherche import data, retrieve, rank
from sentence_transformers import SentenceTransformer

# List of dicts
documents = data.load_towns()

# Retrieve on fields title and article
retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents, k=30)

# Rank on fields title and article
ranker = rank.Encoder(
    key = "id",
    on = ["title", "article"],
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k = 3,
    path = "encoder.pkl"
)

# Pipeline creation
search = retriever + ranker

search.add(documents=documents)

search("Bordeaux")
[{'id': 57, 'similarity': 0.69513476},
 {'id': 63, 'similarity': 0.6214991},
 {'id': 65, 'similarity': 0.61809057}]

Map the index to the documents to access their contents.

search += documents
search("Bordeaux")
[{'id': 57,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'Bordeaux ( bor-DOH, French: [bɔʁdo] (listen); Gascon Occitan: Bordèu [buɾˈðɛw]) is a port city on the river Garonne in the Gironde department, Southwestern France.',
  'similarity': 0.69513476},
 {'id': 63,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'The term "Bordelais" may also refer to the city and its surrounding region.',
  'similarity': 0.6214991},
 {'id': 65,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': "Bordeaux is a world capital of wine, with its castles and vineyards of the Bordeaux region that stand on the hillsides of the Gironde and is home to the world's main wine fair, Vinexpo.",
  'similarity': 0.61809057}]

Retrieve 👻

Cherche provides different retrievers that filter input documents based on a query.

  • retrieve.Elastic
  • retrieve.TfIdf
  • retrieve.Lunr
  • retrieve.BM25Okapi
  • retrieve.BM25L
  • retrieve.Flash
  • retrieve.Encoder

Rank 🤗

Cherche rankers are compatible with SentenceTransformers models, Hugging Face sentence similarity models, Hugging Face zero shot classification models, and of course with your own models.

Summarization and question answering

Cherche provides modules dedicated to summarization and question answering. These modules are compatible with Hugging Face's pre-trained models and can be fully integrated into neural search pipelines.

Acknowledgements 👏

The BM25 models available in Cherche are wrappers around rank_bm25. Elastic retriever is a wrapper around Python Elasticsearch Client. TfIdf retriever is a wrapper around scikit-learn's TfidfVectorizer. Lunr retriever is a wrapper around Lunr.py. Flash retriever is a wrapper around FlashText. DPR and Encode rankers are wrappers dedicated to the use of the pre-trained models of SentenceTransformers in a neural search pipeline. ZeroShot ranker is a wrapper dedicated to the use of the zero-shot sequence classifiers of Hugging Face in a neural search pipeline.

See also 👀

Cherche is a minimalist solution and meets a need for modularity. Cherche is the way to go if you start with a list of documents as JSON with multiple fields to search on and want to create pipelines. Also ,Cherche is well suited for middle sized corpora.

Do not hesitate to look at Haystack, Jina, or TxtAi which offer very advanced solutions for neural search and are great.

Dev Team 💾

The Cherche dev team is made up of Raphaël Sourty and François-Paul Servant 🥳

Comments
  • Added spelling corrector object

    Added spelling corrector object

    Hello ! I added a spelling corrector base class as well as the original implementation of the Norvig spelling corrector. The spelling corrector can be fitted directly on the pipeline's documents with the '.add(documents)' method. I also provided an optional (defaults to False) external dictionary, the one originally used by Norvig.

    I have no issue updating my code for improvements, so feel free to suggest any modification !

    opened by NicolasBizzozzero 4
  • 0.0.5

    0.0.5

    Pull request for Cherche version 0.0.5

    • RAG: add RAG generator for open domain question answering
    • RapidFuzzy: New blazzing fast retriever
    • Retrievers: Provide similarities for each retriever
    • Union & Intersection: Keep similarity scores
    opened by raphaelsty 1
  • Batch processing

    Batch processing

    Retrieving documents with batch of queries can significantly speed up things. It is now available for few models using the development version via the batch method.

    Models involved are:

    • TfIdf retriever
    • Encoder retriever (milvus + faiss)
    • Encoder ranker (milvus)
    • DPR retriever (milvus + faiss)
    • DPR ranker (milvus)
    • Recommend retriever

    Batch is not yet compatible with pipelines.

    enhancement 
    opened by raphaelsty 0
  • Cherche 1.0.0

    Cherche 1.0.0

    Here is an essential update for Cherche. The update retains the previous API and is compatible with previous versions. 🥳

    Main additions:

    • Added compatibility with two new open-source retrievers: Meilisearch and TypeSense.
    • Compatibility with the Milvus index to use the retriever.Encoder and retriever.DPR models on massive corpora.
    • Compatibility with the Milvus index to store ranker embeddings in a database rather than in memory.
    • Progress bar when pre-computing embeddings by Encoder, DPR retrievers and Encoder, DPR rankers.
    • All pipelines (voting, intersection, concatenation) produce a similarity score. To do so, the pipeline object applies a softmax to normalize the scores, thus allowing us to "compare" the scores of two distinct models.
    • Integration of collaborative filtering models via adding a Recommend retriever and a Recommend ranker (indexation via Faiss and compatible with Milvus) to consider users' preferences in the search.
    opened by raphaelsty 0
  • "IndexError: index out of range in self "While adding documents to cherche pipeline

    I'm using a cherche pipline built of a tfidf retriever with a sentencetransformer ranker as follows : search = (retriever + ranker) While trying to add documents to the pipeline (search.add(documents=documents), I got this error :

    """/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse) 2181 # remove once script supports set_grad_enabled 2182 no_grad_embedding_renorm(weight, input, max_norm, norm_type) -> 2183 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) 2184 2185

    IndexError: index out of range in self"""

    opened by delmetni 0
  • incomplete doc about metrics

    incomplete doc about metrics

    opened by fpservant 0
Releases(1.0.1)
  • 1.0.1(Oct 27, 2022)

  • 1.0.0(Oct 26, 2022)

    What's Changed

    Here is an essential update for Cherche! 🥳

    • Added compatibility with two new open-source retrievers: Meilisearch and TypeSense.
    • Compatibility with the Milvus index to use the retriever.Encoder and retriever.DPR models on massive corpora.
    • Compatibility with the Milvus index to store ranker embeddings in a database rather than in memory.
    • Progress bar when pre-computing embeddings by Encoder, DPR retrievers and Encoder, DPR rankers.
    • The path parameter is no longer used.
    • All pipelines (voting, intersection, concatenation) produce a similarity score. To do so, the pipeline object applies a softmax to normalize the scores, thus allowing us to "compare" the scores of two distinct models.
    • Integration of collaborative filtering models via adding a Recommend retriever and a Recommend ranker (indexation via Faiss and compatible with Milvus) to consider users' preferences in the search.

    Cherche is now fully compatible with large-scale corpora and deeply integrates collaborative filtering. Updates retains the previous API and is compatible with previous versions.

    Source code(tar.gz)
    Source code(zip)
  • 0.1.0(Jun 16, 2022)

    Added compatibility with the ONNX environment and quantization to significantly speed up sentence transformers and question answering models. 🏎

    It is now possible to choose the type of index for the Encoder and DPR retrievers in order to process the largest corpora while using the GPU.

    Source code(tar.gz)
    Source code(zip)
  • 0.0.9(Apr 13, 2022)

  • 0.0.8(Mar 7, 2022)

  • 0.0.7(Mar 7, 2022)

  • 0.0.6(Mar 3, 2022)

    • Update documentation
    • Update retriever Encoder and DPR, path is optionnal
    • Add deployment documentation
    • Update similarity type
    • Avoid round similarity
    Source code(tar.gz)
    Source code(zip)
  • 0.0.5(Feb 8, 2022)

    • Loading and Saving tutorial
    • Fuzzy retriever
    • Similarities everywhere (retrievers, union, intersection provide similarity scores)
    • RAG generation
    Source code(tar.gz)
    Source code(zip)
  • 0.0.4(Jan 20, 2022)

    Update of the encoder retriever and the DPR retriever. Documents in the Faiss index will not be duplicated. Query embeddings can now be pre-computed for ranker Encoder and ranker DPR to speed up evaluation without having to compute it again.

    Source code(tar.gz)
    Source code(zip)
  • 0.0.3(Jan 13, 2022)

  • 0.0.2(Jan 12, 2022)

    Update of the Cherche dependencies. The previous dependencies were too strict and restrictive as they were limited to a specific version for each package.

    Source code(tar.gz)
    Source code(zip)
Owner
Raphael Sourty
PhD Student @ IRIT and Renault
Raphael Sourty
华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

HUAWEI STORE GO 2021 说明 基于Python3+Selenium的华为商城抢购爬虫脚本,修改自近两年没更新的项目BUY-HW,为女神抢Nova 8(什么时候华为开始学小米玩饥饿营销了?) 原项目的登陆以及抢购部分已经不可用,本项目对原项目进行了改正以适应新华为商城,并增加一些功能

ZhangLiang 111 Dec 22, 2022
Sequence-to-Sequence Framework in PyTorch

nmtpytorch allows training of various end-to-end neural architectures including but not limited to neural machine translation, image captioning and au

LIUM 395 Nov 21, 2022
A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

Scriptfab - What is it? A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code

DevNugget 3 Jul 28, 2021
Transformers and related deep network architectures are summarized and implemented here.

Transformers: from NLP to CV This is a practical introduction to Transformers from Natural Language Processing (NLP) to Computer Vision (CV) Introduct

Ibrahim Sobh 138 Dec 27, 2022
Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

Iman Kermani 3 Apr 15, 2022
Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

Maarten van Gompel 46 Dec 14, 2022
Code associated with the Don't Stop Pretraining ACL 2020 paper

dont-stop-pretraining Code associated with the Don't Stop Pretraining ACL 2020 paper Citation @inproceedings{dontstoppretraining2020, author = {Suchi

AI2 449 Jan 04, 2023
The SVO-Probes Dataset for Verb Understanding

The SVO-Probes Dataset for Verb Understanding This repository contains the SVO-Probes benchmark designed to probe for Subject, Verb, and Object unders

DeepMind 20 Nov 30, 2022
The aim of this task is to predict someone's English proficiency based on a text input.

English_proficiency_prediction_NLP The aim of this task is to predict someone's English proficiency based on a text input. Using the The NICT JLE Corp

1 Dec 13, 2021
An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据,遵循文件结构如下: ├── data │ ├── train │

旷视天元 MegEngine 10 Jul 19, 2022
Open-source offline translation library written in Python. Uses OpenNMT for translations

Open source neural machine translation in Python. Designed to be used either as a Python library or desktop application. Uses OpenNMT for translations and PyQt for GUI.

Argos Open Tech 1.6k Jan 01, 2023
Prithivida 690 Jan 04, 2023
NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking

pretrain4ir_tutorial NLPIR tutorial: pretrain for IR. pre-train on raw textual corpus, fine-tune on MS MARCO Document Ranking 用作NLPIR实验室, Pre-training

ZYMa 12 Apr 07, 2022
A simple version of DeTR

DeTR-Lite A simple version of DeTR Before you enjoy this DeTR-Lite The purpose of this project is to allow you to learn the basic knowledge of DeTR. P

Jianhua Yang 11 Jun 13, 2022
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
Language-Agnostic SEntence Representations

LASER Language-Agnostic SEntence Representations LASER is a library to calculate and use multilingual sentence embeddings. NEWS 2019/11/08 CCMatrix is

Facebook Research 3.2k Jan 04, 2023
This is a modification of the OpenAI-CLIP repository of moein-shariatnia

This is a modification of the OpenAI-CLIP repository of moein-shariatnia

Sangwon Beak 2 Mar 04, 2022
SDL: Synthetic Document Layout dataset

SDL is the project that synthesizes document images. It facilitates multiple-level labeling on document images and can generate in multiple languages.

Sơn Nguyễn 0 Oct 07, 2021
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 06, 2023