Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Last update: Nov 29, 2022

Overview

Cherche

Neural search

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers. Cherche is meant to be used with small to medium sized corpora. Cherche's main strength is its ability to build diverse and end-to-end pipelines.

Installation 🤖

pip install cherche

To install the development version:

pip install git+https://github.com/raphaelsty/cherche

Documentation 📜

Documentation is available here. It provides details about retrievers, rankers, pipelines, question answering, summarization, and examples.

QuickStart 💨

Documents 📑

Cherche allows findings the right document within a list of objects. Here is an example of a corpus.

from cherche import data

documents = data.load_towns()

documents[:3]
[{'id': 0,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris is the capital and most populous city of France.'},
 {'id': 1,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of science, and arts."},
 {'id': 2,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France.'
  }]

Retriever ranker 🔍

Here is an example of a neural search pipeline composed of a TfIdf that quickly retrieves documents, followed by a ranking model. The ranking model sorts the documents produced by the retriever based on the semantic similarity between the query and the documents.

from cherche import data, retrieve, rank
from sentence_transformers import SentenceTransformer

# List of dicts
documents = data.load_towns()

# Retrieve on fields title and article
retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents, k=30)

# Rank on fields title and article
ranker = rank.Encoder(
    key = "id",
    on = ["title", "article"],
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k = 3,
    path = "encoder.pkl"
)

# Pipeline creation
search = retriever + ranker

search.add(documents=documents)

search("Bordeaux")
[{'id': 57, 'similarity': 0.69513476},
 {'id': 63, 'similarity': 0.6214991},
 {'id': 65, 'similarity': 0.61809057}]

Map the index to the documents to access their contents.

search += documents
search("Bordeaux")
[{'id': 57,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'Bordeaux ( bor-DOH, French: [bɔʁdo] (listen); Gascon Occitan: Bordèu [buɾˈðɛw]) is a port city on the river Garonne in the Gironde department, Southwestern France.',
  'similarity': 0.69513476},
 {'id': 63,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'The term "Bordelais" may also refer to the city and its surrounding region.',
  'similarity': 0.6214991},
 {'id': 65,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': "Bordeaux is a world capital of wine, with its castles and vineyards of the Bordeaux region that stand on the hillsides of the Gironde and is home to the world's main wine fair, Vinexpo.",
  'similarity': 0.61809057}]

Retrieve 👻

Cherche provides different retrievers that filter input documents based on a query.

retrieve.Elastic
retrieve.TfIdf
retrieve.Lunr
retrieve.BM25Okapi
retrieve.BM25L
retrieve.Flash
retrieve.Encoder

Rank 🤗

Cherche rankers are compatible with SentenceTransformers models, Hugging Face sentence similarity models, Hugging Face zero shot classification models, and of course with your own models.

Summarization and question answering

Cherche provides modules dedicated to summarization and question answering. These modules are compatible with Hugging Face's pre-trained models and can be fully integrated into neural search pipelines.

Acknowledgements 👏

The BM25 models available in Cherche are wrappers around rank_bm25. Elastic retriever is a wrapper around Python Elasticsearch Client. TfIdf retriever is a wrapper around scikit-learn's TfidfVectorizer. Lunr retriever is a wrapper around Lunr.py. Flash retriever is a wrapper around FlashText. DPR and Encode rankers are wrappers dedicated to the use of the pre-trained models of SentenceTransformers in a neural search pipeline. ZeroShot ranker is a wrapper dedicated to the use of the zero-shot sequence classifiers of Hugging Face in a neural search pipeline.

Dev Team 💾

The Cherche dev team is made up of Raphaël Sourty and François-Paul Servant 🥳

Comments

Added spelling corrector object

Hello ! I added a spelling corrector base class as well as the original implementation of the Norvig spelling corrector. The spelling corrector can be fitted directly on the pipeline's documents with the '.add(documents)' method. I also provided an optional (defaults to False) external dictionary, the one originally used by Norvig.

I have no issue updating my code for improvements, so feel free to suggest any modification !

opened by NicolasBizzozzero 4
0.0.5
Pull request for Cherche version 0.0.5

RAG: add RAG generator for open domain question answering

RapidFuzzy: New blazzing fast retriever

Retrievers: Provide similarities for each retriever

Union & Intersection: Keep similarity scores
opened by raphaelsty 1
Batch processing
Retrieving documents with batch of queries can significantly speed up things. It is now available for few models using the development version via the batch method.

Models involved are:

TfIdf retriever

Encoder retriever (milvus + faiss)

Encoder ranker (milvus)

DPR retriever (milvus + faiss)

DPR ranker (milvus)

Recommend retriever

Batch is not yet compatible with pipelines.
enhancement
opened by raphaelsty 0
Cherche 1.0.0
Here is an essential update for Cherche. The update retains the previous API and is compatible with previous versions. 🥳

Main additions:

Added compatibility with two new open-source retrievers: Meilisearch and TypeSense.

Compatibility with the Milvus index to use the retriever.Encoder and retriever.DPR models on massive corpora.

Compatibility with the Milvus index to store ranker embeddings in a database rather than in memory.

Progress bar when pre-computing embeddings by Encoder, DPR retrievers and Encoder, DPR rankers.

All pipelines (voting, intersection, concatenation) produce a similarity score. To do so, the pipeline object applies a softmax to normalize the scores, thus allowing us to "compare" the scores of two distinct models.

Integration of collaborative filtering models via adding a Recommend retriever and a Recommend ranker (indexation via Faiss and compatible with Milvus) to consider users' preferences in the search.
opened by raphaelsty 0
"IndexError: index out of range in self "While adding documents to cherche pipeline

I'm using a cherche pipline built of a tfidf retriever with a sentencetransformer ranker as follows : search = (retriever + ranker) While trying to add documents to the pipeline (search.add(documents=documents), I got this error :

"""/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse) 2181 # remove once script supports set_grad_enabled 2182 no_grad_embedding_renorm(weight, input, max_norm, norm_type) -> 2183 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) 2184 2185

IndexError: index out of range in self"""

opened by delmetni 0
incomplete doc about metrics

https://raphaelsty.github.io/cherche/examples/eval_pipeline/ you say that one can find the explnaation about metrics here https://amitness.com/2020/08/information-retrieval-evaluation/ but it doesn't say what "precision" and "r-precision" are.

opened by fpservant 0

Releases(1.0.1)

1.0.1(Oct 27, 2022)

Removed the dependency with grpcio that can cause problems during installation.
Source code(tar.gz)
Source code(zip)
1.0.0(Oct 26, 2022)
What's Changed

Here is an essential update for Cherche! 🥳

Added compatibility with two new open-source retrievers: Meilisearch and TypeSense.

Compatibility with the Milvus index to use the retriever.Encoder and retriever.DPR models on massive corpora.

Compatibility with the Milvus index to store ranker embeddings in a database rather than in memory.

Progress bar when pre-computing embeddings by Encoder, DPR retrievers and Encoder, DPR rankers.

The path parameter is no longer used.

All pipelines (voting, intersection, concatenation) produce a similarity score. To do so, the pipeline object applies a softmax to normalize the scores, thus allowing us to "compare" the scores of two distinct models.

Integration of collaborative filtering models via adding a Recommend retriever and a Recommend ranker (indexation via Faiss and compatible with Milvus) to consider users' preferences in the search.

Cherche is now fully compatible with large-scale corpora and deeply integrates collaborative filtering. Updates retains the previous API and is compatible with previous versions.
Source code(tar.gz)
Source code(zip)
0.1.0(Jun 16, 2022)

Added compatibility with the ONNX environment and quantization to significantly speed up sentence transformers and question answering models. 🏎

It is now possible to choose the type of index for the Encoder and DPR retrievers in order to process the largest corpora while using the GPU.
Source code(tar.gz)
Source code(zip)
0.0.9(Apr 13, 2022)

Voting operator dedicated to retrievers and rankers.
Source code(tar.gz)
Source code(zip)
0.0.8(Mar 7, 2022)

Avoid checking similarities in TF-IDF retrievers while filtering documents.
Source code(tar.gz)
Source code(zip)
0.0.7(Mar 7, 2022)
Significant improvement in the speed of the TF-IDF retriever using sparse CSC matrix.

The setup.py file loads the readme file as UTF-8.

Source code(tar.gz)
Source code(zip)
0.0.6(Mar 3, 2022)
Update documentation

Update retriever Encoder and DPR, path is optionnal

Add deployment documentation

Update similarity type

Avoid round similarity

Source code(tar.gz)
Source code(zip)
0.0.5(Feb 8, 2022)
Loading and Saving tutorial

Fuzzy retriever

Similarities everywhere (retrievers, union, intersection provide similarity scores)

RAG generation

Source code(tar.gz)
Source code(zip)
0.0.4(Jan 20, 2022)

Update of the encoder retriever and the DPR retriever. Documents in the Faiss index will not be duplicated. Query embeddings can now be pre-computed for ranker Encoder and ranker DPR to speed up evaluation without having to compute it again.
Source code(tar.gz)
Source code(zip)
0.0.3(Jan 13, 2022)
Adding:

Translation

DPR retriever

Source code(tar.gz)
Source code(zip)
0.0.2(Jan 12, 2022)

Update of the Cherche dependencies. The previous dependencies were too strict and restrictive as they were limited to a specific version for each package.
Source code(tar.gz)
Source code(zip)
0.0.1(Jan 8, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Raphael Sourty

PhD Student @ IRIT and Renault

GitHub Repository

Anomaly Detection 이상치 탐지 전처리 모듈

Anomaly Detection 시계열 데이터에 대한 이상치 탐지 1. Kernel Density Estimation을 활용한 이상치 탐지 train_data_path와 test_data_path에 존재하는 시점 정보를 포함하고 있는 csv 형태의 train data와

43 Nov 28, 2022

A python package for deep multilingual punctuation prediction.

This python library predicts the punctuation of English, Italian, French and German texts. We developed it to restore the punctuation of transcribed spoken language.

27 Dec 22, 2022

Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Expressions.

patterns-finder Simple, Fast, Powerful and Easily extensible python package for extracting patterns from text, with over than 60 predefined Regular Ex

22 Dec 19, 2022

Large-scale pretraining for dialogue

A State-of-the-Art Large-scale Pretrained Response Generation Model (DialoGPT) This repository contains the source code and trained model for a large-

1.8k Jan 07, 2023

Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

141 Dec 30, 2022

Chatbot for the Chatango messaging platform

BroiestBot The baddest bot in the game right now. Uses the ch.py framework for joining Chantango rooms and responding to user messages. Commands If a

3 Jan 17, 2022

Library for Russian imprecise rhymes generation

TOM RHYMER Library for Russian imprecise rhymes generation. Quick Start Generate rhymes by any given rhyme scheme (aabb, abab, aaccbb, etc ...): from

6 Oct 18, 2022

Indonesia spellchecker with python

indonesia-spellchecker Ganti kata yang terdapat pada file teks.txt untuk diperiksa kebenaran kata. Run on local machine python3 main.py

1 Sep 14, 2022

A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines. Created by Prithiviraj Damodaran. Open to pull requests and other forms of collaboration.

Parrot Parrot is a paraphrase based utterance augmentation framework purpose built to accelerate training NLU models. A paraphrase framework is more t

690 Jan 04, 2023

REST API for sentence tokenization and embedding using Multilingual Universal Sentence Encoder.

What is MUSE? MUSE stands for Multilingual Universal Sentence Encoder - multilingual extension (16 languages) of Universal Sentence Encoder (USE). MUS

47 Sep 05, 2022

The FinQA dataset from paper: FinQA: A Dataset of Numerical Reasoning over Financial Data

Data and code for EMNLP 2021 paper "FinQA: A Dataset of Numerical Reasoning over Financial Data"

114 Dec 29, 2022

Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022

Segmenter - Transformer for Semantic Segmentation

592 Dec 27, 2022

Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

11 Nov 17, 2022

A Persian Image Captioning model based on Vision Encoder Decoder Models of the transformers🤗.

Persian-Image-Captioning We fine-tuning the Vision Encoder Decoder Model for the task of image captioning on the coco-flickr-farsi dataset. The implem

15 Aug 25, 2022

Official source for spanish Language Models and resources made @ BSC-TEMU within the "Plan de las Tecnologías del Lenguaje" (Plan-TL).

Spanish Language Models 💃🏻 A repository part of the MarIA project. Corpora 📃 Corpora Number of documents Number of tokens Size (GB) BNE 201,080,084

203 Dec 20, 2022

Fidibo.com comments Sentiment Analyser

Fidibo.com comments Sentiment Analyser Introduction This project first asynchronously grab Fidibo.com books comment data using grabber.py and then sav

3 Apr 15, 2022

Yet Another Compiler Visualizer

yacv: Yet Another Compiler Visualizer yacv is a tool for visualizing various aspects of typical LL(1) and LR parsers. Check out demo on YouTube to see

129 Dec 17, 2022

An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper：An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

3 Apr 02, 2022

Simple program that translates the name of files into English

Simple program that translates the name of files into English. Useful for when editing/inspecting programs that were developed in a foreign language.

0 Dec 22, 2021

Cherche (search in French) allows you to create a neural search pipeline using retrievers and pre-trained language models as rankers.

Related tags

Overview

Cherche

Installation 🤖

Documentation 📜

QuickStart 💨

Documents 📑

Retriever ranker 🔍

Retrieve 👻

Rank 🤗

Summarization and question answering

Acknowledgements 👏

See also 👀

Dev Team 💾

Comments

Added spelling corrector object

0.0.5

Batch processing

Cherche 1.0.0

"IndexError: index out of range in self "While adding documents to cherche pipeline

incomplete doc about metrics

Releases(1.0.1)

1.0.1(Oct 27, 2022)

1.0.0(Oct 26, 2022)

What's Changed

Here is an essential update for Cherche! 🥳

0.1.0(Jun 16, 2022)

0.0.9(Apr 13, 2022)

0.0.8(Mar 7, 2022)

0.0.7(Mar 7, 2022)

0.0.6(Mar 3, 2022)

0.0.5(Feb 8, 2022)

0.0.4(Jan 20, 2022)

0.0.3(Jan 13, 2022)

0.0.2(Jan 12, 2022)

0.0.1(Jan 8, 2022)