The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Last update: Jan 28, 2022

Related tags

Text Data & NLP information_retrieval

Overview

Main Idea

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Setup

Download trained models

There are two models trained for spanish, a bi-encoder and a cross-encoder. These serve to make the retrieval system using the retrieve and rerank idea:

make setup
pip install -r requirements.txt

Basic usage

Setup Elasticsearch index with semantic vectors. For this step we supose that a set of json files is folder. Each json can contain several optional fields but need to contain id and text fiedlds.

from information_retrieval import SemanticEmbedder, CrossEncoder, Prepare, Search

data_folder = 'data/'
text_field = "texto_parrafo"
id_field = "id_parrafo"
elastic_index_name = "sentencias_2.0"

# Read the files, compute embeddings and upload them to elasticsearch
P = Prepare(data_folder, text_field, id_field, elastic_index_name)
P.prepare()

Make queries to retrieve documents:

from information_retrieval import SearchEngine

query = "la vida es bella"
S = SearchEngine(elastic_index_name)
S.retrieve(query) # Only semantic search

S.rerank(query) # Retrieve and rerank

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Related tags

Overview

Main Idea

Setup

Download trained models

Basic usage

Model architecture

Training

Finetuning

Owner

Sergio Arnaud Gomez

SentAugment is a data augmentation technique for semi-supervised learning in NLP.

내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

State of the art faster Natural Language Processing in Tensorflow 2.0 .

Modular and extensible speech recognition library leveraging pytorch-lightning and hydra.

Main repository for the chatbot Bobotinho.

Reading Wikipedia to Answer Open-Domain Questions

A music comments dataset, containing 39,051 comments for 27,384 songs.

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

End-to-End Speech Processing Toolkit

Azure Text-to-speech service for Home Assistant

YACLC - Yet Another Chinese Learner Corpus

Python library for processing Chinese text

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

A deep learning-based translation library built on Huggingface transformers

KoBERT - Korean BERT pre-trained cased (KoBERT)

Code release for "COTR: Correspondence Transformer for Matching Across Images"

Mapping a variable-length sentence to a fixed-length vector using BERT model

Telegram bot to auto post messages of one channel in another channel as soon as it is posted, without the forwarded tag.

IMS-Toucan is a toolkit to train state-of-the-art Speech Synthesis models

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.