Search with BERT vectors in Solr and Elasticsearch

Overview

BERT models with Solr and Elasticsearch

streamlit-search_demo_solr-2021-05-13-10-05-91.mp4
streamlit-search_demo_elasticsearch-2021-05-14-22-05-55.mp4

This code is described in the following Medium stories, taking one step at a time:

Neural Search with BERT and Solr (August 18,2020)

Fun with Apache Lucene and BERT Embeddings (November 15, 2020)

Speeding up BERT Search in Elasticsearch (March 15, 2021)

Ask Me Anything about Vector Search (June 20, 2021) This blog post gives the answers to the 3 most interesting questions asked during the AMA session at Berlin Buzzwords 2021. The video recording is available here: https://www.youtube.com/watch?v=blFe2yOD1WA

Bert in Solr hat Bert with_es burger


Tech stack:

  • bert-as-service
  • Hugging Face
  • solr / elasticsearch
  • streamlit
  • Python 3.7

Code for dealing with Solr has been copied from the great (and highly recommended) https://github.com/o19s/hello-ltr project.

Install tensorflow

pip install tensorflow==1.15.3

If you try to install tensorflow 2.3, bert service will fail to start, there is an existing issue about it.

If you encounter issues with the above installation, consider installing full list of packages:

pip install -r requirements_freeze.txt

Let's install bert-as-service components

pip install bert-serving-server

pip install bert-serving-client

Download a pre-trained BERT model

into the bert-model/ directory in this project. I have chosen uncased_L-12_H-768_A-12.zip for this experiment. Unzip it.

Now let's start the BERT service

bash start_bert_server.sh

Run a sample bert client

python src/bert_client.py

to compute vectors for 3 sample sentences:

    Bert vectors for sentences ['First do it', 'then do it right', 'then do it better'] : [[ 0.13186474  0.32404128 -0.82704437 ... -0.3711958  -0.39250174
      -0.31721866]
     [ 0.24873531 -0.12334424 -0.38933852 ... -0.44756213 -0.5591355
      -0.11345179]
     [ 0.28627345 -0.18580122 -0.30906814 ... -0.2959366  -0.39310536
       0.07640187]]

This sets up the stage for our further experiment with Solr.

Dataset

This is by far the key ingredient of every experiment. You want to find an interesting collection of texts, that are suitable for semantic level search. Well, maybe all texts are. I have chosen a collection of abstracts from DBPedia, that I downloaded from here: https://wiki.dbpedia.org/dbpedia-version-2016-04 and placed into data/dbpedia directory in bz2 format. You don't need to extract this file onto disk: the provided code will read directly from the compressed file.

Preprocessing and Indexing: Solr

Before running preprocessing / indexing, you need to configure the vector plugin, which allows to index and query the vector data. You can find the plugin for Solr 8.x here: https://github.com/DmitryKey/solr-vector-scoring

After the plugin's jar has been added, configure it in the solrconfig.xml like so:

">

  

Schema also requires an addition: field of type VectorField is required in order to index vector data:

">

  

Find ready-made schema and solrconfig here: https://github.com/DmitryKey/bert-solr-search/tree/master/solr_conf

Let's preprocess the downloaded abstracts, and index them in Solr. First, execute the following command to start Solr:

bin/solr start -m 2g

If during processing you will notice:

<...>/bert-solr-search/venv/lib/python3.7/site-packages/bert_serving/client/__init__.py:299: UserWarning: some of your sentences have more tokens than "max_seq_len=500" set on the server, as consequence you may get less-accurate or truncated embeddings.
here is what you can do:
- disable the length-check by create a new "BertClient(check_length=False)" when you do not want to display this warning
- or, start a new server with a larger "max_seq_len"
  '- or, start a new server with a larger "max_seq_len"' % self.length_limit)

The index_dbpedia_abstracts_solr.py script will output statistics:

Maximum tokens observed per abstract: 697
Flushing 100 docs
Committing changes
All done. Took: 82.46466588973999 seconds

We know how many abstracts there are:

bzcat data/dbpedia/long_abstracts_en.ttl.bz2 | wc -l
5045733

Preprocessing and Indexing: Elasticsearch

This project implements several ways to index vector data:

  • src/index_dbpedia_abstracts_elastic.py vanilla Elasticsearch: using dense_vector data type
  • src/index_dbpedia_abstracts_elastiknn.py Elastiknn plugin: implements own data type. I used elastiknn_dense_float_vector
  • src/index_dbpedia_abstracts_opendistro.py OpenDistro for Elasticsearch: uses nmslib to build Hierarchical Navigable Small World (HNSW) graphs during indexing

Each indexer relies on ready-made Elasticsearch mapping file, that can be found in es_conf/ directory.

Preprocessing and Indexing: GSI APU

In order to use GSI APU solution, a user needs to produce two files: numpy 2D array with vectors of desired dimension (768 in my case) a pickle file with document ids matching the document ids of the said vectors in Elasticsearch.

After these data files get uploaded to the GSI server, the same data gets indexed in Elasticsearch. The APU powered search is performed on up to 3 Leda-G PCIe APU boards. Since I’ve run into indexing performance with bert-as-service solution, I decided to take SBERT approach from Hugging Face to prepare the numpy and pickle array files. This allowed me to index into Elasticsearch freely at any time, without waiting for days. You can use this script to do this on DBPedia data, which allows choosing between:

EmbeddingModel.HUGGING_FACE_SENTENCE (SBERT)
EmbeddingModel.BERT_UNCASED_768 (bert-as-service)

To generate the numpy and pickle files, use the following script: scr/create_gsi_files.py. This script produces two files:

data/1000000_EmbeddingModel.HUGGING_FACE_SENTENCE_vectors.npy
data/1000000_EmbeddingModel.HUGGING_FACE_SENTENCE_vectors_docids.pkl

Both files are perfectly suitable for indexing with Solr and Elasticsearch.

To test the GSI plugin, you will need to upload these files to GSI server for loading them both to Elasticsearch and APU.

Running the BERT search demo

There are two streamlit demos for running BERT search for Solr and Elasticsearch. Each demo compares to BM25 based search. The following assumes that you have bert-as-service up and running (if not, laucnh it with bash start_bert_server.sh) and either Elasticsearch or Solr running with the index containing field with embeddings.

To run a demo, execute the following on the command line from the project root:

# for experiments with Elasticsearch
streamlit run src/search_demo_elasticsearch.py

# for experiments with Solr
streamlit run src/search_demo_solr.py
Owner
Dmitry Kan
I build search engines. Host of the Vector Podcast: https://www.youtube.com/channel/UCCIMPfR7TXyDvlDRXjVhP1g
Dmitry Kan
Various capabilities for static malware analysis.

Malchive The malchive serves as a compendium for a variety of capabilities mainly pertaining to malware analysis, such as scripts supporting day to da

MITRE Cybersecurity 64 Nov 22, 2022
This project converts your human voice input to its text transcript and to an automated voice too.

Human Voice to Automated Voice & Text Introduction: In this project, whenever you'll speak, it will turn your voice into a robot voice and furthermore

Hassan Shahzad 3 Oct 15, 2021
Code for Emergent Translation in Multi-Agent Communication

Emergent Translation in Multi-Agent Communication PyTorch implementation of the models described in the paper Emergent Translation in Multi-Agent Comm

Facebook Research 75 Jul 15, 2022
無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

VOICEVOX ENGINE VOICEVOXの音声合成エンジン。 実態は HTTP サーバーなので、リクエストを送信すればテキスト音声合成できます。 API ドキュメント VOICEVOX ソフトウェアを起動した状態で、ブラウザから

Hiroshiba 3 Jul 05, 2022
Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

Jeffrey M. Binder 20 Jan 09, 2023
Text editor on python tkinter to convert english text to other languages with the help of ployglot.

Transliterator Text Editor This is a simple transliteration program which is used to convert english word to phonetically matching word in another lan

Merin Rose Tom 1 Jan 16, 2022
AI_Assistant - This is a Python based Voice Assistant.

This is a Python based Voice Assistant. This was programmed to increase my understanding of python and also how the in-general Voice Assistants work.

1 Jan 06, 2022
Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

Vincent Rasneur 230 Nov 16, 2022
Modified GPT using average pooling to reduce the softmax attention memory constraints.

NLP-GPT-Upsampling This repository contains an implementation of Open AI's GPT Model. In particular, this implementation takes inspiration from the Ny

WD 1 Dec 03, 2021
KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정한 코드입니다.

KoBERTopic 모델 소개 KoBERTopic은 BERTopic을 한국어 데이터에 적용할 수 있도록 토크나이저와 BERT를 수정했습니다. 기존 BERTopic : https://github.com/MaartenGr/BERTopic/tree/05a6790b21009d

Won Joon Yoo 26 Jan 03, 2023
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
Abhijith Neil Abraham 2 Nov 05, 2021
Codename generator using WordNet parts of speech database

codenames Codename generator using WordNet parts of speech database References: https://possiblywrong.wordpress.com/2021/09/13/code-name-generator/ ht

possiblywrong 27 Oct 30, 2022
A single model that parses Universal Dependencies across 75 languages.

A single model that parses Universal Dependencies across 75 languages. Given a sentence, jointly predicts part-of-speech tags, morphology tags, lemmas, and dependency trees.

Dan Kondratyuk 189 Nov 29, 2022
Unlimited Call - Text Bombing Tool

FastBomber Unlimited Call - Text Bombing Tool Installation On Termux

Aryan 6 Nov 10, 2022
✔👉A Centralized WebApp to Ensure Road Safety by checking on with the activities of the driver and activating label generator using NLP.

AI-For-Road-Safety Challenge hosted by Omdena Hyderabad Chapter Original Repo Link : https://github.com/OmdenaAI/omdena-india-roadsafety Final Present

Prathima Kadari 7 Nov 29, 2022
Fine-tune GPT-3 with a Google Chat conversation history

Google Chat GPT-3 This repo will help you fine-tune GPT-3 with a Google Chat conversation history. The trained model will be able to converse as one o

Nate Baer 7 Dec 10, 2022
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

UNITER: UNiversal Image-TExt Representation Learning This is the official repository of UNITER (ECCV 2020). This repository currently supports finetun

Yen-Chun Chen 680 Dec 24, 2022
Pattern Matching in Python

Pattern Matching finalmente chega no Python 3.10. E daí? "Pattern matching", ou "correspondência de padrões" como é conhecido no Brasil. Algumas pesso

Fabricio Werneck 6 Feb 16, 2022
Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Visual Automata Copyright 2021 Lewi Lie Uberg Released under the MIT license Visual Automata is a Python 3 library built as a wrapper for Caleb Evans'

Lewi Uberg 55 Nov 17, 2022