A production-ready, scalable Indexer for the Jina neural search framework, based on HNSW and PSQL

Last update: Oct 14, 2022

Overview

🌟 HNSW + PostgreSQL Indexer

HNSWPostgreSQLIndexer Jina is a production-ready, scalable Indexer for the Jina neural search framework.

It combines the reliability of PostgreSQL with the speed and efficiency of the HNSWlib nearest neighbor library.

It thus provides all the CRUD operations expected of a database system, while also offering fast and reliable vector lookup.

Requires a running PostgreSQL database service. For quick testing, you can run a containerized version locally with:

docker run -e POSTGRES_PASSWORD=123456 -p 127.0.0.1:5432:5432/tcp postgres:13.2

Syncing between PSQL and HNSW

By default, all data is stored in a PSQL database (as defined in the arguments). In order to add data to / build a HNSW index with your data, you need to manually call the /sync endpoint. This iterates through the data you have stored, and adds it to the HNSW index. By default, this is done incrementally, on top of whatever data the HNSW index already has. If you want to completely rebuild the index, use the parameter rebuild, like so:

flow.post(on='/sync', parameters={'rebuild': True})

At start-up time, the data from PSQL is synced into HNSW automatically. You can disable this with:

Flow().add(
    uses='jinahub://HNSWPostgresIndexer',
    uses_with={'startup_sync': False}
)

Automatic background syncing

⚠ WARNING: Experimental feature

Optionally, you can enable the option for automatic background syncing of the data into HNSW. This creates a thread in the background of the main operations, that will regularly perform the synchronization. This can be done with the sync_interval constructor argument, like so:

Flow().add(
    uses='jinahub://HNSWPostgresIndexer',
    uses_with={'sync_interval': 5}
)

sync_interval argument accepts an integer that represents the amount of seconds to wait between synchronization attempts. This should be adjusted based on your specific data amounts. For the duration of the background sync, the HNSW index will be locked to avoid invalid state, so searching will be queued. When sync_interval is enabled, the index will also be locked during search mode, so that syncing will be queued.

CRUD operations

You can perform all the usual operations on the respective endpoints

/index. Add new data to PostgreSQL
/search. Query the HNSW index with your Documents.
/update. Update documents in PostgreSQL
/delete. Delete documents in PostgreSQL.

Note. This only performs soft-deletion by default. This is done in order to not break the look-up of the document id after doing a search. For a hard delete, add 'soft_delete': False' to parameters. You might also perform a cleanup after a full rebuild of the HNSW index, by calling /cleanup.

Status endpoint

You can also get the information about the status of your data via the /status endpoint. This returns a Document whose tags contain the relevant information. The information can be returned via the following keys:

'psql_docs': number of Documents stored in the PSQL database (includes entries that have been "soft-deleted")
'hnsw_docs': the number of Documents indexed in the HNSW index
'last_sync': the time of the last synchronization of PSQL into HNSW
'pea_id': the shard number

In a sharded environment (parallel>1) you will get one Document from each shard. Each shard will have its own 'hnsw_docs', 'last_sync', 'pea_id', but they will all report the same 'psql_docs' (The PSQL database is available to all your shards). You need to sum the 'hnsw_docs' across these Documents, like so

result = f.post('/status', None, return_results=True)
result_docs = result[0].docs
total_hnsw_docs = sum(d.tags['hnsw_docs'] for d in result_docs)

Comments

Changing how /status method returns its values to try and merge with …

…any pre-existing tags from previous executors if any.

A shot at addressing the issue mentioned in https://github.com/jina-ai/executor-hnsw-postgres/issues/23

opened by louisconcentricsky 6

feat: performance improvements

Closes https://github.com/jina-ai/executor-hnsw-postgres/issues/6

Results before this PR:

indexing 1000 takes 0 seconds (0.22s)
rolling update 3 replicas x 2 shards takes 0 seconds (0.82s)
search with 10 takes 0 seconds (0.23s)

indexing 10000 takes 0 seconds (0.75s)
rolling update 3 replicas x 2 shards takes 9 seconds (9.08s)
search with 10 takes 0 seconds (0.22s)

indexing 100000 takes 7 seconds (7.59s)
rolling update 3 replicas x 2 shards takes 7 minutes and 17 seconds (437.44s)
search with 10 takes 0 seconds (0.22s)

RESULTS NOW

indexing 1000 takes 0 seconds (0.44s)                                                                                   
rolling update 3 replicas x 2 shards takes 0 seconds (0.81s)

indexing 10000 takes 1 second (1.01s)                                                                                   
rolling update 3 replicas x 2 shards takes 2 seconds (2.63s)

indexing 100000 takes 8 seconds (8.10s)                                                                                 
rolling update 3 replicas x 2 shards takes 3 minutes and 27 seconds (207.14s)

MORE BENCHMARKING

indexing 500000 takes 30 seconds (30.07s)    
rolling update 3 replicas x 2 shards takes 26 minutes and 57 seconds (1617.99s)
search with 10 takes 0 seconds (0.21s)

opened by cristianmtr 3

Status endpoint does not allow for compositing data with other executors

If another executor would also like to report some status information using the same status endpoint the return of the HNSQPostgresIndexer will remove it.

It seems some manner of using object update on the tags or just placing the status under a particular key would be more friendlier.

https://github.com/jina-ai/executor-hnsw-postgres/blob/79754090665e8bb86e85ab5693fa9b8be80977ce/executor/hnswpsql.py#L322

opened by louisconcentricsky 1
feat: background sync (with threads)
Closes https://github.com/jina-ai/internal-tasks/issues/293

Issues

[x] timestamp timezone difference

[x] psql connection pool gets exhausted

[x] locking resources in threaded access

NOTE: Even if we don't merge this, the refactoring of PSQL Handler still needs to be merged, as the previous usage of Conn Pool had issues.
opened by cristianmtr 1

fail to connect to PostgreSQL with docker-compose

start a PostgreSQL service with docker:

docker run -e POSTGRES_PASSWORD=123456 -p 127.0.0.1:5432:5432/tcp postgres:13.2

build a flow with one executor:HNSWPostgresIndexer
run the flow locally, it works well
expose the flow to docker-compose yaml, and run the flow with docker-compose ,get an error:

jina version info:


- jina 3.3.19
- docarray 0.12.2
- jina-proto 0.1.8
- jina-vcs-tag (unset)
- protobuf 3.20.0
- proto-backend cpp
- grpcio 1.43.0
- pyyaml 6.0
- python 3.10.2
- platform Linux
- platform-release 4.4.0-186-generic
- platform-version #216-Ubuntu SMP Wed Jul 1 05:34:05 UTC 2020
- architecture x86_64
- processor x86_64
- uid 48710637999860
- session-id 906abcd2-c797-11ec-b1df-2c4d544656f4
- uptime 2022-04-29T16:37:11.758133
- ci-vendor (unset)
* JINA_DEFAULT_HOST (unset)
* JINA_DEFAULT_TIMEOUT_CTRL (unset)
* JINA_DEFAULT_WORKSPACE_BASE /home/chenhao/.jina/executor-workspace
* JINA_DEPLOYMENT_NAME (unset)
* JINA_DISABLE_UVLOOP (unset)
* JINA_FULL_CLI (unset)
* JINA_GATEWAY_IMAGE (unset)
* JINA_GRPC_RECV_BYTES (unset)
* JINA_GRPC_SEND_BYTES (unset)
* JINA_HUBBLE_REGISTRY (unset)
* JINA_HUB_CACHE_DIR (unset)
* JINA_HUB_NO_IMAGE_REBUILD (unset)
* JINA_HUB_ROOT (unset)
* JINA_LOG_CONFIG (unset)
* JINA_LOG_LEVEL (unset)
* JINA_LOG_NO_COLOR (unset)
* JINA_MP_START_METHOD (unset)
* JINA_RANDOM_PORT_MAX (unset)
* JINA_RANDOM_PORT_MIN (unset)
* JINA_VCS_VERSION (unset)
* JINA_CHECK_VERSION True

opened by jerrychen1990 0

test: bug rolling update clear
if you remove from tests/integration/test_hnsw_psql.py

L:180

if benchmark: f.post('/clear')

the test test_benchmark_basic fails when it runs the second case

even though clear is called at the beginning of the flow.

Why?

yes, /clear only hits one replica. but when we restart the flow there should be completely new replicas anyway
opened by cristianmtr 0

performance(HNSWPSQL): syncing is slow

Right now sync will be slow

[ ] we are iterating and doing individual updates (should batch somehow, per sync operation type - index, update, delete)
[x] if rebuild, the operations will always be index. We should optimize for this. Done in #5

Numbers before any perf refactoring

Performance

indexing 1000 ...       indexing 1000 takes 0 seconds (0.22s)
rolling update 3 replicas x 2 shards ...            [email protected][I]:Using existing table
    [email protected][I]:Using existing table
    [email protected][I]:Using existing table
    [email protected][I]:Using existing table
    [email protected][I]:Using existing table
    [email protected][I]:Using existing table
rolling update 3 replicas x 2 shards takes 0 seconds (0.82s)
search with 10 ...      search with 10 takes 0 seconds (0.23s)

indexing 10000 ...      indexing 10000 takes 0 seconds (0.75s)
rolling update 3 replicas x 2 shards ...            [email protected][I]:Using existing table
    [email protected][I]:Using existing table
    [email protected][I]:Using existing table
    [email protected][I]:Using existing table
    [email protected][I]:Using existing table
    [email protected][I]:Using existing table
rolling update 3 replicas x 2 shards takes 9 seconds (9.08s)
search with 10 ...      search with 10 takes 0 seconds (0.22s)

indexing 100000 ...     indexing 100000 takes 7 seconds (7.59s)
rolling update 3 replicas x 2 shards ...            [email protected][I]:Using existing table
    [email protected][I]:Using existing table
    [email protected][I]:Using existing table
    [email protected][I]:Using existing table
    [email protected][I]:Using existing table
    [email protected][I]:Using existing table
rolling update 3 replicas x 2 shards takes 7 minutes and 17 seconds (437.44s)
search with 10 ...      search with 10 takes 0 seconds (0.22s)

priority/important-soon type/maintenance

opened by cristianmtr 0

Releases(v0.9)

v0.9(Apr 12, 2022)

Source code(tar.gz)
Source code(zip)
v0.8(Mar 8, 2022)
return match scores

use jina 3x base image

fix total_shards runtime args

Source code(tar.gz)
Source code(zip)
v0.7(Feb 11, 2022)

Migration to jina3
Source code(tar.gz)
Source code(zip)
v0.6(Jan 3, 2022)
What's Changed

docs: fix typo in delete endpoint and clarify by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/14

Full Changelog: https://github.com/jina-ai/executor-hnsw-postgres/compare/v0.5...v0.6
Source code(tar.gz)
Source code(zip)
v0.5(Dec 14, 2021)
What's Changed

fix: type of trav paths by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/13

Full Changelog: https://github.com/jina-ai/executor-hnsw-postgres/compare/v0.4...v0.5
Source code(tar.gz)
Source code(zip)
v0.4(Dec 9, 2021)
What's Changed

fix: allow using Executor in local mode by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/12

Full Changelog: https://github.com/jina-ai/executor-hnsw-postgres/compare/v0.3...v0.4
Source code(tar.gz)
Source code(zip)
v0.3(Nov 26, 2021)
What's Changed

feat: background sync (with threads) by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/9

docs: add docs on bg sync by @cristianmtr in https://github.com/jina-ai/executor-hnsw-postgres/pull/11

Full Changelog: https://github.com/jina-ai/executor-hnsw-postgres/compare/v0.2...v0.3
Source code(tar.gz)
Source code(zip)
v0.2(Nov 22, 2021)
performance improvements

adapting traversal_paths to new API, as per core

Source code(tar.gz)
Source code(zip)
v0.1(Nov 18, 2021)
initial release

Source code(tar.gz)
Source code(zip)

Owner

Jina AI

A Neural Search Company. We provide the cloud-native neural search solution powered by state-of-the-art AI technology.

GitHub Repository https://hub.jina.ai/executor/dvp0845a

DeepRec is a recommendation engine based on TensorFlow.

DeepRec Introduction DeepRec is a recommendation engine based on TensorFlow 1.15, Intel-TensorFlow and NVIDIA-TensorFlow. Background Sparse model is a

676 Jan 03, 2023

A repo with study material, exercises, examples, etc for Devnet SPAUTO

MPLS in the SDN Era -- DevNet SPAUTO Get right to the study material: Checkout the Wiki! A lab topology based on MPLS in the SDN era book used for 30

67 Nov 16, 2022

Experimental Python implementation of OpenVINO Inference Engine (very slow, limited functionality). All codes are written in Python. Easy to read and modify.

PyOpenVINO - An Experimental Python Implementation of OpenVINO Inference Engine (minimum-set) Description The PyOpenVINO is a spin-off product from my

7 Oct 31, 2022

WormMovementSimulation - 3D Simulation of Worm Body Movement with Neurons attached to its body

Generate 3D Locomotion Data This module is intended to create 2D video trajector

1 Aug 09, 2022

Make your AirPlay devices as TTS speakers

Apple AirPlayer Home Assistant integration component, make your AirPlay devices as TTS speakers. Before Use 2021.6.X or earlier Apple Airplayer compon

117 Dec 15, 2022

WaveFake: A Data Set to Facilitate Audio DeepFake Detection

WaveFake: A Data Set to Facilitate Audio DeepFake Detection This is the code repository for our NeurIPS 2021 (Track on Datasets and Benchmarks) paper

27 Dec 22, 2022

Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20. model in Tensorflow Lite.

TFLite-msg_chn_wacv20-depth-completion Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20. model

2 Oct 04, 2021

[CVPR 2022 Oral] Rethinking Minimal Sufficient Representation in Contrastive Learning

Rethinking Minimal Sufficient Representation in Contrastive Learning PyTorch implementation of Rethinking Minimal Sufficient Representation in Contras

36 Nov 23, 2022

Pytorch0.4.1 codes for InsightFace

InsightFace_Pytorch Pytorch0.4.1 codes for InsightFace 1. Intro This repo is a reimplementation of Arcface(paper), or Insightface(github) For models,

1.5k Jan 01, 2023

[CVPR 2021] Official PyTorch Implementation for "Iterative Filter Adaptive Network for Single Image Defocus Deblurring"

IFAN: Iterative Filter Adaptive Network for Single Image Defocus Deblurring Checkout for the demo (GUI/Google Colab)! The GUI version might occasional

173 Dec 30, 2022

TorchMetrics is a collection of 25+ PyTorch metrics implementations and an easy-to-use API to create custom metrics.

Machine learning metrics for distributed, scalable PyTorch applications.

1.2k Jan 06, 2023

v objective diffusion inference code for JAX.

v-diffusion-jax v objective diffusion inference code for JAX, by Katherine Crowson (@RiversHaveWings) and Chainbreakers AI (@jd_pressman). The models

186 Dec 21, 2022

Power Core Simulator!

Power Core Simulator Power Core Simulator is a simulator based off the Roblox game "Pinewood Builders Computer Core". In this simulator, you can choos

1 Nov 13, 2021

The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This repository is the official PyTorch implementation of SAINT. Find the paper on arxiv SAINT: Improved Neural Networks for Tabular Data via Row Atte

284 Dec 21, 2022

KUIELAB-MDX-Net got the 2nd place on the Leaderboard A and the 3rd place on the Leaderboard B in the MDX-Challenge ISMIR 2021

74 Dec 28, 2022

TorchDistiller - a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

This project is a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and i

147 Dec 03, 2022

A production-ready, scalable Indexer for the Jina neural search framework, based on HNSW and PSQL

Related tags

Overview

🌟 HNSW + PostgreSQL Indexer

Syncing between PSQL and HNSW

Automatic background syncing

CRUD operations

Status endpoint

Comments

Changing how /status method returns its values to try and merge with …

feat: performance improvements

Status endpoint does not allow for compositing data with other executors

feat: background sync (with threads)

fail to connect to PostgreSQL with docker-compose

test: bug rolling update clear

performance(HNSWPSQL): syncing is slow

Releases(v0.9)

v0.9(Apr 12, 2022)

v0.8(Mar 8, 2022)

v0.7(Feb 11, 2022)

v0.6(Jan 3, 2022)

What's Changed

v0.5(Dec 14, 2021)

What's Changed

v0.4(Dec 9, 2021)

What's Changed

v0.3(Nov 26, 2021)

What's Changed

v0.2(Nov 22, 2021)

v0.1(Nov 18, 2021)

Owner

Jina AI

DeepRec is a recommendation engine based on TensorFlow.

A repo with study material, exercises, examples, etc for Devnet SPAUTO

Experimental Python implementation of OpenVINO Inference Engine (very slow, limited functionality). All codes are written in Python. Easy to read and modify.

WormMovementSimulation - 3D Simulation of Worm Body Movement with Neurons attached to its body

Make your AirPlay devices as TTS speakers

WaveFake: A Data Set to Facilitate Audio DeepFake Detection

Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20. model in Tensorflow Lite.

[CVPR 2022 Oral] Rethinking Minimal Sufficient Representation in Contrastive Learning

Pytorch0.4.1 codes for InsightFace

[CVPR 2021] Official PyTorch Implementation for "Iterative Filter Adaptive Network for Single Image Defocus Deblurring"

TorchMetrics is a collection of 25+ PyTorch metrics implementations and an easy-to-use API to create custom metrics.

v objective diffusion inference code for JAX.

Power Core Simulator!

The official PyTorch implementation of recent paper - SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training

This is an official implementation for "PlaneRecNet".

FocusFace: Multi-task Contrastive Learning for Masked Face Recognition

Simple keras FCN Encoder/Decoder model for MS-COCO (food subset) segmentation

Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

KUIELAB-MDX-Net got the 2nd place on the Leaderboard A and the 3rd place on the Leaderboard B in the MDX-Challenge ISMIR 2021

TorchDistiller - a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.