An Indexer that works out-of-the-box when you have less than 100K stored Documents

Last update: Mar 15, 2022

Related tags

Overview

U100KIndexer

An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with 768-dim embeddings, you can expect 300ms for single query or 20~120QPS for batch queries. Results are full Documents.

U100KIndexer leverages jina.DocumenetArrayMemmap as the storage backend and .match() to conduct nearest neighbours search. It returns the full Documents as-is, hence no need to concatenate it with another key-value indexer to retrieve Documents.

Pros & cons

Pros

Exhaustive search: highest recall
Fast indexing
Acceptable query performance under 100K
Always return full Documents
No extra dependencies

Cons

Slow query time

Performance

The indexing and query performance on 768-dim embeddings is as follows (unit is second):

Stored data	Indexing time	Query size=1	Query size=8	Query size=64
10000	0.256	0.019	0.029	0.086
50000	1.156	0.147	0.177	0.314
100000	2.329	0.297	0.332	0.536
200000	4.704	0.656	0.744	1.050
400000	11.105	1.289	1.536	2.793

Benchmark script can be found in benchmark.py.

Tips

To change workspace,

U100KIndexer(metas={'workspace': './my'})

Or .add(..., uses_metas={'workspace': './my'}) when you use it in a Flow.

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Related tags

Overview

U100KIndexer

Pros & cons

Pros

Cons

Performance

Tips

Owner

Jina AI

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

TheMachineScraper 🐱‍👤 is an Information Grabber built for Machine Analysis

An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks

COVID-19 deaths statistics around the world

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

A simplified prototype for an as-built tracking database with API

Methylation/modified base calling separated from basecalling.

Data and code accompanying the paper Politics and Virality in the Time of Twitter

A Python Tools to imaging the shallow seismic structure

This python script allows you to manipulate the audience data from Sl.ido surveys

Semi-Automated Data Processing

Programmatically access the physical and chemical properties of elements in modern periodic table.

Synthetic Data Generation for tabular, relational and time series data.

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Nobel Data Analysis

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.