An Indexer that works out-of-the-box when you have less than 100K stored Documents

Last update: Mar 15, 2022

Related tags

Overview

U100KIndexer

An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with 768-dim embeddings, you can expect 300ms for single query or 20~120QPS for batch queries. Results are full Documents.

U100KIndexer leverages jina.DocumenetArrayMemmap as the storage backend and .match() to conduct nearest neighbours search. It returns the full Documents as-is, hence no need to concatenate it with another key-value indexer to retrieve Documents.

Pros & cons

Pros

Exhaustive search: highest recall
Fast indexing
Acceptable query performance under 100K
Always return full Documents
No extra dependencies

Cons

Slow query time

Performance

The indexing and query performance on 768-dim embeddings is as follows (unit is second):

Stored data	Indexing time	Query size=1	Query size=8	Query size=64
10000	0.256	0.019	0.029	0.086
50000	1.156	0.147	0.177	0.314
100000	2.329	0.297	0.332	0.536
200000	4.704	0.656	0.744	1.050
400000	11.105	1.289	1.536	2.793

Benchmark script can be found in benchmark.py.

Tips

To change workspace,

U100KIndexer(metas={'workspace': './my'})

Or .add(..., uses_metas={'workspace': './my'}) when you use it in a Flow.

An Indexer that works out-of-the-box when you have less than 100K stored Documents

Related tags

Overview

U100KIndexer

Pros & cons

Pros

Cons

Performance

Tips

Owner

Jina AI

A set of procedures that can realize covid19 virus detection based on blood.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Churn prediction with PySpark

Investigating EV charging data

Python reader for Linked Data in HDF5 files

Common bioinformatics database construction

PyTorch implementation for NCL (Neighborhood-enrighed Contrastive Learning)

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

Evidence enables analysts to deliver a polished business intelligence system using SQL and markdown.

This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics!

DaCe is a parallel programming framework that takes code in Python/NumPy and other programming languages

Fancy data functions that will make your life as a data scientist easier.

Python for Data Analysis, 2nd Edition

Vaex library for Big Data Analytics of an Airline dataset

International Space Station data with Python research 🌎

pipeline for migrating lichess data into postgresql

DefAP is a program developed to facilitate the exploration of a material's defect chemistry

Find exposed data in Azure with this public blob scanner

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Pipeline and Dataset helpers for complex algorithm evaluation.