ANNchor is a python library which constructs approximate k-nearest neighbour graphs for slow metrics.

Overview

ANNchor

A python library implementing ANNchor:
k-nearest neighbour graph construction for slow metrics.

User Guide

For user guide and documentation, go to /doc/_build/index.html



What is ANNchor?

ANNchor is a python library which constructs approximate k-nearest neighbour graphs for slow metrics. The k-NN graph is an extremely useful data structure that appears in a wide variety of applications, for example: clustering, dimensionality reduction, visualisation and exploratory data analysis (EDA). However, if we want to use a slow metric, these k-NN graphs can take an exceptionally long time to compute. Typical slow metrics include the Wasserstein metric (Earth Mover's distance) applied to images, and Levenshtein (Edit) distance on long strings, where the time taken to compute these distances is significantly longer than a typical Euclidean distance.

ANNchor uses Machine Learning methods to infer true distances between points in a data set from a variety of features derived from anchor points (aka landmarks/waypoints). In practice, this means that ANNchor does not make as many calls to the underlying metric as other state of the art k-NN graph generation techniques. This translates to quicker run times, especially when the metric is slow.

Results from ANNchor can easily be combined with other popular libraries in the Data Science community. In the docs we give examples of how to use ANNchor in an EDA pipeline alongside UMAP and HDBSCAN.

Installation

Clone this repo and install with pip:

pip install -e annchor/

Basic Usage

import numpy as np
import annchor

X =          #your data, list/np.array of items
distance =   #your distance function, distance(X[i],X[j]) = d

ann = annchor.Annchor(X,
                      distance,
                      n_anchors=15,
                      n_neighbors=15,
                      p_work=0.1)
ann.fit()

print(ann.neighbor_graph)

Examples

We demonstrate ANNchor by example, using Levenshtein distance on a data set of long strings. This data set is bundled with the annchor package for convenience.

Firstly, we import some useful modules and load the data:

import os
import time
import numpy as np

from annchor import Annchor, compare_neighbor_graphs
from annchor.datasets import load_strings

strings_data = load_strings()
X = strings_data['X']
y = strings_data['y']
neighbor_graph = strings_data['neighbor_graph']

nx = X.shape[0]

for x in X[::100]:
    print(x[:50]+'...')
cuiojvfnseoksugfcbwzrcoxtjxrvojrguqttjpeauenefmkmv...
uiofnsosungdgrxiiprvojrgujfdttjioqunknefamhlkyihvx...
cxumzfltweskptzwnlgojkdxidrebonxcmxvbgxayoachwfcsy...
cmjpuuozflodwqvkascdyeosakdupdoeovnbgxpajotahpwaqc...
vzdiefjmblnumdjeetvbvhwgyasygrzhuckvpclnmtviobpzvy...
nziejmbmknuxdhjbgeyvwgasygrhcpdxcgnmtviubjvyzjemll...
yhdpczcjxirmebhfdueskkjjtbclvncxjrstxhqvtoyamaiyyb...
yfhwczcxakdtenvbfctugnkkkjbcvxcxjwfrgcstahaxyiooeb...
yoftbrcmmpngdfzrbyltahrfbtyowpdjrnqlnxncutdovbgabo...
tyoqbywjhdwzoufzrqyltahrefbdzyunpdypdynrmchutdvsbl...
dopgwqjiehqqhmprvhqmnlbpuwszjkjjbshqofaqeoejtcegjt...
rahobdixljmjfysmegdwyzyezulajkzloaxqnipgxhhbyoztzn...
dfgxsltkbpxvgqptghjnkaoofbwqqdnqlbbzjsqubtfwovkbsk...
pjwamicvegedmfetridbijgafupsgieffcwnmgmptjwnmwegvn...
ovitcihpokhyldkuvgahnqnmixsakzbmsipqympnxtucivgqyi...
xvepnposhktvmutozuhkbqarqsbxjrhxuumofmtyaaeesbeuhf...

We see a data set consisting of long strings. A closer inspection may indicate some structure, but it is not obvious at this stage.

We use ANNchor to find the 25-nearest neighbour graph. Levenshtein distance is included in Annchor, and can be called by using the string 'levenshtein' (we could also define the levenshtein function beforehand and pass that to Annchor instead). We will specify that we want to do no more than 12% of the brute force work (since the data set is size 1600, brute force would be 1600x1599/2=1279200 calls to the metric, so we will make around ~153500 to the metric). To get accurate timing information, bear in mind that the first run will be slower than future runs due to the numba.jit compile time.

start_time = time.time()
ann = Annchor(X, 'levenshtein', n_neighbors=25, p_work=0.12)

ann.fit()
print('ANNchor Time: %5.3f seconds' % (time.time()-start_time))


# Test accuracy
error = compare_neighbor_graphs(neighbor_graph,
                                ann.neighbor_graph,
                                k)
print('ANNchor Accuracy: %d incorrect NN pairs (%5.3f%%)' % (error,100*error/(k*nx)))
ANNchor Time: 34.299 seconds
ANNchor Accuracy: 0 incorrect NN pairs (0.000%)

Not bad!

We can continue to use ANNchor in a typical EDA pipeline. Let's find the UMAP projection of our data set:

from umap import UMAP
from matplotlib import pyplot as plt

# Extract the distance matrix
D = ann.to_sparse_matrix()

U = UMAP(metric='precomputed',n_neighbors=k-1)
T = U.fit_transform(D)
# T now holds the 2d UMAP projection of our data

# View the 2D projection with matplotlib
fig,ax = plt.subplots(figsize=(7,7))
ax.scatter(*T.T,alpha=0.1)
plt.show()

Finally the structure of the data set is clear to us! There are 8 clusters of two distinct varieties: filaments and clouds.

More examples can be found in the Examples subfolder. Extra python packages will be required to run the examples. These packages can be installed via:

pip install -r annchor/Examples/requirements.txt
Owner
GCHQ
GCHQ
nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

Microsoft 241 Dec 26, 2022
Uber Open Source 1.6k Dec 31, 2022
Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by

Robustness Gym 115 Dec 12, 2022
Dieses Projekt ermöglicht es den Smartmeter der EVN (Netz Niederösterreich) über die Kundenschnittstelle auszulesen.

SmartMeterEVN Dieses Projekt ermöglicht es den Smartmeter der EVN (Netz Niederösterreich) über die Kundenschnittstelle auszulesen. Smart Meter werden

greenMike 43 Dec 04, 2022
Traingenerator 🧙 A web app to generate template code for machine learning ✨

Traingenerator 🧙 A web app to generate template code for machine learning ✨ 🎉 Traingenerator is now live! 🎉

Johannes Rieke 1.2k Jan 07, 2023
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Jan 03, 2023
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

6.2k Jan 01, 2023
PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.

PyNNDescent PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors. It provides a python implementation of Nearest Neighbo

Leland McInnes 699 Jan 09, 2023
This handbook accompanies the course: Machine Learning with Hung-Yi Lee

This handbook accompanies the course: Machine Learning with Hung-Yi Lee

RenChu Wang 472 Dec 31, 2022
This is an auto-ML tool specialized in detecting of outliers

Auto-ML tool specialized in detecting of outliers Description This tool will allows you, with a Dash visualization, to compare 10 models of machine le

1 Nov 03, 2021
YouTube Spam Detection with python

YouTube Spam Detection This code deletes spam comment on youtube videos based on two characteristics (currently) If the author of the comment has a se

MohamadReza Taalebi 5 Sep 27, 2022
Transform ML models into a native code with zero dependencies

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

Bayes' Witnesses 2.3k Jan 03, 2023
Tribuo - A Java machine learning library

Tribuo - A Java prediction library (v4.1) Tribuo is a machine learning library in Java that provides multi-class classification, regression, clusterin

Oracle 1.1k Dec 28, 2022
Software Engineer Salary Prediction

Based on 2021 stack overflow data, this machine learning web application helps one predict the salary based on years of experience, level of education and the country they work in.

Jhanvi Mimani 1 Jan 08, 2022
My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data

kNN-vs-RFR My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data In many areas, rental bikes have been launched to

1 Oct 28, 2021
Nevergrad - A gradient-free optimization platform

Nevergrad - A gradient-free optimization platform nevergrad is a Python 3.6+ library. It can be installed with: pip install nevergrad More installati

Meta Research 3.4k Jan 08, 2023
虚拟货币(BTC、ETH)炒币量化系统项目。在一版本的基础上加入了趋势判断

🎉 第二版本 🎉 (现货趋势网格) 介绍 在第一版本的基础上 趋势判断,不在固定点位开单,选择更优的开仓点位 优势: 🎉 简单易上手 安全(不用将api_secret告诉他人) 如何启动 修改app目录下的authorization文件

幸福村的码农 250 Jan 07, 2023
A simple python program which predicts the success of a movie based on it's type, actor, actress and director

Movie-Success-Prediction A simple python program which predicts the success of a movie based on it's type, actor, actress and director. The program us

Mahalinga Prasad R N 1 Dec 17, 2021
Simple data balancing baselines for worst-group-accuracy benchmarks.

BalancingGroups Code to replicate the experimental results from Simple data balancing baselines achieve competitive worst-group-accuracy. Replicating

Facebook Research 29 Dec 02, 2022
Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python

Stock Price Prediction Bank Jago Using Facebook Prophet Machine Learning & Python Overview Bank Jago has attracted investors' attention since the end

Najibulloh Asror 3 Feb 10, 2022