ANNchor is a python library which constructs approximate k-nearest neighbour graphs for slow metrics.

Overview

ANNchor

A python library implementing ANNchor:
k-nearest neighbour graph construction for slow metrics.

User Guide

For user guide and documentation, go to /doc/_build/index.html



What is ANNchor?

ANNchor is a python library which constructs approximate k-nearest neighbour graphs for slow metrics. The k-NN graph is an extremely useful data structure that appears in a wide variety of applications, for example: clustering, dimensionality reduction, visualisation and exploratory data analysis (EDA). However, if we want to use a slow metric, these k-NN graphs can take an exceptionally long time to compute. Typical slow metrics include the Wasserstein metric (Earth Mover's distance) applied to images, and Levenshtein (Edit) distance on long strings, where the time taken to compute these distances is significantly longer than a typical Euclidean distance.

ANNchor uses Machine Learning methods to infer true distances between points in a data set from a variety of features derived from anchor points (aka landmarks/waypoints). In practice, this means that ANNchor does not make as many calls to the underlying metric as other state of the art k-NN graph generation techniques. This translates to quicker run times, especially when the metric is slow.

Results from ANNchor can easily be combined with other popular libraries in the Data Science community. In the docs we give examples of how to use ANNchor in an EDA pipeline alongside UMAP and HDBSCAN.

Installation

Clone this repo and install with pip:

pip install -e annchor/

Basic Usage

import numpy as np
import annchor

X =          #your data, list/np.array of items
distance =   #your distance function, distance(X[i],X[j]) = d

ann = annchor.Annchor(X,
                      distance,
                      n_anchors=15,
                      n_neighbors=15,
                      p_work=0.1)
ann.fit()

print(ann.neighbor_graph)

Examples

We demonstrate ANNchor by example, using Levenshtein distance on a data set of long strings. This data set is bundled with the annchor package for convenience.

Firstly, we import some useful modules and load the data:

import os
import time
import numpy as np

from annchor import Annchor, compare_neighbor_graphs
from annchor.datasets import load_strings

strings_data = load_strings()
X = strings_data['X']
y = strings_data['y']
neighbor_graph = strings_data['neighbor_graph']

nx = X.shape[0]

for x in X[::100]:
    print(x[:50]+'...')
cuiojvfnseoksugfcbwzrcoxtjxrvojrguqttjpeauenefmkmv...
uiofnsosungdgrxiiprvojrgujfdttjioqunknefamhlkyihvx...
cxumzfltweskptzwnlgojkdxidrebonxcmxvbgxayoachwfcsy...
cmjpuuozflodwqvkascdyeosakdupdoeovnbgxpajotahpwaqc...
vzdiefjmblnumdjeetvbvhwgyasygrzhuckvpclnmtviobpzvy...
nziejmbmknuxdhjbgeyvwgasygrhcpdxcgnmtviubjvyzjemll...
yhdpczcjxirmebhfdueskkjjtbclvncxjrstxhqvtoyamaiyyb...
yfhwczcxakdtenvbfctugnkkkjbcvxcxjwfrgcstahaxyiooeb...
yoftbrcmmpngdfzrbyltahrfbtyowpdjrnqlnxncutdovbgabo...
tyoqbywjhdwzoufzrqyltahrefbdzyunpdypdynrmchutdvsbl...
dopgwqjiehqqhmprvhqmnlbpuwszjkjjbshqofaqeoejtcegjt...
rahobdixljmjfysmegdwyzyezulajkzloaxqnipgxhhbyoztzn...
dfgxsltkbpxvgqptghjnkaoofbwqqdnqlbbzjsqubtfwovkbsk...
pjwamicvegedmfetridbijgafupsgieffcwnmgmptjwnmwegvn...
ovitcihpokhyldkuvgahnqnmixsakzbmsipqympnxtucivgqyi...
xvepnposhktvmutozuhkbqarqsbxjrhxuumofmtyaaeesbeuhf...

We see a data set consisting of long strings. A closer inspection may indicate some structure, but it is not obvious at this stage.

We use ANNchor to find the 25-nearest neighbour graph. Levenshtein distance is included in Annchor, and can be called by using the string 'levenshtein' (we could also define the levenshtein function beforehand and pass that to Annchor instead). We will specify that we want to do no more than 12% of the brute force work (since the data set is size 1600, brute force would be 1600x1599/2=1279200 calls to the metric, so we will make around ~153500 to the metric). To get accurate timing information, bear in mind that the first run will be slower than future runs due to the numba.jit compile time.

start_time = time.time()
ann = Annchor(X, 'levenshtein', n_neighbors=25, p_work=0.12)

ann.fit()
print('ANNchor Time: %5.3f seconds' % (time.time()-start_time))


# Test accuracy
error = compare_neighbor_graphs(neighbor_graph,
                                ann.neighbor_graph,
                                k)
print('ANNchor Accuracy: %d incorrect NN pairs (%5.3f%%)' % (error,100*error/(k*nx)))
ANNchor Time: 34.299 seconds
ANNchor Accuracy: 0 incorrect NN pairs (0.000%)

Not bad!

We can continue to use ANNchor in a typical EDA pipeline. Let's find the UMAP projection of our data set:

from umap import UMAP
from matplotlib import pyplot as plt

# Extract the distance matrix
D = ann.to_sparse_matrix()

U = UMAP(metric='precomputed',n_neighbors=k-1)
T = U.fit_transform(D)
# T now holds the 2d UMAP projection of our data

# View the 2D projection with matplotlib
fig,ax = plt.subplots(figsize=(7,7))
ax.scatter(*T.T,alpha=0.1)
plt.show()

Finally the structure of the data set is clear to us! There are 8 clusters of two distinct varieties: filaments and clouds.

More examples can be found in the Examples subfolder. Extra python packages will be required to run the examples. These packages can be installed via:

pip install -r annchor/Examples/requirements.txt
Owner
GCHQ
GCHQ
monolish: MONOlithic Liner equation Solvers for Highly-parallel architecture

monolish is a linear equation solver library that monolithically fuses variable data type, matrix structures, matrix data format, vendor specific data transfer APIs, and vendor specific numerical alg

RICOS Co. Ltd. 179 Dec 21, 2022
Tools for diffing and merging of Jupyter notebooks.

nbdime provides tools for diffing and merging of Jupyter Notebooks.

Project Jupyter 2.3k Jan 03, 2023
Python-based implementations of algorithms for learning on imbalanced data.

ND DIAL: Imbalanced Algorithms Minimalist Python-based implementations of algorithms for imbalanced learning. Includes deep and representational learn

DIAL | Notre Dame 220 Dec 13, 2022
Toolkit for building machine learning models that generalize to unseen domains and are robust to privacy and other attacks.

Toolkit for Building Robust ML models that generalize to unseen domains (RobustDG) Divyat Mahajan, Shruti Tople, Amit Sharma Privacy & Causal Learning

Microsoft 149 Jan 06, 2023
vortex particles for simulating smoke in 2d

vortex-particles-method-2d vortex particles for simulating smoke in 2d -vortexparticles_s

12 Aug 23, 2022
A visual dataflow programming language for sklearn

Persimmon What is it? Persimmon is a visual dataflow language for creating sklearn pipelines. It represents functions as blocks, inputs and outputs ar

Álvaro Bermejo 194 Jan 04, 2023
Python module for performing linear regression for data with measurement errors and intrinsic scatter

Linear regression for data with measurement errors and intrinsic scatter (BCES) Python module for performing robust linear regression on (X,Y) data po

Rodrigo Nemmen 56 Sep 27, 2022
Machine learning algorithms implementation

Machine learning algorithms implementation This repository consisits of implementation of various machine learning algorithms. The algorithms implemen

Karun Dawadi 1 Jan 03, 2022
Predicting Keystrokes using an Audio Side-Channel Attack and Machine Learning

Predicting Keystrokes using an Audio Side-Channel Attack and Machine Learning My

3 Apr 10, 2022
Fundamentals of Machine Learning

Fundamentals-of-Machine-Learning This repository introduces the basics of machine learning algorithms for preprocessing, regression and classification

Happy N. Monday 3 Feb 15, 2022
A basic Ray Tracer that exploits numpy arrays and functions to work fast.

Python-Fast-Raytracer A basic Ray Tracer that exploits numpy arrays and functions to work fast. The code is written keeping as much readability as pos

Rafael de la Fuente 393 Dec 27, 2022
Compare MLOps Platforms. Breakdowns of SageMaker, VertexAI, AzureML, Dataiku, Databricks, h2o, kubeflow, mlflow...

Compare MLOps Platforms. Breakdowns of SageMaker, VertexAI, AzureML, Dataiku, Databricks, h2o, kubeflow, mlflow...

Thoughtworks 318 Jan 02, 2023
Responsible AI Workshop: a series of tutorials & walkthroughs to illustrate how put responsible AI into practice

Responsible AI Workshop Responsible innovation is top of mind. As such, the tech industry as well as a growing number of organizations of all kinds in

Microsoft 9 Sep 14, 2022
Banpei is a Python package of the anomaly detection.

Banpei Banpei is a Python package of the anomaly detection. Anomaly detection is a technique used to identify unusual patterns that do not conform to

Hirofumi Tsuruta 282 Jan 03, 2023
Solve automatic numerical differentiation problems in one or more variables.

numdifftools The numdifftools library is a suite of tools written in _Python to solve automatic numerical differentiation problems in one or more vari

Per A. Brodtkorb 181 Dec 16, 2022
Predict the output which should give a fair idea about the chances of admission for a student for a particular university

Predict the output which should give a fair idea about the chances of admission for a student for a particular university.

ArvindSandhu 1 Jan 11, 2022
决策树分类与回归模型的实现和可视化

DecisionTree 决策树分类与回归模型,以及可视化 DecisionTree ID3 C4.5 CART 分类 回归 决策树绘制 分类树 回归树 调参 剪枝 ID3 ID3决策树是最朴素的决策树分类器: 无剪枝 只支持离散属性 采用信息增益准则 在data.py中,我们记录了一个小的西瓜数据

Welt Xing 10 Oct 22, 2022
2D fluid simulation implementation of Jos Stam paper on real-time fuild dynamics, including some suggested extensions.

Fluid Simulation Usage Download this repo and store it in your computer. Open a terminal and go to the root directory of this folder. Make sure you ha

Mariana Ávalos Arce 5 Dec 02, 2022
A data preprocessing and feature engineering script for a machine learning pipeline is prepared.

FEATURE ENGINEERING Business Problem: A data preprocessing and feature engineering script for a machine learning pipeline needs to be prepared. It is

Pinar Oner 7 Dec 18, 2021
SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

hexhamming What does it do? This module performs a fast bitwise hamming distance of two hexadecimal strings. This looks like: DEADBEEF = 1101111010101

Michael Recachinas 12 Oct 14, 2022