A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Last update: Nov 03, 2022

Overview

Install

Linux wheels available (python >=3.6) on pypi:

pip install graphgrove

Building from source:

conda create -n gg python=3.8
conda activate gg
pip install numpy
make

To build your own wheel:

conda create -n gg python=3.8
conda activate gg
pip install numpy
make
pip install build
python -m build --wheel
# which can be used as:
# pip install --force dist/graphgrove-0.0.1-cp37-cp37m-linux_x86_64.whl

Examples

Toy examples of clustering, DAG-structured clustering, and nearest neighbor search are available.

At a high level, incremental clustering can be done as:

import graphgrove as gg
k = 5
num_rounds = 50
thresholds = np.geomspace(1.0, 0.001, num_rounds).astype(np.float32)
scc = gg.vec_scc.Cosine_SCC(k=k, num_rounds=num_rounds, thresholds=thresholds, index_name='cosine_sgtree', cores=cores, verbosity=0)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
    scc.partial_fit(batch)

Incremental nearest neighbor search can be done as:

import graphgrove as gg
k=5
cores=4
tree = gg.graph_builder.Cosine_SGTree(k=k, cores=cores)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
    tree.insert(batch) # or tree.insert_and_knn(batch)

Algorithms Implemented

Clustering:

Sub-Cluster Component Algorithm (SCC) and its minibatch variant from the paper: Scalable Hierarchical Agglomerative Clustering. Nicholas, Monath, Kumar Avinava Dubey, Guru Guruganesh, Manzil Zaheer, Amr Ahmed, Andrew McCallum, Gokhan Mergen, Marc Najork Mert Terzihan Bryon Tjanaka Yuan Wang Yuchen Wu. KDD. 2021
DAG Structured clustering (LLama) from DAG-Structured Clustering by Nearest Neighbors. Nicholas Monath, Manzil Zaheer, Kumar Avinava Dubey, Amr Ahmed, Andrew McCallum. AISTATS 2021.

Nearest Neighbor Search:

CoverTree: Alina Beygelzimer, Sham Kakade, and John Langford. "Cover trees for nearest neighbor." ICML. 2006.
SGTree: SG-Tree is a new data structure for exact nearest neighbor search inspired from Cover Tree and its improvement, which has been used in the TerraPattern project. At a high level, SG-Tree tries to create a hierarchical tree where each node performs a "coarse" clustering. The centers of these "clusters" become the children and subsequent insertions are recursively performed on these children. When performing the NN query, we prune out solutions based on a subset of the dimensions that are being queried. This is particularly useful when trying to find the nearest neighbor in highly clustered subset of the data, e.g. when the data comes from a recursive mixture of Gaussians or more generally time marginalized coalscent process . The effect of these two optimizations is that our data structure is extremely simple, highly parallelizable and is comparable in performance to existing NN implementations on many data-sets. Manzil Zaheer, Guru Guruganesh, Golan Levin, Alexander Smola. TerraPattern: A Nearest Neighbor Search Service. 2019.

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Related tags

Overview

Install

Examples

Algorithms Implemented

Owner

Nicholas Monath

Visualize classified time series data with interactive Sankey plots in Google Earth Engine

Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.

A Python toolbox to churn out organic alkalinity calculations with minimal brain engagement.

Python implementation of the rulefit algorithm

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Solve automatic numerical differentiation problems in one or more variables.

Microsoft Machine Learning for Apache Spark

SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.

Machine-care - A simple python script to take care of simple maintenance tasks

JMP is a Mixed Precision library for JAX.

Dive into Machine Learning

The Fuzzy Labs guide to the universe of open source MLOps

Machine Learning toolbox for Humans

GAM timeseries modeling with auto-changepoint detection. Inspired by Facebook Prophet and implemented in PyMC3

A machine learning toolkit dedicated to time-series data

Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Machine learning that just works, for effortless production applications

Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)"

As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Crate will be the hub of various ML projects which will be the resources for the ML enthusiasts! Open Source Program: SWOC 2021 and JWOC 2022.

MLflow App Using React, Hooks, RabbitMQ, FastAPI Server, Celery, Microservices