A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Overview

Install

Linux wheels available (python >=3.6) on pypi:

pip install graphgrove

Building from source:

conda create -n gg python=3.8
conda activate gg
pip install numpy
make

To build your own wheel:

conda create -n gg python=3.8
conda activate gg
pip install numpy
make
pip install build
python -m build --wheel
# which can be used as:
# pip install --force dist/graphgrove-0.0.1-cp37-cp37m-linux_x86_64.whl 

Examples

Toy examples of clustering, DAG-structured clustering, and nearest neighbor search are available.

At a high level, incremental clustering can be done as:

import graphgrove as gg
k = 5
num_rounds = 50
thresholds = np.geomspace(1.0, 0.001, num_rounds).astype(np.float32)
scc = gg.vec_scc.Cosine_SCC(k=k, num_rounds=num_rounds, thresholds=thresholds, index_name='cosine_sgtree', cores=cores, verbosity=0)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
    scc.partial_fit(batch)

Incremental nearest neighbor search can be done as:

import graphgrove as gg
k=5
cores=4
tree = gg.graph_builder.Cosine_SGTree(k=k, cores=cores)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
    tree.insert(batch) # or tree.insert_and_knn(batch) 

Algorithms Implemented

Clustering:

  • Sub-Cluster Component Algorithm (SCC) and its minibatch variant from the paper: Scalable Hierarchical Agglomerative Clustering. Nicholas, Monath, Kumar Avinava Dubey, Guru Guruganesh, Manzil Zaheer, Amr Ahmed, Andrew McCallum, Gokhan Mergen, Marc Najork Mert Terzihan Bryon Tjanaka Yuan Wang Yuchen Wu. KDD. 2021
  • DAG Structured clustering (LLama) from DAG-Structured Clustering by Nearest Neighbors. Nicholas Monath, Manzil Zaheer, Kumar Avinava Dubey, Amr Ahmed, Andrew McCallum. AISTATS 2021.

Nearest Neighbor Search:

  • CoverTree: Alina Beygelzimer, Sham Kakade, and John Langford. "Cover trees for nearest neighbor." ICML. 2006.
  • SGTree: SG-Tree is a new data structure for exact nearest neighbor search inspired from Cover Tree and its improvement, which has been used in the TerraPattern project. At a high level, SG-Tree tries to create a hierarchical tree where each node performs a "coarse" clustering. The centers of these "clusters" become the children and subsequent insertions are recursively performed on these children. When performing the NN query, we prune out solutions based on a subset of the dimensions that are being queried. This is particularly useful when trying to find the nearest neighbor in highly clustered subset of the data, e.g. when the data comes from a recursive mixture of Gaussians or more generally time marginalized coalscent process . The effect of these two optimizations is that our data structure is extremely simple, highly parallelizable and is comparable in performance to existing NN implementations on many data-sets. Manzil Zaheer, Guru Guruganesh, Golan Levin, Alexander Smola. TerraPattern: A Nearest Neighbor Search Service. 2019.
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

Spark Python Notebooks This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, fro

Jose A Dianes 1.5k Jan 02, 2023
AI and Machine Learning with Kubeflow, Amazon EKS, and SageMaker

Data Science on AWS - O'Reilly Book Get the book on Amazon.com Book Outline Quick Start Workshop (4-hours) In this quick start hands-on workshop, you

Data Science on AWS 2.8k Jan 03, 2023
Laporan Proyek Machine Learning - Azhar Rizki Zulma

Laporan Proyek Machine Learning - Azhar Rizki Zulma Project Overview Domain proyek yang dipilih dalam proyek machine learning ini adalah mengenai hibu

Azhar Rizki Zulma 6 Mar 12, 2022
Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Kaggle Tweet Sentiment Extraction Competition: 1st place solution (Dark of the Moon team)

Artsem Zhyvalkouski 64 Nov 30, 2022
AP1 Transcription Factor Binding Site Prediction

A machine learning project that predicted binding sites of AP1 transcription factor, using ChIP-Seq data and local DNA shape information.

1 Jan 21, 2022
High performance Python GLMs with all the features!

High performance Python GLMs with all the features!

QuantCo 200 Dec 14, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 03, 2023
Machine-learning-dell - Repositório com as atividades desenvolvidas no curso de Machine Learning

📚 Descrição Neste curso da Dell aprofundamos nossos conhecimentos em Machine Learning. 🖥️ Aulas (Em curso) 1.1 - Python aplicado a Data Science 1.2

Claudia dos Anjos 1 Jan 05, 2022
machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made th

Krishna Priyatham Potluri 73 Dec 01, 2022
This repo includes some graph-based CTR prediction models and other representative baselines.

Graph-based CTR prediction This is a repository designed for graph-based CTR prediction methods, it includes our graph-based CTR prediction methods: F

Big Data and Multi-modal Computing Group, CRIPAC 47 Dec 30, 2022
NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

SUN Group @ UMN 28 Aug 03, 2022
A handy tool for common machine learning models' hyper-parameter tuning.

Common machine learning models' hyperparameter tuning This repo is for a collection of hyper-parameter tuning for "common" machine learning models, in

Kevin Hu 2 Jan 27, 2022
Multiple Linear Regression using the LinearRegression class from sklearn.linear_model library

Multiple-Linear-Regression-master - A python program to implement Multiple Linear Regression using the LinearRegression class from sklearn.linear model library

Kushal Shingote 1 Feb 06, 2022
In this Repo a simple Sklearn Model will be trained and pushed to MLFlow

SKlearn_to_MLFLow In this Repo a simple Sklearn Model will be trained and pushed to MLFlow Install This Repo is based on poetry python3 -m venv .venv

1 Dec 13, 2021
李航《统计学习方法》复现

本项目复现李航《统计学习方法》每一章节的算法 特点: 笔记摘要:在每个文件开头都会有一些核心的摘要 pythonic:这里会用尽可能规范的方式来实现,包括编程风格几乎严格按照PEP8 循序渐进:前期的算法会更list的方式来做计算,可读性比较强,后期几乎完全为numpy.array的计算,并且辅助详

58 Oct 22, 2021
Solve automatic numerical differentiation problems in one or more variables.

numdifftools The numdifftools library is a suite of tools written in _Python to solve automatic numerical differentiation problems in one or more vari

Per A. Brodtkorb 181 Dec 16, 2022
Diabetes Prediction with Logistic Regression

Diabetes Prediction with Logistic Regression Exploratory Data Analysis Data Preprocessing Model & Prediction Model Evaluation Model Validation: Holdou

AZİZE SULTAN PALALI 2 Oct 23, 2021
A collection of Machine Learning Models To Web Api which are built on open source technologies/frameworks like Django, Flask.

Author Ibrahim Koné From-Machine-Learning-Models-To-WebAPI A collection of Machine Learning Models To Web Api which are built on open source technolog

Ibrahim Koné 2 May 24, 2022
Esse é o meu primeiro repo tratando de fim a fim, uma pipeline de dados abertos do governo brasileiro relacionado a compras de contrato e cronogramas anuais com spark, em pyspark e SQL!

Olá! Esse é o meu primeiro repo tratando de fim a fim, uma pipeline de dados abertos do governo brasileiro relacionado a compras de contrato e cronogr

Henrique de Paula 10 Apr 04, 2022
The Ultimate FREE Machine Learning Study Plan

The Ultimate FREE Machine Learning Study Plan

Patrick Loeber (Python Engineer) 2.5k Jan 05, 2023