A high performance implementation of HDBSCAN clustering. http://hdbscan.readthedocs.io/en/latest/

Overview
PyPI Version Conda-forge Version Conda-forge downloads License Travis Build Status Test Coverage Docs JOSS article

HDBSCAN

Now a part of scikit-learn-contrib

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning -- and the primary parameter, minimum cluster size, is intuitive and easy to select.

HDBSCAN is ideal for exploratory data analysis; it's a fast and robust algorithm that you can trust to return meaningful clusters (if there are any).

Based on the paper:
R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

Documentation, including tutorials, are available on ReadTheDocs at http://hdbscan.readthedocs.io/en/latest/ .

Notebooks comparing HDBSCAN to other clustering algorithms, explaining how HDBSCAN works and comparing performance with other python clustering implementations are available.

How to use HDBSCAN

The hdbscan package inherits from sklearn classes, and thus drops in neatly next to other sklearn clusterers with an identical calling API. Similarly it supports input in a variety of formats: an array (or pandas dataframe, or sparse matrix) of shape (num_samples x num_features); an array (or sparse matrix) giving a distance matrix between samples.

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)

Performance

Significant effort has been put into making the hdbscan implementation as fast as possible. It is orders of magnitude faster than the reference implementation in Java, and is currently faster than highly optimized single linkage implementations in C and C++. version 0.7 performance can be seen in this notebook . In particular performance on low dimensional data is better than sklearn's DBSCAN , and via support for caching with joblib, re-clustering with different parameters can be almost free.

Additional functionality

The hdbscan package comes equipped with visualization tools to help you understand your clustering results. After fitting data the clusterer object has attributes for:

  • The condensed cluster hierarchy
  • The robust single linkage cluster hierarchy
  • The reachability distance minimal spanning tree

All of which come equipped with methods for plotting and converting to Pandas or NetworkX for further analysis. See the notebook on how HDBSCAN works for examples and further details.

The clusterer objects also have an attribute providing cluster membership strengths, resulting in optional soft clustering (and no further compute expense). Finally each cluster also receives a persistence score giving the stability of the cluster over the range of distance scales present in the data. This provides a measure of the relative strength of clusters.

Outlier Detection

The HDBSCAN clusterer objects also support the GLOSH outlier detection algorithm. After fitting the clusterer to data the outlier scores can be accessed via the outlier_scores_ attribute. The result is a vector of score values, one for each data point that was fit. Higher scores represent more outlier like objects. Selecting outliers via upper quantiles is often a good approach.

Based on the paper:
R.J.G.B. Campello, D. Moulavi, A. Zimek and J. Sander Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection, ACM Trans. on Knowledge Discovery from Data, Vol 10, 1 (July 2015), 1-51.

Robust single linkage

The hdbscan package also provides support for the robust single linkage clustering algorithm of Chaudhuri and Dasgupta. As with the HDBSCAN implementation this is a high performance version of the algorithm outperforming scipy's standard single linkage implementation. The robust single linkage hierarchy is available as an attribute of the robust single linkage clusterer, again with the ability to plot or export the hierarchy, and to extract flat clusterings at a given cut level and gamma value.

Example usage:

import hdbscan
from sklearn.datasets import make_blobs

data = make_blobs(1000)

clusterer = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
cluster_labels = clusterer.fit_predict(data)
hierarchy = clusterer.cluster_hierarchy_
alt_labels = hierarchy.get_clusters(0.100, 5)
hierarchy.plot()
Based on the paper:
K. Chaudhuri and S. Dasgupta. "Rates of convergence for the cluster tree." In Advances in Neural Information Processing Systems, 2010.

Installing

Easiest install, if you have Anaconda (thanks to conda-forge which is awesome!):

conda install -c conda-forge hdbscan

PyPI install, presuming you have sklearn and all its requirements (numpy and scipy) installed:

pip install hdbscan

If pip is having difficulties pulling the dependencies then we'd suggest installing the dependencies manually using anaconda followed by pulling hdbscan from pip:

conda install cython
conda install numpy scipy
conda install scikit-learn
pip install hdbscan

For a manual install get this package:

wget https://github.com/scikit-learn-contrib/hdbscan/archive/master.zip
unzip master.zip
rm master.zip
cd hdbscan-master

Install the requirements

sudo pip install -r requirements.txt

or

conda install scikit-learn cython

Install the package

python setup.py install

Python Version

The hdbscan library supports both Python 2 and Python 3. However we recommend Python 3 as the better option if it is available to you.

Help and Support

For simple issues you can consult the FAQ in the documentation. If your issue is not suitably resolved there, please check the issues on github. Finally, if no solution is available there feel free to open an issue ; the authors will attempt to respond in a reasonably timely fashion.

Contributing

We welcome contributions in any form! Assistance with documentation, particularly expanding tutorials, is always welcome. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Citing

If you have used this codebase in a scientific publication and wish to cite it, please use the Journal of Open Source Software article.

L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density based clustering In: Journal of Open Source Software, The Open Journal, volume 2, number 11. 2017

Licensing

The hdbscan package is 3-clause BSD licensed. Enjoy.

Owner
Leland McInnes
Leland McInnes
GitHub Stats Visualizations : Transparent

GitHub Stats Visualizations : Transparent Generate visualizations of GitHub user and repository statistics using GitHub Actions. ⚠️ Disclaimer The pro

YuanYap 7 Apr 05, 2022
A script written in Python that generate output custom color (HEX or RGB input to x1b hexadecimal)

ColorShell ─ 1.5 Planned for v2: setup.sh for setup alias This script converts HEX and RGB code to x1b x1b is code for colorize outputs, works on ou

Riley 4 Oct 31, 2021
Create animated and pretty Pandas Dataframe or Pandas Series

Rich DataFrame Create animated and pretty Pandas Dataframe or Pandas Series, as shown below: Installation pip install rich-dataframe Usage Minimal exa

Khuyen Tran 92 Dec 26, 2022
A python script editor for napari based on PyQode.

napari-script-editor A python script editor for napari based on PyQode. This napari plugin was generated with Cookiecutter using with @napari's cookie

Robert Haase 9 Sep 20, 2022
Example scripts for generating plots of Bohemian matrices

Bohemian Eigenvalue Plotting Examples This repository contains examples of generating plots of Bohemian eigenvalues. The examples in this repository a

Bohemian Matrices 5 Nov 12, 2022
Minimal Ethereum fee data viewer for the terminal, contained in a single python script.

Minimal Ethereum fee data viewer for the terminal, contained in a single python script. Connects to your node and displays some metrics in real-time.

48 Dec 05, 2022
TensorDebugger (TDB) is a visual debugger for deep learning. It extends TensorFlow with breakpoints + real-time visualization of the data flowing through the computational graph

TensorDebugger (TDB) is a visual debugger for deep learning. It extends TensorFlow (Google's Deep Learning framework) with breakpoints + real-time visualization of the data flowing through the comput

Eric Jang 1.4k Dec 15, 2022
YOPO is an interactive dashboard which generates various standard plots.

YOPO is an interactive dashboard which generates various standard plots.you can create various graphs and charts with a click of a button. This tool uses Dash and Flask in backend.

ADARSH C 38 Dec 20, 2022
Python Data. Leaflet.js Maps.

folium Python Data, Leaflet.js Maps folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the Leaflet.js

6k Jan 02, 2023
Smoking Simulation is an app to simulate the spreading of smokers and non-smokers, their interactions and population during certain amount of time.

Smoking Simulation is an app to simulate the spreading of smokers and non-smokers, their interactions and population during certain

Bohdan Ruban 5 Nov 08, 2022
a robust room presence solution for home automation with nearly no false negatives

Argos Room Presence This project builds a room presence solution on top of Argos. Using just a cheap raspberry pi zero w (plus an attached pi camera,

Angad Singh 46 Sep 18, 2022
Mattia Ficarelli 2 Mar 29, 2022
PolytopeSampler is a Matlab implementation of constrained Riemannian Hamiltonian Monte Carlo for sampling from high dimensional disributions on polytopes

PolytopeSampler PolytopeSampler is a Matlab implementation of constrained Riemannian Hamiltonian Monte Carlo for sampling from high dimensional disrib

9 Sep 26, 2022
Tweets your monthly GitHub Contributions as Wordle grid

Tweets your monthly GitHub Contributions as Wordle grid

Venu Vardhan Reddy Tekula 5 Feb 16, 2022
Sci palettes for matplotlib/seaborn

sci palettes for matplotlib/seaborn Installation python3 -m pip install sci-palettes Usage import seaborn as sns import matplotlib.pyplot as plt impor

Qingdong Su 2 Jun 07, 2022
100 data puzzles for pandas, ranging from short and simple to super tricky (60% complete)

100 pandas puzzles Puzzles notebook Solutions notebook Inspired by 100 Numpy exerises, here are 100* short puzzles for testing your knowledge of panda

Alex Riley 1.9k Jan 08, 2023
Some useful extensions for Matplotlib.

mplx Some useful extensions for Matplotlib. Contour plots for functions with discontinuities plt.contour mplx.contour(max_jump=1.0) Matplotlib has pro

Nico Schlömer 519 Dec 30, 2022
GDSHelpers is an open-source package for automatized pattern generation for nano-structuring.

GDSHelpers GDSHelpers in an open-source package for automatized pattern generation for nano-structuring. It allows exporting the pattern in the GDSII-

Helge Gehring 76 Dec 16, 2022
A custom qq-plot for two sample data comparision

QQ-Plot 2 Sample Just a gist to include the custom code to draw a qq-plot in python when dealing with a "two sample problem". This means when u try to

1 Dec 20, 2021
Color maps for POV-Ray v3.7 from the Plasma, Inferno, Magma and Viridis color maps in Python's Matplotlib

POV-Ray-color-maps Color maps for POV-Ray v3.7 from the Plasma, Inferno, Magma and Viridis color maps in Python's Matplotlib. The include file Color_M

Tor Olav Kristensen 1 Apr 05, 2022