The "breathing k-means" algorithm with datasets and example notebooks

Last update: Nov 17, 2022

Overview

The Breathing K-Means Algorithm (with examples)

The Breathing K-Means is an approximation algorithm for the k-means problem that (on average) is better (higher solution quality) and faster (lower CPU time usage) than k-means++.

Techreport: https://arxiv.org/abs/2006.15666 (submitted for publication)

Typical results for the "Birch" data set (100000 points drawn from a mixture of 100 circular Gaussians). k=100

Can you spot the mistakes? :-)

Installation from pypi

pip install bkmeans

Local installation to run the examples

Clone the repository

git clone https://github.com/gittar/breathing-k-means

Enter the top directory.

cd breathing-k-means

Create the conda environment 'bkm' (or any other name) via

conda env create -n bkm -f environment.yml

Activate the created environment via

conda activate bkm

To run a jupyter notebook with examples, type, e.g.:

jupyter lab notebooks/2D.ipynb

Content

The top level folder contains the following subfolders

data/ - data sets used in the notebooks
notebooks/ - jupyter notebooks with all examples from the technical report
- 2D.ipynb 2D problems executed with helper functions for brevity
- 2D_detail.ipynb 2D problems executed with raw API
- 10+D.ipynb high-dimensional problems based on the data sets from the original k-means++ publication
src/
- bkmeans.py - reference implementation of breathing k-means
misc/
- aux.py - auxiliary functions
- dataset.py - general class to administer and plot data sets
- runfunctions.py - wrapper functions used in the notebook

API

The included class BKMeans is subclassed from scikit-learn's KMeans class and has, therefore, the same API. It can be used as a plug-in replacement for scikit-learn's KMeans.

There is one new parameters which can be ignored (left at default) for normal usage:

m (breathing depth), default: 5

The parameter m can also be used, however, to generate faster ( 1 < m < 5) or better (m>5) solutions. For details see the technical report.

Example 1: running on simple random data set

Code:

import numpy as np
from bkmeans import BKMeans

# generate random data set
X=np.random.rand(1000,2)

# create BKMeans instance
bkm = BKMeans(n_clusters=100)

# run the algorithm
bkm.fit(X)

# print SSE (inertia in scikit-learn terms)
print(bkm.inertia_)

Output:

1.1775040547902602

Example 2: comparison with k-means++ (multiple runs)

Code:

import numpy as np
from sklearn.cluster import KMeans
from bkmeans import BKMeans

# random 2D data set
X=np.random.rand(1000,2)

# number of centroids
k=100

for i in range(5):
    # kmeans++
    km = KMeans(n_clusters=k)
    km.fit(X)

    # breathing k-means
    bkm = BKMeans(n_clusters=k)
    bkm.fit(X)

    # relative SSE improvement of bkm over km++
    imp = 1 - bkm.inertia_/km.inertia_
    print(f"SSE improvement over k-means++: {imp:.2%}")

Output:

SSE improvement over k-means++: 3.38%
SSE improvement over k-means++: 4.16%
SSE improvement over k-means++: 6.14%
SSE improvement over k-means++: 6.79%
SSE improvement over k-means++: 4.76%

Acknowledgements

Kudos go the scikit-learn team for their excellent sklearn.cluster.KMeans class, also to the developers and maintainers of the other packages used: numpy, scipy, matplotlib, jupyterlab

The "breathing k-means" algorithm with datasets and example notebooks

Related tags

Overview

The Breathing K-Means Algorithm (with examples)

Installation from pypi

Local installation to run the examples

Content

API

Example 1: running on simple random data set

Example 2: comparison with k-means++ (multiple runs)

Acknowledgements

Owner

Bernd Fritzke

Efficient 6-DoF Grasp Generation in Cluttered Scenes

Unofficial Tensorflow 2 implementation of the paper Implicit Neural Representations with Periodic Activation Functions

Streamlit Tutorial (ex: stock price dashboard, cartoon-stylegan, vqgan-clip, stylemixing, styleclip, sefa)

Code to generate datasets used in "How Useful is Self-Supervised Pretraining for Visual Tasks?"

[NeurIPS 2021] Official implementation of paper "Learning to Simulate Self-driven Particles System with Coordinated Policy Optimization".

Learning Pixel-level Semantic Affinity with Image-level Supervision for Weakly Supervised Semantic Segmentation, CVPR 2018

Weighing Counts: Sequential Crowd Counting by Reinforcement Learning

Similarity-based Gray-box Adversarial Attack Against Deep Face Recognition

Code for "SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in Deep Latent Space"

The code for the CVPR 2021 paper Neural Deformation Graphs, a novel approach for globally-consistent deformation tracking and 3D reconstruction of non-rigid objects.

Code for database and frontend of webpage for Neural Fields in Visual Computing and Beyond.

This is code to fit per-pixel environment map with spherical Gaussian lobes, using LBFGS optimization

This is a Keras-based Python implementation of DeepMask- a complex deep neural network for learning object segmentation masks

Rethinking Transformer-based Set Prediction for Object Detection

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

This is a Pytorch implementation of paper: DropEdge: Towards Deep Graph Convolutional Networks on Node Classification

This is the official implementation of 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection, built on SECOND.

QuALITY: Question Answering with Long Input Texts, Yes!

Remote sensing change detection using PaddlePaddle

TAPEX: Table Pre-training via Learning a Neural SQL Executor