Buckshot++ is a new algorithm that finds highly stable clusters efficiently.

Overview

Buckshot++: An Outlier-Resistant and Scalable Clustering Algorithm. (Inspired by the Buckshot Algorithm.)

Here, we introduce a new algorithm, which we name Buckshot++. Buckshot++ improves upon the k-means by dealing with the main shortcoming thereof, namely, the need to predetermine the number of clusters, K. Typically, K is found in the following manner:

  1. settle on some metric,
  2. evaluate that metric at multiple values of K,
  3. use a greedy stopping rule to determine when to stop (typically the bend in an elbow curve).

There must be a better way. We detail the following 3 improvements that the Buckshot++ algorithm makes to k-means.

  1. Not all metrics are create equal. And since K-means doesn't prescribe which metric to use for finding K, we analyzed that some of the commonly implemented metrics are too inconsistent from one iteration to the next. Buckshot++ prescribes the silhouette score for finding K.
  2. In k-means, every single point is clustered -- even the noise and outliers. But what we really care about is the pattern and not the noise. We show here an elegant way to overcome this problem -- even simpler than k-medoids or k-medians.
  3. Finally, the computational complexity of running k-means multiple times on the whole dataset to find the best K can be prohibitive. We show below a surprisingly simple alternative with better asymptotics.

Details of the Buckshot++ algorithm

ALGORITHM: Buckshot++
INPUTS: population of N vectors
B := number of bootstrap samples
F := max number of clusters to try
M := cluster quality metric
OUTPUT: the optimal K for kmeans

Take B bootstrap samples where each sample is of size 1/B.
for each counter k from 2 to F do
  Compute kmeans with k centers.
  Compute the metric M on the clusters.
Compute the centroid of all metrics vectors.
Get argmax of the centroid vector.

Explanation of Buckshot++

The Buckshot++ algorithm was motivated by the Buckshot algorithm, which essentially finds cluster centers by performing hierarchical clustering on a sample and then performing k-means by taking those cluster centers as inputs. Hierarchical has relatively high time complexity, which is why Buckshot performs hierarchical only on a sample. The key difference between hierarchical and kmeans is that the former is more deterministic/stable but less scalable than the latter, as the next table elucidates.

%matplotlib inline
import pandas as pd
pd.set_option('display.max_rows', 500)
tbl = pd.DataFrame({'k-means': ['O(N * k * d * i)', 'random initial means; local minimum; outlier'],
                    'hierarchical': ['O(N^2 * logN)', 'outlier']}
                   , index=['Computational Complexity', 'Sources of Instability'])
tbl
k-means hierarchical
Computational Complexity O(N * k * d * i) O(N^2 * logN)
Sources of Instability random initial means; local minimum; outlier outlier

Hierarchical's higher time complexity means that, for large inputs, running k-means multiple times is still faster than running hierarchical just once. The Buckshot algorithm runs hierarchical just once on a small sample in order to initialize cluster centers for k-means. Since O(N^2 * logN) grows really fast, the sample must be really small to make it work computationally. But a key critique of Buckshot is failure to find the right structure with a small sample.

Buckshot++'s key innovation lies in the step "Take B bootstrap samples where each sample is of size 1/B." While Buckshot is doing hierarchical on a sample, Buckshot++ is doing multiple kmeans on bootstrap samples. Doing kmeans many times can still finish sooner than doing hierarchical just once, as the time complexities above show. An added bonus is that bootstrapping is a great way to smooth out noise and improve stability. In fact, that is exactly why Bagging (a.k.a. Bootstrap Aggregating) and Random Forests work so well.

Python implementation of Buckshot++

The core algorithm implementation is in the buckshotpp module. We use it below to cluster a news headlines dataset.

from buckshotpp import Clusterings, plot_mult_samples
from numpy.random import choice
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_mutual_info_score
import nltk; nltk.download('punkt', quiet=True)
import matplotlib.pyplot as plt; plt.rcParams['figure.dpi'] = 120
import warnings; warnings.filterwarnings('ignore')

vecSpaceMod = Clusterings({'file_loc': 'data/news_headlines.csv',
                           'tf_dampen': True,
                           'common_word_pct': 1,
                           'rare_word_pct': 1,
                           'dim_redu': False}
                         )  # Instantiate a Clusterings object using parameters.
news_df = vecSpaceMod.get_file() # Read news_headlines.csv into a df.
metrics_byK = vecSpaceMod.buckshot(news_df)
plot_mult_samples(metrics_byK, 'silhouette')

png

An insight from this chart

Each green curve is generated from a bootstrap sample, and the red curve is their average. Remember the sources of instability for k-means listed in the table above? Outlier is one. The concept of outlier has somewhat different meaning in the context of clustering. In supervised learning, an outlier is a rare observation that's far from other observations distance-wise. In clustering, a far away observation is its own well-separated cluster. Here, our interpretation is that "rare" is the operative word here and that outliers are singleton clusters that exert undue influence on the formation of other clusters. Look at how bagging led to a more stable estimate of the optimal number of clusters in the graph above.

Not all metrics are create equal

The two internal clustering metrics implemented in scikit-learn are: the Silhouette Coefficient and the Calinski-Harabasz criterion. Comparing the Silhouette plotted above with the Calinski plotted below, it's clear that Calinski is far more extreme, perhaps implausibly extreme.

plot_mult_samples(metrics_byK, 'calinski')

png

Internal or External Clustering Metrics?

This data contains a field named "STORY" that indicates which story a headline belongs to. With this field as the ground truth, we compute Mutual Information (the most common external metric) using the code below. Mutual Information's possible range is 0-1. Using the K resulting from Buckshot++, we obtained a Mutual Information of about 0.6, an indicator that the model performance is reasonable.

X = vecSpaceMod.term_weight_matr(news_df.TITLE)
kmeans_fit = KMeans(20).fit(X)  # the argument comes from inflectin point of silhouette plot
mutual_info = adjusted_mutual_info_score(labels_true=news_df.STORY, labels_pred=kmeans_fit.labels_) 
mutual_info
0.6435601965984835

Practically, does Buckshot++ produce well-separated clusters?

Taking a look at the documents and their corresponding "predictedCluster", the results certainly do seem reasonable.

cluster_results = pd.DataFrame({'predictedCluster': kmeans_fit.labels_,
                                'document': news_df.TITLE})
cluster_results.sort_values(by='predictedCluster', inplace=True)

cluster_results
predictedCluster document
25 0 SAC Capital Starts Anew as Point72
50 0 Zebra Technologies to Acquire Enterprise Busin...
23 0 Fine Tuning: Good Wife just gets better
21 0 Boulder's Wealth May Be A Factor For Lowest Ob...
6 0 Power restored to nuclear plant in Waterford, ...
73 0 Electricity out as Millstone shifts to diesel
59 1 Twitter's head of media Chloe Sladden steps do...
28 1 Twitter's revolving door: media head Chloe Sla...
12 1 Twitter Exec Exodus Continues with Media Chief...
67 2 Sony Xperia C3 arrives with 5MP selfie camera,...
30 2 Leaked: Images Of Sony's Xperia C3 'Selfie Phone'
45 2 Sony Xperia Z2 Encased In A Block Of Ice, Cont...
90 2 Sony Xperia Z4 Concept Emerges as Fan Imagines...
78 2 If you hate the word 'selfie' look away now, t...
71 3 Twitter Executive Quits Amid Stalling Growth
47 3 Twitter COO quits, signalling management shake-up
52 3 Twitter Loses a Powerful Executive
31 3 Second Twitter executive quits hours after Row...
20 3 Twitter COO resigns as growth lags
61 3 Twitter COO Rowghani resigns amid lacklustre g...
57 4 'Goodbye Twitter' COO Ali Rowghani, says bye t...
69 4 Twitter chief operating officer resigns as use...
66 4 UPDATE 3-Twitter chief operating officer resig...
86 4 Twitter chief operating officer Ali Rowghani h...
76 4 Ali Rowghani, Twitter's COO, resigns after mon...
49 4 Twitter COO Ali Rowghani Just Announced Via Tw...
13 4 Twitter COO Ali Rowghani Exits
35 4 Second Twitter exec resigns with goodbye tweet...
39 5 Why almost everything you've been told about u...
77 5 Why Fargo Works So Well as a TV Show
0 6 'Mad Men' Preview: Buckle Up For 7 'Dense' Epi...
4 6 'Mad Men' end in sight for Weiner
36 6 Weiner reflects on the beginning of the end of...
42 7 Giant mystery crater in Siberia has scientists...
85 7 Mysterious giant crater in the earth discovere...
60 7 Massive Crater Discovered in Siberia
92 7 Massive mystery crater at 'end of the world'
16 7 Mysterious crater in Siberia spawns wild Inter...
43 8 Inflation rise stalls wage hopes in the UK
82 8 The Least Obese City in the Country
19 8 Real wages could resume fall as "Easter effect...
55 8 UK Inflation Rise To 1.8% Delays Real Wage Ris...
26 8 Virginia's Governor Challenges Abortion Clinic...
51 8 BREAKING NEWS: Transport costs lead to hike in...
8 8 Cable prices climb 4 times faster than inflati...
79 9 Despite Safety Issues, GM's Sales Still Increa...
17 9 Chrysler Group LLC reports June 2014 US sales ...
40 9 GM June Sales Up 9 Percent, Best June Since 2007
87 9 Ford sales fall, GM barely even; Jeep powers C...
18 10 Gov. McAuliffe Makes Health Announcements
48 10 Microsoft wants Windows XP dead and has announ...
74 10 McAuliffe puts focus on women's health
7 11 Sony makes duckfacing official with Xperia C3,...
54 11 Sony to announce 'Selfie' phone on July 8th wi...
27 11 Sony prepares to launch a smartphone that has ...
91 11 Sony Xperia C3 launches as "world's best selfi...
88 11 Sony unveils Xperia C3 smartphone with LED fla...
11 11 Sony Xperia C3 Boasts 5MP "PROselfie" Front-fa...
44 12 UK CPI rises to 1.8% in April, core CPI hits 2%
75 12 Rising CO2 Levels Will Lower Nutritional Value...
1 12 Here's How Climate Change Will Make Food Less ...
81 12 Rising CO2 levels also make our food less nutr...
80 13 Nutrition in Crops Are Cut down Drastically by...
2 13 Rising carbon dioxide levels reduce nutrients ...
68 13 With carbon dioxide levels up, nutrients in cr...
64 14 Inflation back up: Modest rise to 1.8% in Apri...
83 14 US plants prepare for long-term nuclear waste ...
22 14 Nuclear Plant Operators Deal With Radioactive ...
32 14 US plants prepare long-term nuclear waste stor...
84 15 'Mad Men' takes off on its final flight
3 15 'Mad Men' mixology
5 15 'Mad Men': 7 things to know for Season 7
9 15 Mad Men - the (Blaxploitation) Movie
37 15 TV Review: Mad Men Season 7
46 15 'Mad Men': Season 7 Premiere Guide (Video)
70 15 10 Things You Never Knew About 'Mad Men'!
53 15 'Mad Men' Season 7 Spoilers: Everything We Kno...
72 15 Rich Sommer from AMC's 'Mad Men' Season Premiere
63 16 Fargo (FX) Season Finale 2014 �Morton's Fork�
56 16 Before 'Fargo's' season finale, a sequel (or p...
65 16 'Fargo' Season 1 Spoilers: Episode 10 Synopsis...
62 17 Google Glass headsets get new designs in colla...
41 17 Google's first fashionable Glass frames are de...
89 17 Google Glass Still Trying To Look Cool
34 17 Net-a-Porter Embraces Google Glass
15 18 Routine pelvic exams not recommended under new...
14 18 Doctors group nixes routine pelvic exams
38 18 Metro Detroit doctors wary of recommendation a...
10 18 Doctors against having frequent pelvic exams
58 19 Technology stocks falling for 2nd day in a row
24 19 UPDATE 5-JPMorgan profit weaker than expected ...
29 19 JPMorgan profit weaker than expected
33 19 Marks and Spencer's profits fall for third year

Summary of the key advantages of Buckshot++

  • Accurate method of estimating the number of clusters (a clearly best Silhouette emerged every time, while typical elbow heuristic searches can hit or miss).
  • Scalable (faster search for K achieved by using k-means rather than hierarchical; running k-means on subsample rather than everything).
  • Noise resistant when used in conjunction with k-means++ (sampling with replacement lessens the chance of selecting an outlier in the bootstrap sample).
Owner
John Jung
Senior Machine Learning Engineer
John Jung
Auto-detecting the n+1 queries problem in Python

nplusone nplusone is a library for detecting the n+1 queries problem in Python ORMs, including SQLAlchemy, Peewee, and the Django ORM. The Problem Man

Joshua Carp 837 Dec 29, 2022
django CMS Association 1.6k Jan 06, 2023
A fresh approach to autocomplete implementations, specially for Django.

A fresh approach to autocomplete implementations, specially for Django. Status: v3 stable, 2.x.x stable, 1.x.x deprecated. Please DO regularely ping us with your link at #yourlabs IRC channel

YourLabs 1.6k Dec 22, 2022
MAC address Model Field & Form Field for Django apps

django-macaddress MAC Address model and form fields for Django We use netaddr to parse and validate the MAC address. The tests aren't complete yet. Pa

49 Sep 04, 2022
Generate generic activity streams from the actions on your site. Users can follow any actors' activities for personalized streams.

Django Activity Stream What is Django Activity Stream? Django Activity Stream is a way of creating activities generated by the actions on your site. I

Justin Quick 2.1k Dec 29, 2022
Returns unicode slugs

Python Slugify A Python slugify application that handles unicode. Overview Best attempt to create slugs from unicode strings while keeping it DRY. Not

Val Neekman (AvidCoder) 1.3k Dec 23, 2022
Projeto onde podes inserir notícias, ver todas as notícias guardas e filtrar por tag. A base de dados usada é o mongoDB.

djangoProject Projeto onde podes inserir notícias, ver todas as notícias guardas e filtrar por tag. A base de dados usada é o mongoDB. packages utiliz

Sofia Rocha 1 Feb 22, 2022
Django based webapp pulling in crypto news and price data via api

Deploy Django in Production FTA project implementing containerization of Django Web Framework into Docker to be placed into Azure Container Services a

0 Sep 21, 2022
☄️ Google Forms autofill script

lazrr 'Destroy Them With Lazers' - Knife Party, 2011 Google Forms autofill script Installation: pip3 install -r requirements.txt Usage: python3 lazrr.

Serezha Rakhmanov 12 Jun 04, 2022
Django CRUD REST API Generator

Django CRUD REST API Generator This is a simple tool that generates a Django REST API with the given models. Specs: Authentication, DRF generic views,

Mehmet Alp Sümer 57 Nov 24, 2022
A simple REST API to manage postal addresses, written in Python/Django.

A simple REST API to manage postal addresses, written in Python/Django.

Attila Bagossy 2 Feb 14, 2022
Django Pickled Model

Django Pickled Model Django pickled model provides you a model with dynamic data types. a field can store any value in any type. You can store Integer

Amir 3 Sep 14, 2022
REST API con Python, Django y MySQL (GET, POST, PUT, DELETE)

django_api_mysql REST API con Python, Django y MySQL (GET, POST, PUT, DELETE) REST API con Python, Django y MySQL (GET, POST, PUT, DELETE)

Andrew 1 Dec 28, 2021
🔃 A simple implementation of STOMP with Django

Django Stomp A simple implementation of STOMP with Django. In theory it can work with any broker which supports STOMP with none or minor adjustments.

Juntos Somos Mais 32 Nov 08, 2022
Fast / fuzzy PostgreSQL counts for Django

Created by Stephen McDonald Introduction Up until PostgreSQL 9.2, COUNT queries generally required scanning every row in a database table. With millio

stephenmcd 85 Oct 25, 2021
Built from scratch to replicate some of the Django admin functionality and add some more, to serve as an introspective interface for Django and Mongo.

django-mongonaut Info: An introspective interface for Django and MongoDB. Version: 0.2.21 Maintainer: Jazzband (jazzband.co) This Project is Being Mov

Jazzband 238 Dec 26, 2022
Sistema de tratamento e análise de grandes volumes de dados através de técnicas de Data Science

Sistema de tratamento e análise de grandes volumes de dados através de técnicas de data science Todos os scripts, gráficos e relatórios de todas as at

Arthur Quintanilha Neto 1 Sep 05, 2022
A Django app for managing robots.txt files following the robots exclusion protocol

Django Robots This is a basic Django application to manage robots.txt files following the robots exclusion protocol, complementing the Django Sitemap

Jazzband 406 Dec 26, 2022
Django-pwned - A collection of django password validators

Django Pwned A collection of django password validators. Compatibility Python: 3

Quera 22 Jun 27, 2022
Rosetta is a Django application that eases the translation process of your Django projects

Rosetta Rosetta is a Django application that facilitates the translation process of your Django projects. Because it doesn't export any models, Rosett

Marco Bonetti 909 Dec 26, 2022