Buckshot++ is a new algorithm that finds highly stable clusters efficiently.

Overview

Buckshot++: An Outlier-Resistant and Scalable Clustering Algorithm. (Inspired by the Buckshot Algorithm.)

Here, we introduce a new algorithm, which we name Buckshot++. Buckshot++ improves upon the k-means by dealing with the main shortcoming thereof, namely, the need to predetermine the number of clusters, K. Typically, K is found in the following manner:

  1. settle on some metric,
  2. evaluate that metric at multiple values of K,
  3. use a greedy stopping rule to determine when to stop (typically the bend in an elbow curve).

There must be a better way. We detail the following 3 improvements that the Buckshot++ algorithm makes to k-means.

  1. Not all metrics are create equal. And since K-means doesn't prescribe which metric to use for finding K, we analyzed that some of the commonly implemented metrics are too inconsistent from one iteration to the next. Buckshot++ prescribes the silhouette score for finding K.
  2. In k-means, every single point is clustered -- even the noise and outliers. But what we really care about is the pattern and not the noise. We show here an elegant way to overcome this problem -- even simpler than k-medoids or k-medians.
  3. Finally, the computational complexity of running k-means multiple times on the whole dataset to find the best K can be prohibitive. We show below a surprisingly simple alternative with better asymptotics.

Details of the Buckshot++ algorithm

ALGORITHM: Buckshot++
INPUTS: population of N vectors
B := number of bootstrap samples
F := max number of clusters to try
M := cluster quality metric
OUTPUT: the optimal K for kmeans

Take B bootstrap samples where each sample is of size 1/B.
for each counter k from 2 to F do
  Compute kmeans with k centers.
  Compute the metric M on the clusters.
Compute the centroid of all metrics vectors.
Get argmax of the centroid vector.

Explanation of Buckshot++

The Buckshot++ algorithm was motivated by the Buckshot algorithm, which essentially finds cluster centers by performing hierarchical clustering on a sample and then performing k-means by taking those cluster centers as inputs. Hierarchical has relatively high time complexity, which is why Buckshot performs hierarchical only on a sample. The key difference between hierarchical and kmeans is that the former is more deterministic/stable but less scalable than the latter, as the next table elucidates.

%matplotlib inline
import pandas as pd
pd.set_option('display.max_rows', 500)
tbl = pd.DataFrame({'k-means': ['O(N * k * d * i)', 'random initial means; local minimum; outlier'],
                    'hierarchical': ['O(N^2 * logN)', 'outlier']}
                   , index=['Computational Complexity', 'Sources of Instability'])
tbl
k-means hierarchical
Computational Complexity O(N * k * d * i) O(N^2 * logN)
Sources of Instability random initial means; local minimum; outlier outlier

Hierarchical's higher time complexity means that, for large inputs, running k-means multiple times is still faster than running hierarchical just once. The Buckshot algorithm runs hierarchical just once on a small sample in order to initialize cluster centers for k-means. Since O(N^2 * logN) grows really fast, the sample must be really small to make it work computationally. But a key critique of Buckshot is failure to find the right structure with a small sample.

Buckshot++'s key innovation lies in the step "Take B bootstrap samples where each sample is of size 1/B." While Buckshot is doing hierarchical on a sample, Buckshot++ is doing multiple kmeans on bootstrap samples. Doing kmeans many times can still finish sooner than doing hierarchical just once, as the time complexities above show. An added bonus is that bootstrapping is a great way to smooth out noise and improve stability. In fact, that is exactly why Bagging (a.k.a. Bootstrap Aggregating) and Random Forests work so well.

Python implementation of Buckshot++

The core algorithm implementation is in the buckshotpp module. We use it below to cluster a news headlines dataset.

from buckshotpp import Clusterings, plot_mult_samples
from numpy.random import choice
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_mutual_info_score
import nltk; nltk.download('punkt', quiet=True)
import matplotlib.pyplot as plt; plt.rcParams['figure.dpi'] = 120
import warnings; warnings.filterwarnings('ignore')

vecSpaceMod = Clusterings({'file_loc': 'data/news_headlines.csv',
                           'tf_dampen': True,
                           'common_word_pct': 1,
                           'rare_word_pct': 1,
                           'dim_redu': False}
                         )  # Instantiate a Clusterings object using parameters.
news_df = vecSpaceMod.get_file() # Read news_headlines.csv into a df.
metrics_byK = vecSpaceMod.buckshot(news_df)
plot_mult_samples(metrics_byK, 'silhouette')

png

An insight from this chart

Each green curve is generated from a bootstrap sample, and the red curve is their average. Remember the sources of instability for k-means listed in the table above? Outlier is one. The concept of outlier has somewhat different meaning in the context of clustering. In supervised learning, an outlier is a rare observation that's far from other observations distance-wise. In clustering, a far away observation is its own well-separated cluster. Here, our interpretation is that "rare" is the operative word here and that outliers are singleton clusters that exert undue influence on the formation of other clusters. Look at how bagging led to a more stable estimate of the optimal number of clusters in the graph above.

Not all metrics are create equal

The two internal clustering metrics implemented in scikit-learn are: the Silhouette Coefficient and the Calinski-Harabasz criterion. Comparing the Silhouette plotted above with the Calinski plotted below, it's clear that Calinski is far more extreme, perhaps implausibly extreme.

plot_mult_samples(metrics_byK, 'calinski')

png

Internal or External Clustering Metrics?

This data contains a field named "STORY" that indicates which story a headline belongs to. With this field as the ground truth, we compute Mutual Information (the most common external metric) using the code below. Mutual Information's possible range is 0-1. Using the K resulting from Buckshot++, we obtained a Mutual Information of about 0.6, an indicator that the model performance is reasonable.

X = vecSpaceMod.term_weight_matr(news_df.TITLE)
kmeans_fit = KMeans(20).fit(X)  # the argument comes from inflectin point of silhouette plot
mutual_info = adjusted_mutual_info_score(labels_true=news_df.STORY, labels_pred=kmeans_fit.labels_) 
mutual_info
0.6435601965984835

Practically, does Buckshot++ produce well-separated clusters?

Taking a look at the documents and their corresponding "predictedCluster", the results certainly do seem reasonable.

cluster_results = pd.DataFrame({'predictedCluster': kmeans_fit.labels_,
                                'document': news_df.TITLE})
cluster_results.sort_values(by='predictedCluster', inplace=True)

cluster_results
predictedCluster document
25 0 SAC Capital Starts Anew as Point72
50 0 Zebra Technologies to Acquire Enterprise Busin...
23 0 Fine Tuning: Good Wife just gets better
21 0 Boulder's Wealth May Be A Factor For Lowest Ob...
6 0 Power restored to nuclear plant in Waterford, ...
73 0 Electricity out as Millstone shifts to diesel
59 1 Twitter's head of media Chloe Sladden steps do...
28 1 Twitter's revolving door: media head Chloe Sla...
12 1 Twitter Exec Exodus Continues with Media Chief...
67 2 Sony Xperia C3 arrives with 5MP selfie camera,...
30 2 Leaked: Images Of Sony's Xperia C3 'Selfie Phone'
45 2 Sony Xperia Z2 Encased In A Block Of Ice, Cont...
90 2 Sony Xperia Z4 Concept Emerges as Fan Imagines...
78 2 If you hate the word 'selfie' look away now, t...
71 3 Twitter Executive Quits Amid Stalling Growth
47 3 Twitter COO quits, signalling management shake-up
52 3 Twitter Loses a Powerful Executive
31 3 Second Twitter executive quits hours after Row...
20 3 Twitter COO resigns as growth lags
61 3 Twitter COO Rowghani resigns amid lacklustre g...
57 4 'Goodbye Twitter' COO Ali Rowghani, says bye t...
69 4 Twitter chief operating officer resigns as use...
66 4 UPDATE 3-Twitter chief operating officer resig...
86 4 Twitter chief operating officer Ali Rowghani h...
76 4 Ali Rowghani, Twitter's COO, resigns after mon...
49 4 Twitter COO Ali Rowghani Just Announced Via Tw...
13 4 Twitter COO Ali Rowghani Exits
35 4 Second Twitter exec resigns with goodbye tweet...
39 5 Why almost everything you've been told about u...
77 5 Why Fargo Works So Well as a TV Show
0 6 'Mad Men' Preview: Buckle Up For 7 'Dense' Epi...
4 6 'Mad Men' end in sight for Weiner
36 6 Weiner reflects on the beginning of the end of...
42 7 Giant mystery crater in Siberia has scientists...
85 7 Mysterious giant crater in the earth discovere...
60 7 Massive Crater Discovered in Siberia
92 7 Massive mystery crater at 'end of the world'
16 7 Mysterious crater in Siberia spawns wild Inter...
43 8 Inflation rise stalls wage hopes in the UK
82 8 The Least Obese City in the Country
19 8 Real wages could resume fall as "Easter effect...
55 8 UK Inflation Rise To 1.8% Delays Real Wage Ris...
26 8 Virginia's Governor Challenges Abortion Clinic...
51 8 BREAKING NEWS: Transport costs lead to hike in...
8 8 Cable prices climb 4 times faster than inflati...
79 9 Despite Safety Issues, GM's Sales Still Increa...
17 9 Chrysler Group LLC reports June 2014 US sales ...
40 9 GM June Sales Up 9 Percent, Best June Since 2007
87 9 Ford sales fall, GM barely even; Jeep powers C...
18 10 Gov. McAuliffe Makes Health Announcements
48 10 Microsoft wants Windows XP dead and has announ...
74 10 McAuliffe puts focus on women's health
7 11 Sony makes duckfacing official with Xperia C3,...
54 11 Sony to announce 'Selfie' phone on July 8th wi...
27 11 Sony prepares to launch a smartphone that has ...
91 11 Sony Xperia C3 launches as "world's best selfi...
88 11 Sony unveils Xperia C3 smartphone with LED fla...
11 11 Sony Xperia C3 Boasts 5MP "PROselfie" Front-fa...
44 12 UK CPI rises to 1.8% in April, core CPI hits 2%
75 12 Rising CO2 Levels Will Lower Nutritional Value...
1 12 Here's How Climate Change Will Make Food Less ...
81 12 Rising CO2 levels also make our food less nutr...
80 13 Nutrition in Crops Are Cut down Drastically by...
2 13 Rising carbon dioxide levels reduce nutrients ...
68 13 With carbon dioxide levels up, nutrients in cr...
64 14 Inflation back up: Modest rise to 1.8% in Apri...
83 14 US plants prepare for long-term nuclear waste ...
22 14 Nuclear Plant Operators Deal With Radioactive ...
32 14 US plants prepare long-term nuclear waste stor...
84 15 'Mad Men' takes off on its final flight
3 15 'Mad Men' mixology
5 15 'Mad Men': 7 things to know for Season 7
9 15 Mad Men - the (Blaxploitation) Movie
37 15 TV Review: Mad Men Season 7
46 15 'Mad Men': Season 7 Premiere Guide (Video)
70 15 10 Things You Never Knew About 'Mad Men'!
53 15 'Mad Men' Season 7 Spoilers: Everything We Kno...
72 15 Rich Sommer from AMC's 'Mad Men' Season Premiere
63 16 Fargo (FX) Season Finale 2014 �Morton's Fork�
56 16 Before 'Fargo's' season finale, a sequel (or p...
65 16 'Fargo' Season 1 Spoilers: Episode 10 Synopsis...
62 17 Google Glass headsets get new designs in colla...
41 17 Google's first fashionable Glass frames are de...
89 17 Google Glass Still Trying To Look Cool
34 17 Net-a-Porter Embraces Google Glass
15 18 Routine pelvic exams not recommended under new...
14 18 Doctors group nixes routine pelvic exams
38 18 Metro Detroit doctors wary of recommendation a...
10 18 Doctors against having frequent pelvic exams
58 19 Technology stocks falling for 2nd day in a row
24 19 UPDATE 5-JPMorgan profit weaker than expected ...
29 19 JPMorgan profit weaker than expected
33 19 Marks and Spencer's profits fall for third year

Summary of the key advantages of Buckshot++

  • Accurate method of estimating the number of clusters (a clearly best Silhouette emerged every time, while typical elbow heuristic searches can hit or miss).
  • Scalable (faster search for K achieved by using k-means rather than hierarchical; running k-means on subsample rather than everything).
  • Noise resistant when used in conjunction with k-means++ (sampling with replacement lessens the chance of selecting an outlier in the bootstrap sample).
Owner
John Jung
Senior Machine Learning Engineer
John Jung
Django web apps for managing schedules.

skdue Description Skdue is a web application that makes your life easier by helping you manage your schedule. With the ability which allows you to cre

Patkamon_Awai 1 Jun 30, 2022
Coltrane - A simple content site framework that harnesses the power of Django without the hassle.

coltrane A simple content site framework that harnesses the power of Django without the hassle. Features Can be a standalone static site or added to I

Adam Hill 58 Jan 02, 2023
Run Django tests with testcontainers.

django-rdtwt (Run Django Tests With Testcontainers) This targets users who wish to forget setting up a database for tests. There's no manually startin

2 Jan 09, 2022
Django + AWS Elastic Transcoder

Django Elastic Transcoder django-elastic-transcoder is an Django app, let you integrate AWS Elastic Transcoder in Django easily. What is provided in t

StreetVoice 66 Dec 14, 2022
APIs for a Chat app. Written with Django Rest framework and Django channels.

ChatAPI APIs for a Chat app. Written with Django Rest framework and Django channels. The documentation for the http end points can be found here This

Victor Aderibigbe 18 Sep 09, 2022
🌟 A social media made with Django and Python and Bulma. 🎉

Vitary A simple social media made with Django Installation 🛠️ Get the source code 💻 git clone https://github.com/foxy4096/Vitary.git Go the the dir

Aditya Priyadarshi 15 Aug 30, 2022
A reusable Django model field for storing ad-hoc JSON data

jsonfield jsonfield is a reusable model field that allows you to store validated JSON, automatically handling serialization to and from the database.

Ryan P Kilby 1.1k Jan 03, 2023
Django StatusPage - App to display statuspage for your services

Django StatusPage - App to display statuspage for your services

Gorlik 1 Oct 27, 2021
A feature flipper for Django

README Django Waffle is (yet another) feature flipper for Django. You can define the conditions for which a flag should be active, and use it in a num

950 Dec 26, 2022
An automatic django's update checker and MS teams notifier

Django Update Checker This is small script for checking any new updates/bugfixes/security fixes released in django News & Events and sending correspon

prinzpiuz 4 Sep 26, 2022
A simple demonstration of how a django-based website can be set up for local development with microk8s

Django with MicroK8s Start Building Your Project This project provides a Django web app running as a single node Kubernetes cluster in microk8s. It is

Noah Jacobson 19 Oct 22, 2022
🔥 Campus-Run Django Server🔥

🏫 Campus-Run Campus-Run is a 3D racing game set on a college campus. Designed this service to comfort university students who are unable to visit the

Youngkwon Kim 1 Feb 08, 2022
Django datatables with htmx.

Django datatables with htmx.

Regis Santos 7 Oct 23, 2022
Django-gmailapi-json-backend - Email backend for Django which sends email via the Gmail API through a JSON credential

django-gmailapi-json-backend Email backend for Django which sends email via the

Innove 1 Sep 09, 2022
Website desenvolvido em Django para gerenciamento e upload de arquivos (.pdf).

Website para Gerenciamento de Arquivos Features Esta é uma aplicação full stack web construída para desenvolver habilidades com o framework Django. O

Alinne Grazielle 8 Sep 22, 2022
A app for managing lessons with Django

Course Notes A app for managing lessons with Django Some Ideas

Motahhar.Mokfi 6 Jan 28, 2022
Bootstrap 3 integration with Django.

django-bootstrap3 Bootstrap 3 integration for Django. Goal The goal of this project is to seamlessly blend Django and Bootstrap 3. Want to use Bootstr

Zostera B.V. 2.3k Jan 02, 2023
Atualizando o projeto APIs REST Django REST 2.0

APIs REST Django REST 3.0-KevinSoffa Atualização do projeto APIs REST Django REST 2.0-Kevin Soffa Melhorando e adicionando funcionalidades O que já fo

Kevin Soffa 2 Dec 13, 2022
Alt1-compatible widget host for RuneScape 3

RuneKit Alt1-compatible toolbox for RuneScape 3, for Linux and macOS. Compatibility macOS installation guide Running This project use Poetry as packag

Manatsawin Hanmongkolchai 75 Nov 28, 2022
A web app which allows user to query the weather info of any place in the world

weather-app This is a web app which allows user to get the weather info of any place in the world as soon as possible. It makes use of OpenWeatherMap

Oladipo Adesiyan 3 Sep 20, 2021