Clustering is a popular approach to detect patterns in unlabeled data

Last update: Nov 11, 2022

Related tags

Overview

Visual Clustering

Clustering is a popular approach to detect patterns in unlabeled data. Existing clustering methods typically treat samples in a dataset as points in a metric space and compute distances to group together similar points. Visual Clustering a different way of clustering points in 2-dimensional space, inspired by how humans "visually" cluster data. The algorithm is based on trained neural networks that perform instance segmentation on plotted data.

For more details, see the accompanying paper: "Clustering Plotted Data by Image Segmentation", arXiv preprint, and please use the citation below.

@article{naous2021clustering,
  title={Clustering Plotted Data by Image Segmentation},
  author={Naous, Tarek and Sarkar, Srinjay and Abid, Abubakar and Zou, James},
  journal={arXiv preprint arXiv:2110.05187},
  year={2021}
}

Installation

pip install visual-clustering

Usage

The algorithm can be used the same way as the classical clustering algorithms in scikit-learn:
You first import the class VisualClustering and create an instance of it.

from visual_clustering import VisualClustering

model = VisualClustering(median_filter_size = 1, max_filter_size= 1)

The parameters median_filter_size and max_filter_size are set to 1 by default.
You can experiment with different values to see what works best for your dataset !

Let's create a simple synthetic dataset of blobs.

from sklearn import datasets

data = datasets.make_blobs(n_samples=50000, centers=6, random_state=23,center_box=(-30, 30))
plt.scatter(data[0][:, 0], data[0][:, 1], s=1, c='black')

To cluster the dataset, use the fit function of the model:

predictions = model.fit(data[0])

Visualizing the results

You can visualize the results using matplotlib as you would normally do with classical clustering algorithms:

import matplotlib.pyplot as plt
from itertools import cycle, islice
import numpy as np

colors = np.array(list(islice(cycle(["#000000", '#377eb8', '#ff7f00', '#4daf4a', '#f781bf', '#a65628', '#984ea3']), int(max(predictions) + 1))))
#Black color for outliers (if any)
colors = np.append(colors, ["#000000"])
plt.scatter(data[0][:, 0], data[0][:, 1], s=10, color=colors[predictions.astype('int8')])

Run this code inside a colab notebook:
https://colab.research.google.com/drive/1DcZXhKnUpz1GDoGaJmpS6VVNXVuaRmE5?usp=sharing

Dependencies

Make sure that you have the following libraries installed:

transformers 4.15.0
scipy 1.4.1
tensorflow 2.7.0
keras 2.7.0
numpy 1.19.5
cv2 4.1.2
skimage 0.18.3

Contact

Clustering is a popular approach to detect patterns in unlabeled data

Related tags

Overview

Visual Clustering

Installation

Usage

Visualizing the results

Dependencies

Contact

Owner

Tarek Naous

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

Breaking the Curse of Space Explosion: Towards Efficient NAS with Curriculum Search

A Pytorch Implementation of ClariNet

Attack on Confidence Estimation algorithm from the paper "Disrupting Deep Uncertainty Estimation Without Harming Accuracy"

Pyramid Grafting Network for One-Stage High Resolution Saliency Detection. CVPR 2022

ICLR 2021: Pre-Training for Context Representation in Conversational Semantic Parsing

Identify the emotion of multiple speakers in an Audio Segment

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Code for the ICCV 2021 paper "Pixel Difference Networks for Efficient Edge Detection" (Oral).

Package for extracting emotions from social media text. Tailored for financial data.

Title: Graduate-Admissions-Predictor

NAS Benchmark in "Prioritized Architecture Sampling with Monto-Carlo Tree Search", CVPR2021

Applying curriculum to meta-learning for few shot classification

A collection of resources, problems, explanations and concepts that are/were important during my Data Science journey

A multi-functional library for full-stack Deep Learning. Simplifies Model Building, API development, and Model Deployment.

Code To Tune or Not To Tune? Zero-shot Models for Legal Case Entailment.

Linear algebra python - Number of operations and problems in Linear Algebra and Numerical Linear Algebra

List some popular DeepFake models e.g. DeepFake, FaceSwap-MarekKowal, IPGAN, FaceShifter, FaceSwap-Nirkin, FSGAN, SimSwap, CihaNet, etc.

Cleaned up code for DSTC 10: SIMMC 2.0 track: subtask 2: multimodal coreference resolution

Official implementation of the ICCV 2021 paper: "The Power of Points for Modeling Humans in Clothing".