Parallel t-SNE implementation with Python and Torch wrappers.

Overview

Multicore t-SNE Build Status

This is a multicore modification of Barnes-Hut t-SNE by L. Van der Maaten with python and Torch CFFI-based wrappers. This code also works faster than sklearn.TSNE on 1 core.

What to expect

Barnes-Hut t-SNE is done in two steps.

  • First step: an efficient data structure for nearest neighbours search is built and used to compute probabilities. This can be done in parallel for each point in the dataset, this is why we can expect a good speed-up by using more cores.

  • Second step: the embedding is optimized using gradient descent. This part is essentially consecutive so we can only optimize within iteration. In fact some parts can be parallelized effectively, but not all of them a parallelized for now. That is why second step speed-up will not be that significant as first step sepeed-up but there is still room for improvement.

So when can you benefit from parallelization? It is almost true, that the second step computation time is constant of D and depends mostly on N. The first part's time depends on D a lot, so for small D time(Step 1) << time(Step 2), for large D time(Step 1) >> time(Step 2). As we are only good at parallelizing step 1 we will benefit most when D is large enough (MNIST's D = 784 is large, D = 10 even for N=1000000 is not so much). I wrote multicore modification originally for Springleaf competition, where my data table was about 300000 x 3000 and only several days left till the end of the competition so any speed-up was handy.

Benchmark

1 core

Interestingly, that this code beats other implementations. We compare to sklearn (Barnes-Hut of course), L. Van der Maaten's bhtsne, py_bh_tsne repo (cython wrapper for bhtsne with QuadTree). perplexity = 30, theta=0.5 for every run. In fact py_bh_tsne repo works at the same speed as this code when using more optimization flags for compiler.

This is a benchmark for 70000x784 MNIST data:

Method Step 1 (sec) Step 2 (sec)
MulticoreTSNE(n_jobs=1) 912 350
bhtsne 4257 1233
py_bh_tsne 1232 367
sklearn(0.18) ~5400 ~20920

I did my best to find what is wrong with sklearn numbers, but it is the best benchmark I could do (you can find test script in python/tests folder).

Multicore

This table shows a relative to 1 core speed-up when using n cores.

n_jobs Step 1 Step 2
1 1x 1x
2 1.54x 1.05x
4 2.6x 1.2x
8 5.6x 1.65x

How to use

Python and torch wrappers are available.

Python

Install

Directly from pypi

pip install MulticoreTSNE

From source

Make sure cmake is installed on your system, and you will also need a sensible C++ compiler, such as gcc or llvm-clang. On macOS, you can get both via homebrew.

To install the package, please do:

git clone https://github.com/DmitryUlyanov/Multicore-TSNE.git
cd Multicore-TSNE/
pip install .

Tested with both Python 2.7 and 3.6 (conda) and Ubuntu 14.04.

Run

You can use it as a near drop-in replacement for sklearn.manifold.TSNE.

from MulticoreTSNE import MulticoreTSNE as TSNE

tsne = TSNE(n_jobs=4)
Y = tsne.fit_transform(X)

Please refer to sklearn TSNE manual for parameters explanation.

This implementation n_components=2, which is the most common case (use Barnes-Hut t-SNE or sklearn otherwise). Also note that some parameters are there just for the sake of compatibility with sklearn and are otherwise ignored. See MulticoreTSNE class docstring for more info.

MNIST example

from sklearn.datasets import load_digits
from MulticoreTSNE import MulticoreTSNE as TSNE
from matplotlib import pyplot as plt

digits = load_digits()
embeddings = TSNE(n_jobs=4).fit_transform(digits.data)
vis_x = embeddings[:, 0]
vis_y = embeddings[:, 1]
plt.scatter(vis_x, vis_y, c=digits.target, cmap=plt.cm.get_cmap("jet", 10), marker='.')
plt.colorbar(ticks=range(10))
plt.clim(-0.5, 9.5)
plt.show()

Test

You can test it on MNIST dataset with the following command:

python MulticoreTSNE/examples/test.py <n_jobs>

Note on jupyter use

To make the computation log visible in jupyter please install wurlitzer (pip install wurlitzer) and execute this line in any cell beforehand:

%load_ext wurlitzer

Memory leakages are possible if you interrupt the process. Should be OK if you let it run until the end.

Torch

To install execute the following command from repository folder:

luarocks make torch/tsne-1.0-0.rockspec

or

luarocks install https://raw.githubusercontent.com/DmitryUlyanov/Multicore-TSNE/master/torch/tsne-1.0-0.rockspec

You can run t-SNE like that:

tsne = require 'tsne'

Y = tsne(X, n_components, perplexity, n_iter, angle, n_jobs)

torch.DoubleTensor type only supported for now.

License

Inherited from original repo's license.

Future work

  • Allow other types than double
  • Improve step 2 performance (possible)

Citation

Please cite this repository if it was useful for your research:

@misc{Ulyanov2016,
  author = {Ulyanov, Dmitry},
  title = {Multicore-TSNE},
  year = {2016},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/DmitryUlyanov/Multicore-TSNE}},
}

Of course, do not forget to cite L. Van der Maaten's paper

Owner
Dmitry Ulyanov
Co-Founder at in3D, Phd @ Skoltech
Dmitry Ulyanov
Python ts2vg package provides high-performance algorithm implementations to build visibility graphs from time series data.

ts2vg: Time series to visibility graphs The Python ts2vg package provides high-performance algorithm implementations to build visibility graphs from t

Carlos Bergillos 26 Dec 17, 2022
D-Analyst : High Performance Visualization Tool

D-Analyst : High Performance Visualization Tool D-Analyst is a high performance data visualization built with python and based on OpenGL. It allows to

4 Apr 14, 2022
🌀❄️🌩️ This repository contains some examples for creating 2d and 3d weather plots using matplotlib and cartopy libraries in python3.

Weather-Plotting 🌀 ❄️ 🌩️ This repository contains some examples for creating 2d and 3d weather plots using matplotlib and cartopy libraries in pytho

Giannis Dravilas 21 Dec 10, 2022
Fast scatter density plots for Matplotlib

About Plotting millions of points can be slow. Real slow... 😴 So why not use density maps? ⚡ The mpl-scatter-density mini-package provides functional

Thomas Robitaille 473 Dec 12, 2022
Customizing Visual Styles in Plotly

Customizing Visual Styles in Plotly Code for a workshop originally developed for an Unconference session during the Outlier Conference hosted by Data

Data Design Dimension 9 Aug 03, 2022
Parallel t-SNE implementation with Python and Torch wrappers.

Multicore t-SNE This is a multicore modification of Barnes-Hut t-SNE by L. Van der Maaten with python and Torch CFFI-based wrappers. This code also wo

Dmitry Ulyanov 1.7k Jan 09, 2023
WhatsApp Chat Analyzer is a WebApp and it can be used by anyone to analyze their chat. 😄

WhatsApp-Chat-Analyzer You can view the working project here. WhatsApp chat Analyzer is a WebApp where anyone either tech or non-tech person can analy

Prem Chandra Singh 26 Nov 02, 2022
基于python爬虫爬取COVID-19爆发开始至今全球疫情数据并利用Echarts对数据进行分析与多样化展示。

COVID-19-Epidemic-Map 基于python爬虫爬取COVID-19爆发开始至今全球疫情数据并利用Echarts对数据进行分析与多样化展示。 觉得项目还不错的话欢迎给一个star! 项目的源码可以正常运行,各个库的版本、数据库的建表语句、运行过程中遇到的坑以及解决方式在笔记.md中都

31 Dec 15, 2022
High performance, editable, stylable datagrids in jupyter and jupyterlab

An ipywidgets wrapper of regular-table for Jupyter. Examples Two Billion Rows Notebook Click Events Notebook Edit Events Notebook Styling Notebook Pan

J.P. Morgan Chase 75 Dec 15, 2022
Flow-based visual scripting for Python

A simple visual node editor for Python Ryven combines flow-based visual scripting with Python. It gives you absolute freedom for your nodes and a simp

Leon Thomm 3.1k Jan 06, 2023
HW 02 for CS40 - matplotlib practice

HW 02 for CS40 - matplotlib practice project instructions https://github.com/mikeizbicki/cmc-csci040/tree/2021fall/hw_02 Drake Lyric Analysis Bar Char

13 Oct 27, 2021
Mathematical learnings with Lean, for those of us who wish we knew more of both!

Lean for the Inept Mathematician This repository contains source files for a number of articles or posts aimed at explaining bite-sized mathematical c

Julian Berman 8 Feb 14, 2022
Political elections, appointment, analysis and visualization in Python

Political elections, appointment, analysis and visualization in Python poli-sci-kit is a Python package for political science appointment and election

Andrew Tavis McAllister 9 Dec 01, 2022
Squidpy is a tool for the analysis and visualization of spatial molecular data.

Squidpy is a tool for the analysis and visualization of spatial molecular data. It builds on top of scanpy and anndata, from which it inherits modularity and scalability. It provides analysis tools t

Theis Lab 251 Dec 19, 2022
Print matplotlib colors

mplcolors Tired of searching "matplotlib colors" every week/day/hour? This simple script displays them all conveniently right in your terminal emulato

Brandon Barker 32 Dec 13, 2022
A python script to visualise explain plans as a graph using graphviz

README Needs to be improved Prerequisites Need to have graphiz installed on the machine. Refer to https://graphviz.readthedocs.io/en/stable/manual.htm

Edward Mallia 1 Sep 28, 2021
A pandas extension that solves all problems of Jalai/Iraninan/Shamsi dates

Jalali Pandas Extentsion A pandas extension that solves all problems of Jalai/Iraninan/Shamsi dates Features Series Extenstion Convert string to Jalal

51 Jan 02, 2023
🐍PyNode Next allows you to easily create beautiful graph visualisations and animations

PyNode Next A complete rewrite of PyNode for the modern era. Up to five times faster than the original PyNode. PyNode Next allows you to easily create

ehne 3 Feb 12, 2022
A Python package that provides evaluation and visualization tools for the DexYCB dataset

DexYCB Toolkit DexYCB Toolkit is a Python package that provides evaluation and visualization tools for the DexYCB dataset. The dataset and results wer

NVIDIA Research Projects 107 Dec 26, 2022
Function Plotter: a simple application with GUI to plot mathematical functions

Function-Plotter Function Plotter is a simple application with GUI to plot mathe

Mohamed Nabawe 4 Jan 03, 2022