TriMap: Large-scale Dimensionality Reduction Using Triplets

Related tags

Deep Learningtrimap
Overview

TriMap

TriMap is a dimensionality reduction method that uses triplet constraints to form a low-dimensional embedding of a set of points. The triplet constraints are of the form "point i is closer to point j than point k". The triplets are sampled from the high-dimensional representation of the points and a weighting scheme is used to reflect the importance of each triplet.

TriMap provides a significantly better global view of the data than the other dimensionality reduction methods such t-SNE, LargeVis, and UMAP. The global structure includes relative distances of the clusters, multiple scales in the data, and the existence of possible outliers. We define a global score to quantify the quality of an embedding in reflecting the global structure of the data.

CIFAR-10 dataset (test set) passed through a CNN (n = 10,000, d = 1024): Notice the semantic structure unveiled by TriMap.

Visualizations of the CIFAR-10 dataset

The following implementation is in Python. Further details and more experimental results are available in the paper.

How to use TriMap

TriMap has a transformer API similar to other sklearn libraries. To use TriMap with the default parameters, simply do:

import trimap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = trimap.TRIMAP().fit_transform(digits.data)

To find the embedding using a precomputed pairwise distance matrix D, pass D as input and set use_dist_matrix to True:

embedding = trimap.TRIMAP(use_dist_matrix=True).fit_transform(D)

You can also pass the precomputed k-nearest neighbors and their corresponding distances as a tuple (knn_nbrs, knn_distances). Note that the rows must be in order, starting from point 0 to n-1. This feature also requires X to compute the embedding

embedding = trimap.TRIMAP(knn_tuple=(knn_nbrs, knn_distances)).fit_transform(X)

To calculate the global score, do:

gs = trimap.TRIMAP(verbose=False).global_score(digits.data, embedding)
print("global score %2.2f" % gs)

Parameters

The list of parameters is given blow:

  • n_dims: Number of dimensions of the embedding (default = 2)
  • n_inliers: Number of nearest neighbors for forming the nearest neighbor triplets (default = 10).
  • n_outliers: Number of outliers for forming the nearest neighbor triplets (default = 5).
  • n_random: Number of random triplets per point (default = 5).
  • distance: Distance measure ('euclidean' (default), 'manhattan', 'angular', 'hamming')
  • weight_adj: The value of gamma for the log-transformation (default = 500.0).
  • lr: Learning rate (default = 1000.0).
  • n_iters: Number of iterations (default = 400).

The other parameters include:

  • knn_tuple: Use the precomputed nearest-neighbors information in form of a tuple (knn_nbrs, knn_distances) (default = None)
  • use_dist_matrix: Use the precomputed pairwise distance matrix (default = False)
  • apply_pca: Reduce the number of dimensions of the data to 100 if necessary before applying the nearest-neighbor search (default = True).
  • opt_method: Optimization method {'sd' (steepest descent), 'momentum' (GD with momentum), 'dbd' (delta-bar-delta, default)}.
  • verbose: Print the progress report (default = True).
  • return_seq: Store the intermediate results and return the results in a tensor (default = False).

An example of adjusting these parameters:

import trimap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = trimap.TRIMAP(n_inliers=20,
                          n_outliers=10,
                          n_random=10,
                          weight_adj=1000.0).fit_transform(digits.data)

The nearest-neighbor calculation is performed using ANNOY.

Examples

The following are some of the results on real-world datasets. The values of nearest-neighbor accuracy and global score are shown as a pair (NN, GS) on top of each figure. For more results, please refer to our paper.

USPS Handwritten Digits (n = 11,000, d = 256)

Visualizations of the USPS dataset

20 News Groups (n = 18,846, d = 100)

Visualizations of the 20 News Groups dataset

Tabula Muris (n = 53,760, d = 23,433)

Visualizations of the Tabula Muris Mouse Tissues dataset

MNIST Handwritten Digits (n = 70,000, d = 784)

Visualizations of the MNIST dataset

Fashion MNIST (n = 70,000, d = 784)

Visualizations of the  Fashion MNIST dataset

TV News (n = 129,685, d = 100)

Visualizations of the  TV News dataset

Runtime of t-SNE, LargeVis, UMAP, and TriMap in the hh:mm:ss format on a single machine with 2.6 GHz Intel Core i5 CPU and 16 GB of memory is given in the following table. We limit the runtime of each method to 12 hours. Also, UMAP runs out of memory on datasets larger than ~4M points.

Runtime of TriMap compared to other methods

Installing

Requirements:

  • numpy
  • scikit-learn
  • numba
  • annoy

Installing annoy

If you are having trouble with installing annoy on macOS using the command:

pip3 install annoy

you can alternatively try:

pip3 install git+https://github.com/sutao/[email protected]

Install Options

If you have all the requirements installed, you can use pip:

sudo pip install trimap

Please regularly check for updates and make sure you are using the most recent version. If you have TriMap installed and would like to upgrade to the newer version, you can use the command:

sudo pip install --upgrade --force-reinstall trimap

An alternative is to install the dependencies manually using anaconda and using pip to install TriMap:

conda install numpy
conda install scikit-learn
conda install numba
conda install annoy
pip install trimap

For a manual install get this package:

wget https://github.com/eamid/trimap/archive/master.zip
unzip master.zip
rm master.zip
cd trimap-master

Install the requirements

sudo pip install -r requirements.txt

or

conda install scikit-learn numba annoy

Install the package

python setup.py install

Support and Contribution

This implementation is still a work in progress. Any comments/suggestions/bug-reports are highly appreciated. Please feel free contact me at: [email protected]. If you would like to contribute to the code, please fork the project and send me a pull request.

Citation

If you use TriMap in your publications, please cite our current reference on arXiv:

@article{2019TRIMAP,
     author = {{Amid}, Ehsan and {Warmuth}, Manfred K.},
     title = "{TriMap: Large-scale Dimensionality Reduction Using Triplets}",
     journal = {arXiv preprint arXiv:1910.00204},
     archivePrefix = "arXiv",
     eprint = {1910.00204},
     year = 2019,
}

License

Please see the LICENSE file.

Owner
Ehsan Amid
Research Scientist at Google Mountain View
Ehsan Amid
PointCloud Annotation Tools, support to label object bound box, ground, lane and kerb

PointCloud Annotation Tools, support to label object bound box, ground, lane and kerb

halo 368 Dec 06, 2022
Automatically measure the facial Width-To-Height ratio and get facial analysis results provided by Microsoft Azure

fwhr-calc-website This project is to automatically measure the facial Width-To-Height ratio and get facial analysis results provided by Microsoft Azur

SoohyunPark 1 Feb 07, 2022
Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation

DynaBOA Code repositoty for the paper: Out-of-Domain Human Mesh Reconstruction via Dynamic Bilevel Online Adaptation Shanyan Guan, Jingwei Xu, Michell

197 Jan 07, 2023
Have you ever wondered how cool it would be to have your own A.I

Have you ever wondered how cool it would be to have your own A.I. assistant Imagine how easier it would be to send emails without typing a single word, doing Wikipedia searches without opening web br

Harsh Gupta 1 Nov 09, 2021
Robust, modular and efficient implementation of advanced Hamiltonian Monte Carlo algorithms

AdvancedHMC.jl AdvancedHMC.jl provides a robust, modular and efficient implementation of advanced HMC algorithms. An illustrative example for Advanced

The Turing Language 167 Jan 01, 2023
A universal framework for learning timestamp-level representations of time series

TS2Vec This repository contains the official implementation for the paper Learning Timestamp-Level Representations for Time Series with Hierarchical C

Zhihan Yue 284 Dec 30, 2022
PyTorch implementation of "Learning to Discover Cross-Domain Relations with Generative Adversarial Networks"

DiscoGAN in PyTorch PyTorch implementation of Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. * All samples in READM

Taehoon Kim 1k Jan 04, 2023
FFCV: Fast Forward Computer Vision (and other ML workloads!)

Fast Forward Computer Vision: train models at a fraction of the cost with accele

FFCV 2.3k Jan 03, 2023
SwinIR: Image Restoration Using Swin Transformer

SwinIR: Image Restoration Using Swin Transformer This repository is the official PyTorch implementation of SwinIR: Image Restoration Using Shifted Win

Jingyun Liang 2.4k Jan 08, 2023
As-ViT: Auto-scaling Vision Transformers without Training

As-ViT: Auto-scaling Vision Transformers without Training [PDF] Wuyang Chen, Wei Huang, Xianzhi Du, Xiaodan Song, Zhangyang Wang, Denny Zhou In ICLR 2

VITA 68 Sep 05, 2022
QueryDet: Cascaded Sparse Query for Accelerating High-Resolution SmallObject Detection

QueryDet-PyTorch This repository is the official implementation of our paper: QueryDet: Cascaded Sparse Query for Accelerating High-Resolution Small O

Chenhongyi Yang 276 Dec 31, 2022
This is an official source code for implementation on Extensive Deep Temporal Point Process

Extensive Deep Temporal Point Process This is an official source code for implementation on Extensive Deep Temporal Point Process, which is composed o

Haitao Lin 8 Aug 15, 2022
RoFormer_pytorch

PyTorch RoFormer 原版Tensorflow权重(https://github.com/ZhuiyiTechnology/roformer) chinese_roformer_L-12_H-768_A-12.zip (提取码:xy9x) 已经转化为PyTorch权重 chinese_r

yujun 283 Dec 12, 2022
Baselines for TrajNet++

TrajNet++ : The Trajectory Forecasting Framework PyTorch implementation of Human Trajectory Forecasting in Crowds: A Deep Learning Perspective TrajNet

VITA lab at EPFL 183 Jan 05, 2023
'Solving the sampling problem of the Sycamore quantum supremacy circuits

solve_sycamore This repo contains data, contraction code, and contraction order for the paper ''Solving the sampling problem of the Sycamore quantum s

Feng Pan 29 Nov 28, 2022
Autoregressive Models in PyTorch.

Autoregressive This repository contains all the necessary PyTorch code, tailored to my presentation, to train and generate data from WaveNet-like auto

Christoph Heindl 41 Oct 09, 2022
Package for extracting emotions from social media text. Tailored for financial data.

EmTract: Extracting Emotions from Social Media Text Tailored for Financial Contexts EmTract is a tool that extracts emotions from social media text. I

13 Nov 17, 2022
Unpaired Caricature Generation with Multiple Exaggerations

CariMe-pytorch The official pytorch implementation of the paper "CariMe: Unpaired Caricature Generation with Multiple Exaggerations" CariMe: Unpaired

Gu Zheng 37 Dec 30, 2022
OpenFed: A Comprehensive and Versatile Open-Source Federated Learning Framework

OpenFed: A Comprehensive and Versatile Open-Source Federated Learning Framework Introduction OpenFed is a foundational library for federated learning

25 Dec 12, 2022
NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

NVIDIA Merlin NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine

419 Jan 03, 2023