TriMap: Large-scale Dimensionality Reduction Using Triplets

Last update: Dec 24, 2022

Related tags

Overview

TriMap

TriMap is a dimensionality reduction method that uses triplet constraints to form a low-dimensional embedding of a set of points. The triplet constraints are of the form "point i is closer to point j than point k". The triplets are sampled from the high-dimensional representation of the points and a weighting scheme is used to reflect the importance of each triplet.

TriMap provides a significantly better global view of the data than the other dimensionality reduction methods such t-SNE, LargeVis, and UMAP. The global structure includes relative distances of the clusters, multiple scales in the data, and the existence of possible outliers. We define a global score to quantify the quality of an embedding in reflecting the global structure of the data.

CIFAR-10 dataset (test set) passed through a CNN (n = 10,000, d = 1024): Notice the semantic structure unveiled by TriMap.

The following implementation is in Python. Further details and more experimental results are available in the paper.

How to use TriMap

TriMap has a transformer API similar to other sklearn libraries. To use TriMap with the default parameters, simply do:

import trimap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = trimap.TRIMAP().fit_transform(digits.data)

To find the embedding using a precomputed pairwise distance matrix D, pass D as input and set use_dist_matrix to True:

embedding = trimap.TRIMAP(use_dist_matrix=True).fit_transform(D)

You can also pass the precomputed k-nearest neighbors and their corresponding distances as a tuple (knn_nbrs, knn_distances). Note that the rows must be in order, starting from point 0 to n-1. This feature also requires X to compute the embedding

embedding = trimap.TRIMAP(knn_tuple=(knn_nbrs, knn_distances)).fit_transform(X)

To calculate the global score, do:

gs = trimap.TRIMAP(verbose=False).global_score(digits.data, embedding)
print("global score %2.2f" % gs)

Parameters

The list of parameters is given blow:

n_dims: Number of dimensions of the embedding (default = 2)

n_inliers: Number of nearest neighbors for forming the nearest neighbor triplets (default = 10).

n_outliers: Number of outliers for forming the nearest neighbor triplets (default = 5).

n_random: Number of random triplets per point (default = 5).

distance: Distance measure ('euclidean' (default), 'manhattan', 'angular', 'hamming')

weight_adj: The value of gamma for the log-transformation (default = 500.0).

lr: Learning rate (default = 1000.0).

n_iters: Number of iterations (default = 400).

The other parameters include:

knn_tuple: Use the precomputed nearest-neighbors information in form of a tuple (knn_nbrs, knn_distances) (default = None)

use_dist_matrix: Use the precomputed pairwise distance matrix (default = False)

apply_pca: Reduce the number of dimensions of the data to 100 if necessary before applying the nearest-neighbor search (default = True).

opt_method: Optimization method {'sd' (steepest descent), 'momentum' (GD with momentum), 'dbd' (delta-bar-delta, default)}.

verbose: Print the progress report (default = True).

return_seq: Store the intermediate results and return the results in a tensor (default = False).

An example of adjusting these parameters:

import trimap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = trimap.TRIMAP(n_inliers=20,
                          n_outliers=10,
                          n_random=10,
                          weight_adj=1000.0).fit_transform(digits.data)

The nearest-neighbor calculation is performed using ANNOY.

Examples

The following are some of the results on real-world datasets. The values of nearest-neighbor accuracy and global score are shown as a pair (NN, GS) on top of each figure. For more results, please refer to our paper.

USPS Handwritten Digits (n = 11,000, d = 256)

20 News Groups (n = 18,846, d = 100)

Tabula Muris (n = 53,760, d = 23,433)

MNIST Handwritten Digits (n = 70,000, d = 784)

Fashion MNIST (n = 70,000, d = 784)

TV News (n = 129,685, d = 100)

Runtime of t-SNE, LargeVis, UMAP, and TriMap in the hh:mm:ss format on a single machine with 2.6 GHz Intel Core i5 CPU and 16 GB of memory is given in the following table. We limit the runtime of each method to 12 hours. Also, UMAP runs out of memory on datasets larger than ~4M points.

Installing

Requirements:

numpy
scikit-learn
numba
annoy

Installing annoy

If you are having trouble with installing annoy on macOS using the command:

pip3 install annoy

you can alternatively try:

pip3 install git+https://github.com/sutao/[email protected]

Install Options

If you have all the requirements installed, you can use pip:

sudo pip install trimap

Please regularly check for updates and make sure you are using the most recent version. If you have TriMap installed and would like to upgrade to the newer version, you can use the command:

sudo pip install --upgrade --force-reinstall trimap

An alternative is to install the dependencies manually using anaconda and using pip to install TriMap:

conda install numpy
conda install scikit-learn
conda install numba
conda install annoy
pip install trimap

For a manual install get this package:

wget https://github.com/eamid/trimap/archive/master.zip
unzip master.zip
rm master.zip
cd trimap-master

Install the requirements

sudo pip install -r requirements.txt

conda install scikit-learn numba annoy

Install the package

python setup.py install

Support and Contribution

This implementation is still a work in progress. Any comments/suggestions/bug-reports are highly appreciated. Please feel free contact me at: [email protected]. If you would like to contribute to the code, please fork the project and send me a pull request.

Citation

If you use TriMap in your publications, please cite our current reference on arXiv:

@article{2019TRIMAP,
     author = {{Amid}, Ehsan and {Warmuth}, Manfred K.},
     title = "{TriMap: Large-scale Dimensionality Reduction Using Triplets}",
     journal = {arXiv preprint arXiv:1910.00204},
     archivePrefix = "arXiv",
     eprint = {1910.00204},
     year = 2019,
}

License

Please see the LICENSE file.

TriMap: Large-scale Dimensionality Reduction Using Triplets

Related tags

Overview

TriMap

How to use TriMap

Parameters

Examples

Installing

Support and Contribution

Citation

License

Owner

Ehsan Amid

Companion repo of the UCC 2021 paper "Predictive Auto-scaling with OpenStack Monasca"

基于Paddle框架的arcface复现

[CVPR 2022] Deep Equilibrium Optical Flow Estimation

Codes of paper "Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling"

Official PyTorch implementation of Segmenter: Transformer for Semantic Segmentation

code for our paper "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer"

Encode and decode text application

Udacity's CS101: Intro to Computer Science - Building a Search Engine

Project of 'TBEFN: A Two-branch Exposure-fusion Network for Low-light Image Enhancement '

Code for paper: "Spinning Language Models for Propaganda-As-A-Service"

Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

CUda Matrix Multiply library.

Pytorch Implementation of LNSNet for Superpixel Segmentation

This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm and CNN.

[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)

PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

Adversarial Graph Augmentation to Improve Graph Contrastive Learning

Learn other languages using artificial intelligence with python.

Code for 2021 NeurIPS --- Towards Multi-Grained Explainability for Graph Neural Networks

Minimal But Practical Image Classifier Pipline Using Pytorch, Finetune on ResNet18, Got 99% Accuracy on Own Small Datasets.

TriMap: Large-scale Dimensionality Reduction Using Triplets

Related tags

Overview

TriMap

How to use TriMap

Parameters

Examples

Installing

Support and Contribution

Citation

License

Owner

Ehsan Amid

Companion repo of the UCC 2021 paper "Predictive Auto-scaling with OpenStack Monasca"

基于Paddle框架的arcface复现

[CVPR 2022] Deep Equilibrium Optical Flow Estimation

Codes of paper "Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling"

Official PyTorch implementation of Segmenter: Transformer for Semantic Segmentation

code for our paper "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer"

Encode and decode text application

Udacity's CS101: Intro to Computer Science - Building a Search Engine

Project of 'TBEFN: A Two-branch Exposure-fusion Network for Low-light Image Enhancement '

Code for paper: "Spinning Language Models for Propaganda-As-A-Service"

Bottom-up attention model for image captioning and VQA, based on Faster R-CNN and Visual Genome

CUda Matrix Multiply library.

Pytorch Implementation of LNSNet for Superpixel Segmentation

This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm and CNN.

[Official] Exploring Temporal Coherence for More General Video Face Forgery Detection(ICCV 2021)

PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

Adversarial Graph Augmentation to Improve Graph Contrastive Learning

Learn other languages ​​using artificial intelligence with python.

Code for 2021 NeurIPS --- Towards Multi-Grained Explainability for Graph Neural Networks

Minimal But Practical Image Classifier Pipline Using Pytorch, Finetune on ResNet18, Got 99% Accuracy on Own Small Datasets.

Learn other languages using artificial intelligence with python.