Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

Overview

gHHC

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

Setup

In each shell session, run:

source bin/setup.sh

to set environment variables.

Install jq (if not already installed): https://stedolan.github.io/jq/

Install maven (if not already installed):

sh bin/install_mvn.sh

Install python dependencies:

conda create -n env_ghhc pip python=3.6
source activate env_ghhc
# Either (linux)
wget https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.12.0-cp36-cp36m-linux_x86_64.whl
pip install tensorflow-1.12.0-cp36-cp36m-linux_x86_64.whl
# or (mac)
wget https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.12.0-py3-none-any.whl
pip install tensorflow-1.12.0-py3-none-any.whl
conda install scikit-learn
conda install tensorflow-base=1.13.1

See env.yml for a complete list of dependencies if you run into issues with the above.

Build scala code:

mvn clean package

Note you may need to set JAVA_HOME and JAVA_HOME_8 on your system.

ALOI and Glass are downloadable from: https://github.com/iesl/xcluster

Covtype is available here: https://archive.ics.uci.edu/ml/datasets/covertype

Contact me regarding the ImageNet data.

Clustering Experiments

Step 1. Building triples for inference

Sample triples of datapoints that will be used for inference:

On a compute machine:

sh bin/sample_triples.sh config/glass/build_samples.json

Using slurm cluster manager:

sh bin/launch_samples.sh config/glass/build_samples.json <partition-name-here>

Note the above example is for the glass dataset, but the same procedure and scripts are available for all datasets.

Step 2. Run Inference

Update the representations of the internal nodes of the tree structure.

On a compute machine:

sh bin/run_inf.sh config/glass/glass.json

Using slurm cluster manager:

sh bin/launch_inf.sh config/glass/glass.json <partition-name-here>

This will create a directory in exp_out/dataset_name/ghhc/timestamp containing the internal node parameters and configs to run the next step. For example, this would create the following:

exp_out/glass/ghhc/2019-11-29-20-13-29-alg_name=ghhc-init_method=randompts-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=500-struct_prior=pcn

Step 3. Final clustering

Produce assignment of datapoints in the hierarchical clustering and produce internal structure.

For datasets other than ImageNet:

On a compute machine:

# Generally:
sh bin/run_predict_only.sh exp_out/data/ghhc/timestap/config.json data/datasetname/data_to_run_on.tsv

# For example:
sh bin/run_predict_only.sh exp_out/glass/ghhc/2019-11-29-20-13-29-alg_name=ghhc-init_method=randompts-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=500-struct_prior=pcn/config.json data/glass/glass.tsv

Using slurm cluster manager:

sh bin/launch_predict_only.sh exp_out/glass/ghhc/2019-11-29-20-13-29-alg_name=ghhc-init_method=randompts-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=500-struct_prior=pcn/config.json data/glass/glass.tsv <partition-name>

This will create a file: exp_out/glass/ghhc/2019-11-29-20-13-29-alg_name=ghhc-init_method=randompts-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=500-struct_prior=pcn/results/tree.tsv which can be evaluated using

sh bin/score_tree.sh exp_out/glass/ghhc/2019-11-29-20-13-29-alg_name=ghhc-init_method=randompts-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=500-struct_prior=pcn/results/tree.tsv

When evaluating the tree for covtype, use the expected dendrogram purity point id file from the data directory:

sh bin/score_tree.sh /path/to/tree.tsv ghhc covtype $num_threads data/covtype.evalpts5k

For ImageNet:

 sh bin/launch_predict_only_imagenet.sh exp_out/ilsvrc/ghhc/2019-11-29-08-04-23-alg_name=ghhc-init_method=randhac-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=100-struct_prior=pcn/config.json data/ilsvrc/ilsvrc12.tsv.1 cpu 32000

This assumes that the ImageNet data file has been split into 13 files:

data/ilsvrc/ilsvrc12.tsv.1.split_aa
data/ilsvrc/ilsvrc12.tsv.1.split_ab
...
data/ilsvrc/ilsvrc12.tsv.1.split_am

Then when all jobs finish, concatenate results:

sh bin/cat_imagenet_tree.sh exp_out/ilsvrc/ghhc/2019-11-29-08-04-23-alg_name=ghhc-init_method=randhac-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=100-struct_prior=pcn/results/

This will create a file containing the entire tree:

exp_out/ilsvrc/ghhc/2019-11-29-08-04-23-alg_name=ghhc-init_method=randhac-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=100-struct_prior=pcn/results/tree.tsv

which can be evaluated using:

sh bin/score_tree.sh exp_out/ilsvrc/ghhc/2019-11-29-08-04-23-alg_name=ghhc-init_method=randhac-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=100-struct_prior=pcn/results/tree.tsv ghhc ilsvrc12 $num_threads data/imagenet_eval_pts.ids

Citation

@inproceedings{Monath:2019:GHC:3292500.3330997,
     author = {Monath, Nicholas and Zaheer, Manzil and Silva, Daniel and McCallum, Andrew and Ahmed, Amr},
     title = {Gradient-based Hierarchical Clustering Using Continuous Representations of Trees in Hyperbolic Space},
     booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
     series = {KDD '19},
     year = {2019},
     isbn = {978-1-4503-6201-6},
     location = {Anchorage, AK, USA},
     pages = {714--722},
     numpages = {9},
     url = {http://doi.acm.org/10.1145/3292500.3330997},
     doi = {10.1145/3292500.3330997},
     acmid = {3330997},
     publisher = {ACM},
     address = {New York, NY, USA},
     keywords = {clustering, gradient-based clustering, hierarchical clustering},
}

License

Apache License, Version 2.0

Questions / Comments / Bugs / Issues

Please contact Nicholas Monath ([email protected]).

Also, please contact me for access to the data.

Owner
Nicholas Monath
Nicholas Monath
LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models

LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models. Developers can reproduce these SOTA methods and

TuZheng 405 Jan 04, 2023
A python implementation of Deep-Image-Analogy based on pytorch.

Deep-Image-Analogy This project is a python implementation of Deep Image Analogy.https://arxiv.org/abs/1705.01088. Some results Requirements python 3

Peng Lu 171 Dec 14, 2022
Official Pytorch implementation for AAAI2021 paper (RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning)

RSPNet Official Pytorch implementation for AAAI2021 paper "RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning" [Suppleme

35 Jun 24, 2022
A curated list of automated deep learning (including neural architecture search and hyper-parameter optimization) resources.

Awesome AutoDL A curated list of automated deep learning related resources. Inspired by awesome-deep-vision, awesome-adversarial-machine-learning, awe

D-X-Y 2k Dec 30, 2022
Official public repository of paper "Intention Adaptive Graph Neural Network for Category-Aware Session-Based Recommendation"

Intention Adaptive Graph Neural Network (IAGNN) This is the official repository of paper Intention Adaptive Graph Neural Network for Category-Aware Se

9 Nov 22, 2022
Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning.

Collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning Installation

Pytorch Lightning 1.6k Jan 08, 2023
Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis

Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis Website | ICCV paper | arXiv | Twitter This repository contains the official i

Ajay Jain 73 Dec 27, 2022
An efficient toolkit for Face Stylization based on the paper "AgileGAN: Stylizing Portraits by Inversion-Consistent Transfer Learning"

MMGEN-FaceStylor English | įŽ€äŊ“中文 Introduction This repo is an efficient toolkit for Face Stylization based on the paper "AgileGAN: Stylizing Portraits

OpenMMLab 182 Dec 27, 2022
Examples of how to create colorful, annotated equations in Latex using Tikz.

The file "eqn_annotate.tex" is the main latex file. This repository provides four examples of annotated equations: [example_prob.tex] A simple one ins

SyNeRCyS Research Lab 3.2k Jan 05, 2023
An open-source Kazakh named entity recognition dataset (KazNERD), annotation guidelines, and baseline NER models.

Kazakh Named Entity Recognition This repository contains an open-source Kazakh named entity recognition dataset (KazNERD), named entity annotation gui

ISSAI 9 Dec 23, 2022
Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP

Wav2CLIP 🚧 WIP 🚧 Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP 📄 🔗 Ho-Hsiang Wu, Prem Seetharaman

Descript 240 Dec 13, 2022
Neural network for stock price prediction

neural_network_for_stock_price_prediction Neural networks for stock price predic

2 Feb 04, 2022
Image Super-Resolution Using Very Deep Residual Channel Attention Networks

Image Super-Resolution Using Very Deep Residual Channel Attention Networks

kongdebug 14 Oct 14, 2022
A PyTorch implementation of "TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?"

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos? Source: Improving Vision Transformer Efficiency and Accuracy by Learning to Tokenize

Caiyong Wang 14 Sep 20, 2022
Allows including an action inside another action (by preprocessing the Yaml file). This is how composite actions should have worked.

actions-includes Allows including an action inside another action (by preprocessing the Yaml file). Instead of using uses or run in your action step,

Tim Ansell 70 Nov 04, 2022
Pseudo-rng-app - whos needs science to make a random number when you have pseudoscience?

Pseudo-random numbers with pseudoscience rng is so complicated! Why cant we have a horoscopic, vibe-y way of calculating a random number? Why cant rng

Andrew Blance 1 Dec 27, 2021
DvD-TD3: Diversity via Determinants for TD3 version

DvD-TD3: Diversity via Determinants for TD3 version The implementation of paper Effective Diversity in Population Based Reinforcement Learning. Instal

3 Feb 11, 2022
deep learning model with only python and numpy with test accuracy 99 % on mnist dataset and different optimization choices

deep_nn_model_with_only_python_100%_test_accuracy deep learning model with only python and numpy with test accuracy 99 % on mnist dataset and differen

0 Aug 28, 2022
Character-Input - Create a program that asks the user to enter their name and their age

Character-Input Create a program that asks the user to enter their name and thei

PyLaboratory 0 Feb 06, 2022
This repository contains the implementation of the HealthGen model, a generative model to synthesize realistic EHR time series data with missingness

HealthGen: Conditional EHR Time Series Generation This repository contains the implementation of the HealthGen model, a generative model to synthesize

0 Jan 20, 2022