Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

Overview

gHHC

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

Setup

In each shell session, run:

source bin/setup.sh

to set environment variables.

Install jq (if not already installed): https://stedolan.github.io/jq/

Install maven (if not already installed):

sh bin/install_mvn.sh

Install python dependencies:

conda create -n env_ghhc pip python=3.6
source activate env_ghhc
# Either (linux)
wget https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.12.0-cp36-cp36m-linux_x86_64.whl
pip install tensorflow-1.12.0-cp36-cp36m-linux_x86_64.whl
# or (mac)
wget https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.12.0-py3-none-any.whl
pip install tensorflow-1.12.0-py3-none-any.whl
conda install scikit-learn
conda install tensorflow-base=1.13.1

See env.yml for a complete list of dependencies if you run into issues with the above.

Build scala code:

mvn clean package

Note you may need to set JAVA_HOME and JAVA_HOME_8 on your system.

ALOI and Glass are downloadable from: https://github.com/iesl/xcluster

Covtype is available here: https://archive.ics.uci.edu/ml/datasets/covertype

Contact me regarding the ImageNet data.

Clustering Experiments

Step 1. Building triples for inference

Sample triples of datapoints that will be used for inference:

On a compute machine:

sh bin/sample_triples.sh config/glass/build_samples.json

Using slurm cluster manager:

sh bin/launch_samples.sh config/glass/build_samples.json <partition-name-here>

Note the above example is for the glass dataset, but the same procedure and scripts are available for all datasets.

Step 2. Run Inference

Update the representations of the internal nodes of the tree structure.

On a compute machine:

sh bin/run_inf.sh config/glass/glass.json

Using slurm cluster manager:

sh bin/launch_inf.sh config/glass/glass.json <partition-name-here>

This will create a directory in exp_out/dataset_name/ghhc/timestamp containing the internal node parameters and configs to run the next step. For example, this would create the following:

exp_out/glass/ghhc/2019-11-29-20-13-29-alg_name=ghhc-init_method=randompts-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=500-struct_prior=pcn

Step 3. Final clustering

Produce assignment of datapoints in the hierarchical clustering and produce internal structure.

For datasets other than ImageNet:

On a compute machine:

# Generally:
sh bin/run_predict_only.sh exp_out/data/ghhc/timestap/config.json data/datasetname/data_to_run_on.tsv

# For example:
sh bin/run_predict_only.sh exp_out/glass/ghhc/2019-11-29-20-13-29-alg_name=ghhc-init_method=randompts-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=500-struct_prior=pcn/config.json data/glass/glass.tsv

Using slurm cluster manager:

sh bin/launch_predict_only.sh exp_out/glass/ghhc/2019-11-29-20-13-29-alg_name=ghhc-init_method=randompts-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=500-struct_prior=pcn/config.json data/glass/glass.tsv <partition-name>

This will create a file: exp_out/glass/ghhc/2019-11-29-20-13-29-alg_name=ghhc-init_method=randompts-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=500-struct_prior=pcn/results/tree.tsv which can be evaluated using

sh bin/score_tree.sh exp_out/glass/ghhc/2019-11-29-20-13-29-alg_name=ghhc-init_method=randompts-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=500-struct_prior=pcn/results/tree.tsv

When evaluating the tree for covtype, use the expected dendrogram purity point id file from the data directory:

sh bin/score_tree.sh /path/to/tree.tsv ghhc covtype $num_threads data/covtype.evalpts5k

For ImageNet:

 sh bin/launch_predict_only_imagenet.sh exp_out/ilsvrc/ghhc/2019-11-29-08-04-23-alg_name=ghhc-init_method=randhac-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=100-struct_prior=pcn/config.json data/ilsvrc/ilsvrc12.tsv.1 cpu 32000

This assumes that the ImageNet data file has been split into 13 files:

data/ilsvrc/ilsvrc12.tsv.1.split_aa
data/ilsvrc/ilsvrc12.tsv.1.split_ab
...
data/ilsvrc/ilsvrc12.tsv.1.split_am

Then when all jobs finish, concatenate results:

sh bin/cat_imagenet_tree.sh exp_out/ilsvrc/ghhc/2019-11-29-08-04-23-alg_name=ghhc-init_method=randhac-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=100-struct_prior=pcn/results/

This will create a file containing the entire tree:

exp_out/ilsvrc/ghhc/2019-11-29-08-04-23-alg_name=ghhc-init_method=randhac-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=100-struct_prior=pcn/results/tree.tsv

which can be evaluated using:

sh bin/score_tree.sh exp_out/ilsvrc/ghhc/2019-11-29-08-04-23-alg_name=ghhc-init_method=randhac-tree_learning_rate=0.01-loss=sigmoid-lca_type=conditional-num_samples=50000-batch_size=100-struct_prior=pcn/results/tree.tsv ghhc ilsvrc12 $num_threads data/imagenet_eval_pts.ids

Citation

@inproceedings{Monath:2019:GHC:3292500.3330997,
     author = {Monath, Nicholas and Zaheer, Manzil and Silva, Daniel and McCallum, Andrew and Ahmed, Amr},
     title = {Gradient-based Hierarchical Clustering Using Continuous Representations of Trees in Hyperbolic Space},
     booktitle = {Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
     series = {KDD '19},
     year = {2019},
     isbn = {978-1-4503-6201-6},
     location = {Anchorage, AK, USA},
     pages = {714--722},
     numpages = {9},
     url = {http://doi.acm.org/10.1145/3292500.3330997},
     doi = {10.1145/3292500.3330997},
     acmid = {3330997},
     publisher = {ACM},
     address = {New York, NY, USA},
     keywords = {clustering, gradient-based clustering, hierarchical clustering},
}

License

Apache License, Version 2.0

Questions / Comments / Bugs / Issues

Please contact Nicholas Monath ([email protected]).

Also, please contact me for access to the data.

Owner
Nicholas Monath
Nicholas Monath
Code implementation of Data Efficient Stagewise Knowledge Distillation paper.

Data Efficient Stagewise Knowledge Distillation Table of Contents Data Efficient Stagewise Knowledge Distillation Table of Contents Requirements Image

IvLabs 112 Dec 02, 2022
State of the Art Neural Networks for Deep Learning

pyradox This python library helps you with implementing various state of the art neural networks in a totally customizable fashion using Tensorflow 2

Ritvik Rastogi 60 May 29, 2022
A Pytorch implementation of CVPR 2021 paper "RSG: A Simple but Effective Module for Learning Imbalanced Datasets"

RSG: A Simple but Effective Module for Learning Imbalanced Datasets (CVPR 2021) A Pytorch implementation of our CVPR 2021 paper "RSG: A Simple but Eff

120 Dec 12, 2022
TensorFlow 101: Introduction to Deep Learning for Python Within TensorFlow

TensorFlow 101: Introduction to Deep Learning I have worked all my life in Machine Learning, and I've never seen one algorithm knock over its benchmar

Sefik Ilkin Serengil 896 Jan 04, 2023
Beancount-mercury - Beancount importer for Mercury Startup Checking

beancount-mercury beancount-mercury provides an Importer for converting CSV expo

Michael Lynch 4 Oct 31, 2022
Implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

Selection via Proxy: Efficient Data Selection for Deep Learning This repository contains a refactored implementation of "Selection via Proxy: Efficien

Stanford Future Data Systems 70 Nov 16, 2022
CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification (ICCV2021)

CM-NAS Official Pytorch code of paper CM-NAS: Cross-Modality Neural Architecture Search for Visible-Infrared Person Re-Identification in ICCV2021. Vis

JDAI-CV 40 Nov 25, 2022
Code for the ICCV2021 paper "Personalized Image Semantic Segmentation"

PSS: Personalized Image Semantic Segmentation Paper PSS: Personalized Image Semantic Segmentation Yu Zhang, Chang-Bin Zhang, Peng-Tao Jiang, Ming-Ming

张宇 15 Jul 09, 2022
Implementation for the IJCAI2021 work "Beyond the Spectrum: Detecting Deepfakes via Re-synthesis"

Beyond the Spectrum Implementation for the IJCAI2021 work "Beyond the Spectrum: Detecting Deepfakes via Re-synthesis" by Yang He, Ning Yu, Margret Keu

Yang He 27 Jan 07, 2023
Keras-retinanet - Keras implementation of RetinaNet object detection.

Keras RetinaNet Keras implementation of RetinaNet object detection as described in Focal Loss for Dense Object Detection by Tsung-Yi Lin, Priya Goyal,

Fizyr 4.3k Jan 01, 2023
LVI-SAM: Tightly-coupled Lidar-Visual-Inertial Odometry via Smoothing and Mapping

LVI-SAM This repository contains code for a lidar-visual-inertial odometry and mapping system, which combines the advantages of LIO-SAM and Vins-Mono

Tixiao Shan 1.1k Dec 27, 2022
A model to classify a piece of news as REAL or FAKE

Fake_news_classification A model to classify a piece of news as REAL or FAKE. This python project of detecting fake news deals with fake and real news

Gokul Stark 1 Jan 29, 2022
Code for paper Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting

Decoupled Spatial-Temporal Graph Neural Networks Code for our paper: Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting.

S22 43 Jan 04, 2023
Contrastively Disentangled Sequential Variational Audoencoder

Contrastively Disentangled Sequential Variational Audoencoder (C-DSVAE) Overview This is the implementation for our C-DSVAE, a novel self-supervised d

Junwen Bai 35 Dec 24, 2022
A simple algorithm for extracting tree height in sparse scene from point cloud data.

TREE HEIGHT EXTRACTION IN SPARSE SCENES BASED ON UAV REMOTE SENSING This is the offical python implementation of the paper "Tree Height Extraction in

6 Oct 28, 2022
TinyML Cookbook, published by Packt

TinyML Cookbook This is the code repository for TinyML Cookbook, published by Packt. Author: Gian Marco Iodice Publisher: Packt About the book This bo

Packt 93 Dec 29, 2022
Weighing Counts: Sequential Crowd Counting by Reinforcement Learning

LibraNet This repository includes the official implementation of LibraNet for crowd counting, presented in our paper: Weighing Counts: Sequential Crow

Hao Lu 18 Nov 05, 2022
PyTorch implementation of U-TAE and PaPs for satellite image time series panoptic segmentation.

Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks (ICCV 2021) This repository is the official implem

71 Jan 04, 2023
Official code for "InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization" (ICLR 2020, spotlight)

InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization Authors: Fan-yun Sun, Jordan Hoffm

Fan-Yun Sun 232 Dec 28, 2022
TensorFlow-based implementation of "Pyramid Scene Parsing Network".

PSPNet_tensorflow Important Code is fine for inference. However, the training code is just for reference and might be only used for fine-tuning. If yo

HsuanKung Yang 323 Dec 20, 2022