Implementation of SiameseXML (ICML 2021)

Overview

SiameseXML

Code for SiameseXML: Siamese networks meet extreme classifiers with 100M labels


Best Practices for features creation


  • Adding sub-words on top of unigrams to the vocabulary can help in training more accurate embeddings and classifiers.

Setting up


Expected directory structure

+-- <work_dir>
|  +-- programs
|  |  +-- siamesexml
|  |    +-- siamesexml
|  +-- data
|    +-- <dataset>
|  +-- models
|  +-- results

Download data for SiameseXML

* Download the (zipped file) BoW features from XML repository.  
* Extract the zipped file into data directory. 
* The following files should be available in <work_dir>/data/<dataset> for new datasets (ignore the next step)
    - trn_X_Xf.txt
    - trn_X_Y.txt
    - tst_X_Xf.txt
    - lbl_X_Xf.txt
    - tst_X_Y.txt
    - fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy
* The following files should be available in <work_dir>/data/<dataset> if the dataset is in old format (please refer to next step to convert the data to new format)
    - train.txt
    - test.txt
    - fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy 

Convert to new data format

# A perl script is provided (in siamesexml/tools) to convert the data into new format
# Either set the $data_dir variable to the data directory of a particular dataset or replace it with the path
perl convert_format.pl $data_dir/train.txt $data_dir/trn_X_Xf.txt $data_dir/trn_X_Y.txt
perl convert_format.pl $data_dir/test.txt $data_dir/tst_X_Xf.txt $data_dir/tst_X_Y.txt

Example use cases


A single learner

The given code can be utilized as follows. A json file is used to specify architecture and other arguments. Please refer to the full documentation below for more details.

./run_main.sh 0 SiameseXML LF-AmazonTitles-131K 0 108

Full Documentation

./run_main.sh <gpu_id> <type> <dataset> <version> <seed>

* gpu_id: Run the program on this GPU.

* type
  SiameseXML uses DeepXML[2] framework for training. The classifier is trained in M-IV.
  - SiameseXML: The intermediate representation is not fine-tuned while training the classifier (more scalable; suitable for large datasets).
  - SiameseXML++: The intermediate representation is fine-tuned while training the classifier (leads to better accuracy on some datasets).

* dataset
  - Name of the dataset.
  - SiameseXML expects the following files in <work_dir>/data/<dataset>
    - trn_X_Xf.txt
    - trn_X_Y.txt
    - tst_X_Xf.txt
    - lbl_X_Xf.txt
    - tst_X_Y.txt
    - fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy
  - You can set the 'embedding_dims' in config file to switch between 300d and 512d embeddings.

* version
  - different runs could be managed by version and seed.
  - models and results are stored with this argument.

* seed
  - seed value as used by numpy and PyTorch.

Notes

* Other file formats such as npy, npz, pickle are also supported.
* Initializing with token embeddings (computed from FastText) leads to noticible accuracy gains. Please ensure that the token embedding file is available in data directory, if 'init=token_embeddings', otherwise it'll throw an error.
* Config files are made available in siamesexml/configs/<framework>/<method> for datasets in XC repository. You can use them when trying out the given code on new datasets.
* We conducted our experiments on a 24-core Intel Xeon 2.6 GHz machine with 440GB RAM with a single Nvidia P40 GPU. 128GB memory should suffice for most datasets.
* The code make use of CPU (mainly for hnswlib) as well as GPU. 

Cite as

@InProceedings{Dahiya21b,
    author = "Dahiya, K. and Agarwal, A. and Saini, D. and Gururaj, K. and Jiao, J. and Singh, A. and Agarwal, S. and Kar, P. and Varma, M",
    title = "SiameseXML: Siamese Networks meet Extreme Classifiers with 100M Labels",
    booktitle = "Proceedings of the International Conference on Machine Learning",
    month = "July",
    year = "2021"
}

YOU MAY ALSO LIKE

References


[1] K. Dahiya, A. Agarwal, D. Saini, K. Gururaj, J. Jiao, A. Singh, S. Agarwal, P. Kar and M. Varma. SiameseXML: Siamese networks meet extreme classifiers with 100M labels. In ICML, July 2021

[2] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In WSDM, 2021.

[3] pyxclib: https://github.com/kunaldahiya/pyxclib

Owner
Extreme Classification
Extreme Classification
Official Implementation of Domain-Aware Universal Style Transfer

Domain Aware Universal Style Transfer Official Pytorch Implementation of 'Domain Aware Universal Style Transfer' (ICCV 2021) Domain Aware Universal St

KibeomHong 80 Dec 30, 2022
Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

CoProtector Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

Zhensu Sun 1 Oct 26, 2021
3DV 2021: Synergy between 3DMM and 3D Landmarks for Accurate 3D Facial Geometry

SynergyNet 3DV 2021: Synergy between 3DMM and 3D Landmarks for Accurate 3D Facial Geometry Cho-Ying Wu, Qiangeng Xu, Ulrich Neumann, CGIT Lab at Unive

Cho-Ying Wu 239 Jan 06, 2023
Classic Papers for Beginners and Impact Scope for Authors.

There have been billions of academic papers around the world. However, maybe only 0.0...01% among them are valuable or are worth reading. Since our limited life has never been forever, TopPaper provi

Qiulin Zhang 228 Dec 18, 2022
Implementation of "With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition, BMVC, 2021" in PyTorch

Multimodal Temporal Context Network (MTCN) This repository implements the model proposed in the paper: Evangelos Kazakos, Jaesung Huh, Arsha Nagrani,

Evangelos Kazakos 13 Nov 24, 2022
TensorFlow implementation of Deep Reinforcement Learning papers

Deep Reinforcement Learning in TensorFlow TensorFlow implementation of Deep Reinforcement Learning papers. This implementation contains: [1] Playing A

Taehoon Kim 1.6k Jan 03, 2023
DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers Authors: Jaemin Cho, Abhay Zala, and Mohit Bansal (

Jaemin Cho 98 Dec 15, 2022
TensorFlow implementation of the algorithm in the paper "Decoupled Low-light Image Enhancement"

Decoupled Low-light Image Enhancement Shijie Hao1,2*, Xu Han1,2, Yanrong Guo1,2 & Meng Wang1,2 1Key Laboratory of Knowledge Engineering with Big Data

17 Apr 25, 2022
Implementation of the HMAX model of vision in PyTorch

PyTorch implementation of HMAX PyTorch implementation of the HMAX model that closely follows that of the MATLAB implementation of The Laboratory for C

Marijn van Vliet 52 Oct 13, 2022
Label Studio is a multi-type data labeling and annotation tool with standardized output format

Website • Docs • Twitter • Join Slack Community What is Label Studio? Label Studio is an open source data labeling tool. It lets you label data types

Heartex 11.7k Jan 09, 2023
PyTorch Autoencoders - Implementing a Variational Autoencoder (VAE) Series in Pytorch.

PyTorch Autoencoders Implementing a Variational Autoencoder (VAE) Series in Pytorch. Inspired by this repository Model List check model paper conferen

Subin An 8 Nov 21, 2022
(Personalized) Page-Rank computation using PyTorch

torch-ppr This package allows calculating page-rank and personalized page-rank via power iteration with PyTorch, which also supports calculation on GP

Max Berrendorf 69 Dec 03, 2022
Material for my PyConDE & PyData Berlin 2022 Talk "5 Steps to Speed Up Your Data-Analysis on a Single Core"

5 Steps to Speed Up Your Data-Analysis on a Single Core Material for my talk at the PyConDE & PyData Berlin 2022 Description Your data analysis pipeli

Jonathan Striebel 9 Dec 12, 2022
Deep Markov Factor Analysis (NeurIPS2021)

Deep Markov Factor Analysis (DMFA) Codes and experiments for deep Markov factor analysis (DMFA) model accepted for publication at NeurIPS2021: A. Farn

Sarah Ostadabbas 2 Dec 16, 2022
The implemetation of Dynamic Nerual Garments proposed in Siggraph Asia 2021

DynamicNeuralGarments Introduction This repository contains the implemetation of Dynamic Nerual Garments proposed in Siggraph Asia 2021. ./GarmentMoti

42 Dec 27, 2022
Generative Flow Networks for Discrete Probabilistic Modeling

Energy-based GFlowNets Code for Generative Flow Networks for Discrete Probabilistic Modeling by Dinghuai Zhang, Nikolay Malkin, Zhen Liu, Alexandra Vo

Narsil-Dinghuai Zhang 51 Dec 20, 2022
A selection of State Of The Art research papers (and code) on human locomotion (pose + trajectory) prediction (forecasting)

A selection of State Of The Art research papers (and code) on human trajectory prediction (forecasting). Papers marked with [W] are workshop papers.

Karttikeya Manglam 40 Nov 18, 2022
Simple converter for deploying Stable-Baselines3 model to TFLite and/or Coral

Running SB3 developed agents on TFLite or Coral Introduction I've been using Stable-Baselines3 to train agents against some custom Gyms, some of which

Gary Briggs 16 Oct 11, 2022
Implementation of UNet on the Joey ML framework

Independent Research Project - Code Joey can be cloned from here https://github.com/devitocodes/joey/. Devito and other dependencies such as PyTorch a

Navjot Kukreja 1 Oct 21, 2021
[ICCV 2021 Oral] SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer

This repository contains the source code for the paper SnowflakeNet: Point Cloud Completion by Snowflake Point Deconvolution with Skip-Transformer (ICCV 2021 Oral). The project page is here.

AllenXiang 65 Dec 26, 2022