To SMOTE, or not to SMOTE?

Overview

To SMOTE, or not to SMOTE?

This package includes the code required to repeat the experiments in the paper and to analyze the results.

To SMOTE, or not to SMOTE?

Yotam Elor and Hadar Averbuch-Elor

Installation

# Create a new conda environment and activate it
conda create --name to-SMOTE-or-not -y python=3.7
conda activate to-SMOTE-or-not
# Install dependencies
pip install -r requirements.txt

Running experiments

The data is not included with this package. See an example of running a single experiment with a dataset from imblanaced-learn

# Load the data
import pandas as pd
import numpy as np
from imblearn.datasets import fetch_datasets
data = fetch_datasets()["mammography"]
x = pd.DataFrame(data["data"])
y = np.array(data["target"]).reshape((-1, 1))

# Run the experiment
from experiment import experiment
from classifiers import CLASSIFIER_HPS
from oversamplers import OVERSAMPLER_HPS
results = experiment(
    x=x,
    y=y,
    oversampler={
        "type": "smote",
        "ratio": 0.4,
        "params": OVERSAMPLER_HPS["smote"][0],
    },
    classifier={
        "type": "cat",  # Catboost
        "params": CLASSIFIER_HPS["cat"][0]
    },
    seed=0,
    normalize=False,
    clean_early_stopping=False,
    consistent=True,
    repeats=1
)

# Print the results nicely
import json
print(json.dumps(results, indent=4))

To run all the experiments in our study, wrap the above in loops, for example

for dataset in datasets:
    x, y = load_dataset(dataset)  # this functionality is not provided
    for seed in range(7):
        for classifier, classifier_hp_configs in CLASSIFIER_HPS.items():
            for classifier_hp in classifier_hp_configs:
                for oversampler, oversampler_hp_configs in OVERSAMPLER_HPS.items():
                    for oversampler_hp in oversampler_hp_configs:
                        for ratio in [0.1, 0.2, 0.3, 0.4, 0.5]:
                            results = experiment(
                                x=x,
                                y=y,
                                oversampler={
                                    "type": oversampler,
                                    "ratio": ratio,
                                    "params": oversampler_hp,
                                },
                                classifier={
                                    "type": classifier,
                                    "params": classifier_hp
                                },
                                seed=seed,
                                normalize=...,
                                clean_early_stopping=...,
                                consistent=...,
                                repeats=...
                            )

Analyze

Read the results from the compressed csv file. As the results file is large, it is tracked using git-lfs. You might need to download it manually or install git-lfs.

import os
import pandas as pd
data_path = os.path.join(os.path.dirname(__file__), "../data/results.gz")
df = pd.read_csv(data_path)

Drop nans and filter experiments with consistent classifiers, no normalization and a single validation fold

df = df.dropna()
df = df[
    (df["consistent"] == True)
    & (df["normalize"] == False)
    & (df["clean_early_stopping"] == False)
    & (df["repeats"] == 1)
]

Select the best HP configurations according to AUC validation scores. opt_metric is the key used to select the best configuration. For example, for a-priori HPs use opt_metric="test.roc_auc" and for validation-HPs use opt_metric="validation.roc_auc". Additionaly calculate average score and rank

from analyze import filter_optimal_hps
df = filter_optimal_hps(
    df, opt_metric="validation.roc_auc", output_metrics=["test.roc_auc"]
)
print(df)

Plot the results

from analyze import avg_plots
avg_plots(df, "test.roc_auc")

Citation

@misc{elor2022smote,
    title={To SMOTE, or not to SMOTE?}, 
    author={Yotam Elor and Hadar Averbuch-Elor},
    year={2022},
    eprint={2201.08528},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Owner
Amazon Web Services
Amazon Web Services
Milano is a tool for automating hyper-parameters search for your models on a backend of your choice.

Milano (This is a research project, not an official NVIDIA product.) Documentation https://nvidia.github.io/Milano Milano (Machine learning autotuner

NVIDIA Corporation 147 Dec 17, 2022
The implementation of the CVPR2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes"

STAR-FC This code is the implementation for the CVPR 2021 paper "Structure-Aware Face Clustering on a Large-Scale Graph with 10^7 Nodes" 🌟 🌟 . πŸŽ“ Re

Shuai Shen 87 Dec 28, 2022
Capture all information throughout your model's development in a reproducible way and tie results directly to the model code!

Rubicon Purpose Rubicon is a data science tool that captures and stores model training and execution information, like parameters and outcomes, in a r

Capital One 97 Jan 03, 2023
Use of Attention Gates in a Convolutional Neural Network / Medical Image Classification and Segmentation

Attention Gated Networks (Image Classification & Segmentation) Pytorch implementation of attention gates used in U-Net and VGG-16 models. The framewor

Ozan Oktay 1.6k Dec 30, 2022
TilinGNN: Learning to Tile with Self-Supervised Graph Neural Network (SIGGRAPH 2020)

TilinGNN: Learning to Tile with Self-Supervised Graph Neural Network (SIGGRAPH 2020) About The goal of our research problem is illustrated below: give

59 Dec 09, 2022
Painting app using Python machine learning and vision technology.

AI Painting App We are making an app that will track our hand and helps us to draw from that. We will be using the advance knowledge of Machine Learni

Badsha Laskar 3 Oct 03, 2022
Repo for the ACMMM20 submission: "Personalized breath based biometric authentication with wearable multimodality".

personalized-breath Repo for the ACMMM20 submission: "Personalized breath based biometric authentication with wearable multimodality". Guideline To ex

Manh-Ha Bui 2 Nov 15, 2021
PyTorch implementation for our AAAI 2022 Paper "Graph-wise Common Latent Factor Extraction for Unsupervised Graph Representation Learning"

deepGCFX PyTorch implementation for our AAAI 2022 Paper "Graph-wise Common Latent Factor Extraction for Unsupervised Graph Representation Learning" Pr

Thilini Cooray 4 Aug 11, 2022
A PyTorch implementation of "Capsule Graph Neural Network" (ICLR 2019).

CapsGNN β €β € A PyTorch implementation of Capsule Graph Neural Network (ICLR 2019). Abstract The high-quality node embeddings learned from the Graph Neur

Benedek Rozemberczki 1.2k Jan 02, 2023
Perturbed Self-Distillation: Weakly Supervised Large-Scale Point Cloud Semantic Segmentation (ICCV2021)

Perturbed Self-Distillation: Weakly Supervised Large-Scale Point Cloud Semantic Segmentation (ICCV2021) This is the implementation of PSD (ICCV 2021),

12 Dec 12, 2022
Self-supervised learning on Graph Representation Learning (node-level task)

graph_SSL Self-supervised learning on Graph Representation Learning (node-level task) How to run the code To run GRACE, sh run_GRACE.sh To run GCA, sh

Namkyeong Lee 3 Dec 31, 2021
Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations, CVPR 2019 (Oral)

Weakly Supervised Learning of Instance Segmentation with Inter-pixel Relations The code of: Weakly Supervised Learning of Instance Segmentation with I

Jiwoon Ahn 472 Dec 29, 2022
Poisson Surface Reconstruction for LiDAR Odometry and Mapping

Poisson Surface Reconstruction for LiDAR Odometry and Mapping Surfels TSDF Our Approach Table: Qualitative comparison between the different mapping te

Photogrammetry & Robotics Bonn 305 Dec 21, 2022
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [ArXiv] [Project Page] This repository is the official implementation of AdaMML:

International Business Machines 43 Dec 26, 2022
Multi-agent reinforcement learning algorithm and environment

Multi-agent reinforcement learning algorithm and environment [en/cn] Pytorch implements multi-agent reinforcement learning algorithms including IQL, Q

万鲲鹏 7 Sep 20, 2022
Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network.

Dewarping Document Image By Displacement Flow Estimation with Fully Convolutional Network

111 Dec 27, 2022
Nicely is a real-time Feedback and Intervention Program Depression is a prevalent issue across all age groups, socioeconomic classes, and cultural identities.

Nicely is a real-time Feedback and Intervention Program Depression is a prevalent issue across all age groups, socioeconomic classes, and cultural identities.

1 Jan 16, 2022
Code release for Universal Domain Adaptation(CVPR 2019)

Universal Domain Adaptation Code release for Universal Domain Adaptation(CVPR 2019) Requirements python 3.6+ PyTorch 1.0 pip install -r requirements.t

THUML @ Tsinghua University 229 Dec 23, 2022
[ICML 2021] "Graph Contrastive Learning Automated" by Yuning You, Tianlong Chen, Yang Shen, Zhangyang Wang

Graph Contrastive Learning Automated PyTorch implementation for Graph Contrastive Learning Automated [talk] [poster] [appendix] Yuning You, Tianlong C

Shen Lab at Texas A&M University 80 Nov 23, 2022
Official implementation of the ICML2021 paper "Elastic Graph Neural Networks"

ElasticGNN This repository includes the official implementation of ElasticGNN in the paper "Elastic Graph Neural Networks" [ICML 2021]. Xiaorui Liu, W

liuxiaorui 34 Dec 04, 2022