PyImpetus is a Markov Blanket based feature subset selection algorithm that considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features

Overview

forthebadge made-with-python ForTheBadge built-with-love

PyPI version shields.io Downloads Maintenance

PyImpetus

PyImpetus is a Markov Blanket based feature selection algorithm that selects a subset of features by considering their performance both individually as well as a group. This allows the algorithm to not only select the best set of features, but also select the best set of features that play well with each other. For example, the best performing feature might not play well with others while the remaining features, when taken together could out-perform the best feature. PyImpetus takes this into account and produces the best possible combination. Thus, the algorithm provides a minimal feature subset. So, you do not have to decide on how many features to take. PyImpetus selects the optimal set for you.

PyImpetus has been completely revamped and now supports binary classification, multi-class classification and regression tasks. It has been tested on 14 datasets and outperformed state-of-the-art Markov Blanket learning algorithms on all of them along with traditional feature selection algorithms such as Forward Feature Selection, Backward Feature Elimination and Recursive Feature Elimination.

How to install?

pip install PyImpetus

Functions and parameters

# The initialization of PyImpetus takes in multiple parameters as input
# PPIMBC is for classification
model = PPIMBC(model, p_val_thresh, num_simul, simul_size, simul_type, sig_test_type, cv, verbose, random_state, n_jobs)
  • model - estimator object, default=DecisionTreeClassifier() The model which is used to perform classification in order to find feature importance via significance-test.
  • p_val_thresh - float, default=0.05 The p-value (in this case, feature importance) below which a feature will be considered as a candidate for the final MB.
  • num_simul - int, default=30 (This feature has huge impact on speed) Number of train-test splits to perform to check usefulness of each feature. For large datasets, the value should be considerably reduced though do not go below 5.
  • simul_size - float, default=0.2 The size of the test set in each train-test split
  • simul_type - boolean, default=0 To apply stratification or not
    • 0 means train-test splits are not stratified.
    • 1 means the train-test splits will be stratified.
  • sig_test_type - string, default="non-parametric" This determines the type of significance test to use.
    • "parametric" means a parametric significance test will be used (Note: This test selects very few features)
    • "non-parametric" means a non-parametric significance test will be used
  • cv - cv object/int, default=0 Determines the number of splits for cross-validation. Sklearn CV object can also be passed. A value of 0 means CV is disabled.
  • verbose - int, default=2 Controls the verbosity: the higher, more the messages.
  • random_state - int or RandomState instance, default=None Pass an int for reproducible output across multiple function calls.
  • n_jobs - int, default=-1 The number of CPUs to use to do the computation.
    • None means 1 unless in a :obj:joblib.parallel_backend context.
    • -1 means using all processors.
# The initialization of PyImpetus takes in multiple parameters as input
# PPIMBR is for regression
model = PPIMBR(model, p_val_thresh, num_simul, simul_size, sig_test_type, cv, verbose, random_state, n_jobs)
  • model - estimator object, default=DecisionTreeRegressor() The model which is used to perform regression in order to find feature importance via significance-test.
  • p_val_thresh - float, default=0.05 The p-value (in this case, feature importance) below which a feature will be considered as a candidate for the final MB.
  • num_simul - int, default=30 (This feature has huge impact on speed) Number of train-test splits to perform to check usefulness of each feature. For large datasets, the value should be considerably reduced though do not go below 5.
  • simul_size - float, default=0.2 The size of the test set in each train-test split
  • sig_test_type - string, default="non-parametric" This determines the type of significance test to use.
    • "parametric" means a parametric significance test will be used (Note: This test selects very few features)
    • "non-parametric" means a non-parametric significance test will be used
  • cv - cv object/int, default=0 Determines the number of splits for cross-validation. Sklearn CV object can also be passed. A value of 0 means CV is disabled.
  • verbose - int, default=2 Controls the verbosity: the higher, more the messages.
  • random_state - int or RandomState instance, default=None Pass an int for reproducible output across multiple function calls.
  • n_jobs - int, default=-1 The number of CPUs to use to do the computation.
    • None means 1 unless in a :obj:joblib.parallel_backend context.
    • -1 means using all processors.
# To fit PyImpetus on provided dataset and find recommended features
fit(data, target)
  • data - A pandas dataframe upon which feature selection is to be applied
  • target - A numpy array, denoting the target variable
# This function returns the names of the columns that form the MB (These are the recommended features)
transform(data)
  • data - A pandas dataframe which needs to be pruned
# To fit PyImpetus on provided dataset and return pruned data
fit_transform(data, target)
  • data - A pandas dataframe upon which feature selection is to be applied
  • target - A numpy array, denoting the target variable
# To plot XGBoost style feature importance
feature_importance()

How to import?

from PyImpetus import PPIMBC, PPIMBR

Usage

# Import the algorithm. PPIMBC is for classification and PPIMBR is for regression
from PyImeptus import PPIMBC, PPIMBR
# Initialize the PyImpetus object
model = PPIMBC(model=SVC(random_state=27, class_weight="balanced"), p_val_thresh=0.05, num_simul=30, simul_size=0.2, simul_type=0, sig_test_type="non-parametric", cv=5, random_state=27, n_jobs=-1, verbose=2)
# The fit_transform function is a wrapper for the fit and transform functions, individually.
# The fit function finds the MB for given data while transform function provides the pruned form of the dataset
df_train = model.fit_transform(df_train.drop("Response", axis=1), df_train["Response"].values)
df_test = model.transform(df_test)
# Check out the MB
print(model.MB)
# Check out the feature importance scores for the selected feature subset
print(model.feat_imp_scores)
# Get a plot of the feature importance scores
model.feature_importance()

For better accuracy

Note: Play with the values of num_simul, simul_size, simul_type and p_val_thresh because sometimes a specific combination of these values will end up giving best results

  • Increase the cv value In all experiments, cv did not help in getting better accuracy. Use this only when you have extremely small dataset
  • Increase the num_simul value
  • Try one of these values for simul_size = {0.1, 0.2, 0.3, 0.4}
  • Use non-linear models for feature selection. Apply hyper-parameter tuning on models
  • Increase value of p_val_thresh in order to increase the number of features to include in thre Markov Blanket

For better speeds

  • Decrease the cv value. For large datasets cv might not be required. Therefore, set cv=0 to disable the aggregation step. This will result in less robust feature subset selection but at much faster speeds
  • Decrease the num_simul value but don't decrease it below 5
  • Set n_jobs to -1
  • Use linear models

For selection of less features

  • Try reducing the p_val_thresh value
  • Try out sig_test_type = "parametric"

Performance in terms of Accuracy (classification) and MSE (regression)

Dataset # of samples # of features Task Type Score using all features Score using featurewiz Score using PyImpetus # of features selected % of features selected Tutorial
Ionosphere 351 34 Classification 88.01% 92.86% 14 42.42% tutorial here
Arcene 100 10000 Classification 82% 84.72% 304 3.04%
AlonDS2000 62 2000 Classification 80.55% 86.98% 88.49% 75 3.75%
slice_localization_data 53500 384 Regression 6.54 5.69 259 67.45% tutorial here

Note: Here, for the first, second and third tasks, a higher accuracy score is better while for the fourth task, a lower MSE (Mean Squared Error) is better.

Performance in terms of Time (in seconds)

Dataset # of samples # of features Time (with PyImpetus)
Ionosphere 351 34 35.37
Arcene 100 10000 1570
AlonDS2000 62 2000 125.511
slice_localization_data 53500 384 1296.13

Future Ideas

  • Let me know

Feature Request

Drop me an email at [email protected] if you want any particular feature

Please cite this work as

Reference to the upcoming paper will be added here

Owner
Atif Hassan
PhD student at the Center of Excellence for AI, IIT Kharagpur.
Atif Hassan
This code finds bounding box of a single human mouth.

This code finds bounding box of a single human mouth. In comparison to other face segmentation methods, it is relatively insusceptible to open mouth conditions, e.g., yawning, surgical robots, etc. T

iThermAI 4 Nov 27, 2022
Implementation of Memformer, a Memory-augmented Transformer, in Pytorch

Memformer - Pytorch Implementation of Memformer, a Memory-augmented Transformer, in Pytorch. It includes memory slots, which are updated with attentio

Phil Wang 60 Nov 06, 2022
TabNet for fastai

TabNet for fastai This is an adaptation of TabNet (Attention-based network for tabular data) for fastai (=2.0) library. The original paper https://ar

Mikhail Grankin 116 Oct 21, 2022
WRENCH: Weak supeRvision bENCHmark

🔧 What is it? Wrench is a benchmark platform containing diverse weak supervision tasks. It also provides a common and easy framework for development

Jieyu Zhang 176 Dec 28, 2022
Repository for the semantic WMI loss

Installation: pip install -e . Installing DL2: First clone DL2 in a separate directory and install it using the following commands: git clone https:/

Nick Hoernle 4 Sep 15, 2022
Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

GS-WGAN This repository contains the implementation for GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators (NeurIPS

46 Nov 09, 2022
Automatic Attendance marker for LMS Practice School Division, BITS Pilani

LMS Attendance Marker Automatic script for lazy people to mark attendance on LMS for Practice School 1. Setup Add your LMS credentials and time slot t

Nihar Bansal 3 Jun 12, 2021
This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures

Introduction This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures. @inproceedings{Wa

Jiaqi Wang 42 Jan 07, 2023
Code for a real-time distributed cooperative slam(RDC-SLAM) system for ROS compatible platforms.

RDC-SLAM This repository contains code for a real-time distributed cooperative slam(RDC-SLAM) system for ROS compatible platforms. The system takes in

40 Nov 19, 2022
Realtime segmentation with ENet, the fast and accurate segmentation net.

Enet This is a realtime segmentation net with almost 22 fps on GTX1080 ti, and the model size is very small with only 28M. This repo contains the infe

JinTian 14 Aug 30, 2022
Code for our ALiBi method for transformer language models.

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation This repository contains the code and models for our paper Tra

Ofir Press 211 Dec 31, 2022
Optimizaciones incrementales al problema N-Body con el fin de evaluar y comparar las prestaciones de los traductores de Python en el ámbito de HPC.

Python HPC Optimizaciones incrementales de N-Body (all-pairs) con el fin de evaluar y comparar las prestaciones de los traductores de Python en el ámb

Andrés Milla 12 Aug 04, 2022
Libraries, tools and tasks created and used at DeepMind Robotics.

dm_robotics: Libraries, tools, and tasks created and used for Robotics research at DeepMind. Package overview Package Summary Transformations Rigid bo

DeepMind 273 Jan 06, 2023
Pytorch implementation of few-shot semantic image synthesis

Few-shot Semantic Image Synthesis Using StyleGAN Prior Our method can synthesize photorealistic images from dense or sparse semantic annotations using

40 Sep 26, 2022
Code for generating a single image pretraining dataset

Single Image Pretraining of Visual Representations As shown in the paper A critical analysis of self-supervision, or what we can learn from a single i

Yuki M. Asano 12 Dec 19, 2022
Solutions and questions for AoC2021. Merry christmas!

Advent of Code 2021 Merry christmas! 🎄 🎅 To get solutions and approximate execution times for implementations, please execute the run.py script in t

Wilhelm Ã…gren 5 Dec 29, 2022
Implementation of "Learning to Match Features with Seeded Graph Matching Network" ICCV2021

SGMNet Implementation PyTorch implementation of SGMNet for ICCV'21 paper "Learning to Match Features with Seeded Graph Matching Network", by Hongkai C

87 Dec 11, 2022
A collection of resources, problems, explanations and concepts that are/were important during my Data Science journey

Data Science Gurukul List of resources, interview questions, concepts I use for my Data Science work. Topics: Basics of Programming with Python + Unde

Smaranjit Ghose 10 Oct 25, 2022
Official code for UnICORNN (ICML 2021)

UnICORNN (Undamped Independent Controlled Oscillatory RNN) [ICML 2021] This repository contains the implementation to reproduce the numerical experime

Konstantin Rusch 21 Dec 22, 2022
Half Instance Normalization Network for Image Restoration

HINet Half Instance Normalization Network for Image Restoration, based on https://github.com/megvii-model/HINet. Dependencies NumPy PyTorch, preferabl

Holy Wu 4 Jun 06, 2022