scikit-learn cross validators for iterative stratification of multilabel data

Last update: Jan 05, 2023

Overview

iterative-stratification

iterative-stratification is a project that provides scikit-learn compatible cross validators with stratification for multilabel data.

Presently scikit-learn provides several cross validators with stratification. However, these cross validators do not offer the ability to stratify multilabel data. This iterative-stratification project offers implementations of MultilabelStratifiedKFold, MultilabelRepeatedStratifiedKFold, and MultilabelStratifiedShuffleSplit with a base algorithm for stratifying multilabel data described in the following paper:

Sechidis K., Tsoumakas G., Vlahavas I. (2011) On the Stratification of Multi-Label Data. In: Gunopulos D., Hofmann T., Malerba D., Vazirgiannis M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science, vol 6913. Springer, Berlin, Heidelberg.

Requirements

iterative-stratification has been tested under Python 3.4 through 3.8 with the following dependencies:

scipy(>=0.13.3)
numpy(>=1.8.2)
scikit-learn(>=0.19.0)

Installation

iterative-stratification is currently available on the PyPi repository and can be installed via pip:

pip install iterative-stratification

The package is also installable from the Anaconda Cloud platform:

conda install -c trent-b iterative-stratification

Toy Examples

The multilabel cross validators that this package provides may be used with the scikit-learn API in the same manner as any other cross validators. For example, these cross validators may be passed to cross_val_score or cross_val_predict. Below are some toy examples of the direct use of the multilabel cross validators.

MultilabelStratifiedKFold

from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

mskf = MultilabelStratifiedKFold(n_splits=2, shuffle=True, random_state=0)

for train_index, test_index in mskf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

Output:

TRAIN: [0 3 4 6] TEST: [1 2 5 7]
TRAIN: [1 2 5 7] TEST: [0 3 4 6]

RepeatedMultilabelStratifiedKFold

from iterstrat.ml_stratifiers import RepeatedMultilabelStratifiedKFold
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

rmskf = RepeatedMultilabelStratifiedKFold(n_splits=2, n_repeats=2, random_state=0)

for train_index, test_index in rmskf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

Output:

TRAIN: [0 3 4 6] TEST: [1 2 5 7]
TRAIN: [1 2 5 7] TEST: [0 3 4 6]
TRAIN: [0 1 4 5] TEST: [2 3 6 7]
TRAIN: [2 3 6 7] TEST: [0 1 4 5]

MultilabelStratifiedShuffleSplit

from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

msss = MultilabelStratifiedShuffleSplit(n_splits=3, test_size=0.5, random_state=0)

for train_index, test_index in msss.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

Output:

TRAIN: [1 2 5 7] TEST: [0 3 4 6]
TRAIN: [2 3 6 7] TEST: [0 1 4 5]
TRAIN: [1 2 5 6] TEST: [0 3 4 7]

Comments

Adjusting test_size doesn't actually change test_size

Hello! I'm trying to use this code for a project, however, I don't want my test size to be 0.5. When I try and adjust it, I don't get a change:

# from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
msss = MultilabelStratifiedShuffleSplit(n_splits=3, test_size=0.25, random_state=42)

for train_index, test_index in msss.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    print(len(train_index))
    print(len(test_index))
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

outputs:

('TRAIN:', array([1, 2, 4, 7]), 'TEST:', array([0, 3, 5, 6]))
4
4
('TRAIN:', array([2, 3, 6, 7]), 'TEST:', array([0, 1, 4, 5]))
4
4
('TRAIN:', array([0, 2, 4, 6]), 'TEST:', array([1, 3, 5, 7]))
4
4

Koodos on putting this out there!

opened by tyler-lanigan-hs 9

[MOD] Bug Fix for sklearn 1.0~
scikit-learn has been updated to 1.0.0. As a result, there are some functions that don't work properly. it makes errors like the below:

TypeError: __init__() takes from 1 to 2 positional arguments but 5 were given.

To fix this problem, I added * in init parameters refers to PEP 3102(https://www.python.org/dev/peps/pep-3102/).
opened by CryptoSalamander 4
Incompatibility with scikit-learn 1.0 in latest release
As of scikit-learn 1.0 the deprecation warning fixed in 0a108bc2062fd32f98c9a6305508ea213292ba08 has become a hard error. Could a new release be pushed to pypi in order to remain compatible with the latest scikit-learn?

For other users experiencing this issue (it will look something like

, in __init__ super(MultilabelStratifiedShuffleSplit, self).__init__( TypeError: __init__() takes from 1 to 2 positional arguments but 5 were given

^this) the workaround is to use the latest master of this package.
opened by lunik1 4

Error using MultilabelStratifiedKFold

Hi Trent! First, thanks for this repository, it have helped me a lot.

I have a question. I use the MultilabelStratifiedKFold for a machine learning model, but since the last week it have been giving me an error. I haven't changed anything on it, so I don't know what can be happening.

The error I'm having is in this line of code:

mskf = MultilabelStratifiedKFold(n_splits=3, shuffle=True, random_state=42)

And the error that it throws is it:

Input In [13], in <cell line: 6>()
      3 oof_preds["fold_idx"] = -1
      4 oof_preds["oof_pred"] = -1
----> 6 mskf = MultilabelStratifiedKFold(n_splits=3, shuffle=True, random_state=42)
      7 mskf_split = mskf.split(dataset, dataset[["rvm_tipo_enc","rvm_marca_enc","rvm_antiguedad","converted"]])
      9 for fold,(train_idx,valid_idx) in enumerate(mskf_split):

File ~\Anaconda3\envs\JARVIS\lib\site-packages\iterstrat\ml_stratifiers.py:157, in MultilabelStratifiedKFold.__init__(self, n_splits, shuffle, random_state)
    156 def __init__(self, n_splits=3, shuffle=False, random_state=None):
--> 157     super(MultilabelStratifiedKFold, self).__init__(n_splits, shuffle, random_state)

TypeError: __init__() takes 2 positional arguments but 4 were given```



What can be happening on here? Thanks a lot!

opened by robertogarces 3

Ability to set a custom fold proportions for MultilabelStratifiedKFold (pass "r" to IterativeStratification)

For us it's useful to be able to set custom fold proportions when using MultilabelStratifiedKFold (essentially passing custom r to IterativeStratification). It's easy enough to extend outside of the lib (only _make_test_folds needs to be copied), but I wonder if such a feature could be useful in the library itself, what do you think?

And thanks for a great library!

opened by lopuhin 3

Balanced sample with low number of one of the classes

I'm working with an extreme large multilabel problem and there are some rare classes. I was trying to use your package to balance by train/test split and notice that it does not guarantee at least one class in each set. The following example shows to the problem:

>>> import numpy as np
>>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
>>> X = np.arange(10)
>>> 
>>> 
>>> 
>>> import numpy as np
>>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
>>> 
>>> 
>>> X = np.arange(10)
>>> X
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> 
>>> y = np.array([[1,1,0],[0,1,0],[1,0,0],[1,0,0],[0,1,0],[0,1,0],[0,1,0],[1,1,0],[0,1,1],[1,0,1]])
>>> y
array([[1, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0],
       [0, 1, 0],
       [0, 1, 0],
       [0, 1, 0],
       [1, 1, 0],
       [0, 1, 1],
       [1, 0, 1]])
>>> 
>>> temp = MultilabelStratifiedShuffleSplit(n_splits = 1,test_size =.2,random_state = 0)
>>> train, test  = list(temp.split(X, y))[0]
>>> 
>>> train
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> 
>>> 
>>> test
array([0])

The train set contains both samples 8 and 9, which are the only ones that have the class with index 2. How can I make sure that all splits have at least one sample per class?

opened by miguelwon 3

Getting started help

Hello and thank you for this project.

I am new to machine learning and have a little bit of trouble getting started with this.

If i got it correctly this method is used, when I have unevenly distributed multilabel dataset, in order to get an evenly distributed one.

To test this I used one of the toy examples and changed it a little, so that I have an uneven distribution over 3 classes.

from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np
from matplotlib import pyplot as plt


AMOUNT_OF_CLASSES = 3
X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[1,0,1], [1,1,0], [1,0,1], [0,0,1], [1,1,0], [0,0,1], [1,0,0], [1,0,0]])

If I take a look at the distribution at the beginning it will look like the following:

dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
for i in range(0,AMOUNT_OF_CLASSES):
    dis[i] = y[:,i].sum()

# Show original distribution
plt.figure(0)
plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],dis)

If I now do the stratification like this:

# now go for stratifcaation
msss = MultilabelStratifiedShuffleSplit(n_splits=10, test_size=0.5, random_state=0)

cnt = 1
# distribution over all iterations
all_dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
for train_index, test_index in msss.split(X, y):
    iter_dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    
    for i in range(0,AMOUNT_OF_CLASSES):
        iter_dis[i] = y_train[:,i].sum()
        
    all_dis += iter_dis
    # Show new distribution (for the latest one at first)
    plt.figure(cnt)
    plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],iter_dis)
    
    
    
    cnt += 1

and look at the distribution at the end:


plt.figure(cnt+1)
plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],all_dis)    
plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],dis)
plt.title("Distribution after Stratification")
plt.legend(['Distribution after stratification','original distribution'])

I will get the following:

So it still looks like I do not have an even distribution among the classes.

Is this not what this is used for? How could I achieve that every class is evenly distributed over the data? Thank you really much

opened by kevinkit 3

Possibility to do stratification with multi-output multi-class (multi-target) data
Hi, I have a multi-output multi-class (multi-target) dataset and would like to do data stratification before applying a learning algorithm. Using iterative_train_test_split from skmultilearn library (``` from skmultilearn.model_selection import iterative_train_test_split x_train, y_train, x_test, y_test = iterative_train_test_split(x, y, test_size = 0.1)

Thank you.
opened by bundit786 2
Do we need X for `split`

Forgive me if this is a dumb question, but if I understand this library correctly, the main aim is to look at correlations between the y's and somehow accomodate for that when stratifying. Is there any reason why we need the X variable as when splitting? eg. to get the indices we always have to do: train_index, test_index = next(iter(msss.split(X, y))).

Thanks in advance.

opened by sachinruk 2
Is it possible to calculate the total number of all possible splits?
Love this repo, it spares me a lot effort.

Here is my question (or concern).

When we don't enforce any constraint when generating KFold, the number of all possible splits is the largest and simple to calculate.

When we only have one label and enforce the splits to be stratified, i.e. StratifiedKFold, this number drops, but normally will still be large enough to generate a diverse set of splits. Again, this number can be calculated with some simple combinatorics.

However, when stratification on multiple labels is enforced (the goal of this repo), things become more complicated and I am worried that if there are too much labels, say hundreds of them, there won't be too many possible splits that can satisfy the stratification constraint😟.

So my question is,

Does my concern make sense?

Can we calculate the total number of possibilities?

Looking forward to reply.
opened by whatever60 2
Different percentage of samples for each label after using MultilabelStratifiedKFold
Hi trent-b:

Thanks for this nice repository, hope you can reply these questions below:

def multi2single_labels(y): d = {} for yy in y: d[str(yy)] = d.get(str(yy), 0) + 1 return d yy = np.array([[0,0,0,0]]*318+[[1,0,0,0]]*264+[[0,0,1,0]]*58+[[0,1,0,1]]*51+\ [[1,0,0,1]]*81+[[0,1,0,0]]*151+[[0,1,1,0]]*33+[[0,0,1,1]]*27+\ [[0,0,0,1]]*54+[[0,1,1,1]]*21+[[1,1,0,0]]*11+[[1,1,0,1]]*7+[[1,0,1,0]]*2) xx = np.zeros((yy.shape[0],)) kfold = MultilabelStratifiedKFold(n_splits=2, random_state=42, shuffle=True) for idx_fold, (idx_train, idx_valid) in enumerate(kfold.split(xx, yy)): print(f'Now in {idx_fold}th fold') y_valid = yy[idx_valid] d_y = multi2single_labels(y_valid) print(f'labels of y: {d_y}')

Using the code (simplest 2 fold) above will get result: Now in 0th fold labels of y: {'[0 0 0 0]': 155, '[1 0 0 0]': 136, '[0 0 1 0]': 28, '[0 1 0 1]': 25, '[1 0 0 1]': 37, '[0 1 0 0]': 76, '[0 1 1 0]': 18, '[0 0 1 1]': 15, '[0 0 0 1]': 31, '[0 1 1 1]': 9, '[1 1 0 0]': 5, '[1 1 0 1]': 4} Now in 1th fold labels of y: {'[0 0 0 0]': 163, '[1 0 0 0]': 128, '[0 0 1 0]': 30, '[0 1 0 1]': 26, '[1 0 0 1]': 44, '[0 1 0 0]': 75, '[0 1 1 0]': 15, '[0 0 1 1]': 12, '[0 0 0 1]': 23, '[0 1 1 1]': 12, '[1 1 0 0]': 6, '[1 1 0 1]': 3, '[1 0 1 0]': 2} Q1: Why is '[1 0 1 0]' not be 1 in both two fold but all in 1th fold? Q2: Why is number of some label so differ in each fold? (e.g.'[0 0 0 0]', '[1 0 0 0]')

Thanks!
opened by Lance0218 2
Unable to create a small sample of 1000 train and 100 using MultilabelStratifiedShuffleSplit
Hi trent-b:

Thanks for this repository, hope you can help with my issue. I have a large json data set that i want to use MultilabelStratifiedShuffleSplit to create a smaller sample set.

def mlb_train_test_split(labels, test_size, train_size, random_state=0): with warnings.catch_warnings(): warnings.simplefilter("ignore", category=FutureWarning) msss = MultilabelStratifiedShuffleSplit( test_size=test_size, train_size=train_size, random_state=random_state ) train_idx, test_idx = next(msss.split(np.ones_like(labels), labels)) return train_idx, test_idx

i then call the function as :

train_idx, test_idx = mlb_train_test_split(labels, test_size=1000 train_size=200, random_state=0)

When i look at the numbers I'm seeing way more than 200 rows. Is there a limitation? The labels length is approximately 500,000 in the dataset.
opened by meltedhead 1

Releases(0.1.7)

0.1.7(Oct 3, 2021)

Updated to handle extra parameter warnings and errors introduced as scikit-learn moved to version 1.0.
Source code(tar.gz)
Source code(zip)
0.1.6(Aug 12, 2018)

Source code(tar.gz)
Source code(zip)
0.1.5(Aug 12, 2018)

Made minor changes to README.md, setup.py, and .travis.yml.
Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

3.7k Jan 01, 2023

scikit-learn cross validators for iterative stratification of multilabel data

Related tags

Overview

iterative-stratification

Requirements

Installation

Toy Examples

MultilabelStratifiedKFold

RepeatedMultilabelStratifiedKFold

MultilabelStratifiedShuffleSplit

Comments

Releases(0.1.7)

0.1.7(Oct 3, 2021)

0.1.6(Aug 12, 2018)

0.1.5(Aug 12, 2018)

Owner

Extra blocks for scikit-learn pipelines.

Large-scale linear classification, regression and ranking in Python

Topological Data Analysis for Python🐍

A library of extension and helper modules for Python's data analysis and machine learning libraries.

A Python library for dynamic classifier and ensemble selection

scikit-learn cross validators for iterative stratification of multilabel data

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

A scikit-learn based module for multi-label et. al. classification

Fast solver for L1-type problems: Lasso, sparse Logisitic regression, Group Lasso, weighted Lasso, Multitask Lasso, etc.

machine learning with logical rules in Python

Scikit-learn compatible estimation of general graphical models

Multivariate imputation and matrix completion algorithms implemented in Python

Data Analysis Baseline Library

A library of sklearn compatible categorical variable encoders

(AAAI' 20) A Python Toolbox for Machine Learning Model Combination

scikit-learn inspired API for CRFsuite

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)