Python package for concise, transparent, and accurate predictive modeling

Last update: Jan 01, 2023

Overview

Python package for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use.

Modern machine-learning models are increasingly complex, often making them difficult to interpret. This package provides a simple interface for fitting and using state-of-the-art interpretable models, all compatible with scikit-learn. These models can often replace black-box models (e.g. random forests) with simpler models (e.g. rule lists) while improving interpretability and computational efficiency, all without sacrificing predictive accuracy! Simply import a classifier or regressor and use the fit and predict methods, same as standard scikit-learn models.

from imodels import BoostedRulesClassifier, FIGSClassifier, SkopeRulesClassifier
from imodels import RuleFitRegressor, HSTreeRegressorCV, SLIMRegressor

model = BoostedRulesClassifier()  # initialize a model
model.fit(X_train, y_train)   # fit model
preds = model.predict(X_test) # predictions: shape is (n_test, 1)
preds_proba = model.predict_proba(X_test) # predicted probabilities: shape is (n_test, n_classes)
print(model) # print the rule-based model

-----------------------------
# the model consists of the following 3 rules
# if X1 > 5: then 80.5% risk
# else if X2 > 5: then 40% risk
# else: 10% risk

Installation

Install with pip install imodels (see here for help).

Supported models

Model	Reference	Description
Rulefit rule set	🗂️ , 🔗 , 📄	Fits a sparse linear model on rules extracted from decision trees
Skope rule set	🗂️ , 🔗	Extracts rules from gradient-boosted trees, deduplicates them, then linearly combines them based on their OOB precision
Boosted rule set	🗂️ , 🔗 , 📄	Sequentially fits a set of rules with Adaboost
Slipper rule set	🗂️ , ㅤㅤ 📄	Sequentially learns a set of rules with SLIPPER
Bayesian rule set	🗂️ , 🔗 , 📄	Finds concise rule set with Bayesian sampling (slow)
Optimal rule list	🗂️ , 🔗 , 📄	Fits rule list using global optimization for sparsity (CORELS)
Bayesian rule list	🗂️ , 🔗 , 📄	Fits compact rule list distribution with Bayesian sampling (slow)
Greedy rule list	🗂️ , 🔗	Uses CART to fit a list (only a single path), rather than a tree
OneR rule list	🗂️ , ㅤㅤ 📄	Fits rule list restricted to only one feature
Optimal rule tree	🗂️ , 🔗 , 📄	Fits succinct tree using global optimization for sparsity (GOSDT)
Greedy rule tree	🗂️ , 🔗 , 📄	Greedily fits tree using CART
C4.5 rule tree	🗂️ , 🔗 , 📄	Greedily fits tree using C4.5
Iterative random forest	🗂️ , 🔗 , 📄	Repeatedly fit random forest, giving features with high importance a higher chance of being selected
Sparse integer linear model	🗂️ , ㅤㅤ 📄	Sparse linear model with integer coefficients
Greedy tree sums	🗂️ , ㅤㅤ 📄	Sum of small trees with very few total rules (FIGS)
Hierarchical shrinkage wrapper	🗂️ , ㅤㅤ 📄	Improve any tree-based model with ultra-fast, post-hoc regularization
Distillation wrapper	🗂️	Train a black-box model, then distill it into an interpretable model
More models	⌛	(Coming soon!) Lightweight Rule Induction, MLRules, ...

Docs 🗂️ , Reference code implementation 🔗 , Research paper 📄

What's the difference between the models?

The final form of the above models takes one of the following forms, which aim to be simultaneously simple to understand and highly predictive:

Rule set	Rule list	Rule tree	Algebraic models

Different models and algorithms vary not only in their final form but also in different choices made during modeling, such as how they generate, select, and postprocess rules:

Rule candidate generation	Rule selection	Rule postprocessing

Ex. RuleFit vs. SkopeRules

RuleFit and SkopeRules differ only in the way they prune rules: RuleFit uses a linear model whereas SkopeRules heuristically deduplicates rules sharing overlap.

Ex. Bayesian rule lists vs. greedy rule lists

Bayesian rule lists and greedy rule lists differ in how they select rules; bayesian rule lists perform a global optimization over possible rule lists while Greedy rule lists pick splits sequentially to maximize a given criterion.

Ex. FPSkope vs. SkopeRules

FPSkope and SkopeRules differ only in the way they generate candidate rules: FPSkope uses FPgrowth whereas SkopeRules extracts rules from decision trees.

Demo notebooks

Demos are contained in the notebooks folder.

Quickstart demo

Shows how to fit, predict, and visualize with different interpretable models

Quickstart colab demo

Shows how to fit, predict, and visualize with different interpretable models

Clinical decision rule notebook

Shows an example of using imodels for deriving a clinical decision rule

Posthoc analysis

We also include some demos of posthoc analysis, which occurs after fitting models: posthoc.ipynb shows different simple analyses to interpret a trained model and uncertainty.ipynb contains basic code to get uncertainty estimates for a model

Support for different tasks

Different models support different machine-learning tasks. Current support for different models is given below (each of these models can be imported directly from imodels (e.g. from imodels import RuleFitClassifier):

Model	Binary classification	Regression	Notes
Rulefit rule set	RuleFitClassifier	RuleFitRegressor
Skope rule set	SkopeRulesClassifier
Boosted rule set	BoostedRulesClassifier
SLIPPER rule set	SlipperClassifier
Bayesian rule set	BayesianRuleSetClassifier		Fails for large problems
Optimal rule list (CORELS)	OptimalRuleListClassifier		Requires corels, fails for large problems
Bayesian rule list	BayesianRuleListClassifier
Greedy rule list	GreedyRuleListClassifier
OneR rule list	OneRClassifier
Optimal rule tree (GOSDT)	OptimalTreeClassifier		Requires gosdt, fails for large problems
Greedy rule tree (CART)	GreedyTreeClassifier	GreedyTreeRegressor
C4.5 rule tree	C45TreeClassifier
Iterative random forest	IRFClassifier		Requires irf
Sparse integer linear model	SLIMClassifier	SLIMRegressor	Requires extra dependencies for speed
Greedy tree sums (FIGS)	FIGSClassifier	FIGSRegressor
Hierarchical shrinkage	HSTreeClassifierCV	HSTreeRegressorCV	Wraps any sklearn tree-based model
Distillation		DistilledRegressor	Wraps any sklearn-compatible models

Extras

Data-wrangling functions for working with popular tabular datasets (e.g. compas).

These functions, in conjunction with imodels-data and imodels-experiments, make it simple to download data and run experiments on new models.

Explain classification errors with a simple posthoc function.

Fit an interpretable model to explain a previous model's errors (ex. in this notebook 📓 ).

Fast and effective discretizers for data preprocessing.

Discretizer	Reference	Description
MDLP	🗂️ , 🔗 , 📄	Discretize using entropy minimization heuristic
Simple	🗂️ , 🔗	Simple KBins discretization
Random Forest	🗂️	Discretize into bins based on random forest split popularity

Rule-based utils for customizing models

The code here contains many useful and customizable functions for rule-based learning in the [util folder](https://csinva.io/imodels/util/index.html). This includes functions / classes for rule deduplication, rule screening, and converting between trees, rulesets, and neural networks.

Our favorite models

After developing and playing with imodels, we developed a few new models to overcome limitations of existing interpretable models.

FIGS: Fast interpretable greedy-tree sums

📄 Paper, 🔗 Post, 📌 Citation

Fast Interpretable Greedy-Tree Sums (FIGS) is an algorithm for fitting concise rule-based models. Specifically, FIGS generalizes CART to simultaneously grow a flexible number of trees in a summation. The total number of splits across all the trees can be restricted by a pre-specified threshold, keeping the model interpretable. Experiments across a wide array of real-world datasets show that FIGS achieves state-of-the-art prediction performance when restricted to just a few splits (e.g. less than 20).

Example FIGS model. FIGS learns a sum of trees with a flexible number of trees; to make its prediction, it sums the result from each tree.

Hierarchical shrinkage: post-hoc regularization for tree-based methods

📄 Paper, 🔗 Post, 📌 Citation

Hierarchical shinkage is an extremely fast post-hoc regularization method which works on any decision tree (or tree-based ensemble, such as Random Forest). It does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors (using a single regularization parameter). Experiments over a wide variety of datasets show that hierarchical shrinkage substantially increases the predictive performance of individual decision trees and decision-tree ensembles.

References

Readings

Interpretable ML good quick overview: murdoch et al. 2019, pdf
Interpretable ML book: molnar 2019, pdf
Case for interpretable models rather than post-hoc explanation: rudin 2019, pdf
Review on evaluating interpretability: doshi-velez & kim 2017, pdf

Reference implementations (also linked above)

The code here heavily derives from the wonderful work of previous projects. We seek to to extract out, unify, and maintain key parts of these projects.

pycorels - by @fingoldin and the original CORELS team
sklearn-expertsys - by @tmadl and @kenben based on original code by Ben Letham
rulefit - by @christophM
skope-rules - by the skope-rules team (including @ngoix, @floriangardin, @datajms, Bibi Ndiaye, Ronan Gautier)
boa - by @wangtongada

Related packages

gplearn: symbolic regression/classification
pysr: fast symbolic regression
pygam: generative additive models
interpretml: boosting-based gam
h20 ai: gams + glms (and more)
optbinning: data discretization / scoring models

Updates

For updates, star the repo, see this related repo, or follow @csinva_
Please make sure to give authors of original methods / base implementations appropriate credit!
Contributing: pull requests very welcome!

If it's useful for you, please star/cite the package, and make sure to give authors of original methods / base implementations credit:

@software{
    imodels2021,
    title        = {{imodels: a python package for fitting interpretable models}},
    journal      = {Journal of Open Source Software}
    publisher    = {The Open Journal},
    year         = {2021},
    author       = {Singh, Chandan and Nasseri, Keyan and Tan, Yan Shuo and Tang, Tiffany and Yu, Bin},
    volume       = {6},
    number       = {61},
    pages        = {3192},
    doi          = {10.21105/joss.03192},
    url          = {https://doi.org/10.21105/joss.03192},
}

Comments

Added Gini Importances
Hi @csinva how do the new Gini importances look?

I based the calculation off sklearn's code from here and here, though it needed to be made recursive as we do not have arrays of all the nodes and their properties.

There is a demo of the new code in the FIGS_viz_demo.ipynb notebook. I am a bit concerned with the None impurity in the root node of the second tree:

node_id: 0, left.node_id: 1, right.node_id: 2, impurity: None

I filled it with 0 for the calculation for now:

importance_data_tree[node.feature] += ( np.sum(node.value_sklearn) * (node.impurity if node.impurity is not None else 0.) - np.sum(node.left.value_sklearn) * node.left.impurity - np.sum(node.right.value_sklearn) * node.right.impurity )

Is None expected if the tree has just one split?

Also, after taking the mean and normalizing most of the importances are negative. I think this is fine, as we just care about the relative order of the features, but wanted to get your opinion as well:

BTW I noticed that we have an unused variable in plot():

criterion = "squared_error" if isinstance(self, RegressorMixin) else "gini"

Is this need for anything, or should we delete it?
opened by mepland 15
Fixed FIGS plotting
Fixed Issue 132, FIGS plots not appearing correctly.

The primary bug was in the assignment of node ids here.

right = next(node_counter) left = next(node_counter)

They were being improperly set during the recursion of _update_node(nd). I've fixed this by assigning a new node_num variable after the trees are created during fit() here and using that instead:

# add node_num to final tree for tree_ in self.trees_: node_counter = iter(range(0, int(1e06))) def _add_node_num(node: Node): if node is None: return node.setattrs(node_num=next(node_counter)) _add_node_num(node.left) _add_node_num(node.right) _add_node_num(tree_)

I also took the opportunity to return a real sklearn DecisionTreeClassifier or DecisionTreeRegressor object, filling the parameters, including tree_, with the __setstate__() method, building on this SO question. In order to do this, I needed the impurity at each node and the "value" as expected by sklearn, i.e. value = np.array([neg_count, pos_count], dtype=float). If we further rewrite the FIGS class to save this 2D "value" along side the current value, perhaps as value_sklearn, I wouldn't need X_train, y_train for the extract_sklearn_tree_from_figs function, and the subsequent plotting functions.

@csinva does my implementation of the impurity variable look correct? I see the impurities are recomputed after I grab my impurity values, so I expect not. Perhaps you could fix this, or let me know the best way to get the final impurity at each node? I'll also wait for the go ahead on adding the value_sklearn variable, and refactoring away the dependence on X_train, y_train in the plotting functions.
opened by mepland 8
'BoostedRulesClassifier' object has no attribute 'complexity_'

After imodel being updated to 1.3.8, we've got the error msg 'BoostedRulesClassifier' object has no attribute 'complexity_'. Wonder is it removed or renamed? It is generally better to keep public apis/attributes unchanged during minor releases, any plan to add it back?

opened by yinweisu 6
FIGS Fixes
Added SKompiler integration, which required the new n_features_in_ member variable.

Note the demo FIGS model currently requires https://github.com/mepland/SKompiler/tree/fixes to run which fixes a bug in SKompiler. TLDR SKompiler was not letting trees run if they use less than all the available features, like the demo FIGS tree 0.

Fixed bug in n_features

- n_features = np.unique(features[np.where( 0 < features )]).size + n_features = np.unique(features[np.where( 0 <= features )]).size

Improved markdown comments in FIGS_viz_demo.ipynb
opened by mepland 4

HSTree Multiclass Classification Support

Does HSTree support multiclass classification problems with RandomForest / ExtraTrees as the estimator?

From my initial tests it appears buggy. Calling predict_proba with the final model results in lots of NaN predictions, along with warnings during training such as:

/Users/neerick/workspace/virtual/autogluon/lib/python3.8/site-packages/imodels/tree/hierarchical_shrinkage.py:87: RuntimeWarning: invalid value encountered in double_scalars
  val = tree.value[i][0, 1] / (tree.value[i][0, 0] + tree.value[i][0, 1])  # binary classification

If helpful I can try to create a reproducible example.

Here is an example result comparing with sklearn default RF (_og_) with accuracy metric. Because HSTree returns many NaN predictions, the scores are very low.

One observation is the scores get worse the more trees there are in HSTree forests. I'd guess the likelihood of returning a NaN result is increasing with the number of trees.

                       model  score_test  score_val  pred_time_test  pred_time_val  fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0       RandomForest_og_n300    0.711651   0.723618        0.985573       0.050956  0.519926                 0.985573                0.050956           0.519926            1       True          1
1       RandomForest_og_n100    0.710154   0.748744        0.453769       0.019050  0.170951                 0.453769                0.019050           0.170951            1       True          2
2        WeightedEnsemble_L2    0.710154   0.748744        0.464755       0.019376  0.295161                 0.010986                0.000326           0.124210            2       True         36
3        RandomForest_og_n40    0.700636   0.698492        0.193009       0.010738  0.088012                 0.193009                0.010738           0.088012            1       True          3
4        RandomForest_og_n20    0.692039   0.698492        0.103616       0.007549  0.057396                 0.103616                0.007549           0.057396            1       True          4
5        RandomForest_og_n10    0.674165   0.688442        0.075296       0.006166  0.041720                 0.075296                0.006166           0.041720            1       True          5
6     RandomForest_hs=10_n10    0.521949   0.537688        0.070260       0.005246  0.082384                 0.070260                0.005246           0.082384            1       True         15
7     RandomForest_hs=50_n10    0.520839   0.517588        0.075151       0.004875  0.071219                 0.075151                0.004875           0.071219            1       True         20
8    RandomForest_hs=0.1_n10    0.520796   0.537688        0.074070       0.005233  0.093299                 0.074070                0.005233           0.093299            1       True         35
9      RandomForest_hs=1_n10    0.520692   0.542714        0.077687       0.005690  0.075061                 0.077687                0.005690           0.075061            1       True         10
10   RandomForest_hs=100_n10    0.519246   0.517588        0.075059       0.006019  0.082536                 0.075059                0.006019           0.082536            1       True         25
11   RandomForest_hs=500_n10    0.488877   0.517588        0.072145       0.005125  0.072223                 0.072145                0.005125           0.072223            1       True         30
12     RandomForest_hs=1_n20    0.485125   0.472362        0.113002       0.006484  0.123639                 0.113002                0.006484           0.123639            1       True          9
13   RandomForest_hs=0.1_n20    0.485005   0.472362        0.111342       0.005953  0.146246                 0.111342                0.005953           0.146246            1       True         34
14    RandomForest_hs=10_n20    0.484833   0.482412        0.104076       0.006577  0.131909                 0.104076                0.006577           0.131909            1       True         14
15    RandomForest_hs=50_n20    0.482896   0.482412        0.115057       0.006263  0.130512                 0.115057                0.006263           0.130512            1       True         19
16   RandomForest_hs=100_n20    0.480840   0.482412        0.108625       0.006045  0.135224                 0.108625                0.006045           0.135224            1       True         24
17   RandomForest_hs=500_n20    0.458035   0.467337        0.108658       0.006302  0.123907                 0.108658                0.006302           0.123907            1       True         29
18     RandomForest_hs=1_n40    0.451434   0.467337        0.185129       0.010619  0.210639                 0.185129                0.010619           0.210639            1       True          8
19   RandomForest_hs=0.1_n40    0.451382   0.467337        0.170597       0.009024  0.244322                 0.170597                0.009024           0.244322            1       True         33
20    RandomForest_hs=10_n40    0.451322   0.467337        0.173382       0.009955  0.210795                 0.173382                0.009955           0.210795            1       True         13
21    RandomForest_hs=50_n40    0.450350   0.467337        0.170041       0.008673  0.236081                 0.170041                0.008673           0.236081            1       True         18
22   RandomForest_hs=100_n40    0.449119   0.467337        0.169396       0.010918  0.226784                 0.169396                0.010918           0.226784            1       True         23
23   RandomForest_hs=500_n40    0.435832   0.472362        0.162881       0.009256  0.202447                 0.162881                0.009256           0.202447            1       True         28
24    RandomForest_hs=1_n100    0.420419   0.452261        0.442328       0.017688  0.480776                 0.442328                0.017688           0.480776            1       True          7
25  RandomForest_hs=0.1_n100    0.420411   0.452261        0.354523       0.018247  0.548557                 0.354523                0.018247           0.548557            1       True         32
26   RandomForest_hs=10_n100    0.419981   0.452261        0.355097       0.017487  0.469547                 0.355097                0.017487           0.469547            1       True         12
27   RandomForest_hs=50_n100    0.419034   0.447236        0.344341       0.021125  0.465810                 0.344341                0.021125           0.465810            1       True         17
28  RandomForest_hs=100_n100    0.418672   0.447236        0.372041       0.018402  0.477048                 0.372041                0.018402           0.477048            1       True         22
29  RandomForest_hs=500_n100    0.415256   0.457286        0.338696       0.017128  0.492786                 0.338696                0.017128           0.492786            1       True         27
30  RandomForest_hs=0.1_n300    0.381049   0.391960        0.967061       0.045552  1.533075                 0.967061                0.045552           1.533075            1       True         31
31   RandomForest_hs=10_n300    0.381049   0.391960        1.109062       0.054005  1.442369                 1.109062                0.054005           1.442369            1       True         11
32    RandomForest_hs=1_n300    0.381040   0.391960        1.677277       0.055421  2.346773                 1.677277                0.055421           2.346773            1       True          6
33   RandomForest_hs=50_n300    0.380945   0.391960        0.889030       0.053650  1.320377                 0.889030                0.053650           1.320377            1       True         16
34  RandomForest_hs=100_n300    0.380885   0.391960        1.031198       0.045266  1.254918                 1.031198                0.045266           1.254918            1       True         21
35  RandomForest_hs=500_n300    0.380816   0.391960        0.948715       0.050209  1.266396                 0.948715                0.050209           1.266396            1       True         26

enhancement

opened by Innixma 4

Two Extractly same rules by RulefitClassifier

Hello~

When I use the RulefitClassifier, it will return two exactly same rules but with different coef, whether the inherent structures didn't aggregate the rules? I have tried to use the Rulefit directly, and it seems that it doesn't have the similar problem~

The following image is part of my result
bug

opened by Yannahhh 4

BoostedRulesClassifier sometimes throws an exception

Hi,

When I use the BoostedRulesClassifier, it sometimes throws an exception as follows:

This BoostedRulesClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

I find that the exception results from the implementation of the class RuleSet: ` def _eval_weighted_rule_sum(self, X) -> np.ndarray:

    check_is_fitted(self, ['rules_without_feature_names_', 'n_features_', 'feature_placeholders'])

    X = check_array(X)

    if X.shape[1] != self.n_features_:
        raise ValueError("X.shape[1] = %d should be equal to %d, the number of features at training time."
                         " Please reshape your data."
                         % (X.shape[1], self.n_features_))

    df = pd.DataFrame(X, columns=self.feature_placeholders)
    selected_rules = self.rules_without_feature_names_

    scores = np.zeros(X.shape[0])
    for r in selected_rules: 
        features_r_uses = list(map(lambda x: x[0], r.agg_dict.keys()))
        scores[df[features_r_uses].query(str(r)).index.values] += r.args[0]

    return scores`

Specifically, when the computer runs the check_is_fitted(self, ['rules_without_feature_names_', 'n_features_', 'feature_placeholders']), it finds that self.rules_without_feature_names_ does not exist, so the computer throws the above exception.

And I further review my code and data set, I find that my training set is easy to train a classifier, so the training error of the estimator is close to zero, it may result in a bug in the fit function of the class BoostedRulesClassifier: ` for _ in range(self.n_estimators): # Fit a classifier with the specific weights clf = self.estimator() clf.fit(X, y, sample_weight=w) # uses w as the sampling weight! preds = clf.predict(X) self.estimator_mean_prediction_.append(np.mean(preds)) # just for printing

        # Indicator function
        miss = preds != y

        # Equivalent with 1/-1 to update weights
        miss2 = np.ones(miss.size)
        miss2[~miss] = -1

        # Error
        err_m = np.dot(w, miss) / sum(w)
        
        if err_m < 1e-3:
            return self
          
        # Alpha
        alpha_m = 0.5 * np.log((1 - err_m) / float(err_m))

        # New weights
        w = np.multiply(w, np.exp([float(x) * alpha_m for x in miss2]))

        self.estimators_.append(deepcopy(clf))
        self.estimator_weights_.append(alpha_m)
        self.estimator_errors_.append(err_m)

    rules = []

` Because the error_m is zero, so it directly returns self without executing subsequent statements, in such a case, self.rules_without_feature_names_ dose not exist.

My current solution to this bug is to modify the following code fragment in the fit function of the class BoostedRulesClassifier: ` # Error err_m = np.dot(w, miss) / sum(w)

        # modification ###########################
        if err_m < 1e-3:
            # return self
            w = np.ones(miss.size) / len(y)
            self.estimators_.append(deepcopy(clf))
            self.estimator_weights_.append(float("inf"))
            self.estimator_errors_.append(err_m)
            break
         ####################################
        # Alpha
        alpha_m = 0.5 * np.log((1 - err_m) / float(err_m))

` I'm not sure whether it may introduce new defects, but it indeed solves the exception.

opened by Wan-xiaohui 3

GreedyRuleListClassifier has wildly varying performance and sometimes crashes

When running a certain number of experiments with different splits of a given dataset, I see that GreedyRuleListClassifier's accuracy wildly varies, and sometimes the code (see for loop below) crashes.

So, for example running 10 experiments like this, with different random splits of the same set:

import pandas
import sklearn
import sklearn.datasets
from sklearn.model_selection import train_test_split

from imodels import GreedyRuleListClassifier

X, Y = sklearn.datasets.load_breast_cancer(as_frame=True, return_X_y=True)

model = GreedyRuleListClassifier(max_depth=10)

for i in range(10):
  try:
    X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)
    model.fit(X_train, y_train, feature_names=X_train.columns)
    y_pred = model.predict(X_test)
    from sklearn.metrics import accuracy_score
    score = accuracy_score(y_test.values,y_pred)
    print('Accuracy:\n', score)
  except KeyError as e:
    print("Failed with KeyError")

Will give as output something along the lines of

Accuracy: 0.6081871345029239
Failed with KeyError
Accuracy: 0.4619883040935672
Accuracy: 0.45614035087719296
Accuracy: 0.2222222222222222
Failed with KeyError
Failed with KeyError
Failed with KeyError
Accuracy: 0.18128654970760233
Failed with KeyError

Is this intended behavior? While my test dataset is smallish, the variation in accuracy is still surprising for me and so is the throwing of a KeyError. I'm using scikit-learn==1.0.2 and imodels=1.3.6 and can edit the issue here to add more details.

Incidentally, the same behaviour was observed in https://datascience.stackexchange.com/a/116283/50519, noticed by @jonnor.

Thanks!

opened by davidefiocco 3

Issue with feature_names in GreedyRuleListClassifier

when i am putting feature_names= X.columns only the first feature is appearing in the rule list and others are appearing as feat i. unable to fix this and request for your kind support.

here is the output snippet: Selected features: Index(['Processor(P99)_Q', 'Opto(F99)_Q', 'Logic(L99)_Am', 'Qualcom', 'Toshiba', 'ABB', 'Whirlpool', 'Honeywell'], dtype='object') mean 0.6 (30 pts) if Whirlpool >= 153 then 1.0 (16 pts) mean 0.143 (14 pts) if feat 1 >= 16882885 then 1.0 (2 pts) mean 0.0 (12 pts)

opened by pauldebdeep9 3
Complexity comparisons
Compare all models on several UCI datasets

Generate complexity-accuracy plots for each model

Cache comparison results for convenience

Set self.complexity when fitting models
opened by keyan3 3
Test fixes
Fixed an issue where the GitHub build would pass even if the tests actually failed (screenshot below)

Added missing random seeding in Skope

I skipped testing predict_proba for Skope altogether — thought behind this is that even if you write a predict_proba that uses eval_weighted_rule_sum, it still won't match the predictions since since Skope predicts based only on whether the score is positive or not. I'm not sure if our Skope needs to have this method at all (the original Skope implementation doesn't)
opened by keyan3 3
FIGS Demo Notebook Update
@csinva let's wait on merging this for a few weeks, until both imodels and dtreeviz release new minor versions. I have a few changes I want to make then:

Remove path to ~/imodels

Use 'leaftype': 'barh'

Update color scheme

Possibly add numeric leaf predictions and split visualizations
opened by mepland 0
Full sample_weight support for FIGS

Some parts of FIGS do not support sample_weight including the extract_sklearn_tree_from_figs() function and feature_importances_.

Originally posted by @mepland in https://github.com/csinva/imodels/issues/89#issuecomment-1367595878

opened by mepland 0
Implement Dynamic CDI
Implementing a Dynamic CDIs class based on FIGS.

TODOs:

[ ] Implement a sklearn compatible class named D-FIGS in a new file imodels/tree/dynamic_figs.py

[ ] Write a test using the PECARN IAI dataset

More details:

The D-FIGS class should inherit from FIGS class, and take an additional dictionary at initialization, corresponding to the features phases. When applying the fit or predict methods, the class should verify that the matrix $X$ is compatible with the features tiers. For example phase 2 features can be available (not NA) only if all phase 1 features are available (we may refine this logic later).

D-FIGS should infer the phase from the matrix.

The tests should be written in a new file named imodels/tests/dynamic_figs_test.py, using pytest (see package documentation or you can use the figs test as reference)

Before you start writing code, please write down a short description detailing how you are going to implement the dynamic fitting algorithm. Specifically: How does the model infer the current phase of the patient? How do you store the different models for different phases and ensure these are compatible with one another?

@aagarwal1996
opened by OmerRonen 1
Add support for `dtreeviz` visualizations
Add any required translation code to allow imodels trees to be plotted with dtreeviz. This basically boils down to successfully generating a ShadowDecTree object from an imodels tree.

We can reuse the existing ShadowSKDTree constructor by converting imodels trees into sklearn objects, then calling:

sk_dtree = ShadowSKDTree(tree_classifier, X, y, features, target, [0, 1])

Alternatively, we can make an imodels specific implementation of ShadowDecTree, similar to the sklearn implementation here, but that may be more work than necessary.
opened by mepland 0
RuleFitClassifier(tree_generator = GradientBoostingClassifier()) not working as per documentation
Hi,

When using RuleFitClassifier(tree_generator = GradientBoostingClassifier()) with a GradientBoostingClassifier() object fitted and optimized separately via Scikitlearn API, it returns the next error when fitting RuleFitClassifier(tree_generator = GradientBoostingClassifier()):

ValueError: n_estimators=1 must be larger or equal to estimators_.shape[0]=100 when warm_start==True

When inspecting whats inside RuleFitClassifier(tree_generator = GradientBoostingClassifier()) after fitting the model, the GradientBoostingClassifier() is completely modified to other parameters different than those optimized before fitting RuleFitClassifier(), i.e., GradientBoostingClassifier(max_leaf_nodes=4, n_estimators=1, random_state=0, warm_start=True). Not sure why these parameters (from the GradientBoostingClassifier()) are changed inside the RuleFitClassifier() object.

If RuleFitClassifier(tree_generator = None), everything works well.

As per documentation:

tree_generator : Optional: this object will be used as provided to generate the rules. This will override almost all the other properties above. Must be GradientBoostingRegressor(), GradientBoostingClassifier(), or RandomForestRegressor()

Which are those properties from RuleFitClassifier() that are override if tree_generator=GradientBoostingClassifier()?

Why does this behavior occurs?

Here is the closest solution I found in Issue #34, however the behavior is not clear.

Any help will be highly appreciated.

Many thanks!
opened by Manuelhrokr 0

Releases(v1.3.11)

v1.3.11(Dec 28, 2022)

Source code(tar.gz)
Source code(zip)
v1.3.8(Dec 10, 2022)

Source code(tar.gz)
Source code(zip)
v1.3.7(Dec 6, 2022)

Fixed issue with KeyError in GRL + speed up by making call to sklearn stump.
Source code(tar.gz)
Source code(zip)
v1.3.6(Oct 31, 2022)

Source code(tar.gz)
Source code(zip)
v1.3.5(Sep 15, 2022)

Support dtree viz and viz fix for FIGS from @mepland
Source code(tar.gz)
Source code(zip)
v1.3.4(Aug 25, 2022)

Source code(tar.gz)
Source code(zip)
v1.3.3(Jul 28, 2022)

Source code(tar.gz)
Source code(zip)
v.1.3.2(Jul 3, 2022)

Improved visualizations: call str(model) or print(model) after fitting.
Source code(tar.gz)
Source code(zip)
v1.3.0(Jun 15, 2022)
Consistent printing and API for autogluon integration with following models:

FIGS

HSTree

BoostedRuleSet

GreedyTree

RuleFit

Source code(tar.gz)
Source code(zip)
v1.2.5(Feb 19, 2022)

Source code(tar.gz)
Source code(zip)
v1.2.3(Jan 29, 2022)

Release of FIGS and Hierarchical Shrinkage models
Source code(tar.gz)
Source code(zip)
v1.2.2(Jan 19, 2022)

Source code(tar.gz)
Source code(zip)
v1.2.0(Dec 6, 2021)

Shrunk trees added + data util improvements
Source code(tar.gz)
Source code(zip)
v1.1.3(Nov 15, 2021)

Source code(tar.gz)
Source code(zip)
v1.0.3(Oct 27, 2021)

Source code(tar.gz)
Source code(zip)
v1.0.2(Oct 12, 2021)
add function to explain classification errors

improve string printing for several models, especially corels

Source code(tar.gz)
Source code(zip)
v1.0.1(Oct 4, 2021)

Rename RuleSets, update packaging code
Source code(tar.gz)
Source code(zip)
v0.3.0(Sep 25, 2021)

Add new classifiers, SlipperRuleClassifier, BOAClassifier, CorelsRuleListClassifier.

Cleaned up some discretizer code.

Updated docs.
Source code(tar.gz)
Source code(zip)
v0.2.9(Jul 20, 2021)

Adder support for SLIPPER and new discretizer algorithms.
Source code(tar.gz)
Source code(zip)
v0.2.8(Mar 11, 2021)

Increased compatibility with sklearn.

One API change: SLIM models now take alpha parameter in init function rather than lambda_reg parameter in fit function.
Source code(tar.gz)
Source code(zip)
v0.2.7(Feb 17, 2021)

Refactor rule set functions and recombine to create FPLasso + FPSkope
Source code(tar.gz)
Source code(zip)
v0.2.6(Jan 5, 2021)

Working support for py3.6 and added new models including BoostedRuleSet and OneRRuleList
Source code(tar.gz)
Source code(zip)
v0.2.5(Dec 9, 2020)

Source code(tar.gz)
Source code(zip)
v0.2.1(Jul 21, 2020)

Source code(tar.gz)
Source code(zip)
v0.2.4(Nov 15, 2020)

This release cleans up dependencies as well as started a more unified API for rule sets.
Source code(tar.gz)
Source code(zip)
v0.2.2(Sep 13, 2020)

Added tests ensuring the core imodels package is stable.
Source code(tar.gz)
Source code(zip)
v0.1(Jul 19, 2019)

Initial release with working bayesian rule lists.
Source code(tar.gz)
Source code(zip)

Owner

Chandan Singh

Working on interpretable machine learning across domains 🧠⚕️🦠 Let's do good with models.

GitHub Repository https://csinva.io/imodels

This machine learning model was developed for House Prices

This machine learning model was developed for House Prices - Advanced Regression Techniques competition in Kaggle by using several machine learning models such as Random Forest, XGBoost and LightGBM.

1 Mar 02, 2022

Falken provides developers with a service that allows them to train AI that can play their games

Falken provides developers with a service that allows them to train AI that can play their games. Unlike traditional RL frameworks that learn through rewards or batches of offline training, Falken is

223 Jan 03, 2023

Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

519 Jan 03, 2023

Microsoft Machine Learning for Apache Spark

Microsoft Machine Learning for Apache Spark MMLSpark is an ecosystem of tools aimed towards expanding the distributed computing framework Apache Spark

3.9k Dec 30, 2022

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Hivemind: decentralized deep learning in PyTorch Hivemind is a PyTorch library to train large neural networks across the Internet. Its intended usage

1.3k Jan 08, 2023

Machine Learning from Scratch

Machine Learning from Scratch Author: Shengxuan Wang From: Oregon State University Content: Building Machine Learning model from Scratch, without usin

0 Jul 05, 2022

A Python implementation of the Robotics Toolbox for MATLAB

Robotics Toolbox for Python A Python implementation of the Robotics Toolbox for MATLAB® GitHub repository Documentation Wiki (examples and details) Sy

1.2k Jan 07, 2023

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

English | 简体中文 AutoX是什么？ AutoX一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。它的特点包括: 效果出色: AutoX在多个kaggle数据集上，效果显著优于其他解决方案(见效果对比)。简单易用: AutoX的接口和sklearn类似，方便上手使用。

431 Dec 28, 2022

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms Based on the work by Smith et al. (2021) Query

5 Aug 06, 2022

Pydantic based mock data generation

This library offers powerful mock data generation capabilities for pydantic based models. It can also be used with other libraries that use pydantic as a foundation, for example SQLModel, Beanie and

396 Dec 28, 2022

Land Cover Classification Random Forest

You can perform Land Cover Classification on Satellite Images using Random Forest and visualize the result using Earthpy package. Make sure to install the required packages and such as

1 Jan 21, 2022

CorrProxies - Optimizing Machine Learning Inference Queries with Correlative Proxy Models

8 Jun 07, 2022

Repository for DCA0305, an undergraduate course about Machine Learning Workflows and Pipelines

Federal University of Rio Grande do Norte Technology Center Department of Computer Engineering and Automation Machine Learning Based Systems Design Re

81 Oct 18, 2022

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

25 Dec 28, 2022

Python package for concise, transparent, and accurate predictive modeling

Related tags

Overview

Installation

Supported models

What's the difference between the models?

Demo notebooks

Support for different tasks

Extras

Our favorite models

FIGS: Fast interpretable greedy-tree sums

Hierarchical shrinkage: post-hoc regularization for tree-based methods

References

Comments

Releases(v1.3.11)

v1.3.11(Dec 28, 2022)

v1.3.8(Dec 10, 2022)

v1.3.7(Dec 6, 2022)

v1.3.6(Oct 31, 2022)

v1.3.5(Sep 15, 2022)

v1.3.4(Aug 25, 2022)

v1.3.3(Jul 28, 2022)

v.1.3.2(Jul 3, 2022)

v1.3.0(Jun 15, 2022)

v1.2.5(Feb 19, 2022)

v1.2.3(Jan 29, 2022)

v1.2.2(Jan 19, 2022)

v1.2.0(Dec 6, 2021)

v1.1.3(Nov 15, 2021)

v1.0.3(Oct 27, 2021)

v1.0.2(Oct 12, 2021)

v1.0.1(Oct 4, 2021)

v0.3.0(Sep 25, 2021)

v0.2.9(Jul 20, 2021)

v0.2.8(Mar 11, 2021)

v0.2.7(Feb 17, 2021)

v0.2.6(Jan 5, 2021)

v0.2.5(Dec 9, 2020)

v0.2.1(Jul 21, 2020)

v0.2.4(Nov 15, 2020)

v0.2.2(Sep 13, 2020)

v0.1(Jul 19, 2019)

Owner

Chandan Singh

This machine learning model was developed for House Prices

Falken provides developers with a service that allows them to train AI that can play their games

Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Microsoft Machine Learning for Apache Spark

Decentralized deep learning in PyTorch. Built to train models on thousands of volunteers across the world.

Machine Learning from Scratch

A Python implementation of the Robotics Toolbox for MATLAB

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

LILLIE: Information Extraction and Database Integration Using Linguistics and Learning-Based Algorithms

Pydantic based mock data generation

Land Cover Classification Random Forest

CorrProxies - Optimizing Machine Learning Inference Queries with Correlative Proxy Models

Repository for DCA0305, an undergraduate course about Machine Learning Workflows and Pipelines

Model Validation Toolkit is a collection of tools to assist with validating machine learning models prior to deploying them to production and monitoring them after deployment to production.

It is a forest of random projection trees

🚪✊Knock Knock: Get notified when your training ends with only two additional lines of code

Avocado hass time series vs predict price

End to End toy example of MLOps

ETNA – time series forecasting framework

XGBoost + Optuna

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。它的特点包括: 效果出色、简单易用、通用、自动化、灵活。