Python package for concise, transparent, and accurate predictive modeling

Overview


Python package for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use.

πŸ“š docs β€’ πŸ“– demo notebooks

Modern machine-learning models are increasingly complex, often making them difficult to interpret. This package provides a simple interface for fitting and using state-of-the-art interpretable models, all compatible with scikit-learn. These models can often replace black-box models (e.g. random forests) with simpler models (e.g. rule lists) while improving interpretability and computational efficiency, all without sacrificing predictive accuracy! Simply import a classifier or regressor and use the fit and predict methods, same as standard scikit-learn models.

from imodels import BoostedRulesClassifier, FIGSClassifier, SkopeRulesClassifier
from imodels import RuleFitRegressor, HSTreeRegressorCV, SLIMRegressor

model = BoostedRulesClassifier()  # initialize a model
model.fit(X_train, y_train)   # fit model
preds = model.predict(X_test) # predictions: shape is (n_test, 1)
preds_proba = model.predict_proba(X_test) # predicted probabilities: shape is (n_test, n_classes)
print(model) # print the rule-based model

-----------------------------
# the model consists of the following 3 rules
# if X1 > 5: then 80.5% risk
# else if X2 > 5: then 40% risk
# else: 10% risk

Installation

Install with pip install imodels (see here for help).

Supported models

Model Reference Description
Rulefit rule set πŸ—‚οΈ , πŸ”— , πŸ“„ Fits a sparse linear model on rules extracted from decision trees
Skope rule set πŸ—‚οΈ , πŸ”— Extracts rules from gradient-boosted trees, deduplicates them,
then linearly combines them based on their OOB precision
Boosted rule set πŸ—‚οΈ , πŸ”— , πŸ“„ Sequentially fits a set of rules with Adaboost
Slipper rule set πŸ—‚οΈ , γ…€γ…€ πŸ“„ Sequentially learns a set of rules with SLIPPER
Bayesian rule set πŸ—‚οΈ , πŸ”— , πŸ“„ Finds concise rule set with Bayesian sampling (slow)
Optimal rule list πŸ—‚οΈ , πŸ”— , πŸ“„ Fits rule list using global optimization for sparsity (CORELS)
Bayesian rule list πŸ—‚οΈ , πŸ”— , πŸ“„ Fits compact rule list distribution with Bayesian sampling (slow)
Greedy rule list πŸ—‚οΈ , πŸ”— Uses CART to fit a list (only a single path), rather than a tree
OneR rule list πŸ—‚οΈ , γ…€γ…€ πŸ“„ Fits rule list restricted to only one feature
Optimal rule tree πŸ—‚οΈ , πŸ”— , πŸ“„ Fits succinct tree using global optimization for sparsity (GOSDT)
Greedy rule tree πŸ—‚οΈ , πŸ”— , πŸ“„ Greedily fits tree using CART
C4.5 rule tree πŸ—‚οΈ , πŸ”— , πŸ“„ Greedily fits tree using C4.5
Iterative random
forest
πŸ—‚οΈ , πŸ”— , πŸ“„ Repeatedly fit random forest, giving features with
high importance a higher chance of being selected
Sparse integer
linear model
πŸ—‚οΈ , γ…€γ…€ πŸ“„ Sparse linear model with integer coefficients
Greedy tree sums πŸ—‚οΈ , γ…€γ…€ πŸ“„ Sum of small trees with very few total rules (FIGS)
Hierarchical
shrinkage wrapper
πŸ—‚οΈ , γ…€γ…€ πŸ“„ Improve any tree-based model with ultra-fast, post-hoc regularization
Distillation
wrapper
πŸ—‚οΈ Train a black-box model,
then distill it into an interpretable model
More models βŒ› (Coming soon!) Lightweight Rule Induction, MLRules, ...

Docs πŸ—‚οΈ , Reference code implementation πŸ”— , Research paper πŸ“„

What's the difference between the models?

The final form of the above models takes one of the following forms, which aim to be simultaneously simple to understand and highly predictive:

Rule set Rule list Rule tree Algebraic models

Different models and algorithms vary not only in their final form but also in different choices made during modeling, such as how they generate, select, and postprocess rules:

Rule candidate generation Rule selection Rule postprocessing
Ex. RuleFit vs. SkopeRules RuleFit and SkopeRules differ only in the way they prune rules: RuleFit uses a linear model whereas SkopeRules heuristically deduplicates rules sharing overlap.
Ex. Bayesian rule lists vs. greedy rule lists Bayesian rule lists and greedy rule lists differ in how they select rules; bayesian rule lists perform a global optimization over possible rule lists while Greedy rule lists pick splits sequentially to maximize a given criterion.
Ex. FPSkope vs. SkopeRules FPSkope and SkopeRules differ only in the way they generate candidate rules: FPSkope uses FPgrowth whereas SkopeRules extracts rules from decision trees.

Demo notebooks

Demos are contained in the notebooks folder.

Quickstart demo Shows how to fit, predict, and visualize with different interpretable models
Quickstart colab demo Shows how to fit, predict, and visualize with different interpretable models
Clinical decision rule notebook Shows an example of using imodels for deriving a clinical decision rule
Posthoc analysis We also include some demos of posthoc analysis, which occurs after fitting models: posthoc.ipynb shows different simple analyses to interpret a trained model and uncertainty.ipynb contains basic code to get uncertainty estimates for a model

Support for different tasks

Different models support different machine-learning tasks. Current support for different models is given below (each of these models can be imported directly from imodels (e.g. from imodels import RuleFitClassifier):

Model Binary classification Regression Notes
Rulefit rule set RuleFitClassifier RuleFitRegressor
Skope rule set SkopeRulesClassifier
Boosted rule set BoostedRulesClassifier
SLIPPER rule set SlipperClassifier
Bayesian rule set BayesianRuleSetClassifier Fails for large problems
Optimal rule list (CORELS) OptimalRuleListClassifier Requires corels, fails for large problems
Bayesian rule list BayesianRuleListClassifier
Greedy rule list GreedyRuleListClassifier
OneR rule list OneRClassifier
Optimal rule tree (GOSDT) OptimalTreeClassifier Requires gosdt, fails for large problems
Greedy rule tree (CART) GreedyTreeClassifier GreedyTreeRegressor
C4.5 rule tree C45TreeClassifier
Iterative random forest IRFClassifier Requires irf
Sparse integer linear model SLIMClassifier SLIMRegressor Requires extra dependencies for speed
Greedy tree sums (FIGS) FIGSClassifier FIGSRegressor
Hierarchical shrinkage HSTreeClassifierCV HSTreeRegressorCV Wraps any sklearn tree-based model
Distillation DistilledRegressor Wraps any sklearn-compatible models

Extras

Data-wrangling functions for working with popular tabular datasets (e.g. compas). These functions, in conjunction with imodels-data and imodels-experiments, make it simple to download data and run experiments on new models.
Explain classification errors with a simple posthoc function. Fit an interpretable model to explain a previous model's errors (ex. in this notebook πŸ““ ).
Fast and effective discretizers for data preprocessing.
Discretizer Reference Description
MDLP πŸ—‚οΈ , πŸ”— , πŸ“„ Discretize using entropy minimization heuristic
Simple πŸ—‚οΈ , πŸ”— Simple KBins discretization
Random Forest πŸ—‚οΈ Discretize into bins based on random forest split popularity
Rule-based utils for customizing models The code here contains many useful and customizable functions for rule-based learning in the [util folder](https://csinva.io/imodels/util/index.html). This includes functions / classes for rule deduplication, rule screening, and converting between trees, rulesets, and neural networks.

Our favorite models

After developing and playing with imodels, we developed a few new models to overcome limitations of existing interpretable models.

FIGS: Fast interpretable greedy-tree sums

πŸ“„ Paper, πŸ”— Post, πŸ“Œ Citation

Fast Interpretable Greedy-Tree Sums (FIGS) is an algorithm for fitting concise rule-based models. Specifically, FIGS generalizes CART to simultaneously grow a flexible number of trees in a summation. The total number of splits across all the trees can be restricted by a pre-specified threshold, keeping the model interpretable. Experiments across a wide array of real-world datasets show that FIGS achieves state-of-the-art prediction performance when restricted to just a few splits (e.g. less than 20).

Example FIGS model. FIGS learns a sum of trees with a flexible number of trees; to make its prediction, it sums the result from each tree.

Hierarchical shrinkage: post-hoc regularization for tree-based methods

πŸ“„ Paper, πŸ”— Post, πŸ“Œ Citation

Hierarchical shinkage is an extremely fast post-hoc regularization method which works on any decision tree (or tree-based ensemble, such as Random Forest). It does not modify the tree structure, and instead regularizes the tree by shrinking the prediction over each node towards the sample means of its ancestors (using a single regularization parameter). Experiments over a wide variety of datasets show that hierarchical shrinkage substantially increases the predictive performance of individual decision trees and decision-tree ensembles.

References

Readings
  • Interpretable ML good quick overview: murdoch et al. 2019, pdf
  • Interpretable ML book: molnar 2019, pdf
  • Case for interpretable models rather than post-hoc explanation: rudin 2019, pdf
  • Review on evaluating interpretability: doshi-velez & kim 2017, pdf
Reference implementations (also linked above) The code here heavily derives from the wonderful work of previous projects. We seek to to extract out, unify, and maintain key parts of these projects.
Related packages
  • gplearn: symbolic regression/classification
  • pysr: fast symbolic regression
  • pygam: generative additive models
  • interpretml: boosting-based gam
  • h20 ai: gams + glms (and more)
  • optbinning: data discretization / scoring models
Updates
  • For updates, star the repo, see this related repo, or follow @csinva_
  • Please make sure to give authors of original methods / base implementations appropriate credit!
  • Contributing: pull requests very welcome!

If it's useful for you, please star/cite the package, and make sure to give authors of original methods / base implementations credit:

@software{
    imodels2021,
    title        = {{imodels: a python package for fitting interpretable models}},
    journal      = {Journal of Open Source Software}
    publisher    = {The Open Journal},
    year         = {2021},
    author       = {Singh, Chandan and Nasseri, Keyan and Tan, Yan Shuo and Tang, Tiffany and Yu, Bin},
    volume       = {6},
    number       = {61},
    pages        = {3192},
    doi          = {10.21105/joss.03192},
    url          = {https://doi.org/10.21105/joss.03192},
}
Comments
  • Added Gini Importances

    Added Gini Importances

    Hi @csinva how do the new Gini importances look?

    I based the calculation off sklearn's code from here and here, though it needed to be made recursive as we do not have arrays of all the nodes and their properties.

    There is a demo of the new code in the FIGS_viz_demo.ipynb notebook. I am a bit concerned with the None impurity in the root node of the second tree:

    node_id: 0, left.node_id: 1, right.node_id: 2, impurity: None
    

    I filled it with 0 for the calculation for now:

                    importance_data_tree[node.feature] += (
                        np.sum(node.value_sklearn) * (node.impurity if node.impurity is not None else 0.) -
                        np.sum(node.left.value_sklearn) * node.left.impurity -
                        np.sum(node.right.value_sklearn) * node.right.impurity
                    )
    

    Is None expected if the tree has just one split?

    Also, after taking the mean and normalizing most of the importances are negative. I think this is fine, as we just care about the relative order of the features, but wanted to get your opinion as well: image

    BTW I noticed that we have an unused variable in plot():

    criterion = "squared_error" if isinstance(self, RegressorMixin) else "gini"
    

    Is this need for anything, or should we delete it?

    opened by mepland 15
  • Fixed FIGS plotting

    Fixed FIGS plotting

    Fixed Issue 132, FIGS plots not appearing correctly.

    The primary bug was in the assignment of node ids here.

                right = next(node_counter)
                left = next(node_counter)
    

    They were being improperly set during the recursion of _update_node(nd). I've fixed this by assigning a new node_num variable after the trees are created during fit() here and using that instead:

            # add node_num to final tree
            for tree_ in self.trees_:
                node_counter = iter(range(0, int(1e06)))
                def _add_node_num(node: Node):
                    if node is None:
                        return
                    node.setattrs(node_num=next(node_counter))
                    _add_node_num(node.left)
                    _add_node_num(node.right)
    
                _add_node_num(tree_)
    

    I also took the opportunity to return a real sklearn DecisionTreeClassifier or DecisionTreeRegressor object, filling the parameters, including tree_, with the __setstate__() method, building on this SO question. In order to do this, I needed the impurity at each node and the "value" as expected by sklearn, i.e. value = np.array([neg_count, pos_count], dtype=float). If we further rewrite the FIGS class to save this 2D "value" along side the current value, perhaps as value_sklearn, I wouldn't need X_train, y_train for the extract_sklearn_tree_from_figs function, and the subsequent plotting functions.

    @csinva does my implementation of the impurity variable look correct? I see the impurities are recomputed after I grab my impurity values, so I expect not. Perhaps you could fix this, or let me know the best way to get the final impurity at each node? I'll also wait for the go ahead on adding the value_sklearn variable, and refactoring away the dependence on X_train, y_train in the plotting functions.

    opened by mepland 8
  • 'BoostedRulesClassifier' object has no attribute 'complexity_'

    'BoostedRulesClassifier' object has no attribute 'complexity_'

    After imodel being updated to 1.3.8, we've got the error msg 'BoostedRulesClassifier' object has no attribute 'complexity_'. Wonder is it removed or renamed? It is generally better to keep public apis/attributes unchanged during minor releases, any plan to add it back?

    opened by yinweisu 6
  • FIGS Fixes

    FIGS Fixes

    • Added SKompiler integration, which required the new n_features_in_ member variable.
      • Note the demo FIGS model currently requires https://github.com/mepland/SKompiler/tree/fixes to run which fixes a bug in SKompiler. TLDR SKompiler was not letting trees run if they use less than all the available features, like the demo FIGS tree 0.
    • Fixed bug in n_features
    -    n_features = np.unique(features[np.where( 0 < features )]).size
    +    n_features = np.unique(features[np.where( 0 <= features )]).size
    
    • Improved markdown comments in FIGS_viz_demo.ipynb
    opened by mepland 4
  • HSTree Multiclass Classification Support

    HSTree Multiclass Classification Support

    Does HSTree support multiclass classification problems with RandomForest / ExtraTrees as the estimator?

    From my initial tests it appears buggy. Calling predict_proba with the final model results in lots of NaN predictions, along with warnings during training such as:

    /Users/neerick/workspace/virtual/autogluon/lib/python3.8/site-packages/imodels/tree/hierarchical_shrinkage.py:87: RuntimeWarning: invalid value encountered in double_scalars
      val = tree.value[i][0, 1] / (tree.value[i][0, 0] + tree.value[i][0, 1])  # binary classification
    

    If helpful I can try to create a reproducible example.

    Here is an example result comparing with sklearn default RF (_og_) with accuracy metric. Because HSTree returns many NaN predictions, the scores are very low.

    One observation is the scores get worse the more trees there are in HSTree forests. I'd guess the likelihood of returning a NaN result is increasing with the number of trees.

                           model  score_test  score_val  pred_time_test  pred_time_val  fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
    0       RandomForest_og_n300    0.711651   0.723618        0.985573       0.050956  0.519926                 0.985573                0.050956           0.519926            1       True          1
    1       RandomForest_og_n100    0.710154   0.748744        0.453769       0.019050  0.170951                 0.453769                0.019050           0.170951            1       True          2
    2        WeightedEnsemble_L2    0.710154   0.748744        0.464755       0.019376  0.295161                 0.010986                0.000326           0.124210            2       True         36
    3        RandomForest_og_n40    0.700636   0.698492        0.193009       0.010738  0.088012                 0.193009                0.010738           0.088012            1       True          3
    4        RandomForest_og_n20    0.692039   0.698492        0.103616       0.007549  0.057396                 0.103616                0.007549           0.057396            1       True          4
    5        RandomForest_og_n10    0.674165   0.688442        0.075296       0.006166  0.041720                 0.075296                0.006166           0.041720            1       True          5
    6     RandomForest_hs=10_n10    0.521949   0.537688        0.070260       0.005246  0.082384                 0.070260                0.005246           0.082384            1       True         15
    7     RandomForest_hs=50_n10    0.520839   0.517588        0.075151       0.004875  0.071219                 0.075151                0.004875           0.071219            1       True         20
    8    RandomForest_hs=0.1_n10    0.520796   0.537688        0.074070       0.005233  0.093299                 0.074070                0.005233           0.093299            1       True         35
    9      RandomForest_hs=1_n10    0.520692   0.542714        0.077687       0.005690  0.075061                 0.077687                0.005690           0.075061            1       True         10
    10   RandomForest_hs=100_n10    0.519246   0.517588        0.075059       0.006019  0.082536                 0.075059                0.006019           0.082536            1       True         25
    11   RandomForest_hs=500_n10    0.488877   0.517588        0.072145       0.005125  0.072223                 0.072145                0.005125           0.072223            1       True         30
    12     RandomForest_hs=1_n20    0.485125   0.472362        0.113002       0.006484  0.123639                 0.113002                0.006484           0.123639            1       True          9
    13   RandomForest_hs=0.1_n20    0.485005   0.472362        0.111342       0.005953  0.146246                 0.111342                0.005953           0.146246            1       True         34
    14    RandomForest_hs=10_n20    0.484833   0.482412        0.104076       0.006577  0.131909                 0.104076                0.006577           0.131909            1       True         14
    15    RandomForest_hs=50_n20    0.482896   0.482412        0.115057       0.006263  0.130512                 0.115057                0.006263           0.130512            1       True         19
    16   RandomForest_hs=100_n20    0.480840   0.482412        0.108625       0.006045  0.135224                 0.108625                0.006045           0.135224            1       True         24
    17   RandomForest_hs=500_n20    0.458035   0.467337        0.108658       0.006302  0.123907                 0.108658                0.006302           0.123907            1       True         29
    18     RandomForest_hs=1_n40    0.451434   0.467337        0.185129       0.010619  0.210639                 0.185129                0.010619           0.210639            1       True          8
    19   RandomForest_hs=0.1_n40    0.451382   0.467337        0.170597       0.009024  0.244322                 0.170597                0.009024           0.244322            1       True         33
    20    RandomForest_hs=10_n40    0.451322   0.467337        0.173382       0.009955  0.210795                 0.173382                0.009955           0.210795            1       True         13
    21    RandomForest_hs=50_n40    0.450350   0.467337        0.170041       0.008673  0.236081                 0.170041                0.008673           0.236081            1       True         18
    22   RandomForest_hs=100_n40    0.449119   0.467337        0.169396       0.010918  0.226784                 0.169396                0.010918           0.226784            1       True         23
    23   RandomForest_hs=500_n40    0.435832   0.472362        0.162881       0.009256  0.202447                 0.162881                0.009256           0.202447            1       True         28
    24    RandomForest_hs=1_n100    0.420419   0.452261        0.442328       0.017688  0.480776                 0.442328                0.017688           0.480776            1       True          7
    25  RandomForest_hs=0.1_n100    0.420411   0.452261        0.354523       0.018247  0.548557                 0.354523                0.018247           0.548557            1       True         32
    26   RandomForest_hs=10_n100    0.419981   0.452261        0.355097       0.017487  0.469547                 0.355097                0.017487           0.469547            1       True         12
    27   RandomForest_hs=50_n100    0.419034   0.447236        0.344341       0.021125  0.465810                 0.344341                0.021125           0.465810            1       True         17
    28  RandomForest_hs=100_n100    0.418672   0.447236        0.372041       0.018402  0.477048                 0.372041                0.018402           0.477048            1       True         22
    29  RandomForest_hs=500_n100    0.415256   0.457286        0.338696       0.017128  0.492786                 0.338696                0.017128           0.492786            1       True         27
    30  RandomForest_hs=0.1_n300    0.381049   0.391960        0.967061       0.045552  1.533075                 0.967061                0.045552           1.533075            1       True         31
    31   RandomForest_hs=10_n300    0.381049   0.391960        1.109062       0.054005  1.442369                 1.109062                0.054005           1.442369            1       True         11
    32    RandomForest_hs=1_n300    0.381040   0.391960        1.677277       0.055421  2.346773                 1.677277                0.055421           2.346773            1       True          6
    33   RandomForest_hs=50_n300    0.380945   0.391960        0.889030       0.053650  1.320377                 0.889030                0.053650           1.320377            1       True         16
    34  RandomForest_hs=100_n300    0.380885   0.391960        1.031198       0.045266  1.254918                 1.031198                0.045266           1.254918            1       True         21
    35  RandomForest_hs=500_n300    0.380816   0.391960        0.948715       0.050209  1.266396                 0.948715                0.050209           1.266396            1       True         26
    
    
    enhancement 
    opened by Innixma 4
  • Two Extractly same rules by RulefitClassifier

    Two Extractly same rules by RulefitClassifier

    Hello~

    When I use the RulefitClassifier, it will return two exactly same rules but with different coef, whether the inherent structures didn't aggregate the rules? I have tried to use the Rulefit directly, and it seems that it doesn't have the similar problem~

    The following image is part of my result image

    bug 
    opened by Yannahhh 4
  • BoostedRulesClassifier sometimes throws an exception

    BoostedRulesClassifier sometimes throws an exception

    Hi,

    When I use the BoostedRulesClassifier, it sometimes throws an exception as follows:

    This BoostedRulesClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

    I find that the exception results from the implementation of the class RuleSet: ` def _eval_weighted_rule_sum(self, X) -> np.ndarray:

        check_is_fitted(self, ['rules_without_feature_names_', 'n_features_', 'feature_placeholders'])
    
        X = check_array(X)
    
        if X.shape[1] != self.n_features_:
            raise ValueError("X.shape[1] = %d should be equal to %d, the number of features at training time."
                             " Please reshape your data."
                             % (X.shape[1], self.n_features_))
    
        df = pd.DataFrame(X, columns=self.feature_placeholders)
        selected_rules = self.rules_without_feature_names_
    
        scores = np.zeros(X.shape[0])
        for r in selected_rules: 
            features_r_uses = list(map(lambda x: x[0], r.agg_dict.keys()))
            scores[df[features_r_uses].query(str(r)).index.values] += r.args[0]
    
        return scores`
    

    Specifically, when the computer runs the check_is_fitted(self, ['rules_without_feature_names_', 'n_features_', 'feature_placeholders']), it finds that self.rules_without_feature_names_ does not exist, so the computer throws the above exception.

    And I further review my code and data set, I find that my training set is easy to train a classifier, so the training error of the estimator is close to zero, it may result in a bug in the fit function of the class BoostedRulesClassifier: ` for _ in range(self.n_estimators): # Fit a classifier with the specific weights clf = self.estimator() clf.fit(X, y, sample_weight=w) # uses w as the sampling weight! preds = clf.predict(X) self.estimator_mean_prediction_.append(np.mean(preds)) # just for printing

            # Indicator function
            miss = preds != y
    
            # Equivalent with 1/-1 to update weights
            miss2 = np.ones(miss.size)
            miss2[~miss] = -1
    
            # Error
            err_m = np.dot(w, miss) / sum(w)
            
            if err_m < 1e-3:
                return self
              
            # Alpha
            alpha_m = 0.5 * np.log((1 - err_m) / float(err_m))
    
            # New weights
            w = np.multiply(w, np.exp([float(x) * alpha_m for x in miss2]))
    
            self.estimators_.append(deepcopy(clf))
            self.estimator_weights_.append(alpha_m)
            self.estimator_errors_.append(err_m)
    
        rules = []
    

    ` Because the error_m is zero, so it directly returns self without executing subsequent statements, in such a case, self.rules_without_feature_names_ dose not exist.

    My current solution to this bug is to modify the following code fragment in the fit function of the class BoostedRulesClassifier: ` # Error err_m = np.dot(w, miss) / sum(w)

            # modification ###########################
            if err_m < 1e-3:
                # return self
                w = np.ones(miss.size) / len(y)
                self.estimators_.append(deepcopy(clf))
                self.estimator_weights_.append(float("inf"))
                self.estimator_errors_.append(err_m)
                break
             ####################################
            # Alpha
            alpha_m = 0.5 * np.log((1 - err_m) / float(err_m))
    

    ` I'm not sure whether it may introduce new defects, but it indeed solves the exception.

    opened by Wan-xiaohui 3
  • GreedyRuleListClassifier has wildly varying performance and sometimes crashes

    GreedyRuleListClassifier has wildly varying performance and sometimes crashes

    When running a certain number of experiments with different splits of a given dataset, I see that GreedyRuleListClassifier's accuracy wildly varies, and sometimes the code (see for loop below) crashes.

    So, for example running 10 experiments like this, with different random splits of the same set:

    import pandas
    import sklearn
    import sklearn.datasets
    from sklearn.model_selection import train_test_split
    
    from imodels import GreedyRuleListClassifier
    
    X, Y = sklearn.datasets.load_breast_cancer(as_frame=True, return_X_y=True)
    
    model = GreedyRuleListClassifier(max_depth=10)
    
    for i in range(10):
      try:
        X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3)
        model.fit(X_train, y_train, feature_names=X_train.columns)
        y_pred = model.predict(X_test)
        from sklearn.metrics import accuracy_score
        score = accuracy_score(y_test.values,y_pred)
        print('Accuracy:\n', score)
      except KeyError as e:
        print("Failed with KeyError")
    

    Will give as output something along the lines of

    Accuracy: 0.6081871345029239
    Failed with KeyError
    Accuracy: 0.4619883040935672
    Accuracy: 0.45614035087719296
    Accuracy: 0.2222222222222222
    Failed with KeyError
    Failed with KeyError
    Failed with KeyError
    Accuracy: 0.18128654970760233
    Failed with KeyError
    

    Is this intended behavior? While my test dataset is smallish, the variation in accuracy is still surprising for me and so is the throwing of a KeyError. I'm using scikit-learn==1.0.2 and imodels=1.3.6 and can edit the issue here to add more details.

    Incidentally, the same behaviour was observed in https://datascience.stackexchange.com/a/116283/50519, noticed by @jonnor.

    Thanks!

    opened by davidefiocco 3
  • Issue with feature_names in GreedyRuleListClassifier

    Issue with feature_names in GreedyRuleListClassifier

    when i am putting feature_names= X.columns only the first feature is appearing in the rule list and others are appearing as feat i. unable to fix this and request for your kind support.

    here is the output snippet: Selected features: Index(['Processor(P99)_Q', 'Opto(F99)_Q', 'Logic(L99)_Am', 'Qualcom', 'Toshiba', 'ABB', 'Whirlpool', 'Honeywell'], dtype='object') mean 0.6 (30 pts) if Whirlpool >= 153 then 1.0 (16 pts) mean 0.143 (14 pts) if feat 1 >= 16882885 then 1.0 (2 pts) mean 0.0 (12 pts)

    opened by pauldebdeep9 3
  • Complexity comparisons

    Complexity comparisons

    • Compare all models on several UCI datasets
    • Generate complexity-accuracy plots for each model
    • Cache comparison results for convenience
    • Set self.complexity when fitting models
    opened by keyan3 3
  • Test fixes

    Test fixes

    • Fixed an issue where the GitHub build would pass even if the tests actually failed (screenshot below)
    Screen Shot 2021-01-24 at 11 59 12 PM
    • Added missing random seeding in Skope

    I skipped testing predict_proba for Skope altogether β€” thought behind this is that even if you write a predict_proba that uses eval_weighted_rule_sum, it still won't match the predictions since since Skope predicts based only on whether the score is positive or not. I'm not sure if our Skope needs to have this method at all (the original Skope implementation doesn't)

    opened by keyan3 3
  • FIGS Demo Notebook Update

    FIGS Demo Notebook Update

    @csinva let's wait on merging this for a few weeks, until both imodels and dtreeviz release new minor versions. I have a few changes I want to make then:

    • Remove path to ~/imodels
    • Use 'leaftype': 'barh'
    • Update color scheme
    • Possibly add numeric leaf predictions and split visualizations
    opened by mepland 0
  • Full sample_weight support for FIGS

    Full sample_weight support for FIGS

    Some parts of FIGS do not support sample_weight including the extract_sklearn_tree_from_figs() function and feature_importances_.

    Originally posted by @mepland in https://github.com/csinva/imodels/issues/89#issuecomment-1367595878

    opened by mepland 0
  • Implement Dynamic CDI

    Implement Dynamic CDI

    Implementing a Dynamic CDIs class based on FIGS.

    TODOs:

    • [ ] Implement a sklearn compatible class named D-FIGS in a new file imodels/tree/dynamic_figs.py
    • [ ] Write a test using the PECARN IAI dataset

    More details:

    • The D-FIGS class should inherit from FIGS class, and take an additional dictionary at initialization, corresponding to the features phases. When applying the fit or predict methods, the class should verify that the matrix $X$ is compatible with the features tiers. For example phase 2 features can be available (not NA) only if all phase 1 features are available (we may refine this logic later).
    • D-FIGS should infer the phase from the matrix.
    • The tests should be written in a new file named imodels/tests/dynamic_figs_test.py, using pytest (see package documentation or you can use the figs test as reference)
    • Before you start writing code, please write down a short description detailing how you are going to implement the dynamic fitting algorithm. Specifically: How does the model infer the current phase of the patient? How do you store the different models for different phases and ensure these are compatible with one another?

    @aagarwal1996

    opened by OmerRonen 1
  • Add support for `dtreeviz` visualizations

    Add support for `dtreeviz` visualizations

    Add any required translation code to allow imodels trees to be plotted with dtreeviz. This basically boils down to successfully generating a ShadowDecTree object from an imodels tree.

    We can reuse the existing ShadowSKDTree constructor by converting imodels trees into sklearn objects, then calling:

    sk_dtree = ShadowSKDTree(tree_classifier, X, y, features, target, [0, 1])
    

    Alternatively, we can make an imodels specific implementation of ShadowDecTree, similar to the sklearn implementation here, but that may be more work than necessary.

    opened by mepland 0
  • RuleFitClassifier(tree_generator = GradientBoostingClassifier()) not working as per documentation

    RuleFitClassifier(tree_generator = GradientBoostingClassifier()) not working as per documentation

    Hi,

    When using RuleFitClassifier(tree_generator = GradientBoostingClassifier()) with a GradientBoostingClassifier() object fitted and optimized separately via Scikitlearn API, it returns the next error when fitting RuleFitClassifier(tree_generator = GradientBoostingClassifier()):

    ValueError: n_estimators=1 must be larger or equal to estimators_.shape[0]=100 when warm_start==True

    When inspecting whats inside RuleFitClassifier(tree_generator = GradientBoostingClassifier()) after fitting the model, the GradientBoostingClassifier() is completely modified to other parameters different than those optimized before fitting RuleFitClassifier(), i.e., GradientBoostingClassifier(max_leaf_nodes=4, n_estimators=1, random_state=0, warm_start=True). Not sure why these parameters (from the GradientBoostingClassifier()) are changed inside the RuleFitClassifier() object.

    If RuleFitClassifier(tree_generator = None), everything works well.

    As per documentation:

    tree_generator : Optional: this object will be used as provided to generate the rules. This will override almost all the other properties above. Must be GradientBoostingRegressor(), GradientBoostingClassifier(), or RandomForestRegressor()

    • Which are those properties from RuleFitClassifier() that are override if tree_generator=GradientBoostingClassifier()?
    • Why does this behavior occurs?

    Here is the closest solution I found in Issue #34, however the behavior is not clear.

    Any help will be highly appreciated.

    Many thanks!

    opened by Manuelhrokr 0
Releases(v1.3.11)
Owner
Chandan Singh
Working on interpretable machine learning across domains πŸ§ βš•οΈπŸ¦  Let's do good with models.
Chandan Singh
The project's goal is to show a real world application of image segmentation using k means algorithm

The project's goal is to show a real world application of image segmentation using k means algorithm

2 Jan 22, 2022
Projeto: Machine Learning: Linguagens de Programacao 2004-2001

Projeto: Machine Learning: Linguagens de Programacao 2004-2001 Projeto de Data Science e Machine Learning de anÑlise de linguagens de programação de 2

Victor Hugo Negrisoli 0 Jun 29, 2021
MIT-Machine Learning with Python–From Linear Models to Deep Learning

MIT-Machine Learning with Python–From Linear Models to Deep Learning | One of the 5 courses in MIT MicroMasters in Statistics & Data Science Welcome t

2 Aug 23, 2022
hgboost - Hyperoptimized Gradient Boosting

hgboost is short for Hyperoptimized Gradient Boosting and is a python package for hyperparameter optimization for xgboost, catboost and lightboost using cross-validation, and evaluating the results o

Erdogan Taskesen 34 Jan 03, 2023
Estudos e projetos feitos com PySpark.

PySpark (Spark com Python) PySpark Γ© uma biblioteca Spark escrita em Python, e seu objetivo Γ© permitir a anΓ‘lise interativa dos dados em um ambiente d

Karinne Cristina 54 Nov 06, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 09, 2023
Optimal Randomized Canonical Correlation Analysis

ORCCA Optimal Randomized Canonical Correlation Analysis This project is for the python version of ORCCA algorithm. It depends on Numpy for matrix calc

Yinsong Wang 1 Nov 21, 2021
Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

Intel(R) Extension for Scikit-learn* Installation | Documentation | Examples | Support | FAQ With Intel(R) Extension for Scikit-learn you can accelera

Intel Corporation 858 Dec 25, 2022
easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

Neuron AI 5 Jun 18, 2022
The Ultimate FREE Machine Learning Study Plan

The Ultimate FREE Machine Learning Study Plan

Patrick Loeber (Python Engineer) 2.5k Jan 05, 2023
MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees. MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. Th

Swiggy 66 Dec 06, 2022
Machine Learning from Scratch

Machine Learning from Scratch Author: Shengxuan Wang From: Oregon State University Content: Building Machine Learning model from Scratch, without usin

ShawnWang 0 Jul 05, 2022
Painless Machine Learning for python based on scikit-learn

PlainML Painless Machine Learning Library for python based on scikit-learn. Install pip install plainml Example from plainml import KnnModel, load_ir

1 Aug 06, 2022
MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data

MCML is a toolkit for semi-supervised dimensionality reduction and quantitative analysis of Multi-Class, Multi-Label data. We demonstrate its use

Pachter Lab 26 Nov 29, 2022
Apache (Py)Spark type annotations (stub files).

PySpark Stubs A collection of the Apache Spark stub files. These files were generated by stubgen and manually edited to include accurate type hints. T

Maciej 114 Nov 22, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
Pytools is an open source library containing general machine learning and visualisation utilities for reuse

pytools is an open source library containing general machine learning and visualisation utilities for reuse, including: Basic tools for API developmen

BCG Gamma 26 Nov 06, 2022
Implementation of deep learning models for time series in PyTorch.

List of Implementations: Currently, the reimplementation of the DeepAR paper(DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks

Yunkai Zhang 275 Dec 28, 2022
A basic Ray Tracer that exploits numpy arrays and functions to work fast.

Python-Fast-Raytracer A basic Ray Tracer that exploits numpy arrays and functions to work fast. The code is written keeping as much readability as pos

Rafael de la Fuente 393 Dec 27, 2022
ο»ΏGreykite: A flexible, intuitive and fast forecasting library

The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite.

LinkedIn 1.7k Jan 04, 2023