hgboost - Hyperoptimized Gradient Boosting

Last update: Jan 03, 2023

Overview

hgboost - Hyperoptimized Gradient Boosting

Star it if you like it!

hgboost is short for Hyperoptimized Gradient Boosting and is a python package for hyperparameter optimization for xgboost, catboost and lightboost using cross-validation, and evaluating the results on an independent validation set. hgboost can be applied for classification and regression tasks.

hgboost is fun because:

* 1. Hyperoptimization of the Parameter-space using bayesian approach.
* 2. Determines the best scoring model(s) using k-fold cross validation.
* 3. Evaluates best model on independent evaluation set.
* 4. Fit model on entire input-data using the best model.
* 5. Works for classification and regression
* 6. Creating a super-hyperoptimized model by an ensemble of all individual optimized models.
* 7. Return model, space and test/evaluation results.
* 8. Makes insightful plots.

Documentation

API Documentation: https://erdogant.github.io/hgboost/
Github: https://github.com/erdogant/hgboost/

Regression example

Classification example

Schematic overview of hgboost

Installation Environment

Install hgboost from PyPI (recommended). hgboost is compatible with Python 3.6+ and runs on Linux, MacOS X and Windows.
A new environment is recommended and created as following:

conda create -n env_hgboost python=3.6
conda activate env_hgboost

Install newest version hgboost from pypi

pip install hgboost

Force to install latest version

pip install -U hgboost

Install from github-source

pip install git+https://github.com/erdogant/hgboost#egg=master

Import hgboost package

import hgboost as hgboost

Classification example for xgboost, catboost and lightboost:

# Load library
from hgboost import hgboost

# Initialization
hgb = hgboost(max_eval=10, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=42)

# Import data
df = hgb.import_example()
y = df['Survived'].values
y = y.astype(str)
y[y=='1']='survived'
y[y=='0']='dead'

# Preprocessing by encoding variables
del df['Survived']
X = hgb.preprocessing(df)

# Fit catboost by hyperoptimization and cross-validation
results = hgb.catboost(X, y, pos_label='survived')

# Fit lightboost by hyperoptimization and cross-validation
results = hgb.lightboost(X, y, pos_label='survived')

# Fit xgboost by hyperoptimization and cross-validation
results = hgb.xgboost(X, y, pos_label='survived')

# [hgboost] >Start hgboost classification..
# [hgboost] >Collecting xgb_clf parameters.
# [hgboost] >Number of variables in search space is [11], loss function: [auc].
# [hgboost] >method: xgb_clf
# [hgboost] >eval_metric: auc
# [hgboost] >greater_is_better: True
# [hgboost] >pos_label: True
# [hgboost] >Total dataset: (891, 204) 
# [hgboost] >Hyperparameter optimization..
#  100% |----| 500/500 [04:39<05:21,  1.33s/trial, best loss: -0.8800619834710744]
# [hgboost] >Best performing [xgb_clf] model: auc=0.881198
# [hgboost] >5-fold cross validation for the top 10 scoring models, Total nr. tests: 50
# 100%|██████████| 10/10 [00:42<00:00,  4.27s/it]
# [hgboost] >Evalute best [xgb_clf] model on independent validation dataset (179 samples, 20.00%).
# [hgboost] >[auc] on independent validation dataset: -0.832
# [hgboost] >Retrain [xgb_clf] on the entire dataset with the optimal parameters settings.

# Plot searched parameter space 
hgb.plot_params()

# Plot summary results
hgb.plot()

# Plot the best tree
hgb.treeplot()

# Plot the validation results
hgb.plot_validation()

# Plot the cross-validation results
hgb.plot_cv()

# use the learned model to make new predictions.
y_pred, y_proba = hgb.predict(X)

Create ensemble model for Classification

from hgboost import hgboost

hgb = hgboost(max_eval=100, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=None, verbose=3)

# Import data
df = hgb.import_example()
y = df['Survived'].values
del df['Survived']
X = hgb.preprocessing(df, verbose=0)

results = hgb.ensemble(X, y, pos_label=1)

# use the predictor
y_pred, y_proba = hgb.predict(X)

Create ensemble model for Regression

from hgboost import hgboost

hgb = hgboost(max_eval=100, threshold=0.5, cv=5, test_size=0.2, val_size=0.2, top_cv_evals=10, random_state=None, verbose=3)

# Import data
df = hgb.import_example()
y = df['Age'].values
del df['Age']
I = ~np.isnan(y)
X = hgb.preprocessing(df, verbose=0)
X = X.loc[I,:]
y = y[I]

results = hgb.ensemble(X, y, methods=['xgb_reg','ctb_reg','lgb_reg'])

# use the predictor
y_pred, y_proba = hgb.predict(X)

# Plot the ensemble classification validation results
hgb.plot_validation()

References

* http://hyperopt.github.io/hyperopt/
* https://github.com/dmlc/xgboost
* https://github.com/microsoft/LightGBM
* https://github.com/catboost/catboost

Maintainers

Erdogan Taskesen, github: erdogant

Contribute

Contributions are welcome.

Licence See LICENSE for details.

Coffee

If you wish to buy me a Coffee for this work, it is very appreciated :)

Comments

import error during import hgboost

When I finished installation of hgboost and try to import hgboost,there is something wrong,could you please help me out? Details are as follows:

ImportError Traceback (most recent call last) in ----> 1 from hgboost import hgboost

C:\ProgramData\Anaconda3\lib\site-packages\hgboost_init_.py in ----> 1 from hgboost.hgboost import hgboost 2 3 from hgboost.hgboost import ( 4 import_example, 5 )

C:\ProgramData\Anaconda3\lib\site-packages\hgboost\hgboost.py in 9 import classeval as cle 10 from df2onehot import df2onehot ---> 11 import treeplot as tree 12 import colourmap 13

C:\ProgramData\Anaconda3\lib\site-packages\treeplot_init_.py in ----> 1 from treeplot.treeplot import ( 2 plot, 3 randomforest, 4 xgboost, 5 lgbm,

C:\ProgramData\Anaconda3\lib\site-packages\treeplot\treeplot.py in 14 import numpy as np 15 from sklearn.tree import export_graphviz ---> 16 from sklearn.tree.export import export_text 17 from subprocess import call 18 import matplotlib.image as mpimg

ImportError: cannot import name 'export_text' from 'sklearn.tree.export'

thanks a lot!

opened by recherHE 3
Test:Validation:Train split

Shouldn't be the new test-train split be test_size=self.test_size/(1-self.val_size) in def _HPOpt(self):. We updated the shape of X in _set_validation_set(self, X, y)

I'm assuming that the test, train, and validation set ratios are defined on the original data.

opened by SSLPP 3
Treeplot failure - missing graphviz dependency

I'm running through the example classification notebook now, and the treeplot fails to render, with the following warning:

It seems that graphviz being a compiled c library is not bundled in pip (it is included in conda install treeplot/graphviz though).

Since we have no recourse to add this to pip requirements, maybe a sentence in the Instalation instructions warning that graphviz must already be available and/or installed separately.

(note the suggested apt command for linux is not entirely necessary, because pydot does get installed with treeplot via pip)

opened by ninjit 2
Getting the native model for compatibility with shap.TreeExplainer

Hello, first of all really nice project. I've just found out about it today and started playing with it a little bit. Is there any way to get the trained model as an XGBoost, LightGBM or CatBoost class in order to fit a shap.TreeExplainer instance to it?

Thanks in advance! -Nicolás

opened by nicolasaldecoa 2
Xgboost parameter

After using the code hgb.plot_params(), the parameter of learning rate is 796. I don't think it's reasonable. Can I see the model parameters optimized by using HyperOptimized parameters？

opened by LAH19999 2

HP Tuning: best_model uses different parameters from those that were reported as best ones

I used hgboost for optimizing the hyper-parameters of my XGBoost model as described in the API References with the following parameters:

hgb = hgboost()
results = hgb.xgboost(X_train, y_train, pos_label=1, method='xgb_clf', eval_metric='logloss')

As noted in the documentation, results is a dictionary that, among other things, returns the best performing parameters (best_params) and the best performing model (model). However, the parameters that the best performing model uses are different from what the function returns as best_params:

`best_params`

'params': {'colsample_bytree': 0.47000000000000003,
  'gamma': 1,
  'learning_rate': 534,
  'max_depth': 49,
  'min_child_weight': 3.0,
  'n_estimators': 36,
  'subsample': 0.96}

`model`

'model': XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
               colsample_bynode=1, colsample_bytree=0.47000000000000003,
               enable_categorical=False, gamma=1, gpu_id=-1,
               importance_type=None, interaction_constraints='',
               learning_rate=0.058619090164329916, max_delta_step=0,
               max_depth=54, min_child_weight=3.0, missing=nan,
               monotone_constraints='()', n_estimators=200, n_jobs=-1,
               num_parallel_tree=1, predictor='auto', random_state=0,
               reg_alpha=0, reg_lambda=1, scale_pos_weight=0.5769800646551724,
               subsample=0.96, tree_method='exact', validate_parameters=1,
               verbosity=0),

As you can see, for example, max_depth=49 in the best_params, but the model uses max_depth=54 etc.

Is this a bug or the intended behavior? In case of the latter, I'd really appreciate an explanation!

My setup:

OS: WSL (Ubuntu)
Python: 3.9.7
hgboost: 1.0.0

opened by Mikki99 1

Running regression example error

when I try to use hgboost for regression model ,there was a error: _get_params() got multiple values for argument 'eval_metric', at the begining,I think there are some error with my script,The same problem occurs when running the example follow: https://erdogant.github.io/hgboost/pages/html/Examples.html#xgboost-reg

opened by recherHE 1
Error in rmse calculaiton
if self.eval_metric=='rmse': loss = mean_squared_error(y_test, y_pred)

mean_squared_error in sklearn gives mse, use mean_squared_error(y_true, y_pred, squared=False) for rmse
opened by SSLPP 1
numpy.AxisError: axis 1 is out of bounds for array of dimension 1

When eval_metric is auc, it raises an error. The related line is hgboost.py:906 and the related issue is: https://stackoverflow.com/questions/61288972/axiserror-axis-1-is-out-of-bounds-for-array-of-dimension-1-when-calculating-auc

opened by quancore 0
ValueError: Target is multiclass but average='binary'. Please choose another average setting, one of [None, 'micro', 'macro', 'weighted'].

There is an error when f1 score is used for multı-class classification. The error of line is on hgboost.py:904 while calculating f1 score, average param default is binary which is not suitable for multi-class.

opened by quancore 0

Releases(1.1.3)

1.1.3(Oct 5, 2022)
Fix for error in seaborn regplot (issue: #13)

Source code(tar.gz)
Source code(zip)
1.1.2(Aug 26, 2022)
Various improvements

Docstrings updated

Plots updated

Source code(tar.gz)
Source code(zip)
1.1.1(Aug 1, 2022)
Added GPU support. More information can be found here

Source code(tar.gz)
Source code(zip)
1.1.0(Mar 18, 2022)
Fix for returning correct hyperparameters in the output

Fix for make plots of the hyperparameters

Some code styling

Source code(tar.gz)
Source code(zip)
1.0.0(Nov 30, 2021)
add citation

add licence

add doi

Source code(tar.gz)
Source code(zip)
0.1.10(Mar 30, 2021)
Bug fix when using regression models.

Source code(tar.gz)
Source code(zip)
0.1.9(Jan 28, 2021)
ignore lgb when using mac-os

Source code(tar.gz)
Source code(zip)
0.1.8(Jan 17, 2021)
Included option for unbalanced classes

Source code(tar.gz)
Source code(zip)
0.1.7(Oct 3, 2020)
fix for mse and rmse usage in regression models

Source code(tar.gz)
Source code(zip)
0.1.6(Oct 2, 2020)
val_size is set to the percentage of the total

test_size is set to the percentage of the total

bugfix if val_size was to 0.

Source code(tar.gz)
Source code(zip)
0.1.5(Oct 2, 2020)
fix for plotting in python 3.8

Source code(tar.gz)
Source code(zip)
0.1.4(Sep 5, 2020)
several small fixes and updates

Source code(tar.gz)
Source code(zip)
0.1.3(Sep 5, 2020)
fix for voting classifier and verbosity output

Source code(tar.gz)
Source code(zip)
0.1.2(Sep 5, 2020)
small fixes

docstring updates

verbosity fixes

n_jobs included

Source code(tar.gz)
Source code(zip)
0.1.1(Sep 4, 2020)
ensemble model developed!

Source code(tar.gz)
Source code(zip)
0.1.0(Apr 19, 2020)

First release!
Source code(tar.gz)
Source code(zip)

Owner

Erdogan Taskesen

GitHub Repository http://erdogant.github.io/hgboost

Basic Docker Compose for Machine Learning Purposes

Docker-compose for Machine Learning How to use: cd docker-ml-jupyterlab

1 Oct 29, 2021

Automated machine learning: Review of the state-of-the-art and opportunities for healthcare

42 Dec 23, 2022

pymc-learn: Practical Probabilistic Machine Learning in Python

pymc-learn: Practical Probabilistic Machine Learning in Python Contents: Github repo What is pymc-learn? Quick Install Quick Start Index What is pymc-

196 Dec 07, 2022

Nevergrad - A gradient-free optimization platform

Nevergrad - A gradient-free optimization platform nevergrad is a Python 3.6+ library. It can be installed with: pip install nevergrad More installati

3.4k Jan 08, 2023

This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing variance.

minvar_invest_portfolio This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing var

1 Jan 06, 2022

inding a method to objectively quantify skill versus chance in games, using reinforcement learning

Skill-vs-chance-games-analysis - Finding a method to objectively quantify skill versus chance in games, using reinforcement learning

4 Nov 19, 2022

Land Cover Classification Random Forest

You can perform Land Cover Classification on Satellite Images using Random Forest and visualize the result using Earthpy package. Make sure to install the required packages and such as

1 Jan 21, 2022

LinearRegression2 Tvads and CarSales

LinearRegression2_Tvads_and_CarSales This project infers the insight that how the TV ads for cars and car Sales are being linked with each other. It i

1 Dec 29, 2021

This is a Machine Learning model which predicts the presence of Diabetes in Patients

Diabetes Disease Prediction This is a machine Learning mode which tries to determine if a person has a diabetes or not. Data The dataset is in comma s

4 Mar 16, 2022

🤖 ⚡ scikit-learn tips

🤖 ⚡ scikit-learn tips New tips are posted on LinkedIn, Twitter, and Facebook. 👉 Sign up to receive 2 video tips by email every week! 👈 List of all

1.6k Jan 03, 2023

A unified framework for machine learning with time series

Welcome to sktime A unified framework for machine learning with time series We provide specialized time series algorithms and scikit-learn compatible

6k Jan 06, 2023

Turns your machine learning code into microservices with web API, interactive GUI, and more.

2.8k Jan 02, 2023

Simplify stop motion animation with machine learning.

25 Sep 15, 2022

MasTrade is a trading bot in baselines3,pytorch,gym

mastrade MasTrade is a trading bot in baselines3,pytorch,gym idea we have for example 1 btc and we buy a crypto with it with market option to trade in

18 May 24, 2022

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

5 Apr 05, 2022

hgboost - Hyperoptimized Gradient Boosting

Related tags

Overview

hgboost - Hyperoptimized Gradient Boosting

Installation Environment

Install newest version hgboost from pypi

Install from github-source

Import hgboost package

Classification example for xgboost, catboost and lightboost:

Create ensemble model for Classification

Create ensemble model for Regression

Comments

best_params

model

Releases(1.1.3)

1.1.3(Oct 5, 2022)

1.1.2(Aug 26, 2022)

1.1.1(Aug 1, 2022)

1.1.0(Mar 18, 2022)

1.0.0(Nov 30, 2021)

0.1.10(Mar 30, 2021)

0.1.9(Jan 28, 2021)

0.1.8(Jan 17, 2021)

0.1.7(Oct 3, 2020)

0.1.6(Oct 2, 2020)

0.1.5(Oct 2, 2020)

0.1.4(Sep 5, 2020)

0.1.3(Sep 5, 2020)

0.1.2(Sep 5, 2020)

0.1.1(Sep 4, 2020)

0.1.0(Apr 19, 2020)

Owner

Erdogan Taskesen

Basic Docker Compose for Machine Learning Purposes

Automated machine learning: Review of the state-of-the-art and opportunities for healthcare

pymc-learn: Practical Probabilistic Machine Learning in Python

Nevergrad - A gradient-free optimization platform

This project used bitcoin, S&P500, and gold to construct an investment portfolio that aimed to minimize risk by minimizing variance.

inding a method to objectively quantify skill versus chance in games, using reinforcement learning

Land Cover Classification Random Forest

LinearRegression2 Tvads and CarSales

This is a Machine Learning model which predicts the presence of Diabetes in Patients

🤖 ⚡ scikit-learn tips

A unified framework for machine learning with time series

Turns your machine learning code into microservices with web API, interactive GUI, and more.

Simplify stop motion animation with machine learning.

MasTrade is a trading bot in baselines3,pytorch,gym

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

A data preprocessing package for time series data. Design for machine learning and deep learning.

database for artificial intelligence/machine learning data

EbookMLCB - ebook Machine Learning cơ bản

The project's goal is to show a real world application of image segmentation using k means algorithm

Production Grade Machine Learning Service

`best_params`

`model`