XGBoost + Optuna

Overview

AutoXGB

XGBoost + Optuna: no brainer

  • auto train xgboost directly from CSV files
  • auto tune xgboost using optuna
  • auto serve best xgboot model using fastapi

NOTE: PRs are currently not accepted. If there are issues/problems, please create an issue.

Installation

Install using pip

pip install autoxgb

Usage

Training a model using AutoXGB is a piece of cake. All you need is some tabular data.

Parameters

###############################################################################
### required parameters
###############################################################################

# path to training data
train_filename = "data_samples/binary_classification.csv"

# path to output folder to store artifacts
output = "output"

###############################################################################
### optional parameters
###############################################################################

# path to test data. if specified, the model will be evaluated on the test data
# and test_predictions.csv will be saved to the output folder
# if not specified, only OOF predictions will be saved
# test_filename = "test.csv"
test_filename = None

# task: classification or regression
# if not specified, the task will be inferred automatically
# task = "classification"
# task = "regression"
task = None

# an id column
# if not specified, the id column will be generated automatically with the name `id`
# idx = "id"
idx = None

# target columns are list of strings
# if not specified, the target column be assumed to be named `target`
# and the problem will be treated as one of: binary classification, multiclass classification,
# or single column regression
# targets = ["target"]
# targets = ["target1", "target2"]
targets = ["income"]

# features columns are list of strings
# if not specified, all columns except `id`, `targets` & `kfold` columns will be used
# features = ["col1", "col2"]
features = None

# categorical_features are list of strings
# if not specified, categorical columns will be inferred automatically
# categorical_features = ["col1", "col2"]
categorical_features = None

# use_gpu is boolean
# if not specified, GPU is not used
# use_gpu = True
# use_gpu = False
use_gpu = True

# number of folds to use for cross-validation
# default is 5
num_folds = 5

# random seed for reproducibility
# default is 42
seed = 42

# number of optuna trials to run
# default is 1000
# num_trials = 1000
num_trials = 100

# time_limit for optuna trials in seconds
# if not specified, timeout is not set and all trials are run
# time_limit = None
time_limit = 360

# if fast is set to True, the hyperparameter tuning will use only one fold
# however, the model will be trained on all folds in the end
# to generate OOF predictions and test predictions
# default is False
# fast = False
fast = False

Python API

To train a new model, you can run:

from autoxgb import AutoXGB


# required parameters:
train_filename = "data_samples/binary_classification.csv"
output = "output"

# optional parameters
test_filename = None
task = None
idx = None
targets = ["income"]
features = None
categorical_features = None
use_gpu = True
num_folds = 5
seed = 42
num_trials = 100
time_limit = 360
fast = False

# Now its time to train the model!
axgb = AutoXGB(
    train_filename=train_filename,
    output=output,
    test_filename=test_filename,
    task=task,
    idx=idx,
    targets=targets,
    features=features,
    categorical_features=categorical_features,
    use_gpu=use_gpu,
    num_folds=num_folds,
    seed=seed,
    num_trials=num_trials,
    time_limit=time_limit,
    fast=fast,
)
axgb.train()

CLI

Train the model using the autoxgb train command. The parameters are same as above.

autoxgb train \
 --train_filename datasets/30train.csv \
 --output outputs/30days \
 --test_filename datasets/30test.csv \
 --use_gpu

You can also serve the trained model using the autoxgb serve command.

autoxgb serve --model_path outputs/mll --host 0.0.0.0 --debug

To know more about a command, run:

`autoxgb <command> --help` 
autoxgb train --help


usage: autoxgb <command> [<args>] train [-h] --train_filename TRAIN_FILENAME [--test_filename TEST_FILENAME] --output
                                        OUTPUT [--task {classification,regression}] [--idx IDX] [--targets TARGETS]
                                        [--num_folds NUM_FOLDS] [--features FEATURES] [--use_gpu] [--fast]
                                        [--seed SEED] [--time_limit TIME_LIMIT]

optional arguments:
  -h, --help            show this help message and exit
  --train_filename TRAIN_FILENAME
                        Path to training file
  --test_filename TEST_FILENAME
                        Path to test file
  --output OUTPUT       Path to output directory
  --task {classification,regression}
                        User defined task type
  --idx IDX             ID column
  --targets TARGETS     Target column(s). If there are multiple targets, separate by ';'
  --num_folds NUM_FOLDS
                        Number of folds to use
  --features FEATURES   Features to use, separated by ';'
  --use_gpu             Whether to use GPU for training
  --fast                Whether to use fast mode for tuning params. Only one fold will be used if fast mode is set
  --seed SEED           Random seed
  --time_limit TIME_LIMIT
                        Time limit for optimization
Owner
abhishek thakur
Kaggle: www.kaggle.com/abhishek
abhishek thakur
MICOM is a Python package for metabolic modeling of microbial communities

Welcome MICOM is a Python package for metabolic modeling of microbial communities currently developed in the Gibbons Lab at the Institute for Systems

57 Dec 21, 2022
Solve automatic numerical differentiation problems in one or more variables.

numdifftools The numdifftools library is a suite of tools written in _Python to solve automatic numerical differentiation problems in one or more vari

Per A. Brodtkorb 181 Dec 16, 2022
A Python implementation of GRAIL, a generic framework to learn compact time series representations.

GRAIL A Python implementation of GRAIL, a generic framework to learn compact time series representations. Requirements Python 3.6+ numpy scipy tslearn

3 Nov 24, 2021
In this Repo a simple Sklearn Model will be trained and pushed to MLFlow

SKlearn_to_MLFLow In this Repo a simple Sklearn Model will be trained and pushed to MLFlow Install This Repo is based on poetry python3 -m venv .venv

1 Dec 13, 2021
Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

Brett Vogelsang 2 Jan 18, 2022
Programming assignments and quizzes from all courses within the Machine Learning Engineering for Production (MLOps) specialization offered by deeplearning.ai

Machine Learning Engineering for Production (MLOps) Specialization on Coursera (offered by deeplearning.ai) Programming assignments from all courses i

Aman Chadha 173 Jan 05, 2023
Decision tree is the most powerful and popular tool for classification and prediction

Diabetes Prediction Using Decision Tree Introduction Decision tree is the most powerful and popular tool for classification and prediction. A Decision

Arjun U 1 Jan 23, 2022
虚拟货币(BTC、ETH)炒币量化系统项目。在一版本的基础上加入了趋势判断

🎉 第二版本 🎉 (现货趋势网格) 介绍 在第一版本的基础上 趋势判断,不在固定点位开单,选择更优的开仓点位 优势: 🎉 简单易上手 安全(不用将api_secret告诉他人) 如何启动 修改app目录下的authorization文件

幸福村的码农 250 Jan 07, 2023
Machine Learning Model to predict the payment date of an invoice when it gets created in the system.

Payment-Date-Prediction Machine Learning Model to predict the payment date of an invoice when it gets created in the system.

15 Sep 09, 2022
Optimal Randomized Canonical Correlation Analysis

ORCCA Optimal Randomized Canonical Correlation Analysis This project is for the python version of ORCCA algorithm. It depends on Numpy for matrix calc

Yinsong Wang 1 Nov 21, 2021
End to End toy example of MLOps

churn_model MLOps Toy Example End to End You might find below links useful Connect VSCode to Git MLFlow Port Heroku App Project Organization ├── LICEN

Ashish Tele 6 Feb 06, 2022
scikit-multimodallearn is a Python package implementing algorithms multimodal data.

scikit-multimodallearn is a Python package implementing algorithms multimodal data. It is compatible with scikit-learn, a popul

12 Jun 29, 2022
Pydantic based mock data generation

This library offers powerful mock data generation capabilities for pydantic based models. It can also be used with other libraries that use pydantic as a foundation, for example SQLModel, Beanie and

Na'aman Hirschfeld 396 Dec 28, 2022
Bodywork deploys machine learning projects developed in Python, to Kubernetes.

Bodywork deploys machine learning projects developed in Python, to Kubernetes. It helps you to: serve models as microservices execute batch jobs run r

Bodywork Machine Learning 409 Jan 01, 2023
Regularization and Feature Selection in Least Squares Temporal Difference Learning

Regularization and Feature Selection in Least Squares Temporal Difference Learning Description This is Python implementations of Least Angle Regressio

Mina Parham 0 Jan 18, 2022
Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License.

Jeong-Yoon Lee 720 Dec 25, 2022
A Lucid Framework for Transparent and Interpretable Machine Learning Models.

Currently a Beta-Version lucidmode is an open-source, low-code and lightweight Python framework for transparent and interpretable machine learning mod

lucidmode 15 Aug 12, 2022
Machine Learning approach for quantifying detector distortion fields

DistortionML Machine Learning approach for quantifying detector distortion fields. This project is a feasibility study for training a surrogate model

Joel Bernier 1 Nov 05, 2021
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 05, 2023
Reproducibility and Replicability of Web Measurement Studies

Reproducibility and Replicability of Web Measurement Studies This repository holds additional material to the paper "Reproducibility and Replicability

6 Dec 31, 2022