AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

Overview

AutoTabular

Paper Conference Conference Conference

AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy machine learning and deep learning models tabular data.

autotabular

[Toc]

What's good in it?

  • It is using the RAPIDS as back-end support, gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs.
  • It Supports many anomaly detection models: ,
  • It using meta learning to accelerate model selection and parameter tuning.
  • It is using many Deep Learning models for tabular data: Wide&Deep, DCN(Deep & Cross Network), FM, DeepFM, PNN ...
  • It is using many machine learning algorithms: Baseline, Linear, Random Forest, Extra Trees, LightGBM, Xgboost, CatBoost, and Nearest Neighbors.
  • It can compute Ensemble based on greedy algorithm from Caruana paper.
  • It can stack models to build level 2 ensemble (available in Compete mode or after setting stack_models parameter).
  • It can do features preprocessing, like: missing values imputation and converting categoricals. What is more, it can also handle target values preprocessing.
  • It can do advanced features engineering, like: Golden Features, Features Selection, Text and Time Transformations.
  • It can tune hyper-parameters with not-so-random-search algorithm (random-search over defined set of values) and hill climbing to fine-tune final models.

Installation

The sources for AutoTabular can be downloaded from the Github repo.

You can either clone the public repository:

# clone project
git clone https://apulis-gitlab.apulis.cn/apulis/AutoTabular/autotabular.git
# First, install dependencies
pip install -r requirements.txt

Once you have a copy of the source, you can install it with:

python setup.py install

Example

Next, navigate to any file and run it.

# module folder
cd example

# run module (example: mnist as your main contribution)
python binary_classifier_Titanic.py

Auto Feature generate & Selection

TODO

Deep Feature Synthesis

import featuretools as ft
import pandas as pd
from sklearn.datasets import load_iris

# Load data and put into dataframe
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target
df['species'] = df['species'].map({
    0: 'setosa',
    1: 'versicolor',
    2: 'virginica'
})
# Make an entityset and add the entity
es = ft.EntitySet()
es.add_dataframe(
    dataframe_name='data', dataframe=df, make_index=True, index='index')
# Run deep feature synthesis with transformation primitives
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    max_depth=3,
    target_dataframe_name='data',
    agg_primitives=['mode', 'mean', 'max', 'count'],
    trans_primitives=[
        'add_numeric', 'multiply_numeric', 'cum_min', 'cum_mean', 'cum_max'
    ],
    groupby_trans_primitives=['cum_sum'])

print(feature_defs)
print(feature_matrix.head())
print(feature_matrix.ww)

GBDT Feature Generate

from autofe.feature_engineering.gbdt_feature import CatboostFeatureTransformer, GBDTFeatureTransformer, LightGBMFeatureTransformer, XGBoostFeatureTransformer

titanic = pd.read_csv('autotabular/datasets/data/Titanic.csv')
# 'Embarked' is stored as letters, so fit a label encoder to the train set to use in the loop
embarked_encoder = LabelEncoder()
embarked_encoder.fit(titanic['Embarked'].fillna('Null'))
# Record anyone travelling alone
titanic['Alone'] = (titanic['SibSp'] == 0) & (titanic['Parch'] == 0)
# Transform 'Embarked'
titanic['Embarked'].fillna('Null', inplace=True)
titanic['Embarked'] = embarked_encoder.transform(titanic['Embarked'])
# Transform 'Sex'
titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 0
titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 1
titanic['Sex'] = titanic['Sex'].astype('int8')
# Drop features that seem unusable. Save passenger ids if test
titanic.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

trainMeans = titanic.groupby(['Pclass', 'Sex'])['Age'].mean()

def f(x):
    if not np.isnan(x['Age']):  # not NaN
        return x['Age']
    return trainMeans[x['Pclass'], x['Sex']]

titanic['Age'] = titanic.apply(f, axis=1)
rows = titanic.shape[0]
n_train = int(rows * 0.77)
train_data = titanic[:n_train, :]
test_data = titanic[n_train:, :]

X_train = titanic.drop(['Survived'], axis=1)
y_train = titanic['Survived']

clf = XGBoostFeatureTransformer(task='classification')
clf.fit(X_train, y_train)
result = clf.concate_transform(X_train)
print(result)

clf = LightGBMFeatureTransformer(task='classification')
clf.fit(X_train, y_train)
result = clf.concate_transform(X_train)
print(result)

clf = GBDTFeatureTransformer(task='classification')
clf.fit(X_train, y_train)
result = clf.concate_transform(X_train)
print(result)

clf = CatboostFeatureTransformer(task='classification')
clf.fit(X_train, y_train)
result = clf.concate_transform(X_train)
print(result)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

lr = LogisticRegression()
x_train_gb, x_test_gb, y_train_gb, y_test_gb = train_test_split(
    result, y_train)
x_train, x_test, y_train, y_test = train_test_split(X_train, y_train)

lr.fit(x_train, y_train)
score = roc_auc_score(y_test, lr.predict(x_test))
print('LR with GBDT apply data, train data shape : {0}  auc: {1}'.format(
    x_train.shape, score))

lr = LogisticRegression()
lr.fit(x_train_gb, y_train_gb)
score = roc_auc_score(y_test_gb, lr.predict(x_test_gb))
print('LR with GBDT apply data, train data shape : {0}  auc: {1}'.format(
    x_train_gb.shape, score))

Golden Feature Generate

from autofe import GoldenFeatureTransform

titanic = pd.read_csv('autotabular/datasets/data/Titanic.csv')
embarked_encoder = LabelEncoder()
embarked_encoder.fit(titanic['Embarked'].fillna('Null'))
# Record anyone travelling alone
titanic['Alone'] = (titanic['SibSp'] == 0) & (titanic['Parch'] == 0)
# Transform 'Embarked'
titanic['Embarked'].fillna('Null', inplace=True)
titanic['Embarked'] = embarked_encoder.transform(titanic['Embarked'])
# Transform 'Sex'
titanic.loc[titanic['Sex'] == 'female', 'Sex'] = 0
titanic.loc[titanic['Sex'] == 'male', 'Sex'] = 1
titanic['Sex'] = titanic['Sex'].astype('int8')
# Drop features that seem unusable. Save passenger ids if test
titanic.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

trainMeans = titanic.groupby(['Pclass', 'Sex'])['Age'].mean()

def f(x):
    if not np.isnan(x['Age']):  # not NaN
        return x['Age']
    return trainMeans[x['Pclass'], x['Sex']]

titanic['Age'] = titanic.apply(f, axis=1)

X_train = titanic.drop(['Survived'], axis=1)
y_train = titanic['Survived']
print(X_train)
gbdt_model = GoldenFeatureTransform(
    results_path='./', ml_task='BINARY_CLASSIFICATION')
gbdt_model.fit(X_train, y_train)
results = gbdt_model.transform(X_train)
print(results)

Neural Network Embeddings

# data url
"""https://www.kaggle.com/c/house-prices-advanced-regression-techniques."""
data_dir = '/media/robin/DATA/datatsets/structure_data/house_price/train.csv'
data = pd.read_csv(
    data_dir,
    usecols=[
        'SalePrice', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea',
        'Street', 'YearBuilt', 'LotShape', '1stFlrSF', '2ndFlrSF'
    ]).dropna()

categorical_features = [
    'MSSubClass', 'MSZoning', 'Street', 'LotShape', 'YearBuilt'
]
output_feature = 'SalePrice'
label_encoders = {}
for cat_col in categorical_features:
    label_encoders[cat_col] = LabelEncoder()
    data[cat_col] = label_encoders[cat_col].fit_transform(data[cat_col])

dataset = TabularDataset(
    data=data, cat_cols=categorical_features, output_col=output_feature)

batchsize = 64
dataloader = DataLoader(dataset, batchsize, shuffle=True, num_workers=1)

cat_dims = [int(data[col].nunique()) for col in categorical_features]
emb_dims = [(x, min(50, (x + 1) // 2)) for x in cat_dims]
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = FeedForwardNN(
    emb_dims,
    no_of_cont=4,
    lin_layer_sizes=[50, 100],
    output_size=1,
    emb_dropout=0.04,
    lin_layer_dropouts=[0.001, 0.01]).to(device)
print(model)
num_epochs = 100
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
for epoch in range(num_epochs):
    for y, cont_x, cat_x in dataloader:
        cat_x = cat_x.to(device)
        cont_x = cont_x.to(device)
        y = y.to(device)
        # Forward Pass
        preds = model(cont_x, cat_x)
        loss = criterion(preds, y)
        # Backward Pass and Optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print('loss:', loss)

License

This library is licensed under the Apache 2.0 License.

Contributing to AutoTabular

We are actively accepting code contributions to the AutoTabular project. If you are interested in contributing to AutoTabular, please contact me.

Owner
wenqi
Learning is all you need!
wenqi
Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas.

Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas. Its objective is to ex

Taylor G Smith 54 Aug 20, 2022
Decision tree is the most powerful and popular tool for classification and prediction

Diabetes Prediction Using Decision Tree Introduction Decision tree is the most powerful and popular tool for classification and prediction. A Decision

Arjun U 1 Jan 23, 2022
Simple and flexible ML workflow engine.

This is a simple and flexible ML workflow engine. It helps to orchestrate events across a set of microservices and create executable flow to handle requests. Engine is designed to be configurable wit

Katana ML 295 Jan 06, 2023
This is the material used in my free Persian course: Machine Learning with Python

This is the material used in my free Persian course: Machine Learning with Python

Yara Mohamadi 4 Aug 07, 2022
Data Efficient Decision Making

Data Efficient Decision Making

Microsoft 197 Jan 06, 2023
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

92 Dec 14, 2022
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

2.1k Jan 07, 2023
A toolbox to iNNvestigate neural networks' predictions!

iNNvestigate neural networks! Table of contents Introduction Installation Usage and Examples More documentation Contributing Releases Introduction In

Maximilian Alber 1.1k Jan 05, 2023
MasTrade is a trading bot in baselines3,pytorch,gym

mastrade MasTrade is a trading bot in baselines3,pytorch,gym idea we have for example 1 btc and we buy a crypto with it with market option to trade in

Masoud Azizi 18 May 24, 2022
A toolkit for geo ML data processing and model evaluation (fork of solaris)

An open source ML toolkit for overhead imagery. This is a beta version of lunular which may continue to develop. Please report any bugs through issues

Ryan Avery 4 Nov 04, 2021
Dive into Machine Learning

Dive into Machine Learning Hi there! You might find this guide helpful if: You know Python or you're learning it 🐍 You're new to Machine Learning You

Michael Floering 11.1k Jan 03, 2023
A repository to index and organize the latest machine learning courses found on YouTube.

📺 ML YouTube Courses At DAIR.AI we ❤️ open education. We are excited to share some of the best and most recent machine learning courses available on

DAIR.AI 9.6k Jan 01, 2023
PySurvival is an open source python package for Survival Analysis modeling

PySurvival What is Pysurvival ? PySurvival is an open source python package for Survival Analysis modeling - the modeling concept used to analyze or p

Square 265 Dec 27, 2022
Titanic Traveller Survivability Prediction

The aim of the mini project is predict whether or not a passenger survived based on attributes such as their age, sex, passenger class, where they embarked and more.

John Phillip 0 Jan 20, 2022
A naive Bayes model for cancer classification using a set of documents

Naivebayes text classifcation model for cancer and noncancer documents Author: Alex King Purpose Requirements/files included How to use 1. Purpose The

Alex W King 1 Nov 24, 2021
Laporan Proyek Machine Learning - Azhar Rizki Zulma

Laporan Proyek Machine Learning - Azhar Rizki Zulma Project Overview Domain proyek yang dipilih dalam proyek machine learning ini adalah mengenai hibu

Azhar Rizki Zulma 6 Mar 12, 2022
An easier way to build neural search on the cloud

Jina is geared towards building search systems for any kind of data, including text, images, audio, video and many more. With the modular design & multi-layer abstraction, you can leverage the effici

Jina AI 17k Jan 01, 2023
Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.

Time series analysis today is an important cornerstone of quantitative science in many disciplines, including natural and life sciences as well as eco

Christoph Mark 129 Dec 24, 2022
Predict the income for each percentile of the population (Python) - FRENCH

05.income-prediction Predict the income for each percentile of the population (Python) - FRENCH Effectuez une prédiction de revenus Prérequis Pour ce

1 Feb 13, 2022
SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

hexhamming What does it do? This module performs a fast bitwise hamming distance of two hexadecimal strings. This looks like: DEADBEEF = 1101111010101

Michael Recachinas 12 Oct 14, 2022