AutoX是一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

Related tags

Machine LearningAutoX
Overview

English | 简体中文

AutoX是什么?

AutoX一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括:

  • 效果出色: AutoX在多个kaggle数据集上,效果显著优于其他解决方案(见效果对比)。
  • 简单易用: AutoX的接口和sklearn类似,方便上手使用。
  • 通用: 适用于分类和回归问题。
  • 自动化: 无需人工干预,全自动的数据清洗、特征工程、模型调参等步骤。
  • 灵活性: 各组件解耦合,能单独使用,对于自动机器学习效果不满意的地方,可以结合专家知识,AutoX提供灵活的接口。
  • 比赛上分点总结:整理并公开历史比赛的上分点。

目录

安装

1. git clone https://github.com/4paradigm/autox.git
2. cd autox
3. python setup.py install

架构

├── autox
│   ├── ensemble
│   ├── feature_engineer
│   ├── feature_selection
│   ├── file_io
│   ├── join_tables
│   ├── metrics
│   ├── models
│   ├── process_data
│   └── util.py
│   ├── CONST.py
│   ├── autox.py
├── run_oneclick.py
└── demo
└── test
├── setup.py
├── README.md

快速上手

  • 全自动: 适合于想要快速获得一个不错结果的用户。只需要配置最少的数据信息,就能完成机器学习全流程的构建。
from autox import AutoX
path = data_dir
autox = AutoX(target = 'loss', train_name = 'train.csv', test_name = 'test.csv', 
               id = ['id'], path = path)
sub = autox.get_submit()
sub.to_csv("submission.csv", index = False)
  • 半自动: run_demo.ipynb
适合于想要获得更优预测结果的用户。AutoX提供了易用且丰富的接口,用户可以方便地根据实际数据场景进行配置,以获得更优的预测结果。

效果对比:

index data_type data_name(link) AutoX AutoGluon H2o
1 regression zhidemai 1.1231 1.9466 1.1927
2 regression Tabular Playground Series - Aug 2021 7.87731 10.3944 7.8895
3 binary classification Titanic x 0.78229 0.79186

数据类型

  • cat: Categorical,类别型无序变量
  • ord: Ordinal,类别型有序变量
  • num: Numeric,连续型变量
  • datetime: Datetime型时间变量
  • timestamp: imestamp型时间变量

pipeline的逻辑

  • 1.初始化AutoX类
1.1 读数据
1.2 合并train和test
1.3 识别数据表中列的类型
1.4 数据预处理
  • 2.特征工程
特征工程包含单表特征和多表特征。
每一个特征工程类都包含以下功能:
    一、自动筛选要执行当前操作的特征;
    二、查看筛选出来的特征
    三、修改要执行当前操作的特征
    四、执行特征数据的计算,返回和主表样本条数以及顺序一致的特征
  • 3.特征合并
将构造出来的特征进行合并,行数不变,列数增加,返回大的宽表
  • 4.训练集和测试集的划分
将宽表划分成训练集和测试集
  • 5.特征过滤
通过train和test的特征列数据分布情况,对构造出来的特征进行过滤,避免过拟合
  • 6.模型训练
利用过滤后的宽表特征对模型进行训练
模型类提供功能包括:
    一、查看模型默认参数;
    二、模型训练;
    三、模型调参;
    四、查看模型对应的特征重要性;
    五、模型预测
  • 7.模型预测

AutoX类

AutoX类自动为用户管理数据集和数据集信息。
初始化AutoX类之后会执行以下操作:
一、读数据;
二、合并train和test;
三、识别数据表中列的类型;
四、数据预处理。

属性

info_: info_属性用于保存数据集的信息。

  • info_['id']: List,用于标识数据表唯一的Key
  • info_['target']: String,用于标识数据表的标签列
  • info_['shape_of_train']: Int,train数据集的数据样本条数
  • info_['shape_of_test']: Int,test数据集的数据样本条数
  • info_['feature_type']: Dict of Dict,标识数据表中特征列的数据类型
  • info_['train_name']: String,用于训练集主表表名
  • info_['test_name']: String,用于测试集主表表名

dfs_: dfs_属性用于保存所有的DataFrame,包含原始表数据和构造的表数据。

  • dfs_['train_test']: train数据和test数据合并后的数据
  • dfs_['FE_feature_name']:特征工程所构造出的数据,例如FE_count,FE_groupby
  • dfs_['FE_all']:原始特征和所有特征工程合并后的数据集

方法

  • concat_train_test: 将训练集和测试集拼接起来,一般在读取数据之后执行
  • split_train_test: 将训练集和测试集分开,一般在完成特征工程之后执行
  • get_submit: 获得预测结果(中间过程执行了完成的机器学习pipeline,包括数据预处理,特征工程,模型训练,模型调参,模型融合,模型预测等)

AutoX的pipeline中的操作对应的具体细节:

读数据

读取给定路径下的所有文件。默认情况下,会将训练集主表和测试集主表进行拼接,
再进行后续的数据预处理以及特征工程等操作,并在模型预测开始前,将训练集和测试进行拆分。

数据预处理

- 对时间列解析年, 月, 日, 时、星期几等信息
- 在每次训练前,会对输入到模型的数据删除无效(nunique为1)的特征
- 去除异常样本,去除label为nan的样本

特征工程

  • 1-1拼表特征
  • 1-M拼表特征
- time diff特征
- 聚合统计类特征
  • count特征
对要操作的特征列,将全体数据集中,和当前样本特征属性一致的样本计数作为特征
  • target encoding特征

  • 统计类特征

使用两层for训练提取统计类特征。
第一层for循环遍历所有筛选出来的分组特征(group_col),
第二层for循环遍历所有筛选出来的聚合特征(agg_col),
在第二层for循环中,
若遇到类别型特征,计算的统计特征为nunique,
若遇到数值型特征,计算的统计特征包括[median, std, sum, max, min, mean].
  • shift特征

模型训练

AutoX目前支持以下模型,默认情况下,会对Lightgbm模型进行训练:
1. Lightgbm;
2. AutoX 深度神经网络。

模型融合

AutoX支持的模型融合方式包括一下两种,默认情况下,不进行融合。
1. Stacking;
2. Bagging。

比赛上分点总结:

kaggle criteo: 对于nunique很大的特征列,进行分桶操作。例如,对于nunique大于10000的特征,做hash后截断保留4位,再进行label_encode。 zhidemai: article_id隐含了时间信息,增加article_id的排序特征。例如,groupby(['date'])['article_id'].rank()。

错误排查

错误信息 解决办法
Comments
  • AutoX_Recommend, 数据集处理: kdd cup 2020

    AutoX_Recommend, 数据集处理: kdd cup 2020

    原始数据地址: https://tianchi.aliyun.com/competition/entrance/231785/introduction 数据处理方法参考: https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/MovieLens_data_process.ipynb 以及 https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/Netflix-data-process.ipynb

    call-for-contributions AutoX_Recommend 
    opened by poteman 1
  • AutoX_Recommend, 数据集处理: Amazon product data

    AutoX_Recommend, 数据集处理: Amazon product data

    原始数据地址: http://jmcauley.ucsd.edu/data/amazon/ 数据处理方法参考: https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/MovieLens_data_process.ipynb 以及 https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/Netflix-data-process.ipynb

    call-for-contributions AutoX_Recommend 
    opened by poteman 1
  • AutoX_Recommend, 数据集处理: Amazon electronic product recommendation

    AutoX_Recommend, 数据集处理: Amazon electronic product recommendation

    原始数据地址: https://www.kaggle.com/datasets/prokaggler/amazon-electronic-product-recommendation 数据处理方法参考: https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/MovieLens_data_process.ipynb 以及 https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/Netflix-data-process.ipynb

    call-for-contributions AutoX_Recommend 
    opened by poteman 1
  • ModuleNotFoundError: No module named 'autox.autox_server'

    ModuleNotFoundError: No module named 'autox.autox_server'

    git clone https://github.com/4paradigm/AutoX.git pip install pytorch_tabnet pip install ./AutoX python from autox import AutoX

    ModuleNotFoundError: No module named 'autox.autox_server'

    opened by utopianet 1
  • lightgbm.train bug(lightgbm==3.3.2.99)

    lightgbm.train bug(lightgbm==3.3.2.99)

    Mac中 lightgbm==3.3.2.99, lightgbm.train不再包含verbose_eval和early_stopping_rounds接口,改用callbacks接口,调用lgb模型时会报错

    File ~/miniforge3/envs/lx/lib/python3.9/site-packages/autox/autox_competition/models/regressor_ts.py:231, in LgbRegressionTs.fit(self, train, test, used_features, target, time_col, ts_unit, Early_Stopping_Rounds, N_round, Verbose, log1p, custom_metric, weight_for_mae)
        226     model = lgb.train(self.params_, trn_data, num_boost_round=self.N_round, valid_sets=[trn_data, val_data],
        227                       verbose_eval=self.Verbose,
        228                       early_stopping_rounds=self.Early_Stopping_Rounds,
        229                       feval=weighted_mae_lgb(weight=weight_for_mae))
        230 else:
    --> 231     model = lgb.train(self.params_, trn_data, num_boost_round=self.N_round, valid_sets=[trn_data, val_data],
    ...
        233                     early_stopping_rounds=self.Early_Stopping_Rounds)
        234 val = model.predict(train.iloc[valid_idx][used_features])
        235 if log1p:
    
    TypeError: train() got an unexpected keyword argument 'verbose_eval'
    
    opened by LXlearning 0
  • AutoX_NLP/ nlp_feature.py,glove环境适配

    AutoX_NLP/ nlp_feature.py,glove环境适配

    opened by DHengW 0
  • AutoX_NLP/ nlp_feature.py, OOV问题优化

    AutoX_NLP/ nlp_feature.py, OOV问题优化

    opened by DHengW 0
  • AutoX_NLP/ nlp_feature.py, fasttext处理效率优化

    AutoX_NLP/ nlp_feature.py, fasttext处理效率优化

    opened by DHengW 0
Releases(v5.2.0)
Owner
4Paradigm
4Paradigm Open Source Community
4Paradigm
Hypernets: A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

DataCanvas 216 Dec 23, 2022
Software Engineer Salary Prediction

Based on 2021 stack overflow data, this machine learning web application helps one predict the salary based on years of experience, level of education and the country they work in.

Jhanvi Mimani 1 Jan 08, 2022
SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow

SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow, in High Performance Computing (HPC) simulations and workloads.

Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill

Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill This is a port of the amazing openskill.js package

Open Debates Project 156 Dec 14, 2022
Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

pyspark-anonymizer Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark envir

6 Jun 30, 2022
Falken provides developers with a service that allows them to train AI that can play their games

Falken provides developers with a service that allows them to train AI that can play their games. Unlike traditional RL frameworks that learn through rewards or batches of offline training, Falken is

Google Research 223 Jan 03, 2023
A Collection of Conference & School Notes in Machine Learning 🦄📝🎉

Machine Learning Conference & Summer School Notes. 🦄📝🎉

558 Dec 28, 2022
[DEPRECATED] Tensorflow wrapper for DataFrames on Apache Spark

TensorFrames (Deprecated) Note: TensorFrames is deprecated. You can use pandas UDF instead. Experimental TensorFlow binding for Scala and Apache Spark

Databricks 757 Dec 31, 2022
Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

Amazon Web Services - Labs 3.3k Jan 03, 2023
CS 7301: Spring 2021 Course on Advanced Topics in Optimization in Machine Learning

CS 7301: Spring 2021 Course on Advanced Topics in Optimization in Machine Learning

Rishabh Iyer 141 Nov 10, 2022
Pyomo is an object-oriented algebraic modeling language in Python for structured optimization problems.

Pyomo is a Python-based open-source software package that supports a diverse set of optimization capabilities for formulating and analyzing optimization models. Pyomo can be used to define symbolic p

Pyomo 1.4k Dec 28, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 29, 2022
database for artificial intelligence/machine learning data

AIDB v0.0.1 database for artificial intelligence/machine learning data Overview aidb is a database designed for large dataset for machine learning pro

Aarush Gupta 1 Oct 24, 2021
Python factor analysis library (PCA, CA, MCA, MFA, FAMD)

Prince is a library for doing factor analysis. This includes a variety of methods including principal component analysis (PCA) and correspondence anal

Max Halford 915 Dec 31, 2022
Implementations of Machine Learning models, Regularizers, Optimizers and different Cost functions.

Linear Models Implementations of LinearRegression, LassoRegression and RidgeRegression with appropriate Regularizers and Optimizers. Linear Regression

Keivan Ipchi Hagh 1 Nov 22, 2021
The Emergence of Individuality

The Emergence of Individuality

16 Jul 20, 2022
使用数学和计算机知识投机倒把

偷鸡不成项目集锦 坦率地讲,涉及金融市场的好策略如果公开,必然导致使用的人多,最后策略变差。所以这个仓库只收集我目前失败了的案例。 加密货币组合套利 中国体育彩票预测 我赚不上钱的项目,也许可以帮助更有能力的人去赚钱。

Roy 28 Dec 29, 2022
Projeto: Machine Learning: Linguagens de Programacao 2004-2001

Projeto: Machine Learning: Linguagens de Programacao 2004-2001 Projeto de Data Science e Machine Learning de análise de linguagens de programação de 2

Victor Hugo Negrisoli 0 Jun 29, 2021
MIT-Machine Learning with Python–From Linear Models to Deep Learning

MIT-Machine Learning with Python–From Linear Models to Deep Learning | One of the 5 courses in MIT MicroMasters in Statistics & Data Science Welcome t

2 Aug 23, 2022
Both social media sentiment and stock market data are crucial for stock price prediction

Relating-Social-Media-to-Stock-Movement-Public - We explore the application of Machine Learning for predicting the return of the stock by using the information of stock returns. A trading strategy ba

Vishal Singh Parmar 15 Oct 29, 2022