Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas.

Overview

codecov Supported versions Supported versions Supported versions CircleCI Build Status

skoot

Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas. Its objective is to expedite data munging and pre-processing tasks that can tend to take up so much of data science practitioners' time. See the documentation for more info.

Note that skoot is the preferred alternative to the now deprecated skutil library

Two minutes to model-readiness

Real world data is nasty. Most data scientists spend the majority of their time tackling data cleansing tasks. With skoot, we can automate away so much of the bespoke hacking solutions that consume data scientists' time.

In this example, we'll examine a common dataset (the adult dataset from the UCI machine learning repo) that requires significant pre-processing.

from skoot.datasets import load_adult_df
from skoot.feature_selection import FeatureFilter
from skoot.decomposition import SelectivePCA
from skoot.preprocessing import DummyEncoder
from skoot.utils.dataframe import get_numeric_columns
from skoot.utils.dataframe import get_categorical_columns
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# load the dataset with the skoot-native loader & split it
adult = load_adult_df(tgt_name="target")
y = adult.pop("target")
X_train, X_test, y_train, y_test = train_test_split(
    adult, y, random_state=42, test_size=0.2)
    
# get numeric and categorical feature names
num_cols = get_numeric_columns(X_train).columns
obj_cols = get_categorical_columns(X_train).columns

# remove the education-num from the num_cols since we're going to remove it
num_cols = num_cols[~(num_cols == "education-num")]
    
# build a pipeline
pipe = Pipeline([
    # drop out the ordinal level that's otherwise equal to "education"
    ("dropper", FeatureFilter(cols=["education-num"])),
    
    # decompose the numeric features with PCA
    ("pca", SelectivePCA(cols=num_cols)),
    
    # dummy encode the categorical features
    ("dummy", DummyEncoder(cols=obj_cols, handle_unknown="ignore")),
    
    # and a simple classifier class
    ("clf", RandomForestClassifier(n_estimators=100, random_state=42))
])

pipe.fit(X_train, y_train)

# produce predictions
preds = pipe.predict(X_test)
print("Test accuracy: %.3f" % accuracy_score(y_test, preds))

For more tutorials, check out the documentation.

Comments
  • Windows: pip install not working

    Windows: pip install not working

    Hi, I can't install skoot neither via pip, nor anaconda.

    > pip install skoot
    Collecting skoot
      Could not find a version that satisfies the requirement skoot (from versions: )
    No matching distribution found for skoot
    

    Any ideas why that might be? Thank you!

    opened by r0f1 2
  • Bump django from 1.11 to 1.11.29 in /build_tools/doc

    Bump django from 1.11 to 1.11.29 in /build_tools/doc

    Bumps django from 1.11 to 1.11.29.

    Commits
    • f1e3017 [1.11.x] Bumped version for 1.11.29 release.
    • 02d97f3 [1.11.x] Fixed CVE-2020-9402 -- Properly escaped tolerance parameter in GIS f...
    • e643833 [1.11.x] Pinned PyYAML < 5.3 in test requirements.
    • d0e3eb8 [1.11.x] Added CVE-2020-7471 to security archive.
    • 9a62ed5 [1.11.x] Post-release version bump.
    • e09f09b [1.11.x] Bumped version for 1.11.28 release.
    • 001b063 [1.11.x] Fixed CVE-2020-7471 -- Properly escaped StringAgg(delimiter) parameter.
    • 7fd1ca3 [1.11.x] Fixed timezones tests for PyYAML 5.3+.
    • 121115d [1.11.x] Added CVE-2019-19844 to the security archive.
    • 2c4fb9a [1.11.x] Post-release version bump.
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump django from 1.11 to 1.11.28 in /build_tools/doc

    Bump django from 1.11 to 1.11.28 in /build_tools/doc

    Bumps django from 1.11 to 1.11.28.

    Commits
    • e09f09b [1.11.x] Bumped version for 1.11.28 release.
    • 001b063 [1.11.x] Fixed CVE-2020-7471 -- Properly escaped StringAgg(delimiter) parameter.
    • 7fd1ca3 [1.11.x] Fixed timezones tests for PyYAML 5.3+.
    • 121115d [1.11.x] Added CVE-2019-19844 to the security archive.
    • 2c4fb9a [1.11.x] Post-release version bump.
    • 358973a [1.11.x] Bumped version for 1.11.27 release.
    • f4cff43 [1.11.x] Fixed CVE-2019-19844 -- Used verified user email for password reset ...
    • a235574 [1.11.x] Refs #31073 -- Added release notes for 02eff7ef60466da108b1a33f1e4dc...
    • e8fdf00 [1.11.x] Fixed #31073 -- Prevented CheckboxInput.get_context() from mutating ...
    • 4f15016 [1.11.x] Post-release version bump.
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Bump django from 1.11 to 1.11.23 in /build_tools/doc

    Bump django from 1.11 to 1.11.23 in /build_tools/doc

    Bumps django from 1.11 to 1.11.23.

    Commits
    • 9748977 [1.11.x] Bumped version for 1.11.23 release.
    • 869b34e [1.11.x] Fixed CVE-2019-14235 -- Fixed potential memory exhaustion in django....
    • ed682a2 [1.11.x] Fixed CVE-2019-14234 -- Protected JSONField/HStoreField key and inde...
    • 52479ac [1.11.x] Fixed CVE-2019-14233 -- Prevented excessive HTMLParser recursion in ...
    • 42a66e9 [1.11.X] Fixed CVE-2019-14232 -- Adjusted regex to avoid backtracking issues ...
    • 693046e [1.11.x] Added stub release notes for security releases.
    • 6d054b5 [1.11.x] Added CVE-2019-12781 to the security release archive.
    • 7c849b9 [1.11.x] Post-release version bump.
    • 480380c [1.11.x] Bumped version for 1.11.22 release.
    • 32124fc [1.11.x] Fixed CVE-2019-12781 -- Made HttpRequest always trust SECURE_PROXY_S...
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot ignore this [patch|minor|major] version will close this PR and stop Dependabot creating any more for this minor/major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 1
  • Wrapped classes still reference sklearn user-guide

    Wrapped classes still reference sklearn user-guide

    The "See Also" section of wrapped sklearn estimators still references sklearn user_guide refs. We need to monkey patch "Selective" (or whatever prefix we are using) in front of them so they link in our documentation.

    bug 
    opened by tgsmith61591 1
  • Bump django from 1.11 to 2.2.24 in /build_tools/doc

    Bump django from 1.11 to 2.2.24 in /build_tools/doc

    Bumps django from 1.11 to 2.2.24.

    Commits
    • 2da029d [2.2.x] Bumped version for 2.2.24 release.
    • f27c38a [2.2.x] Fixed CVE-2021-33571 -- Prevented leading zeros in IPv4 addresses.
    • 053cc95 [2.2.x] Fixed CVE-2021-33203 -- Fixed potential path-traversal via admindocs'...
    • 6229d87 [2.2.x] Confirmed release date for Django 2.2.24.
    • f163ad5 [2.2.x] Added stub release notes and date for Django 2.2.24.
    • bed1755 [2.2.x] Changed IRC references to Libera.Chat.
    • 63f0d7a [2.2.x] Refs #32718 -- Fixed file_storage.test_generate_filename and model_fi...
    • 5fe4970 [2.2.x] Post-release version bump.
    • 61f814f [2.2.x] Bumped version for 2.2.23 release.
    • b8ecb06 [2.2.x] Fixed #32718 -- Relaxed file name validation in FileField.
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • scipy._lib_version not found when building package

    scipy._lib_version not found when building package

    problem: error saying scipy._lib_version is missing when building skoot

    cause: scipy._lib_version was removed in scipy 1.5.0 --> https://github.com/scipy/scipy/pull/11290 (downgrading to scipy 1.4.0 helps)

    Thanks!

    opened by AgroSimi 0
  • pip install Skoot on Mac keeps failing with ERROR: Could not find a version that satisfies the requirement skoot (from versions: none).

    pip install Skoot on Mac keeps failing with ERROR: Could not find a version that satisfies the requirement skoot (from versions: none).

    Description

    pip install Skoot on Mac keeps failing with ERROR: Could not find a version that satisfies the requirement skoot (from versions: none) ERROR: No matching distribution found for skoot

    Steps/Code to Reproduce

    pip install skoot using python version : Python 2.7.17 using pip version : pip 19.3.1

    Expected Results

    No errors thrown, successful installation of Skoot

    Actual Results

    ERROR: Could not find a version that satisfies the requirement skoot (from versions: none) ERROR: No matching distribution found for skoot

    Versions

    platform - Darwin-19.2.0-x86_64-i386-64bit sys - ('Python', '2.7.17 (default, Oct 24 2019, 12:57:47) \n[GCC 4.2.1 Compatible Apple LLVM 11.0.0 (clang-1100.0.33.8)]') Skoot -( not able to install ) numpy -("NumPy", numpy.version) scipy - ('SciPy', '1.2.3') sklearn - scikit-learn->sklearn (1.16.6)

    opened by lakshmikrish-97 8
  • [MRG] Mac builds

    [MRG] Mac builds

    This PR adds builds for mac. Currently, it does not deploy to PyPI. We still need the deploy-vars group on ADO. Since we decided to just do mac + Linux for now, this branched off of add-azure... We can use that branch to play around with Windows, or create a new one

    opened by aaronreidsmith 1
  • Package Roadmap

    Package Roadmap

    Is skoot still an active project? Or is there a successor to this concept? Looking to build something similar for my specific workflow, but maybe it would be mutually beneficial to contribute to this project.

    opened by MattConflitti 2
  • String fields with typos

    String fields with typos

    Description

    TODO: Create a transformer that can map values in text fields to known "good" values given Levenstein distance or some other method.

    enhancement 
    opened by tgsmith61591 0
Releases(0.20.0)
Owner
Taylor G Smith
Data scientist, ML engineer and all-around hacker. Java was once my first love, but I've long since converted to the cult of Python.
Taylor G Smith
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 02, 2023
A Lucid Framework for Transparent and Interpretable Machine Learning Models.

Currently a Beta-Version lucidmode is an open-source, low-code and lightweight Python framework for transparent and interpretable machine learning mod

lucidmode 15 Aug 12, 2022
This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev

MLProject_01 This project impelemented for midterm of the Machine Learning #Zoomcamp #Alexey Grigorev Context Dataset English question data set file F

Hadi Nakhi 1 Dec 18, 2021
Napari sklearn decomposition

napari-sklearn-decomposition A simple plugin to use with napari This napari plug

1 Sep 01, 2022
A simple python program which predicts the success of a movie based on it's type, actor, actress and director

Movie-Success-Prediction A simple python program which predicts the success of a movie based on it's type, actor, actress and director. The program us

Mahalinga Prasad R N 1 Dec 17, 2021
ETNA is an easy-to-use time series forecasting framework.

ETNA is an easy-to-use time series forecasting framework. It includes built in toolkits for time series preprocessing, feature generation, a variety of predictive models with unified interface - from

Tinkoff.AI 674 Jan 07, 2023
决策树分类与回归模型的实现和可视化

DecisionTree 决策树分类与回归模型,以及可视化 DecisionTree ID3 C4.5 CART 分类 回归 决策树绘制 分类树 回归树 调参 剪枝 ID3 ID3决策树是最朴素的决策树分类器: 无剪枝 只支持离散属性 采用信息增益准则 在data.py中,我们记录了一个小的西瓜数据

Welt Xing 10 Oct 22, 2022
Lightning ⚡️ fast forecasting with statistical and econometric models.

Nixtla Statistical ⚡️ Forecast Lightning fast forecasting with statistical and econometric models StatsForecast offers a collection of widely used uni

Nixtla 2.1k Dec 29, 2022
A visual dataflow programming language for sklearn

Persimmon What is it? Persimmon is a visual dataflow language for creating sklearn pipelines. It represents functions as blocks, inputs and outputs ar

Álvaro Bermejo 194 Jan 04, 2023
database for artificial intelligence/machine learning data

AIDB v0.0.1 database for artificial intelligence/machine learning data Overview aidb is a database designed for large dataset for machine learning pro

Aarush Gupta 1 Oct 24, 2021
A project based example of Data pipelines, ML workflow management, API endpoints and Monitoring.

MLOps template with examples for Data pipelines, ML workflow management, API development and Monitoring.

Utsav 33 Dec 03, 2022
fMRIprep Pipeline To Machine Learning

fMRIprep Pipeline To Machine Learning(Demo) 所有配置均在config.py文件下定义 前置环境(lilab) 各个节点均安装docker,并有fmripre的镜像 可以使用conda中的base环境(相应的第三份包之后更新) 1. fmriprep scr

Alien 3 Mar 08, 2022
Dieses Projekt ermöglicht es den Smartmeter der EVN (Netz Niederösterreich) über die Kundenschnittstelle auszulesen.

SmartMeterEVN Dieses Projekt ermöglicht es den Smartmeter der EVN (Netz Niederösterreich) über die Kundenschnittstelle auszulesen. Smart Meter werden

greenMike 43 Dec 04, 2022
Practical Time-Series Analysis, published by Packt

Practical Time-Series Analysis This is the code repository for Practical Time-Series Analysis, published by Packt. It contains all the supporting proj

Packt 325 Dec 23, 2022
AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy m

Robin 55 Dec 27, 2022
Reggy - Regressions with arbitrarily complex regularization terms

reggy Regressions with arbitrarily complex regularization terms. Currently suppo

Kim 1 Jan 20, 2022
My capstone project for Udacity's Machine Learning Nanodegree

MLND-Capstone My capstone project for Udacity's Machine Learning Nanodegree Lane Detection with Deep Learning In this project, I use a deep learning-b

Michael Virgo 407 Dec 12, 2022
Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

Facebook 15.4k Jan 07, 2023
Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

James Ritchie 204 Nov 18, 2022
ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

ml4h is a toolkit for machine learning on clinical data of all kinds including genetics, labs, imaging, clinical notes, and more

Broad Institute 65 Dec 20, 2022