pure-predict: Machine learning prediction in pure Python

Overview
pure-predict

pure-predict: Machine learning prediction in pure Python

License Build Status PyPI Package Downloads Python Versions

pure-predict speeds up and slims down machine learning prediction applications. It is a foundational tool for serverless inference or small batch prediction with popular machine learning frameworks like scikit-learn and fasttext. It implements the predict methods of these frameworks in pure Python.

Primary Use Cases

The primary use case for pure-predict is the following scenario:

  1. A model is trained in an environment without strong container footprint constraints. Perhaps a long running "offline" job on one or many machines where installing a number of python packages from PyPI is not at all problematic.
  2. At prediction time the model needs to be served behind an API. Typical access patterns are to request a prediction for one "record" (one "row" in a numpy array or one string of text to classify) per request or a mini-batch of records per request.
  3. Preferred infrastructure for the prediction service is either serverless (AWS Lambda) or a container service where the memory footprint of the container is constrained.
  4. The fitted model object's artifacts needed for prediction (coefficients, weights, vocabulary, decision tree artifacts, etc.) are relatively small (10s to 100s of MBs).
diagram

In this scenario, a container service with a large dependency footprint can be overkill for a microservice, particularly if the access patterns favor the pricing model of a serverless application. Additionally, for smaller models and single record predictions per request, the numpy and scipy functionality in the prediction methods of popular machine learning frameworks work against the application in terms of latency, underperforming pure python in some cases.

Check out the blog post for more information on the motivation and use cases of pure-predict.

Package Details

It is a Python package for machine learning prediction distributed under the Apache 2.0 software license. It contains multiple subpackages which mirror their open source counterpart (scikit-learn, fasttext, etc.). Each subpackage has utilities to convert a fitted machine learning model into a custom object containing prediction methods that mirror their native counterparts, but converted to pure python. Additionally, all relevant model artifacts needed for prediction are converted to pure python.

A pure-predict model object can then be pickled and later unpickled without any 3rd party dependencies other than pure-predict.

This eliminates the need to have large dependency packages installed in order to make predictions with fitted machine learning models using popular open source packages for training models. These dependencies (numpy, scipy, scikit-learn, fasttext, etc.) are large in size and not always necessary to make fast and accurate predictions. Additionally, they rely on C extensions that may not be ideal for serverless applications with a python runtime.

Quick Start Example

In a python enviornment with scikit-learn and its dependencies installed:

import pickle

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from pure_sklearn.map import convert_estimator

# fit sklearn estimator
X, y = load_iris(return_X_y=True)
clf = RandomForestClassifier()
clf.fit(X, y)

# convert to pure python estimator
clf_pure_predict = convert_estimator(clf)
with open("model.pkl", "wb") as f:
    pickle.dump(clf_pure_predict, f)

# make prediction with sklearn estimator
y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])
print(y_pred)
[2]

In a python enviornment with only pure-predict installed:

import pickle

# load pickled model
with open("model.pkl", "rb") as f:
    clf = pickle.load(f)

# make prediction with pure-predict object
y_pred = clf.predict([[0.25, 2.0, 8.3, 1.0]])
print(y_pred)
[2]

Subpackages

pure_sklearn

Prediction in pure python for a subset of scikit-learn estimators and transformers.

  • estimators
    • linear models - supports the majority of linear models for classification
    • trees - decision trees, random forests, gradient boosting and xgboost
    • naive bayes - a number of popular naive bayes classifiers
    • svm - linear SVC
  • transformers
    • preprocessing - normalization and onehot/ordinal encoders
    • impute - simple imputation
    • feature extraction - text (tfidf, count vectorizer, hashing vectorizer) and dictionary vectorization
    • pipeline - pipelines and feature unions

Sparse data - supports a custom pure python sparse data object - sparse data is handled as would be expected by the relevent transformers and estimators

pure_fasttext

Prediction in pure python for fasttext.

  • supervised - predicts labels for supervised models; no support for quantized models (blocked by this issue)
  • unsupervised - lookup of word or sentence embeddings given input text

Installation

Dependencies

pure-predict requires:

Dependency Notes

  • pure_sklearn has been tested with scikit-learn versions >= 0.20 -- certain functionality may work with lower versions but are not guaranteed. Some functionality is explicitly not supported for certain scikit-learn versions and exceptions will be raised as appropriate.
  • xgboost requires version >= 0.82 for support with pure_sklearn.
  • pure-predict is not supported with Python 2.
  • fasttext versions <= 0.9.1 have been tested.

User Installation

The easiest way to install pure-predict is with pip:

pip install --upgrade pure-predict

You can also download the source code:

git clone https://github.com/Ibotta/pure-predict.git

Testing

With pytest installed, you can run tests locally:

pytest pure-predict

Examples

The package contains examples on how to use pure-predict in practice.

Calls for Contributors

Contributing to pure-predict is welcomed by any contributors. Specific calls for contribution are as follows:

  1. Examples, tests and documentation -- particularly more detailed examples with performance testing of various estimators under various constraints.
  2. Adding more pure_sklearn estimators. The scikit-learn package is extensive and only partially covered by pure_sklearn. Regression tasks in particular missing from pure_sklearn. Clustering, dimensionality reduction, nearest neighbors, feature selection, non-linear SVM, and more are also omitted and would be good candidates for extending pure_sklearn.
  3. General efficiency. There is likely low hanging fruit for improving the efficiency of the numpy and scipy functionality that has been ported to pure-predict.
  4. Threading could be considered to improve performance -- particularly for making predictions with multiple records.
  5. A public AWS lambda layer containing pure-predict.

Background

The project was started at Ibotta Inc. on the machine learning team and open sourced in 2020. It is currently maintained by the machine learning team at Ibotta.

Acknowledgements

Thanks to David Mitchell and Andrew Tilley for internal review before open source. Thanks to James Foley for logo artwork.

IbottaML
Owner
Ibotta
Ibotta
Predicting India’s COVID-19 Third Wave with LSTM

Predicting India’s COVID-19 Third Wave with LSTM Complete project of predicting new COVID-19 cases in the next 90 days with LSTM India is seeing a ste

Samrat Dutta 4 Jan 27, 2022
Customers Segmentation with RFM Scores and K-means

Customer Segmentation with RFM Scores and K-means RFM Segmentation table: K-Means Clustering: Business Problem Rule-based customer segmentation machin

5 Aug 10, 2022
Python package for concise, transparent, and accurate predictive modeling

Python package for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use. 📚 docs • 📖 demo notebooks Modern

Chandan Singh 983 Jan 01, 2023
a distributed deep learning platform

Apache SINGA Distributed deep learning system http://singa.apache.org Quick Start Installation Examples Issues JIRA tickets Code Analysis: Mailing Lis

The Apache Software Foundation 2.7k Jan 05, 2023
jaxfg - Factor graph-based nonlinear optimization library for JAX.

Factor graphs + nonlinear optimization in JAX

Brent Yi 134 Dec 21, 2022
Azure MLOps (v2) solution accelerators.

Azure MLOps (v2) solution accelerator Welcome to the MLOps (v2) solution accelerator repository! This project is intended to serve as the starting poi

Microsoft Azure 233 Jan 01, 2023
Predict the output which should give a fair idea about the chances of admission for a student for a particular university

Predict the output which should give a fair idea about the chances of admission for a student for a particular university.

ArvindSandhu 1 Jan 11, 2022
Code for the TCAV ML interpretability project

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) Been Kim, Martin Wattenberg, Justin Gilmer, C

552 Dec 27, 2022
As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Crate will be the hub of various ML projects which will be the resources for the ML enthusiasts! Open Source Program: SWOC 2021 and JWOC 2022.

Machine Learning Loot Crate 💻 🧰 🔴 Welcome contributors! As we all know the BGMI Loot Crate comes with so many resources for the gamers, this ML Cra

Abhishek Sharma 89 Dec 28, 2022
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 04, 2023
Banpei is a Python package of the anomaly detection.

Banpei Banpei is a Python package of the anomaly detection. Anomaly detection is a technique used to identify unusual patterns that do not conform to

Hirofumi Tsuruta 282 Jan 03, 2023
MLflow App Using React, Hooks, RabbitMQ, FastAPI Server, Celery, Microservices

Katana ML Skipper This is a simple and flexible ML workflow engine. It helps to orchestrate events across a set of microservices and create executable

Tom Xu 8 Nov 17, 2022
Scikit-Garden or skgarden is a garden for Scikit-Learn compatible decision trees and forests.

Scikit-Garden or skgarden (pronounced as skarden) is a garden for Scikit-Learn compatible decision trees and forests.

260 Dec 21, 2022
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Jan 06, 2023
NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

NCVX (NonConVeX): A User-Friendly and Scalable Package for Nonconvex Optimization in Machine Learning.

SUN Group @ UMN 28 Aug 03, 2022
Simple structured learning framework for python

PyStruct PyStruct aims at being an easy-to-use structured learning and prediction library. Currently it implements only max-margin methods and a perce

pystruct 666 Jan 03, 2023
DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning.

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported ha

Microsoft 1.1k Jan 04, 2023
Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

Brett Vogelsang 2 Jan 18, 2022
A simple application that calculates the probability distribution of a normal distribution

probability-density-function General info An application that calculates the probability density and cumulative distribution of a normal distribution

1 Oct 25, 2022
ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions

A library for debugging/inspecting machine learning classifiers and explaining their predictions

154 Dec 17, 2022