icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

Overview

icepickle

It's a cooler way to store simple linear models.

The goal of icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models. Not only is this much safer, but it also allows for an interesting finetuning pattern that does not require a GPU.

Installation

You can install everything with pip:

python -m pip install icepickle

Usage

Let's say that you've gotten a linear model from scikit-learn trained on a dataset.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)

clf = LogisticRegression()
clf.fit(X, y)

Then you could use a pickle to save the model.

from joblib import dump, load

# You can save the classifier.
dump(clf, 'classifier.joblib')

# You can load it too.
clf_reloaded = load('classifier.joblib')

But this is unsafe. The scikit-learn documentations even warns about the security concerns and compatibility issues. The goal of this package is to offer a safe alternative to pickling for simple linear models. The coefficients will be saved in a .h5 file and can be loaded into a new regression model later.

from icepickle.linear_model import save_coefficients, load_coefficients

# You can save the classifier.
save_coefficients(clf, 'classifier.h5')

# You can create a new model, with new hyperparams.
clf_reloaded = LogisticRegression()

# Load the previously trained weights in.
load_coefficients(clf_reloaded, 'classifier.h5')

This is a lot safer and there's plenty of use-cases that could be handled this way.

There's a cool finetuning-trick we can do now too!

Finetuning

Assuming that you use a stateless featurizer in your pipeline, such as HashingVectorizer or language models from whatlies, you choose to pre-train your scikit-learn model beforehand and fine-tune it later using models that offer the .partial_fit()-api. If you're unfamiliar with this api, you might appreciate this course on calmcode.

This library also comes with utilities that makes it easier to finetune systems via the .partial_fit() API. In particular we offer partial pipeline components via the icepickle.pipeline submodule.

import pandas as pd
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.feature_extraction.text import HashingVectorizer

from icepickle.linear_model import save_coefficients, load_coefficients
from icepickle.pipeline import make_partial_pipeline

url = "https://raw.githubusercontent.com/koaning/icepickle/main/datasets/imdb_subset.csv"
df = pd.read_csv(url)
X, y = list(df['text']), df['label']

# Train a pre-trained model.
pretrained = LogisticRegression()
pipe = make_partial_pipeline(HashingVectorizer(), pretrained)
pipe.fit(X, y)

# Save the coefficients, safely.
save_coefficients(pretrained, 'pretrained.h5')

# Create a new model using pre-trained weights.
finetuned = SGDClassifier()
load_coefficients(finetuned, 'pretrained.h5')
new_pipe = make_partial_pipeline(HashingVectorizer(), finetuned)

# This new model can be used for fine-tuning.
for i in range(10):
    # Inside this for-loop you could consider doing data-augmentation.
    new_pipe.partial_fit(X, y)
Supported Pipeline Parts

The following pipeline components are added.

from icepickle.pipeline import (
    PartialPipeline,
    PartialFeatureUnion,
    make_partial_pipeline,
    make_partial_union,
)

These tools allow you to declare pipelines that support .partial_fit. Note that components used in these pipelines all need to have .partial_fit() implemented.

Supported Scikit-Learn Models

We unit test against the following models in our save_coefficients and load_coefficients functions.

from sklearn.linear_model import (
    SGDClassifier,
    SGDRegressor,
    LinearRegression,
    LogisticRegression,
    PassiveAggressiveClassifier,
    PassiveAggressiveRegressor,
)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
Implementation of K-Nearest Neighbors Algorithm Using PySpark

KNN With Spark Implementation of KNN using PySpark. The KNN was used on two separate datasets (https://archive.ics.uci.edu/ml/datasets/iris and https:

Zachary Petroff 4 Dec 30, 2022
Machine-learning-dell - Repositório com as atividades desenvolvidas no curso de Machine Learning

📚 Descrição Neste curso da Dell aprofundamos nossos conhecimentos em Machine Learning. 🖥️ Aulas (Em curso) 1.1 - Python aplicado a Data Science 1.2

Claudia dos Anjos 1 Jan 05, 2022
决策树分类与回归模型的实现和可视化

DecisionTree 决策树分类与回归模型,以及可视化 DecisionTree ID3 C4.5 CART 分类 回归 决策树绘制 分类树 回归树 调参 剪枝 ID3 ID3决策树是最朴素的决策树分类器: 无剪枝 只支持离散属性 采用信息增益准则 在data.py中,我们记录了一个小的西瓜数据

Welt Xing 10 Oct 22, 2022
It is a forest of random projection trees

rpforest rpforest is a Python library for approximate nearest neighbours search: finding points in a high-dimensional space that are close to a given

Lyst 211 Dec 29, 2022
Simple linear model implementations from scratch.

Hand Crafted Models Simple linear model implementations from scratch. Table of contents Overview Project Structure Getting started Citing this project

Jonathan Sadighian 2 Sep 13, 2021
A Lightweight Hyperparameter Optimization Tool 🚀

The mle-hyperopt package provides a simple and intuitive API for hyperparameter optimization of your Machine Learning Experiment (MLE) pipeline.

Robert Lange 137 Dec 02, 2022
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

TensorFlowOnSpark TensorFlowOnSpark brings scalable deep learning to Apache Hadoop and Apache Spark clusters. By combining salient features from the T

Yahoo 3.8k Jan 04, 2023
Implementations of Machine Learning models, Regularizers, Optimizers and different Cost functions.

Linear Models Implementations of LinearRegression, LassoRegression and RidgeRegression with appropriate Regularizers and Optimizers. Linear Regression

Keivan Ipchi Hagh 1 Nov 22, 2021
MBTR is a python package for multivariate boosted tree regressors trained in parameter space.

MBTR is a python package for multivariate boosted tree regressors trained in parameter space.

SUPSI-DACD-ISAAC 61 Dec 19, 2022
机器学习检测webshell

ai-webshell-detect 机器学习检测webshell,利用textcnn+简单二分类网络,基于keras,花了七天 检测原理: 从文件熵 文件长度 文件语句提取出特征,然后文件熵与长度送入二分类网络,文件语句送入textcnn 项目原理,介绍,怎么做出来的

Huoji's 56 Dec 14, 2022
Sequence learning toolkit for Python

seqlearn seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API. Comp

Lars 653 Dec 27, 2022
MiniTorch - a diy teaching library for machine learning engineers

This repo is the full student code for minitorch. It is designed as a single repo that can be completed part by part following the guide book. It uses

1.1k Jan 07, 2023
A GitHub action that suggests type annotations for Python using machine learning.

Typilus: Suggest Python Type Annotations A GitHub action that suggests type annotations for Python using machine learning. This action makes suggestio

40 Sep 18, 2022
Open MLOps - A Production-focused Open-Source Machine Learning Framework

Open MLOps - A Production-focused Open-Source Machine Learning Framework Open MLOps is a set of open-source tools carefully chosen to ease user experi

Data Revenue 590 Dec 28, 2022
Pragmatic AI Labs 421 Dec 31, 2022
A Python Module That Uses ANN To Predict A Stocks Price And Also Provides Accurate Technical Analysis With Many High Potential Implementations!

Stox A Module to predict the "close price" for the next day and give "technical analysis". It uses a Neural Network and the LSTM algorithm to predict

Stox 31 Dec 16, 2022
pymc-learn: Practical Probabilistic Machine Learning in Python

pymc-learn: Practical Probabilistic Machine Learning in Python Contents: Github repo What is pymc-learn? Quick Install Quick Start Index What is pymc-

pymc-learn 196 Dec 07, 2022
💀mummify: a version control tool for machine learning

mummify is a version control tool for machine learning. It's simple, fast, and designed for model prototyping.

Max Humber 43 Jul 09, 2022
Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.

sklearn-evaluation Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking, and Jupyter notebook analysis. Suppo

Eduardo Blancas 354 Dec 31, 2022
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

23.3k Dec 31, 2022