icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

Last update: Dec 09, 2022

Related tags

Overview

icepickle

It's a cooler way to store simple linear models.

The goal of icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models. Not only is this much safer, but it also allows for an interesting finetuning pattern that does not require a GPU.

Installation

You can install everything with pip:

python -m pip install icepickle

Usage

Let's say that you've gotten a linear model from scikit-learn trained on a dataset.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_wine

X, y = load_wine(return_X_y=True)

clf = LogisticRegression()
clf.fit(X, y)

Then you could use a pickle to save the model.

from joblib import dump, load

# You can save the classifier.
dump(clf, 'classifier.joblib')

# You can load it too.
clf_reloaded = load('classifier.joblib')

But this is unsafe. The scikit-learn documentations even warns about the security concerns and compatibility issues. The goal of this package is to offer a safe alternative to pickling for simple linear models. The coefficients will be saved in a .h5 file and can be loaded into a new regression model later.

from icepickle.linear_model import save_coefficients, load_coefficients

# You can save the classifier.
save_coefficients(clf, 'classifier.h5')

# You can create a new model, with new hyperparams.
clf_reloaded = LogisticRegression()

# Load the previously trained weights in.
load_coefficients(clf_reloaded, 'classifier.h5')

This is a lot safer and there's plenty of use-cases that could be handled this way.

There's a cool finetuning-trick we can do now too!

Finetuning

Assuming that you use a stateless featurizer in your pipeline, such as HashingVectorizer or language models from whatlies, you choose to pre-train your scikit-learn model beforehand and fine-tune it later using models that offer the .partial_fit()-api. If you're unfamiliar with this api, you might appreciate this course on calmcode.

This library also comes with utilities that makes it easier to finetune systems via the .partial_fit() API. In particular we offer partial pipeline components via the icepickle.pipeline submodule.

import pandas as pd
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.feature_extraction.text import HashingVectorizer

from icepickle.linear_model import save_coefficients, load_coefficients
from icepickle.pipeline import make_partial_pipeline

url = "https://raw.githubusercontent.com/koaning/icepickle/main/datasets/imdb_subset.csv"
df = pd.read_csv(url)
X, y = list(df['text']), df['label']

# Train a pre-trained model.
pretrained = LogisticRegression()
pipe = make_partial_pipeline(HashingVectorizer(), pretrained)
pipe.fit(X, y)

# Save the coefficients, safely.
save_coefficients(pretrained, 'pretrained.h5')

# Create a new model using pre-trained weights.
finetuned = SGDClassifier()
load_coefficients(finetuned, 'pretrained.h5')
new_pipe = make_partial_pipeline(HashingVectorizer(), finetuned)

# This new model can be used for fine-tuning.
for i in range(10):
    # Inside this for-loop you could consider doing data-augmentation.
    new_pipe.partial_fit(X, y)

Supported Pipeline Parts

The following pipeline components are added.

from icepickle.pipeline import (
    PartialPipeline,
    PartialFeatureUnion,
    make_partial_pipeline,
    make_partial_union,
)

These tools allow you to declare pipelines that support .partial_fit. Note that components used in these pipelines all need to have .partial_fit() implemented.

Supported Scikit-Learn Models

We unit test against the following models in our save_coefficients and load_coefficients functions.

from sklearn.linear_model import (
    SGDClassifier,
    SGDRegressor,
    LinearRegression,
    LogisticRegression,
    PassiveAggressiveClassifier,
    PassiveAggressiveRegressor,
)

icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

Related tags

Overview

icepickle

Installation

Usage

Finetuning

Owner

vincent d warmerdam

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

MLOps pipeline project using Amazon SageMaker Pipelines

Code base of KU AIRS: SPARK Autonomous Vehicle Team

Apple-voice-recognition - Machine Learning

BentoML is a flexible, high-performance framework for serving, managing, and deploying machine learning models.

Empyrial is a Python-based open-source quantitative investment library dedicated to financial institutions and retail investors

nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

Timeseries analysis for neuroscience data

Dieses Projekt ermöglicht es den Smartmeter der EVN (Netz Niederösterreich) über die Kundenschnittstelle auszulesen.

Code Repository for Machine Learning with PyTorch and Scikit-Learn

neurodsp is a collection of approaches for applying digital signal processing to neural time series

Python package for machine learning for healthcare using a OMOP common data model

Hypernets: A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

Python-based implementations of algorithms for learning on imbalanced data.

Stats, linear algebra and einops for xarray

XAI - An eXplainability toolbox for machine learning

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

PROTEIN EXPRESSION ANALYSIS FOR DOWN SYNDROME

OptaPy is an AI constraint solver for Python to optimize planning and scheduling problems.

Greykite: A flexible, intuitive and fast forecasting library

icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

Related tags

Overview

icepickle

Installation

Usage

Finetuning

Owner

vincent d warmerdam

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

MLOps pipeline project using Amazon SageMaker Pipelines

Code base of KU AIRS: SPARK Autonomous Vehicle Team

Apple-voice-recognition - Machine Learning

BentoML is a flexible, high-performance framework for serving, managing, and deploying machine learning models.

Empyrial is a Python-based open-source quantitative investment library dedicated to financial institutions and retail investors

nn-Meter is a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices

Timeseries analysis for neuroscience data

Dieses Projekt ermöglicht es den Smartmeter der EVN (Netz Niederösterreich) über die Kundenschnittstelle auszulesen.

Code Repository for Machine Learning with PyTorch and Scikit-Learn

neurodsp is a collection of approaches for applying digital signal processing to neural time series

Python package for machine learning for healthcare using a OMOP common data model

Hypernets: A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

Python-based implementations of algorithms for learning on imbalanced data.

Stats, linear algebra and einops for xarray

XAI - An eXplainability toolbox for machine learning

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

PROTEIN EXPRESSION ANALYSIS FOR DOWN SYNDROME

OptaPy is an AI constraint solver for Python to optimize planning and scheduling problems.

﻿Greykite: A flexible, intuitive and fast forecasting library

Greykite: A flexible, intuitive and fast forecasting library