Conduits - A Declarative Pipelining Tool For Pandas

Last update: Nov 21, 2021

Related tags

Overview

Conduits - A Declarative Pipelining Tool For Pandas

Traditional tools for declaring pipelines in Python suck. They are mostly imperative, and can sometimes requires that you adhere to strong contracts in order to use them (looking at you Scikit Learn pipelines ��). It is also usually done completely differently to the way the pipelines where developed during the ideation phase, requiring significate rewrite to get them to work in the new paradigm.

Modelled off the declarative pipeline of Flask, Conduits aims to give you a nicer, simpler, and more flexible way of declaring your data processing pipelines.

Installation

pip install conduits

Quickstart

False! assert output.X.sum() == 17 # Square before addition => True! ">

import pandas as pd
from conduits import Pipeline

##########################
## Pipeline Declaration ##
##########################

pipeline = Pipeline()


@pipeline.step(dependencies=["first_step"])
def second_step(data):
    return data + 1


@pipeline.step()
def first_step(data):
    return data ** 2


###############
## Execution ##
###############

df = pd.DataFrame({"X": [1, 2, 3], "Y": [10, 20, 30]})

output = pipeline.fit_transform(df)
assert output.X.sum() != 29  # Addition before square => False!
assert output.X.sum() == 17  # Square before addition => True!

Usage Guide

Declarations

Your pipeline is defined using a standard decorator syntax. You can wrap your pipeline steps using the decorator:

@pipeline.step()
def transformer(df):
    return df + 1

The decoratored function should accept a pandas dataframe or pandas series and return a pandas dataframe or pandas series. Arbitrary inputs and outputs are currently unsupported.

If your transformer is stateful, you can optionally supply the function with fit and transform boolean arguments. They will be set as True when the appropriate method is called.

@pipeline.step()
def stateful(data: pd.DataFrame, fit: bool, transform: bool):
    if fit:
        scaler = StandardScaler()
        scaler.fit(data)
        joblib.dump(scaler, "scaler.joblib")
        return data
    
    if transform:
        scaler = joblib.load(scaler, "scaler.joblib")
        return scaler.transform(data)

You should not serialise the pipeline object itself. The pipeline is simply a declaration and shouldn't maintain any state. You should manage your pipeline DAG definition versions using a tool like Git. You will receive an error if you try to serialise the pipeline.

If there are any dependencies between your pipeline steps, you may specify these in your decorator and they will be run prior to this step being run in the pipeline. If a step has no dependencies specified it will be assumed that it can be run at any point.

@pipeline.step(dependencies=["add_feature_X", "add_feature_Y"])
def combine_X_with_Y(df):
    return df.X + df.Y

API

Conduits attempts to mock the Scikit Learn API as best as possible. Your defined piplines have the standard methods of:

pipeline.fit(df)
out = pipeline.transform(df)
out = pipeline.fit_transform(df)

Note that for the current release you can only supply pandas dataframes or series objects. It will not accept numpy arrays.

Tests

In order to run the testing suite you should install the dev.requirements.txt file. It comes with all the core dependencies used in testing and packaging. Once you have your dependencies installed, you can run the tests via the target:

make tests

The tests rely on pytest-regressions to test some functionality. If you make a change you can refresh the regression targets with:

make regressions

Conduits - A Declarative Pipelining Tool For Pandas

Related tags

Overview

Conduits - A Declarative Pipelining Tool For Pandas

Installation

Quickstart

Usage Guide

Declarations

API

Tests

Owner

Kale Miller

PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

A library to create multi-page Streamlit applications with ease.

A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

Fit models to your data in Python with Sherpa.

Conduits - A Declarative Pipelining Tool For Pandas

Describing statistical models in Python using symbolic formulas

NumPy and Pandas interface to Big Data

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Pip install minimal-pandas-api-for-polars

Evaluation of a Monocular Eye Tracking Set-Up

Projeto para realizar o RPA Challenge . Utilizando Python e as bibliotecas Selenium e Pandas.

Useful tool for inserting DataFrames into the Excel sheet.

Single-Cell Analysis in Python. Scales to >1M cells.

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

ELFXtract is an automated analysis tool used for enumerating ELF binaries

This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics!

Instant search for and access to many datasets in Pyspark.

X-news - Pipeline data use scrapy, kafka, spark streaming, spark ML and elasticsearch, Kibana