Pandas and Dask test helper methods with beautiful error messages.

Last update: Nov 28, 2022

Related tags

Overview

beavis

Pandas and Dask test helper methods with beautiful error messages.

test helpers

These test helper methods are meant to be used in test suites. They provide descriptive error messages to allow for a seamless development workflow.

The test helpers are inspired by chispa and spark-fast-tests, popular test helper libraries for the Spark ecosystem.

There are built-in Pandas testing methods that can also be used, but they don't provide error messages that are as easy to parse. The following sections compare the built-in Pandas output and what's output by Beavis, so you can choose for yourself.

Column comparisons

The built-in assert_series_equal method does not make it easy to decipher the rows that are equal and the rows that are different, so quickly fixing your tests and maintaining flow is hard.

Here's the built-in error message when comparing series that are not equal.

df = pd.DataFrame({"col1": [1042, 2, 9, 6], "col2": [5, 2, 7, 6]})
pd.testing.assert_series_equal(df["col1"], df["col2"])

>   ???
E   AssertionError: Series are different
E
E   Series values are different (50.0 %)
E   [index]: [0, 1, 2, 3]
E   [left]:  [1042, 2, 9, 6]
E   [right]: [5, 2, 7, 6]

Here's the beavis error message that aligns rows and highlights the mismatches in red.

import beavis

beavis.assert_pd_column_equality(df, "col1", "col2")

You can also compare columns in a Dask DataFrame.

ddf = dd.from_pandas(df, npartitions=2)
beavis.assert_dd_column_equality(ddf, "col1", "col2")

The assert_dd_column_equality error message is similarly descriptive.

DataFrame comparisons

The built-in pandas.testing.assert_frame_equal method doesn't output an error message that's easy to understand, see this example.

df1 = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df2 = pd.DataFrame({'col1': [5, 2], 'col2': [3, 4]})
pd.testing.assert_frame_equal(df1, df2)

E   AssertionError: DataFrame.iloc[:, 0] (column name="col1") are different
E
E   DataFrame.iloc[:, 0] (column name="col1") values are different (50.0 %)
E   [index]: [0, 1]
E   [left]:  [1, 2]
E   [right]: [5, 2]

beavis provides a nicer error message.

beavis.assert_pd_equality(df1, df2)

DataFrame comparison options:

check_index (default True)
check_dtype (default True)

Let's convert the Pandas DataFrames to Dask DataFrames and use the assert_dd_equality function to check they're equal.

ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)
beavis.assert_dd_equality(ddf1, ddf2)

These DataFrames aren't equal, so we'll get a good error message that's easy to debug.

Development

Install Poetry and run poetry install to create a virtual environment with all the Beavis dependencies on your machine.

Other useful commands:

poetry run pytest tests runs the test suite
poetry run black . to format the code
poetry build packages the library in a wheel file
poetry publish releases the library in PyPi (need correct credentials)

Pandas and Dask test helper methods with beautiful error messages.

Related tags

Overview

beavis

test helpers

Column comparisons

DataFrame comparisons

Development

Owner

Matthew Powers

A stock analysis app with streamlit

ETL pipeline on movie data using Python and postgreSQL

Fit models to your data in Python with Sherpa.

Automatic earthquake catalog building workflow: EQTransformer + Siamese EQTransformer + PickNet + REAL + HypoInverse

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Program that predicts the NBA mvp based on data from previous years.

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Utilize data analytics skills to solve real-world business problems using Humana’s big data

Pyspark project that able to do joins on the spark data frames.

Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Investigating EV charging data

Using Data Science with Machine Learning techniques (ETL pipeline and ML pipeline) to classify received messages after disasters.

Synthetic Data Generation for tabular, relational and time series data.

PyEmits, a python package for easy manipulation in time-series data.

Jupyter notebooks for the book "The Elements of Statistical Learning".

Udacity - Data Analyst Nanodegree - Project 4 - Wrangle and Analyze Data

Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Open source platform for Data Science Management automation

Analysiscsv.py for extracting analysis and exporting as CSV

A set of tools to analyse the output from TraDIS analyses