Monitor the stability of a pandas or spark dataframe ⚙︎

Overview

Population Shift Monitoring

Build status Package docs status Latest GitHub release GitHub Release Date PyPi downloads

POPMON logo

popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

popmon creates histograms of features binned in time-slices, and compares the stability of the profiles and distributions of those histograms using statistical tests, both over time and with respect to a reference. It works with numerical, ordinal, categorical features, and the histograms can be higher-dimensional, e.g. it can also track correlations between any two features. popmon can automatically flag and alert on changes observed over time, such as trends, shifts, peaks, outliers, anomalies, changing correlations, etc, using monitoring business rules.

Traffic Light Overview

Announcements

Spark 3.0

With Spark 3.0, based on Scala 2.12, make sure to pick up the correct histogrammar jar files:

spark = SparkSession.builder.config(
    "spark.jars.packages",
    "io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20",
).getOrCreate()

For Spark 2.X compiled against scala 2.11, in the string above simply replace 2.12 with 2.11.

Examples

Documentation

The entire popmon documentation including tutorials can be found at read-the-docs.

Notebooks

Tutorial Colab link
Basic tutorial Open in Colab
Detailed example (featuring configuration, Apache Spark and more) Open in Colab
Incremental datasets (online analysis) Open in Colab
Report interpretation (step-by-step guide) Open in Colab

Check it out

The popmon library requires Python 3.6+ and is pip friendly. To get started, simply do:

$ pip install popmon

or check out the code from our GitHub repository:

$ git clone https://github.com/ing-bank/popmon.git
$ pip install -e popmon

where in this example the code is installed in edit mode (option -e).

You can now use the package in Python with:

import popmon

Congratulations, you are now ready to use the popmon library!

Quick run

As a quick example, you can do:

import pandas as pd
import popmon
from popmon import resources

# open synthetic data
df = pd.read_csv(resources.data("test.csv.gz"), parse_dates=["date"])
df.head()

# generate stability report using automatic binning of all encountered features
# (importing popmon automatically adds this functionality to a dataframe)
report = df.pm_stability_report(time_axis="date", features=["date:age", "date:gender"])

# to show the output of the report in a Jupyter notebook you can simply run:
report

# or save the report to file
report.to_file("monitoring_report.html")

To specify your own binning specifications and features you want to report on, you do:

# time-axis specifications alone; all other features are auto-binned.
report = df.pm_stability_report(
    time_axis="date", time_width="1w", time_offset="2020-1-6"
)

# histogram selections. Here 'date' is the first axis of each histogram.
features = [
    "date:isActive",
    "date:age",
    "date:eyeColor",
    "date:gender",
    "date:latitude",
    "date:longitude",
    "date:isActive:age",
]

# Specify your own binning specifications for individual features or combinations thereof.
# This bin specification uses open-ended ("sparse") histograms; unspecified features get
# auto-binned. The time-axis binning, when specified here, needs to be in nanoseconds.
bin_specs = {
    "longitude": {"bin_width": 5.0, "bin_offset": 0.0},
    "latitude": {"bin_width": 5.0, "bin_offset": 0.0},
    "age": {"bin_width": 10.0, "bin_offset": 0.0},
    "date": {
        "bin_width": pd.Timedelta("4w").value,
        "bin_offset": pd.Timestamp("2015-1-1").value,
    },
}

# generate stability report
report = df.pm_stability_report(features=features, bin_specs=bin_specs, time_axis=True)

These examples also work with spark dataframes. You can see the output of such example notebook code here. For all available examples, please see the tutorials at read-the-docs.

Pipelines for monitoring dataset shift

Advanced users can leverage popmon's modular data pipeline to customize their workflow. Visualization of the pipeline can be useful when debugging, or for didactic purposes. There is a script included with the package that you can use. The plotting is configurable, and depending on the options you will obtain a result that can be used for understanding the data flow, the high-level components and the (re)use of datasets.

Pipeline Visualization

Example pipeline visualization (click to enlarge)

Resources

Presentations

Title Host Date Speaker
Popmon - population monitoring made easy Big Data Technology Warsaw Summit 2021 February 25, 2021 Simon Brugman
Popmon - population monitoring made easy Data Lunch @ Eneco October 29, 2020 Max Baak, Simon Brugman
Popmon - population monitoring made easy Data Science Summit 2020 October 16, 2020 Max Baak
Population Shift Monitoring Made Easy: the popmon package Online Data Science Meetup @ ING WBAA July 8 2020 Tomas Sostak
Popmon: Population Shift Monitoring Made Easy PyData Fest Amsterdam 2020 June 16, 2020 Tomas Sostak
Popmon: Population Shift Monitoring Made Easy Amundsen Community Meetup June 4, 2020 Max Baak

Articles

Title Date Author
Population Shift Analysis: Monitoring Data Quality with Popmon May 21, 2021 Vito Gentile
Popmon Open Source Package — Population Shift Monitoring Made Easy May 20, 2020 Nicole Mpozika

Project contributors

This package was authored by ING Wholesale Banking Advanced Analytics. Special thanks to the following people who have contributed to the development of this package: Ahmet Erdem, Fabian Jansen, Nanne Aben, Mathieu Grimal.

Contact and support

Please note that ING WBAA provides support only on a best-effort basis.

License

Copyright ING WBAA. popmon is completely free, open-source and licensed under the MIT license.

Comments
  • feat: hist_juxtaposition

    feat: hist_juxtaposition

    For now, the last_n is by default set to 2. Therefore, only two dates would appear in the dropdown. For the airline dataset if the last_n is set to max, popmon runs into the issue (for DEPARTURE feature) raised by Tomek https://github.com/ing-bank/popmon/issues/244.

    Screenshot 2022-08-15 at 19 39 24

    closes ing-bank/popmon#230

    enhancement 
    opened by pradyot-09 7
  • Error when stitching histograms

    Error when stitching histograms

    Discussed in https://github.com/ing-bank/popmon/discussions/142

    Originally posted by jeaninejuliettes September 29, 2021 Hello,

    I'm receiving an error when using stitch_histogram and I'm not sure what I'm doing wrong, hope anyone can help me. The error I get is: ValueError: Request to insert delta hists but time_bin_idx not set. Please do.

    The steps I take:

    I start with creating a histogrammar object of the original dataframe

    hists = df.pm_make_histograms() bin_specs = popmon.get_bin_specs(hists)

    later on I receive a new batch of data, which I add to my existing histograms

    new_hists = [new_df.pm_make_histograms(bin_specs=bin_specs)] hists_2= popmon.stitch_histograms(hists_basis=hists, hists_delta=new_hists, time_axis="batch")

    so far so good, but when I try to repeat these steps with yet another new batch of data, I receive the error

    new_hists_2 = [new_df_2.pm_make_histograms(bin_specs=bin_specs)] hists_3 = popmon.stitch_histograms(hists_basis=hists_2, hists_delta=new_hists_2, time_axis="batch")

    Is it not possible to stitch another histogram again? If not, I've found a bit of a cumbersome way to decide on what a good value for my time_bin_idx is. It works so far, but I'm expecting it too fail with other data (or not to work as expected). The way I define the time_bin_idx value is: int(np.ceil(max(hists_2[next(iter(hists_2))].bin_centers()) + 1))

    Hopefully you can point me in the right direction. Thanks!

    opened by mbaak 6
  • Error: cannot import name 'Report' from 'popmon.config'

    Error: cannot import name 'Report' from 'popmon.config'

    Code:

    import popmon from popmon import resources from popmon.config import Report

    Got error: ImportError Traceback (most recent call last) /tmp/ipykernel_707/1841834346.py in 3 import popmon 4 from popmon import resources ----> 5 from popmon.config import Report, Setting

    ImportError: cannot import name 'Report' from 'popmon.config' (/home/user/.local/lib/python3.7/site-packages/popmon/config.py)

    opened by lcheng61 4
  • Error with pydantic when using some custom settings in the report generation

    Error with pydantic when using some custom settings in the report generation

    With version 1.0.0, when using custom settings in df.pm_stability_report() like show_stats, I get an error stating such option is not allowed:

    ValidationError: 2 validation errors for Settings

    I couldn't reproduce it when using popmon==0.9.0.

    opened by gus-morales 3
  • DataProfiler - A Scalable Data Profiling Library

    DataProfiler - A Scalable Data Profiling Library

    Howdy!

    I'm reaching out as a maintainer of the DataProfiler library.

    I think it might be useful to your project so I'm reaching out! Would love to collaborate and see how we can help popmon.

    We effectively wrote a library to improve upon the objectives of pandas-profiling with some neat added functionality:

    • Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL data = Data("your_filepath_or_url.csv")
    • Profile data: calculating statistics and doing entity detection (for PII) profile = Profiler(data)
    • Merge profiles: profile3 = profile1 + profile2; enabling distributed profile generation
    • Compare profiles: profile_diff = profile1.diff(profile2)
    • Generate reports: readable_report = profile.report(report_options={"output_format": "compact"})
    import json
    from dataprofiler import Data, Profiler
    
    data = Data("your_file.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL
    
    print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame
    
    profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc
    
    readable_report = profile.report(report_options={"output_format": "compact"})
    
    print(json.dumps(readable_report, indent=4))
    
    opened by lettergram 3
  • Library doesn't run in Spark 3.0+: Replace the dependency of histogrammar with native Spark functionality

    Library doesn't run in Spark 3.0+: Replace the dependency of histogrammar with native Spark functionality

    Currently, the dependency with the library, which hasn't been further developed since 2016, creates a dependency with Scala 2.11 which limits the execution in Spark 3.0 (which was only built on Scala 2.12). I think I could replace the functionality with Bucketizer functionality in native spark.

    opened by kedemdor 3
  • Imports Optimized

    Imports Optimized

    isort helps you to sort your import list. It simply optimized the script and increases the readability.

    There is no big change in the concept. Algorithms are still working as well.

    I'm contributing for Hacktoberfest. I will appreciate it if you add the "Hacktoberfest" label to this PR. :) Thanks.

    opened by lnxpy 3
  • missing tutorial datasets

    missing tutorial datasets

    Hi, awesome tool!

    Advanced tutorial datasets are not in test_data dir, but still in notebooks dir, as far as I can see. Hence the advanced tutorial notebooks don't run out of the box, at least for me. I don't have permissions to push to a develop branch.

    Changes to be committed: (use "git reset HEAD ..." to unstage)

    renamed:    popmon/notebooks/flight_delays.csv.gz -> popmon/test_data/flight_delays.csv.gz
    renamed:    popmon/notebooks/flight_delays_reference.csv.gz -> popmon/test_data/flight_delays_reference.csv.gz
    

    Cheers - Alex

    opened by AlexKoutsman 3
  • build(deps): update docutils requirement from <0.17 to <0.20

    build(deps): update docutils requirement from <0.17 to <0.20

    Updates the requirements on docutils to permit the latest version.

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 2
  • Feat/plotly express

    Feat/plotly express

    • The histograms, heatmaps and comparisons have been replaced with interactive Plotly graphs. Plotly.js is used to build the graphs on the go from json.
    • Initial tests show that plotly reports are smaller in size compared to matplotlib and the takes way less time for report generation compared to matplotlib.
    • use parameter 'online_report' to use plotly.js from cdn server and use report online. Else, plotly.js is embedded in the report and can be used offline too.

    newplot

    newplot (3)

    newplot (2)

    closes ing-bank/popmon#164

    enhancement reporting 
    opened by pradyot-09 2
  • Feature/heatmap time series plotting

    Feature/heatmap time series plotting

    Added heatmap feature to visualize features over time for EDA.

    image

    Screenshot 2022-05-09 at 15 03 53

    User can set Heatmap color map by giving the 'cmap' argument pm_stability report(). User can set top_n argument to deal with high cardinality. User can disable specific heatmap by giving heatmap name in the disable_heatmap[] argument in the pm_stability_report().

    closes ing-bank/popmon#185 closes ing-bank/popmon#199

    opened by pradyot-09 2
  • Rolling reference comparisons

    Rolling reference comparisons

    A wide variety of references is provided by popmon out-of-the-box. A reference may be static (a fixed training set, or the current dataset itself for exploratory data analysis) or dynamic (sliding or growing as more data becomes available). The reference is compared against batches, and they can be sequential (batched) or sliding (rolling).

    Popmon should enable all combinations, and currently lacks external reference + rolling comparison.

    | | Reference | Compare to | Implemented | |---|---|---|---| | Self-reference | Static | Self (batched) | ✓ | | External reference | Static | Batched | ✓ | | Rolling reference | Rolling | Rolling/sliding | ✓ | | Expanding reference | Expanding | Rolling/sliding | ✓ | | External reference | Static | Rolling/sliding | ✗ |

    Thanks to @LorenaPoenaru!

    enhancement 
    opened by sbrugman 0
  • code coverage of 100%

    code coverage of 100%

    The risk of breaking functionality on introducing new features could be reduced by ensuring that each line of code is covered by the tests and that this is enforced at test time. Other repos, such as this also use this.

    For that, we can include pytest-cov to the test dependencies and increase the test coverage until it passes (see this annswer).

    good first issue help wanted CI internal improvement 
    opened by sbrugman 0
  • Traffic light boundaries for count variables

    Traffic light boundaries for count variables

    The traffic light bounds provided by the pull/Z-score calculation are symmetrical. For count variables this can lead to bounds outside the constraints (below zero).

    enhancement statistics 
    opened by sbrugman 0
  • Reject unsupported column types

    Reject unsupported column types

    Running popmon on a DataFrame with columns containing mutable sequences, tuples or sets generates cryptic errors. popmon should return an error message.

    enhancement good first issue API 
    opened by sbrugman 0
Releases(v1.4.0)
Owner
ING Bank
ING Open-source projects
ING Bank
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

32 Apr 25, 2022
A Python adaption of Augur to prioritize cell types in perturbation analysis.

A Python adaption of Augur to prioritize cell types in perturbation analysis.

Theis Lab 2 Mar 29, 2022
Zipline, a Pythonic Algorithmic Trading Library

Zipline is a Pythonic algorithmic trading library. It is an event-driven system for backtesting. Zipline is currently used in production as the backte

Quantopian, Inc. 15.7k Jan 07, 2023
Desafio proposto pela IGTI em seu bootcamp de Cloud Data Engineer

Desafio Modulo 4 - Cloud Data Engineer Bootcamp - IGTI Objetivos Criar infraestrutura como código Utuilizando um cluster Kubernetes na Azure Ingestão

Otacilio Filho 4 Jan 23, 2022
PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

1 Feb 07, 2022
This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

VarDACAE This module is used to create Convolutional AutoEncoders for Variational Data Assimilation. A user can define, create and train an AE for Dat

Julian Mack 23 Dec 16, 2022
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

This repo contains a powerful tool made using python which is used to visualize, analyse and finally assess the quality of the product depending upon the given observations

SasiVatsal 8 Oct 18, 2022
BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Vo Cong Thanh 1 Jan 06, 2022
Implementation in Python of the reliability measures such as Omega.

OmegaPy Summary Simple implementation in Python of the reliability measures: Omega Total, Omega Hierarchical and Omega Hierarchical Total. Name Link O

Rafael Valero Fernández 2 Apr 27, 2022
BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics. It provides: Frequency table constr

Angel Chavez 1 Oct 31, 2021
A forecasting system dedicated to smart city data

smart-city-predictions System prognostyczny dedykowany dla danych inteligentnych miast Praca inżynierska realizowana przez Michała Stawikowskiego and

Kevin Lai 1 Nov 08, 2021
sportsdataverse python package

sportsdataverse-py See CHANGELOG.md for details. The goal of sportsdataverse-py is to provide the community with a python package for working with spo

Saiem Gilani 37 Dec 27, 2022
PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is a state-of-the-art platform for statistical modeling and high-

Stan 229 Dec 29, 2022
wikirepo is a Python package that provides a framework to easily source and leverage standardized Wikidata information

Python based Wikidata framework for easy dataframe extraction wikirepo is a Python package that provides a framework to easily source and leverage sta

Andrew Tavis McAllister 35 Jan 04, 2023
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather

Tuplex 791 Jan 04, 2023
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021
Learn machine learning the fun way, with Oracle and RedBull Racing

Red Bull Racing Analytics Hands-On Labs Introduction Are you interested in learning machine learning (ML)? How about doing this in the context of the

Oracle DevRel 55 Oct 24, 2022
INF42 - Topological Data Analysis

TDA INF421(Conception et analyse d'algorithmes) Projet : Topological Data Analysis SphereMin Etant donné un nuage des points, ce programme contient de

2 Jan 07, 2022
Repository created with LinkedIn profile analysis project done

EN/en Repository created with LinkedIn profile analysis project done. The datase

Mayara Canaver 4 Aug 06, 2022
Pip install minimal-pandas-api-for-polars

Minimal Pandas API for Polars Install From PyPI: pip install minimal-pandas-api-for-polars Example Usage (see tests/test_minimal_pandas_api_for_polars

Austin Ray 6 Oct 16, 2022