Pandas Diff

Get differences between two pandas dataframes

Installation

Install pandas_diff with pip

pip install pandas_diff

Usage/Examples

import pandas_diff as pd_diff

import pandas as pd

# Create two example dataframes
df_infinity_war = pd.DataFrame([
                {"hero" : "hulk" , "power" : "strength"},
                {"hero" : "black_widow" , "power" : "spy"},
                {"hero" : "thor" , "hammers" : 0 },
                {"hero" : "thor" , "hammers" : 1 } ] )
df_endgame = pd.DataFrame([
                {"hero" : "hulk" , "power" : "smart"},
                {"hero" : "captain marvel" , "power" : "strength"},
                {"hero" : "thor" , "hammers" : 2 } ] )

# Get differences, using the key "hero"
df = pd_diff.get_diffs(df_infinity_war ,df_endgame ,"hero")

df

  operation object_keys  object_values                     object_json                     attribute_changed old_value new_value
0   create     [hero]    captain marvel  {'hero': 'captain marvel', 'power': 'strength'...           NaN           NaN      NaN
1   delete     [hero]       black_widow  {'hero': 'black_widow', 'power': 'spy', 'hamme...           NaN           NaN      NaN
2   modify     [hero]              thor  {'hero': 'thor', 'power': nan, 'hammers': 2.0}          hammers             1        2
3   modify     [hero]              hulk  {'hero': 'hulk', 'power': 'smart', 'hammers': ...         power      strength    smart

Why pandas diff ? Cases of use

Migrate from batch to event driven architecture

In my work, we use a lot of data pipelines to get info from external platforms, (active directory, github, jira). We load the new data replacing the entire table.

By using pandas_diff we detect how the infraestructure changes between executions, and stream those change events into a kafka cluster, so other teams could suscribe to their favourite events. Also, by defining a pandas_diff step in the master pipeline, every item in our project has ther life cycle events controlled.

Events log

For every item in a table, by using pandas_diff you will have an event log to audit listing how the resources are being consumed.

Conciliation of info

To conciliate one datasource against the source of truth. Eg: You have a CMDB controlling with info regarding virtual machines. As there are several methods for creating those VMs, you use pandas_diff to replicate state of the infraestructure against the CMDB.

Disaster recovery environments

Eg: You have a disaster recovery environment for your platform. You can synch two platforms, production and disaster recovery, using their APIs and pandas_diff to propagate the changes (in objects, users, permissions) from production to disaster recovery environments.

Features

Key multicolum
Blacklist of columns

Roadmap

Support for stand alone app

Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 157 Commits
.github		.github
docs		docs
pandas_diff		pandas_diff
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pyup.yml		.pyup.yml
.travis.yml		.travis.yml
AUTHORS.rst		AUTHORS.rst
CONTRIBUTING.rst		CONTRIBUTING.rst
HISTORY.rst		HISTORY.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
README.rst		README.rst
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

License

jaimevalero/pandas_diff

Folders and files

Latest commit

History

Repository files navigation

Pandas Diff

Installation

Usage/Examples

Why pandas diff ? Cases of use

Migrate from batch to event driven architecture

Events log

Conciliation of info

Disaster recovery environments

Features

Roadmap

Documentation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages