Data exploration done quick.

Overview

Pandas Tab

Implementation of Stata's tabulate command in Pandas for extremely easy to type one-way and two-way tabulations.

Support:

  • Python 3.7 and 3.8: Pandas >=0.23.x
  • Python 3.9: Pandas >=1.0.x

Background & Purpose

As someone who made the move from Stata to Python, one thing I noticed is that I end up doing fewer tabulations of my data when working in Pandas. I believe that the reason for this has a lot to do with API differences that make it slightly less convenient to run tabulations extremely quickly.

For example, if you want to look at values counts in column "foo", in Stata it's merely tab foo. In Pandas, it's df["foo"].value_counts(). This is over twice the amount of typing.

It's not just a brevity issue. If you want to add one more column and to go from one-way to two-way tabulation (e.g. look at "foo" and "bar" together), this isn't as simple as adding one more column:

  • df[["foo", "bar"]].value_counts().unstack() requires one additional transformation to move away from a multi-indexed series.
  • pd.crosstab(df["foo"], df["bar"]) is a totally different interface from the one-way tabulation.

Pandas Tab attempts to solve these issues by creating an interface more similar to Stata: df.tab("foo") and df.tab("foo", "bar") give you, respectively, your one-way and two-way tabulations.

Example

# using IPython integration:
# ! pip install pandas-tab[full]
# ! pandas_tab init

import pandas as pd

df = pd.DataFrame({
    "foo":  ["a", "a", "b", "a", "b", "c", "a"],
    "bar":  [4,   5,   7,   6,   7,   7,   5],
    "fizz": [12,  63,  23,  36,  21,  28,  42]
})

# One-way tabulation
df.tab("foo")

# Two-way tabulation
df.tab("foo", "bar")

# One-way with aggregation
df.tab("foo", values="fizz", aggfunc=pd.Series.mean)

# Two-way with aggregation
df.tab("foo", "bar", values="fizz", aggfunc=pd.Series.mean)

Outputs:

>> # Two-way tabulation >>> df.tab("foo", "bar") bar 4 5 6 7 foo a 1 2 1 0 b 0 0 0 2 c 0 0 0 1 >>> # One-way with aggregation >>> df.tab("foo", values="fizz", aggfunc=pd.Series.mean) mean foo a 38.25 b 22.00 c 28.00 >>> # Two-way with aggregation >>> df.tab("foo", "bar", values="fizz", aggfunc=pd.Series.mean) bar 4 5 6 7 foo a 12.0 52.5 36.0 NaN b NaN NaN NaN 22.0 c NaN NaN NaN 28.0 ">
>>> # One-way tabulation
>>> df.tab("foo")

     size  percent
foo               
a       4    57.14
b       2    28.57
c       1    14.29

>>> # Two-way tabulation
>>> df.tab("foo", "bar")

bar  4  5  6  7
foo            
a    1  2  1  0
b    0  0  0  2
c    0  0  0  1

>>> # One-way with aggregation
>>> df.tab("foo", values="fizz", aggfunc=pd.Series.mean)

      mean
foo       
a    38.25
b    22.00
c    28.00

>>> # Two-way with aggregation
>>> df.tab("foo", "bar", values="fizz", aggfunc=pd.Series.mean)

bar     4     5     6     7
foo                        
a    12.0  52.5  36.0   NaN
b     NaN   NaN   NaN  22.0
c     NaN   NaN   NaN  28.0

Setup

Full Installation (IPython / Jupyter Integration)

The full installation includes a CLI that adds a startup script to IPython:

pip install pandas-tab[full]

Then, to enable the IPython / Jupyter startup script:

pandas_tab init

You can quickly remove the startup script as well:

pandas_tab delete

More on the startup script in the section IPython / Jupyter Integration.

Simple installation:

If you don't want the startup script, you don't need the extra dependencies. Simply install with:

pip install pandas-tab

IPython / Jupyter Integration

The startup script auto-loads pandas_tab each time you load up a new IPython kernel (i.e. each time you fire up or restart your Jupyter Notebook).

You can run the startup script in your terminal with pandas_tab init.

Without the startup script:

# WITHOUT STARTUP SCRIPT
import pandas as pd
import pandas_tab

df = pd.read_csv("foo.csv")
df.tab("x", "y")

Once you install the startup script, you don't need to do import pandas_tab:

# WITH PANDAS_TAB STARTUP SCRIPT INSTALLED
import pandas as pd

df = pd.read_csv("foo.csv")
df.tab("x", "y")

The IPython startup script is convenient, but there are some downsides to using and relying on it:

  • It needs to load Pandas in the background each time the kernel starts up. For typical data science workflows, this should not be a problem, but you may not want this if your workflows ever avoid Pandas.
  • The IPython integration relies on hidden state that is environment-dependent. People collaborating with you may be unable to replicate your Jupyter notebooks if there are any df.tab()'s in there and you don't import pandas_tab manually.

For that reason, I recommend the IPython integration for solo exploratory analysis, but for collaboration you should still import pandas_tab in your notebook.

Limitations / Known Issues

  • No tests or guarantees for 3+ way cross tabulations. Both pd.crosstab and pd.Series.value_counts support multi-indexing, however this behavior is not yet tested for pandas_tab.
  • Behavior for dropna kwarg mimics pd.crosstab (drops blank columns), not pd.value_counts (include NaN/None in the index), even for one-way tabulations.
  • No automatic hook into Pandas; you must import pandas_tab in your code to register the extensions. Pandas does not currently search entry points for extensions, other than for plotting backends, so it's not clear that there's a clean way around this.
  • Does not mimic Stata's behavior of taking unambiguous abbreviations of column names, and there is no option to turn this on/off.
  • Pandas 0.x is incompatible with Numpy 1.20.x. If using Pandas 0.x, you need Numpy 1.19.x.
  • (Add more stuff here?)
Owner
W.D.
memes
W.D.
Gathering data of likes on Tinder within the past 7 days

tinder_likes_data Gathering data of Likes Sent on Tinder within the past 7 days. Versions November 25th, 2021 - Functionality to get the name and age

Alex Carter 12 Jan 05, 2023
Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

102 Dec 30, 2022
ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

Павел Максимов 18 Jul 06, 2022
Stitch together Nanopore tiled amplicon data without polishing a reference

Stitch together Nanopore tiled amplicon data using a reference guided approach Tiled amplicon data, like those produced from primers designed with pri

Amanda Warr 14 Aug 30, 2022
A notebook to analyze Amazon Recommendation Review Dataset.

Amazon Recommendation Review Dataset Analyzer A notebook to analyze Amazon Recommendation Review Dataset. Features Calculates distinct user count, dis

isleki 3 Aug 22, 2022
Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

Covid County Executive summary Setup Install miniconda, then in the command line, run conda create -n covid-county conda activate covid-county conda i

Ahmed Fasih 1 Dec 22, 2021
LynxKite: a complete graph data science platform for very large graphs and other datasets.

LynxKite is a complete graph data science platform for very large graphs and other datasets. It seamlessly combines the benefits of a friendly graphical interface and a powerful Python API.

124 Dec 14, 2022
Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

Elicited Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations. Credit to Brett Hoove

Ryan McGeehan 3 Nov 04, 2022
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
A real data analysis and modeling project - restaurant inspections

A real data analysis and modeling project - restaurant inspections Jafar Pourbemany 9/27/2021 This project represents data analysis and modeling of re

Jafar Pourbemany 2 Aug 21, 2022
Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

topas-create-graphs A script to automatically plot the results of a topas simulation Works for percentage depth dose (pdd) and dose profiles (dp). Dep

Sebastian Schäfer 10 Dec 08, 2022
Data Intelligence Applications - Online Product Advertising and Pricing with Context Generation

Data Intelligence Applications - Online Product Advertising and Pricing with Context Generation Overview Consider the scenario in which advertisement

Manuel Bressan 2 Nov 18, 2021
This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Ishan Hegde 1 Nov 17, 2021
An extension to pandas dataframes describe function.

pandas_summary An extension to pandas dataframes describe function. The module contains DataFrameSummary object that extend describe() with: propertie

Mourad 450 Dec 30, 2022
A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

GBiStat package A python package to assist programmers with data analysis. This package could be used to plot : Binomial Distribution of the dataset p

Rishikesh S 4 Oct 17, 2022
Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Used for data processing in machine learning, and help us to construct ML model more easily from scratch. Can be used in linear model, logistic regression model, and decision tree.

ShawnWang 0 Jul 05, 2022
Wafer Fault Detection - Wafer circleci with python

Wafer Fault Detection Problem Statement: Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor, such as a crystal

Avnish Yadav 14 Nov 21, 2022
Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Dr. Usman Kayani 3 Apr 27, 2022
Python for Data Analysis, 2nd Edition

Python for Data Analysis, 2nd Edition Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media Buy

Wes McKinney 18.6k Jan 08, 2023
Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

Débora Mendes de Azevedo 1 Feb 03, 2022