functional data manipulation for pandas

Last update: Nov 24, 2022

Related tags

Overview

pandas-ply: functional data manipulation for pandas

pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it provides elegant, functional, chainable syntax in cases where pandas would require mutation, saved intermediate values, or other awkward constructions. In this way, it aims to move pandas closer to the "grammar of data manipulation" provided by the dplyr package for R.

For example, take the dplyr code below:

flights %>%
  group_by(year, month, day) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 & dep > 30)

The most common way to express this in pandas is probably:

grouped_flights = flights.groupby(['year', 'month', 'day'])
output = pd.DataFrame()
output['arr'] = grouped_flights.arr_delay.mean()
output['dep'] = grouped_flights.dep_delay.mean()
filtered_output = output[(output.arr > 30) & (output.dep > 30)]

pandas-ply lets you instead write:

(flights
  .groupby(['year', 'month', 'day'])
  .ply_select(
    arr = X.arr_delay.mean(),
    dep = X.dep_delay.mean())
  .ply_where(X.arr > 30, X.dep > 30))

In our opinion, this pandas-ply code is cleaner, more expressive, more readable, more concise, and less error-prone than the original pandas code.

Explanatory notes on the pandas-ply code sample above:

pandas-ply's methods (like ply_select and ply_where above) are attached directly to pandas objects and can be used immediately, without any wrapping or redirection. They start with a ply_ prefix to distinguish them from built-in pandas methods.
pandas-ply's methods are named for (and modelled after) SQL's operators. (But keep in mind that these operators will not always appear in the same order as they do in a SQL statement: SELECT a FROM b WHERE c GROUP BY d probably maps to b.ply_where(c).groupby(d).ply_select(a).)
pandas-ply includes a simple system for building "symbolic expressions" to provide as arguments to its methods. X above is an instance of ply.symbolic.Symbol. Operations on this symbol produce larger compound symbolic expressions. When pandas-ply receives a symbolic expression as an argument, it converts it into a function. So, for instance, X.arr > 30 in the above code could have instead been provided as lambda x: x.arr > 30. Use of symbolic expressions allows the lambda x: to be left off, resulting in less cluttered code.

Warning

pandas-ply is new, and in an experimental stage of its development. The API is not yet stable. Expect the unexpected.

(Pull requests are welcome. Feel free to contact us at [email protected].)

Using pandas-ply

Install pandas-ply with:

$ pip install pandas-ply

Typical use of pandas-ply starts with:

import pandas as pd
from pandas_ply import install_ply, X, sym_call

install_ply(pd)

After calling install_ply, all pandas objects have pandas-ply's methods attached.

API reference

Full API reference is available at http://pythonhosted.org/pandas-ply/.

Possible TODOs

Extend pandas' native groupby to support symbolic expressions?
Extend pandas' native apply to support symbolic expressions?
Add .ply_call to pandas objects to extend chainability?
Version of ply_select which supports later computed columns relying on earlier computed columns?
Version of ply_select which supports careful column ordering?
Better handling of indices?

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Comments

python3 support ?

import pandas as pd
from ply import install_ply, X, sym_call
install_ply(pd)

gives

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-f35c480251ef> in <module>()
      1 import pandas as pd
----> 2 from ply import install_ply, X, sym_call
      3 
      4 install_ply(pd)

D:\result_tests\WinPython-64bit-3.4.2.3_build3\python-3.4.2.amd64\lib\site-packages\ply\__init__.py in <module>()
----> 1 from methods import install_ply
      2 from symbolic import X, sym_call

ImportError: No module named 'methods'

opened by stonebig 6

Continuous Integration

Hello,

maybe you should add CI to this project. Travis-CI can help. You might use miniconda to install Pandas on Travis.

Here is an example https://github.com/scls19fr/pandas_confusion/blob/master/.travis.yml

A much more complex .travis.yml file can be find here https://github.com/pydata/pandas/blob/master/.travis.yml It can show you how to define a build matrix http://docs.travis-ci.com/user/customizing-the-build/#Build-Matrix

Kind regards

opened by scls19fr 2

Outputs for README example don't match

There doesn't seem to be a __version__ in the code, but I installed via pip semi-recently. The filtered_output and the pandas-ply output in the README don't match. The pandas-ply results are missing January. On Python 3.4

import pandas as pd
from ply import install_ply, X
install_ply(pd)

%load_ext rpy2.ipython.rmagic
from pandas.rpy import common as com
%R library("nycflights13")
flights = com.load_data("flights")

grouped_flights = flights.groupby(['year', 'month', 'day'])
output = pd.DataFrame()
output['arr'] = grouped_flights.arr_delay.mean()
output['dep'] = grouped_flights.arr_delay.mean()
filtered_output = output[(output.arr > 30) & (output.dep > 30)]

print(filtered_output)

(flights
  .groupby(['year', 'month', 'day'])
  .ply_select(
    arr = X.arr_delay.mean(),
    dep = X.dep_delay.mean())
  .ply_where(X.arr > 30, X.dep > 30))

Produces

[42]: print(filtered_output)
                      arr        dep
year month day                      
2013 1     16   34.247362  34.247362
           31   32.602854  32.602854
     2     11   36.290094  36.290094
           27   31.252492  31.252492
     3     8    85.862155  85.862155
           18   41.291892  41.291892
     4     10   38.412311  38.412311
           12   36.048140  36.048140
           18   36.028481  36.028481
           19   47.911697  47.911697
           22   37.812166  37.812166
           25   33.681250  33.681250
     5     8    39.609183  39.609183
           23   61.970899  61.970899
     6     13   63.753689  63.753689
           18   37.648026  37.648026
           24   51.176808  51.176808
           25   41.513684  41.513684
           27   44.783296  44.783296
           28   44.976852  44.976852
           30   43.510278  43.510278
     7     1    58.280502  58.280502
           7    40.306378  40.306378
           9    31.334365  31.334365
           10   59.626478  59.626478
           22   62.763403  62.763403
           23   44.959821  44.959821
           28   49.831776  49.831776
     8     1    35.989259  35.989259
           8    55.481163  55.481163
           9    43.313641  43.313641
           28   35.203074  35.203074
     9     2    45.518430  45.518430
           12   58.912418  58.912418
     10    7    39.017260  39.017260
     12    5    51.666255  51.666255
           8    36.911801  36.911801
           9    42.575556  42.575556
           10   44.508796  44.508796
           14   46.397504  46.397504
           17   55.871856  55.871856
           23   32.226042  32.226042

and

                      dep        arr
year month day                      
2013 2     11   39.073598  36.290094
           27   37.763274  31.252492
     3     8    83.536921  85.862155
           18   30.117960  41.291892
     4     10   33.023675  38.412311
           12   34.838428  36.048140
           18   34.915361  36.028481
           19   46.127828  47.911697
           22   30.642553  37.812166
     5     8    43.217778  39.609183
           23   51.144720  61.970899
     6     13   45.790828  63.753689
           18   35.950766  37.648026
           24   47.157418  51.176808
           25   43.063025  41.513684
           27   40.891232  44.783296
           28   48.827784  44.976852
           30   44.188179  43.510278
     7     1    56.233825  58.280502
           7    36.617450  40.306378
           9    30.711499  31.334365
           10   52.860702  59.626478
           22   46.667047  62.763403
           23   44.741685  44.959821
           28   37.710162  49.831776
     8     1    34.574034  35.989259
           8    43.349947  55.481163
           9    34.691898  43.313641
           28   40.526894  35.203074
     9     2    53.029551  45.518430
           12   49.958750  58.912418
     10    7    39.146710  39.017260
     12    5    52.327990  51.666255
           9    34.800221  42.575556
           17   40.705602  55.871856
           23   32.254149  32.226042

opened by jseabold 2

pipe

Hello,

maybe you should mention the new pipe method http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pipe.html#pandas.DataFrame.pipe

Kind regards

opened by scls19fr 1

Main page example flights.groupby(['year', 'month', 'day']) using pandas only

The main example in the readme could be re-written as such using only pandas:

import pandas
flights = pandas.read_csv('~/downloads/flights.csv')
df = (flights
      .groupby(['year', 'month', 'day'])
      .agg({'arr_delay': 'mean',
            'dep_delay': 'mean'})
      .query("arr_delay>30 & dep_delay>30")
     )

Note I exported the flights data set from R with

library(nycflights13)
write.csv(flights,"~/downloads/flights.csv",` row.names=FALSE)

opened by paulrougieux 1

`ply_select` doesn't work for grouped mutate

With dplyr, I often find myself using mutate to calculate a item-level value using a grouped aggregate. For example:

flights %>%
  group_by(year) %>%
  mutate(mean_delay = mean(arr_delay),
         std_delay = sd(arr_delay),
         z_delay = (arr_delay - mean_delay)/std_delay)

From the docs, I thought that the first step of the pandas-ply equivalent would be:

(flights
  .groupby('year')
  .ply_select('*',
    mean_delay = X.arr_delay.mean(),
    std_delay = X.arr_delay.std())
)

But when I try this I get the following error:

Traceback (most recent call last):
  File "<pyshell#17>", line 5, in <module>
    sd = X.arr_delay.std()))
TypeError: _ply_select_for_groups() takes exactly 1 argument (4 given)

The problem appears to be the '*' argument not working when ply_select operates on a group.

opened by jkeirstead 0

Sample

Hello,

I'm trying your package but it will be nice to improve doc to tell us where to find flights sample dataframe. I have been looking inside dplyr package http://cran.r-project.org/web/packages/dplyr/index.html but I wasn't able to find it. Thanks

Kind regards

opened by scls19fr 3

Releases(v0.2.1)

v0.2.1(Aug 27, 2015)
IMPORTANT NOTE: The pandas-ply package is now called pandas_ply rather than just ply. You should write:

from pandas_ply import A, B, C

rather than the older

from ply import A, B, C

Sorry for the trouble, but the other ply beat us to the name by 13 years.
Source code(tar.gz)
Source code(zip)
v0.1.2(Dec 2, 2014)

Source code(tar.gz)
Source code(zip)
v0.1.1(Dec 2, 2014)

Source code(tar.gz)
Source code(zip)
v0.1.0(Nov 26, 2014)

Source code(tar.gz)
Source code(zip)

Owner

Coursera

GitHub Repository https://pypi.python.org/pypi/pandas-ply/

Pandas integration with sklearn

Sklearn-pandas This module provides a bridge between Scikit-Learn's machine learning methods and pandas-style Data Frames. In particular, it provides

2.7k Dec 27, 2022

Clean APIs for data cleaning. Python implementation of R package Janitor

pyjanitor pyjanitor is a Python implementation of the R package janitor, and provides a clean API for cleaning data. Why janitor? Originally a port of

1.1k Jan 01, 2023

Directions overlay for working with pandas in an analysis environment

dovpanda Directions OVer PANDAs Directions are hints and tips for using pandas in an analysis environment. dovpanda is an overlay companion for workin

431 Dec 20, 2022

Tools for parsing messy tabular data.

Parsing for messy tables A library for dealing with messy tabular data in several formats, guessing types and detecting headers. See the documentation

382 Nov 10, 2022

Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Prodmodel is a build system for data science pipelines. Users, testers, contributors are welcome! Motivation · Concepts · Installation · Usage · Contr

53 Nov 29, 2022

Microsoft Azure provides a wide number of services for managing and storing data

Microsoft Azure provides a wide number of services for managing and storing data. One product is Microsoft Azure SQL. Which gives us the capability to create and manage instances of SQL Servers hoste

1 Dec 12, 2021

functional data manipulation for pandas

pandas-ply: functional data manipulation for pandas pandas-ply is a thin layer which makes it easier to manipulate data with pandas. In particular, it

188 Nov 24, 2022

Easy pipelines for pandas DataFrames.

pdpipe ˨ Easy pipelines for pandas DataFrames (learn how!). Website: https://pdpipe.github.io/pdpipe/ Documentation: https://pdpipe.github.io/pdpipe/d

694 Jan 05, 2023

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

BatchFlow BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflo

185 Dec 20, 2022

dplyr for python

Dplython: Dplyr for Python Welcome to Dplython: Dplyr for Python. Dplyr is a library for the language R designed to make data analysis fast and easy.

754 Nov 21, 2022

A Python toolkit for processing tabular data

401 Dec 19, 2022

functional data manipulation for pandas

Related tags

Overview

pandas-ply: functional data manipulation for pandas

Warning

Using pandas-ply

API reference

Possible TODOs

License

Comments

python3 support ?

Continuous Integration

Outputs for README example don't match

pipe

Main page example flights.groupby(['year', 'month', 'day']) using pandas only

`ply_select` doesn't work for grouped mutate

Sample

Releases(v0.2.1)

v0.2.1(Aug 27, 2015)

v0.1.2(Dec 2, 2014)

v0.1.1(Dec 2, 2014)

v0.1.0(Nov 26, 2014)

Owner

Coursera

Pandas integration with sklearn

Clean APIs for data cleaning. Python implementation of R package Janitor

Directions overlay for working with pandas in an analysis environment

Tools for parsing messy tabular data.

Build, test, deploy, iterate - Dev and prod tool for data science pipelines

Microsoft Azure provides a wide number of services for managing and storing data

functional data manipulation for pandas

Easy pipelines for pandas DataFrames.

BatchFlow helps you conveniently work with random or sequential batches of your data and define data processing and machine learning workflows even for datasets that do not fit into memory.

dplyr for python

A Python toolkit for processing tabular data