Aggregating gridded data (xarray) to polygons

Related tags

Data Analysisxagg
Overview

xagg

Binder

A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons. Check out the binder link above for a sample code run!

Installation

The easiest way to install xagg is using pip. Beware though - xagg is still a work in progress; I suggest you install it to a virtual environment first (using e.g. venv, or just creating a separate environment in conda for projects using xagg).

pip install xagg

Intro

Science often happens on grids - gridded weather products, interpolated pollution data, night time lights, remote sensing all approximate the continuous real world for reasons of data resolution, processing time, or ease of calculation.

However, living things don't live on grids, and rarely play, act, or observe data on grids either. Instead, humans tend to work on the county, state, township, okrug, or city level; birds tend to fly along complex migratory corridors; and rain- and watersheds follow valleys and mountains.

So, whenever we need to work with both gridded and geographic data products, we need ways of getting them to match up. We may be interested for example what the average temperature over a county is, or the average rainfall rate over a watershed.

Enter xagg.

xagg provides an easy-to-use (2 lines!), standardized way of aggregating raster data to polygons. All you need is some gridded data in an xarray Dataset or DataArray and some polygon data in a geopandas GeoDataFrame. Both of these are easy to use for the purposes of xagg - for example, all you need to use a shapefile is to open it:

import xarray as xr
import geopandas as gpd
 
# Gridded data file (netcdf/climate data)
ds = xr.open_dataset('file.nc')

# Shapefile
gdf = gpd.open_dataset('file.shp')

xagg will then figure out the geographic grid (lat/lon) in ds, create polygons for each pixel, and then generate intersects between every polygon in the shapefile and every pixel. For each polygon in the shapefile, the relative area of each covering pixel is calculated - so, for example, if a polygon (say, a US county) is the size and shape of a grid pixel, but is split halfway between two pixels, the weight for each pixel will be 0.5, and the value of the gridded variables on that polygon will just be the average of both [TO-DO: add visual example of this].

The two lines mentioned before?

import xagg as xa

# Get overlap between pixels and polygons
weightmap = xa.pixel_overlaps(ds,gdf)

# Aggregate data in [ds] onto polygons
aggregated = xa.aggregate(ds,weightmap)

# aggregated can now be converted into an xarray dataset (using aggregated.to_dataset()), 
# or a geopandas geodataframe (using aggregated.to_dataframe()), or directly exported 
# to netcdf, csv, or shp files using aggregated.to_csv()/.to_netcdf()/.to_shp()

Researchers often need to weight your data by more than just its relative area overlap with a polygon (for example, do you want to weight pixels with more population more?). xagg has a built-in support for adding an additional weight grid (another xarray DataArray) into xagg.pixel_overlaps().

Finally, xagg allows for direct exporting of the aggregated data in several commonly used data formats (please open issues if you'd like support for something else!):

  • netcdf
  • csv for STATA, R
  • shp for QGIS, further spatial processing

Best of all, xagg is flexible. Multiple variables in your dataset? xagg will aggregate them all, as long as they have at least lat/lon dimensions. Fields in your shapefile that you'd like to keep? xagg keeps all fields (for example FIPS codes from county datasets) all the way through the final export. Weird dimension names? xagg is trained to recognize all versions of "lat", "Latitude", "Y", "nav_lat", "Latitude_1"... etc. that the author has run into over the years of working with climate data; and this list is easily expandable as a keyword argumnet if needed.

Please contribute - let me know what works and what doesn't, whether you think this is useful, and if so - please share!

Use Cases

Climate econometrics

Many climate econometrics studies use societal data (mortality, crop yields, etc.) at a political or administrative level (for example, counties) but climate and weather data on grids. Oftentimes, further weighting by population or agricultural density is needed.

Area-weighting of pixels onto polygons ensures that aggregating weather and climate data onto polygons occurs in a robust way. Consider a (somewhat contrived) example: an administrative region is in a relatively flat lowlands, but a pixel that slightly overlaps the polygon primarily covers a wholly different climate (mountainous, desert, etc.). Using a simple mask would weight that pixel the same, though its information is not necessarily relevant to the climate of the region. Population-weighting may not always be sufficient either; consider Los Angeles, which has multiple significantly different climates, all with high densities.

xagg allows a simple population and area-averaging, in addition to export functions that will turn the aggregated data into output easily used in STATA or R for further calculations.

Left to do

  • Testing, bug fixes, stability checks, etc.
  • Share widely! I hope this will be helpful to a wide group of natural and social scientists who have to work with both gridded and polygon data!
Comments
  • Speedup for large grids - mod gdf_pixels in create_raster_polgons

    Speedup for large grids - mod gdf_pixels in create_raster_polgons

    In create_raster_polygons, the for loop that assigns individual polygons to gdf_pixels essentially renders xagg unusable for larger high res grids because it goes so slow. Here I propose elimination of the for loop and replacement with a lambda apply. Big improvement for large grids!

    opened by kerriegeil 10
  • dot product implementation

    dot product implementation

    Starting this pull request. This is code that implements a dot-product approach for doing the aggregation. See #2

    This works for my application but I have not run the tests on this yet.

    opened by jsadler2 9
  • work for one geometry?

    work for one geometry?

    I ran into IndexError: single positional indexer is out-of-bounds (Traceback below)

    I have a dataset with one variable over CONUS and I'm trying to weight to one geom e.g. a county.

    I'll try to give make a reproducible example

    ---------------------------------------------------------------------------
    IndexError                                Traceback (most recent call last)
    <ipython-input-83-5cd8fd54cbfc> in <module>
          1 weightmap = xa.pixel_overlaps(ds, gdf, subset_bbox=True)
    ----> 2 aggregated = xa.aggregate(ds, weightmap)
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/xagg/core.py in aggregate(ds, wm)
        434                 #   the grid have just nan values for this variable
        435                 # in both cases; the "aggregated variable" is just a vector of nans.
    --> 436                 if not np.isnan(wm.agg.iloc[poly_idx,:].pix_idxs).all():
        437                     # Get the dimensions of the variable that aren't "loc" (location)
        438                     other_dims = [k for k in np.atleast_1d(ds[var].dims) if k != 'loc']
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/pandas/core/indexing.py in __getitem__(self, key)
        887                     # AttributeError for IntervalTree get_value
        888                     return self.obj._get_value(*key, takeable=self._takeable)
    --> 889             return self._getitem_tuple(key)
        890         else:
        891             # we by definition only have the 0th axis
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
       1448     def _getitem_tuple(self, tup: Tuple):
       1449 
    -> 1450         self._has_valid_tuple(tup)
       1451         with suppress(IndexingError):
       1452             return self._getitem_lowerdim(tup)
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
        721         for i, k in enumerate(key):
        722             try:
    --> 723                 self._validate_key(k, i)
        724             except ValueError as err:
        725                 raise ValueError(
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
       1356             return
       1357         elif is_integer(key):
    -> 1358             self._validate_integer(key, axis)
       1359         elif isinstance(key, tuple):
       1360             # a tuple should already have been caught by this point
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_integer(self, key, axis)
       1442         len_axis = len(self.obj._get_axis(axis))
       1443         if key >= len_axis or key < -len_axis:
    -> 1444             raise IndexError("single positional indexer is out-of-bounds")
       1445 
       1446     # -------------------------------------------------------------------
    
    IndexError: single positional indexer is out-of-bounds
    
    opened by raybellwaves 5
  • dot product implementation for overlaps breaks xagg for high res grids

    dot product implementation for overlaps breaks xagg for high res grids

    I'm finding that the implementation of the dot product for computing weighted averages in core.py/aggregrate eats up way too much memory for high res grids. It's the wm.overlap_da that requires way too much memory. I am unable to allocate enough memory to make it through core.py/aggregate for many of the datasets I'm processing on an HPC system. I had no issue with the previous aggregate function before commit 4c5cc6503efde05153181e15bc5f7fe6bb92bd07. Looks like the dot product method is a lot cleaner in the code, but is there another benefit?

    opened by kerriegeil 3
  • work with xr.DataArray's

    work with xr.DataArray's

    In providing an xr.DataArray to xa.pixel_overlaps(da, gdf) you get the Traceback below.

    Couple of ideas for fixes:

    • in the code parse it to an xr.Dataset
    • Don't use .keys() but use .dims() instead
    AttributeError                            Traceback (most recent call last)
    <ipython-input-74-f5cd39618cec> in <module>
    ----> 1 weightmap = xa.pixel_overlaps(da, gdf)
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/xagg/wrappers.py in pixel_overlaps(ds, gdf_in, weights, weights_target, subset_bbox)
         58     print('creating polygons for each pixel...')
         59     if subset_bbox:
    ---> 60         pix_agg = create_raster_polygons(ds,subset_bbox=gdf_in,weights=weights)
         61     else:
         62         pix_agg = create_raster_polygons(ds,subset_bbox=None,weights=weights)
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/xagg/core.py in create_raster_polygons(ds, mask, subset_bbox, weights, weights_target)
        148     # Standardize inputs
        149     ds = fix_ds(ds)
    --> 150     ds = get_bnds(ds)
        151     #breakpoint()
        152     # Subset by shapefile bounding box, if desired
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/xagg/aux.py in get_bnds(ds, edges, wrap_around_thresh)
        196         # to [0,360], but it's not tested yet.
        197 
    --> 198     if ('lat' not in ds.keys()) | ('lon' not in ds.keys()):
        199         raise KeyError('"lat"/"lon" not found in [ds]. Make sure the '+
        200                        'geographic dimensions follow this naming convention.')
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/xarray/core/common.py in __getattr__(self, name)
        237                 with suppress(KeyError):
        238                     return source[name]
    --> 239         raise AttributeError(
        240             "{!r} object has no attribute {!r}".format(type(self).__name__, name)
        241         )
    
    AttributeError: 'DataArray' object has no attribute 'keys'
    
    opened by raybellwaves 2
  • fix export to dataset issue, insert export tests

    fix export to dataset issue, insert export tests

    .to_dataset() was not working due to too many layers of lists in the agg.agg geodataframe. This issue has been fixed by replacing an index with np.squeeze() instead. The broader problem may be that there are too many unecessary layers of lists in the agg.agg geodataframe, which should be simplified in the next round of backend cleanup.

    Furthermore, there are now tests for .to_dataset() and .to_dataframe()

    opened by ks905383 1
  • speed improvement for high res grids in create_raster_polygons

    speed improvement for high res grids in create_raster_polygons

    Hi there, I'm a first timer when it comes to contributing to someone else's repo so please let me know if I need to fix/change anything. I've got a handful of small changes that greatly impact the speed of xagg when using high resolution grids. Planning to submit one at a time when I have the time to spend on it. It may take me a while...

    This first one removes the hard coded 0.1 degree buffer for selecting a subset bounding box in create_raster_polygons. For high res grids this will select a much larger area than desired. The solution I propose is to change the 0.1 degree buffer to twice the max grid spacing.

    opened by kerriegeil 1
  • Rename aux for windows

    Rename aux for windows

    As aux is a protected filename on Windows I could not install the package, and not even clone the repo without renaming the file first. This is a fix inspired by NeuralEnsemble/PyNN#678.

    opened by Hugovdberg 1
  • use aggregated.to_dataset().to_dataframe() within aggregated.to_dataframe()

    use aggregated.to_dataset().to_dataframe() within aggregated.to_dataframe()

    When dealing with time data aggregated.to_dataframe() will return columns as data_var0, data_var1.

    xarray has a method to convert to a dataframe http://xarray.pydata.org/en/stable/generated/xarray.DataArray.to_dataframe.html which moves coords to an multiindex.

    You would just have to add in the geometry and crs from the incoming geopandas to make it a geopandas dataframe.

    opened by raybellwaves 1
  • return geometry key in aggregated.to_dataframe()

    return geometry key in aggregated.to_dataframe()

    When doing aggregated.to_dataframe() it drops the geometry column that is in the original geopandas.DataFrame.

    It would be nice if it was returned to be used for things such as visualization.

    Screen Shot 2021-08-26 at 10 44 18 PM Screen Shot 2021-08-26 at 10 44 34 PM Screen Shot 2021-08-26 at 10 50 34 PM

    Code:

    import geopandas as gpd
    import pooch
    import xagg as xa
    import xarray as xr
    import hvplot.pandas
    
    # Load in example data and shapefiles
    ds = xr.tutorial.open_dataset("air_temperature").isel(time=0)
    file = pooch.retrieve(
        "https://pubs.usgs.gov/of/2006/1187/basemaps/continents/continents.zip", None
    )
    continents = gpd.read_file("zip://" + file)
    continents
    
    wm = xa.pixel_overlaps(ds, continents)
    aggregated = xa.aggregate(ds, wm)
    aggregated.to_dataframe()
    
    pd.merge(aggregated.to_dataframe(), continents, on="CONTINENT").hvplot(c="air")
    
    opened by raybellwaves 1
  • fix index error if input gdf has own index [issue #8]

    fix index error if input gdf has own index [issue #8]

    xa.get_pixel_overlaps() creates a poly_idx column in the gdf that takes as its value the index of the input gdf. However, if there is a pre-existing index, this can lead to bad behavior, since poly_idx is used as an .iloc indexer in the gdf. This update instead makes poly_idx np.arange(0,len(gdf)), which will avoid this indexing issue (and hopefully not cause any more? I figured there would've been a reason I used the existing index if not a new one... fingers crossed).

    opened by ks905383 1
  • silence functions

    silence functions

    Hi, thank you a lot for the great package.

    I was wondering if it is possible to add an argument to the functions (pixel_overlaps and aggregate) to silence them if we want? I am doing aggregations for many geometries and sometimes it becomes too crowded, especially if I try to print other things along while the functions are executed.

    Thanks !

    opened by khalilT 0
  • Mistaken use of ds.var() in `core.py`?

    Mistaken use of ds.var() in `core.py`?

    In core.py, there are a few loops of the form: for var in ds.var():.

    This tries to compute a variance across all dimensions, for each variable. Is that the intention? I think you just mean for var in ds:.

    Note that if any variables are of a type for which var cannot be computed (e.g., timedelta64[ns]) then aggregate fails.

    opened by jrising 3
  • Odd errors from using pixel_overlaps with a weights option

    Odd errors from using pixel_overlaps with a weights option

    This issue is sort of three issues that I encountered while trying to solve a problem. Fixes to any of these would work for me.

    I'm trying to use xagg with some fairly large files including a weights file, and I was getting an error during the regridding process:

    >>> weightmap = xa.pixel_overlaps(ds_tas, gdf_regions, weights=ds_pop.Population, subset_bbox=False)
    creating polygons for each pixel...
    lat/lon bounds not found in dataset; they will be created.
    regridding weights to data grid...
    Create weight file: bilinear_1800x3600_1080x2160.nc
    zsh: illegal hardware instruction  python
    

    (at which point, python crashes)

    I decided to do the regridding myself and save the result. Here are what the data file (ds_tas) and weights file (ds_pop) look like:

    >>> ds_tas
    <xarray.Dataset>
    Dimensions:      (band: 12, x: 2160, y: 1080)
    Coordinates:
      * band         (band) int64 1 2 3 4 5 6 7 8 9 10 11 12
      * x            (x) float64 -179.9 -179.8 -179.6 -179.4 ... 179.6 179.7 179.9
      * y            (y) float64 89.92 89.75 89.58 89.42 ... -89.58 -89.75 -89.92
        spatial_ref  int64 ...
    Data variables:
        band_data    (band, y, x) float32 ...
    
    >>> ds_pop
    <xarray.Dataset>
    Dimensions:     (longitude: 2160, latitude: 1080)
    Coordinates:
      * longitude   (longitude) float64 -179.9 -179.8 -179.6 ... 179.6 179.8 179.9
      * latitude    (latitude) float64 89.92 89.75 89.58 ... -89.58 -89.75 -89.92
    Data variables:
        crs         int32 ...
        Population  (latitude, longitude) float32 ...
    Attributes:
        Conventions:  CF-1.4
        created_by:   R, packages ncdf4 and raster (version 3.4-13)
        date:         2022-02-05 22:14:16
    

    The dimensions line up exactly. But xagg still wanted to regrid my weights file. My guess is that this is because the dimensions are labeled differently (and so an np.allclose fails because taking a difference between the coordinates results in a 2-D matrix).

    So I relabeled my coordinates and dimensions. This results in a new error:

    >>> weightmap = xa.pixel_overlaps(ds_tas, gdf_regions, weights=ds_pop.Population, subset_bbox=False)
    creating polygons for each pixel...
    lat/lon bounds not found in dataset; they will be created.
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/admin/opt/anaconda3/envs/ccenv2/lib/python3.7/site-packages/xagg/wrappers.py", line 50, in pixel_overlaps
        pix_agg = create_raster_polygons(ds,subset_bbox=None,weights=weights)
      File "/Users/admin/opt/anaconda3/envs/ccenv2/lib/python3.7/site-packages/xagg/core.py", line 127, in create_raster_polygons
        ds = get_bnds(ds)
      File "/Users/admin/opt/anaconda3/envs/ccenv2/lib/python3.7/site-packages/xagg/aux.py", line 190, in get_bnds
        bnds_tmp[1:,:] = xr.concat([ds[var]-0.5*ds[var].diff(var),
      File "/Users/admin/opt/anaconda3/envs/ccenv2/lib/python3.7/site-packages/xarray/core/_typed_ops.py", line 209, in __sub__
        return self._binary_op(other, operator.sub)
      File "/Users/admin/opt/anaconda3/envs/ccenv2/lib/python3.7/site-packages/xarray/core/dataarray.py", line 3081, in _binary_op
        self, other = align(self, other, join=align_type, copy=False)
      File "/Users/admin/opt/anaconda3/envs/ccenv2/lib/python3.7/site-packages/xarray/core/alignment.py", line 349, in align
        f"arguments without labels along dimension {dim!r} cannot be "
    ValueError: arguments without labels along dimension 'lat' cannot be aligned because they have different dimension sizes: {1080, 1079}
    

    To be clear, neither of my datasets has a dimension of size 1079.

    opened by jrising 11
  • weightmap (pixel_overlaps) warnings and errors

    weightmap (pixel_overlaps) warnings and errors

    Hi, thanks for the helpful package.

    On a Windows machine, I'm using the package successfully on ERA5 reanalysis data although I do get a user warning when calling pixel_overlaps. It occurs after the output "calculating overlaps between pixels and output polygons...". The warning is:

    "/home/kgeil/miniconda3/envs/xagg/lib/python3.9/site-packages/xagg/core.py:308: UserWarning: keep_geom_type=True in overlay resulted in 1 dropped geometries of different geometry types than df1 has. Set keep_geom_type=False to retain all geometries overlaps = gpd.overlay(gdf_in.to_crs(epsg_set),'

    When I try generating a weight map with the exact same shapefile but on AVHRR NDVI data instead I get a full error at the same-ish location:

    "ValueError: GeoDataFrame does not support setting the geometry column where the column name is shared by multiple columns."

    It looks like something is going wrong in get_pixel_overlaps around line 323 overlap_info...

    I've tried rewriting the NDVI netcdf to be as identical as possible as the ERA5 file (same coord and dim names, etc) and both files are epsg:4326.

    Any ideas how to get past this error?

    opened by kerriegeil 3
  • Trying to install outside target directory

    Trying to install outside target directory

    Get this error when trying to install with pip on windows. Have tried to install from pypi, github, and zip. Same error in each instance. I've tried with a base python install using virtual env and with conda. ERROR: The zip file (C:\Users\profile\Downloads\xagg-main.zip) has a file (C:\Users\khafen\AppData\Local\Temp\3\pip-req-build-cok0yin6\xagg/aux.py) trying to install outside target directory (C:\Users\profile\AppData\Local\Temp\3\pip-req-build-cok0yin6)

    opened by konradhafen 2
  • add to_geodataframe

    add to_geodataframe

    Closes https://github.com/ks905383/xagg/issues/17

    Open to feedback here.

    I believe https://github.com/ks905383/xagg/blob/main/xagg/classes.py#L62 should say geopandas.GeoDataFrame

    but I was thinking to_dataframe could return a pandas dataframe (no geometry and no crs). and to_geodataframe returns the geomety and crs

    opened by raybellwaves 2
Releases(v0.3.0.2)
  • v0.3.0.2(Apr 10, 2022)

    Bug fixes

    • .to_dataset() functions again
    • .read_wm() is now loaded by default

    What's Changed

    • fix export to dataset issue, insert export tests by @ks905383 in https://github.com/ks905383/xagg/pull/35
    • add read_wm() to init by @ks905383 in https://github.com/ks905383/xagg/pull/36

    Full Changelog: https://github.com/ks905383/xagg/compare/v0.3.0.1...v0.3.0.2

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0.1(Apr 7, 2022)

    Fixes dependency error in setup.py that was preventing publication of v0.3* on conda-forge.

    Full Changelog: https://github.com/ks905383/xagg/compare/v0.3.0...v0.3.0.1

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Apr 2, 2022)

    Performance upgrades

    Various performance upgrades, particularly for working with high resolution grids.

    In create_raster_polygons:

    • replacing the for-loop assigning pixels to polygons with a lambda apply
    • creating flexible buffer for subsetting to bounding box, replacing the hardcoded 0.1 degrees used previously with twice the max grid spacing

    In aggregate:

    • an optional replacement of the aggregating calculation with a dot-product implementation (impl='dot_product' in pixel_overlaps() and aggregate()), which may improve performance in certain situations

    Expanded functionality

    Weightmaps can now be saved using wm.to_file() and loaded using xagg.core.read_wm(), and no longer have to be regenerated with each code run.

    Bug fixes

    Various bug fixes

    What's Changed

    • speed improvement for high res grids in create_raster_polygons by @kerriegeil in https://github.com/ks905383/xagg/pull/29
    • dot product implementation by @jsadler2 in https://github.com/ks905383/xagg/pull/4
    • Speedup for large grids - mod gdf_pixels in create_raster_polgons by @kerriegeil in https://github.com/ks905383/xagg/pull/30
    • implement making dot product optional, restoring default agg behavior by @ks905383 in https://github.com/ks905383/xagg/pull/32
    • Implement a way to save weightmaps (output from pixel_overlaps) by @ks905383 in https://github.com/ks905383/xagg/pull/33

    New Contributors

    • @kerriegeil made their first contribution in https://github.com/ks905383/xagg/pull/29
    • @jsadler2 made their first contribution in https://github.com/ks905383/xagg/pull/4

    Full Changelog: https://github.com/ks905383/xagg/compare/v0.2.6...v0.3.0

    Source code(tar.gz)
    Source code(zip)
  • v0.2.6(Jan 26, 2022)

    Bug fixes:

    • #11 pixel_overlaps no longer changes the gdf_in outside of the function

    Functionality tweaks

    • added agg.to_geodataframe(), similar to agg.to_dataframe(), but keeping the geometries from the original shapefile
    • adapted xarray's ds.to_dataframe() in agg.to_dataframe(), which has better functionality
    • .csvs now export long instead of wide, using the output from ds.to_dataframe() above
    Source code(tar.gz)
    Source code(zip)
  • v0.2.5(Jul 24, 2021)

  • v0.2.4(May 14, 2021)

Owner
Kevin Schwarzwald
Researching climate variability + impacts by profession, urban expansion by studies, and transit/land use policy by interest. Moonlight as rock violinist.
Kevin Schwarzwald
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather

Tuplex 791 Jan 04, 2023
A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

Andre Pretorius 9 Sep 15, 2022
Project under the certification "Data Analysis with Python" on FreeCodeCamp

Sea Level Predictor Assignment You will anaylize a dataset of the global average sea level change since 1880. You will use the data to predict the sea

Bhavya Gopal 3 Jan 31, 2022
Desafio 1 ~ Bantotal

Challenge 01 | Bantotal Please read the instructions for the challenge by selecting your preferred language below: Español Português License Copyright

Maratona Behind the Code 44 Sep 28, 2022
INF42 - Topological Data Analysis

TDA INF421(Conception et analyse d'algorithmes) Projet : Topological Data Analysis SphereMin Etant donné un nuage des points, ce programme contient de

2 Jan 07, 2022
A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset

xwrf A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset. The primary objective of

National Center for Atmospheric Research 43 Nov 29, 2022
A simplified prototype for an as-built tracking database with API

Asbuilt_Trax A simplified prototype for an as-built tracking database with API The purpose of this project is to: Model a database that tracks constru

Ryan Pemberton 1 Jan 31, 2022
This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics!

COSMETICS GENERATOR This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics! Remember to put the l

ᴅᴊʟᴏʀ3xᴢᴏ 11 Dec 13, 2022
Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

HoloViz 2.9k Jan 06, 2023
Python ELT Studio, an application for building ELT (and ETL) data flows.

The Python Extract, Load, Transform Studio is an application for performing ELT (and ETL) tasks. Under the hood the application consists of a two parts.

Schlerp 55 Nov 18, 2022
The Spark Challenge Student Check-In/Out Tracking Script

The Spark Challenge Student Check-In/Out Tracking Script This Python Script uses the Student ID Database to match the entries with the ID Card Swipe a

1 Dec 09, 2021
Manage large and heterogeneous data spaces on the file system.

signac - simple data management The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, and reproduc

Glotzer Group 109 Dec 14, 2022
CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

cleanX CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological

Candace Makeda Moore, MD 20 Jan 05, 2023
BAyesian Model-Building Interface (Bambi) in Python.

Bambi BAyesian Model-Building Interface in Python Overview Bambi is a high-level Bayesian model-building interface written in Python. It's built on to

861 Dec 29, 2022
Using Python to derive insights on particular Pokemon, Types, Generations, and Stats

Pokémon Analysis Andreas Nikolaidis February 2022 Introduction Exploratory Analysis Correlations & Descriptive Statistics Principal Component Analysis

Andreas 1 Feb 18, 2022
PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams

PLStream: A Framework for Fast Polarity Labelling of Massive Data Streams Motivation When dataset freshness is critical, the annotating of high speed

4 Aug 02, 2022
Titanic data analysis for python

Titanic-data-analysis This Repo is an analysis on Titanic_mod.csv This csv file contains some assumed data of the Titanic ship after sinking This full

Hardik Bhanot 1 Dec 26, 2021
Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

About Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions. The tool provides rich data and a summary g

9 Nov 16, 2022
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

Jacob Schreiber 3k Jan 02, 2023