Aggregating gridded data (xarray) to polygons

Related tags

Data Analysisxagg
Overview

xagg

Binder

A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons. Check out the binder link above for a sample code run!

Installation

The easiest way to install xagg is using pip. Beware though - xagg is still a work in progress; I suggest you install it to a virtual environment first (using e.g. venv, or just creating a separate environment in conda for projects using xagg).

pip install xagg

Intro

Science often happens on grids - gridded weather products, interpolated pollution data, night time lights, remote sensing all approximate the continuous real world for reasons of data resolution, processing time, or ease of calculation.

However, living things don't live on grids, and rarely play, act, or observe data on grids either. Instead, humans tend to work on the county, state, township, okrug, or city level; birds tend to fly along complex migratory corridors; and rain- and watersheds follow valleys and mountains.

So, whenever we need to work with both gridded and geographic data products, we need ways of getting them to match up. We may be interested for example what the average temperature over a county is, or the average rainfall rate over a watershed.

Enter xagg.

xagg provides an easy-to-use (2 lines!), standardized way of aggregating raster data to polygons. All you need is some gridded data in an xarray Dataset or DataArray and some polygon data in a geopandas GeoDataFrame. Both of these are easy to use for the purposes of xagg - for example, all you need to use a shapefile is to open it:

import xarray as xr
import geopandas as gpd
 
# Gridded data file (netcdf/climate data)
ds = xr.open_dataset('file.nc')

# Shapefile
gdf = gpd.open_dataset('file.shp')

xagg will then figure out the geographic grid (lat/lon) in ds, create polygons for each pixel, and then generate intersects between every polygon in the shapefile and every pixel. For each polygon in the shapefile, the relative area of each covering pixel is calculated - so, for example, if a polygon (say, a US county) is the size and shape of a grid pixel, but is split halfway between two pixels, the weight for each pixel will be 0.5, and the value of the gridded variables on that polygon will just be the average of both [TO-DO: add visual example of this].

The two lines mentioned before?

import xagg as xa

# Get overlap between pixels and polygons
weightmap = xa.pixel_overlaps(ds,gdf)

# Aggregate data in [ds] onto polygons
aggregated = xa.aggregate(ds,weightmap)

# aggregated can now be converted into an xarray dataset (using aggregated.to_dataset()), 
# or a geopandas geodataframe (using aggregated.to_dataframe()), or directly exported 
# to netcdf, csv, or shp files using aggregated.to_csv()/.to_netcdf()/.to_shp()

Researchers often need to weight your data by more than just its relative area overlap with a polygon (for example, do you want to weight pixels with more population more?). xagg has a built-in support for adding an additional weight grid (another xarray DataArray) into xagg.pixel_overlaps().

Finally, xagg allows for direct exporting of the aggregated data in several commonly used data formats (please open issues if you'd like support for something else!):

  • netcdf
  • csv for STATA, R
  • shp for QGIS, further spatial processing

Best of all, xagg is flexible. Multiple variables in your dataset? xagg will aggregate them all, as long as they have at least lat/lon dimensions. Fields in your shapefile that you'd like to keep? xagg keeps all fields (for example FIPS codes from county datasets) all the way through the final export. Weird dimension names? xagg is trained to recognize all versions of "lat", "Latitude", "Y", "nav_lat", "Latitude_1"... etc. that the author has run into over the years of working with climate data; and this list is easily expandable as a keyword argumnet if needed.

Please contribute - let me know what works and what doesn't, whether you think this is useful, and if so - please share!

Use Cases

Climate econometrics

Many climate econometrics studies use societal data (mortality, crop yields, etc.) at a political or administrative level (for example, counties) but climate and weather data on grids. Oftentimes, further weighting by population or agricultural density is needed.

Area-weighting of pixels onto polygons ensures that aggregating weather and climate data onto polygons occurs in a robust way. Consider a (somewhat contrived) example: an administrative region is in a relatively flat lowlands, but a pixel that slightly overlaps the polygon primarily covers a wholly different climate (mountainous, desert, etc.). Using a simple mask would weight that pixel the same, though its information is not necessarily relevant to the climate of the region. Population-weighting may not always be sufficient either; consider Los Angeles, which has multiple significantly different climates, all with high densities.

xagg allows a simple population and area-averaging, in addition to export functions that will turn the aggregated data into output easily used in STATA or R for further calculations.

Left to do

  • Testing, bug fixes, stability checks, etc.
  • Share widely! I hope this will be helpful to a wide group of natural and social scientists who have to work with both gridded and polygon data!
Comments
  • Speedup for large grids - mod gdf_pixels in create_raster_polgons

    Speedup for large grids - mod gdf_pixels in create_raster_polgons

    In create_raster_polygons, the for loop that assigns individual polygons to gdf_pixels essentially renders xagg unusable for larger high res grids because it goes so slow. Here I propose elimination of the for loop and replacement with a lambda apply. Big improvement for large grids!

    opened by kerriegeil 10
  • dot product implementation

    dot product implementation

    Starting this pull request. This is code that implements a dot-product approach for doing the aggregation. See #2

    This works for my application but I have not run the tests on this yet.

    opened by jsadler2 9
  • work for one geometry?

    work for one geometry?

    I ran into IndexError: single positional indexer is out-of-bounds (Traceback below)

    I have a dataset with one variable over CONUS and I'm trying to weight to one geom e.g. a county.

    I'll try to give make a reproducible example

    ---------------------------------------------------------------------------
    IndexError                                Traceback (most recent call last)
    <ipython-input-83-5cd8fd54cbfc> in <module>
          1 weightmap = xa.pixel_overlaps(ds, gdf, subset_bbox=True)
    ----> 2 aggregated = xa.aggregate(ds, weightmap)
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/xagg/core.py in aggregate(ds, wm)
        434                 #   the grid have just nan values for this variable
        435                 # in both cases; the "aggregated variable" is just a vector of nans.
    --> 436                 if not np.isnan(wm.agg.iloc[poly_idx,:].pix_idxs).all():
        437                     # Get the dimensions of the variable that aren't "loc" (location)
        438                     other_dims = [k for k in np.atleast_1d(ds[var].dims) if k != 'loc']
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/pandas/core/indexing.py in __getitem__(self, key)
        887                     # AttributeError for IntervalTree get_value
        888                     return self.obj._get_value(*key, takeable=self._takeable)
    --> 889             return self._getitem_tuple(key)
        890         else:
        891             # we by definition only have the 0th axis
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
       1448     def _getitem_tuple(self, tup: Tuple):
       1449 
    -> 1450         self._has_valid_tuple(tup)
       1451         with suppress(IndexingError):
       1452             return self._getitem_lowerdim(tup)
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/pandas/core/indexing.py in _has_valid_tuple(self, key)
        721         for i, k in enumerate(key):
        722             try:
    --> 723                 self._validate_key(k, i)
        724             except ValueError as err:
        725                 raise ValueError(
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
       1356             return
       1357         elif is_integer(key):
    -> 1358             self._validate_integer(key, axis)
       1359         elif isinstance(key, tuple):
       1360             # a tuple should already have been caught by this point
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/pandas/core/indexing.py in _validate_integer(self, key, axis)
       1442         len_axis = len(self.obj._get_axis(axis))
       1443         if key >= len_axis or key < -len_axis:
    -> 1444             raise IndexError("single positional indexer is out-of-bounds")
       1445 
       1446     # -------------------------------------------------------------------
    
    IndexError: single positional indexer is out-of-bounds
    
    opened by raybellwaves 5
  • dot product implementation for overlaps breaks xagg for high res grids

    dot product implementation for overlaps breaks xagg for high res grids

    I'm finding that the implementation of the dot product for computing weighted averages in core.py/aggregrate eats up way too much memory for high res grids. It's the wm.overlap_da that requires way too much memory. I am unable to allocate enough memory to make it through core.py/aggregate for many of the datasets I'm processing on an HPC system. I had no issue with the previous aggregate function before commit 4c5cc6503efde05153181e15bc5f7fe6bb92bd07. Looks like the dot product method is a lot cleaner in the code, but is there another benefit?

    opened by kerriegeil 3
  • work with xr.DataArray's

    work with xr.DataArray's

    In providing an xr.DataArray to xa.pixel_overlaps(da, gdf) you get the Traceback below.

    Couple of ideas for fixes:

    • in the code parse it to an xr.Dataset
    • Don't use .keys() but use .dims() instead
    AttributeError                            Traceback (most recent call last)
    <ipython-input-74-f5cd39618cec> in <module>
    ----> 1 weightmap = xa.pixel_overlaps(da, gdf)
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/xagg/wrappers.py in pixel_overlaps(ds, gdf_in, weights, weights_target, subset_bbox)
         58     print('creating polygons for each pixel...')
         59     if subset_bbox:
    ---> 60         pix_agg = create_raster_polygons(ds,subset_bbox=gdf_in,weights=weights)
         61     else:
         62         pix_agg = create_raster_polygons(ds,subset_bbox=None,weights=weights)
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/xagg/core.py in create_raster_polygons(ds, mask, subset_bbox, weights, weights_target)
        148     # Standardize inputs
        149     ds = fix_ds(ds)
    --> 150     ds = get_bnds(ds)
        151     #breakpoint()
        152     # Subset by shapefile bounding box, if desired
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/xagg/aux.py in get_bnds(ds, edges, wrap_around_thresh)
        196         # to [0,360], but it's not tested yet.
        197 
    --> 198     if ('lat' not in ds.keys()) | ('lon' not in ds.keys()):
        199         raise KeyError('"lat"/"lon" not found in [ds]. Make sure the '+
        200                        'geographic dimensions follow this naming convention.')
    
    /opt/userenvs/ray.bell/main/lib/python3.9/site-packages/xarray/core/common.py in __getattr__(self, name)
        237                 with suppress(KeyError):
        238                     return source[name]
    --> 239         raise AttributeError(
        240             "{!r} object has no attribute {!r}".format(type(self).__name__, name)
        241         )
    
    AttributeError: 'DataArray' object has no attribute 'keys'
    
    opened by raybellwaves 2
  • fix export to dataset issue, insert export tests

    fix export to dataset issue, insert export tests

    .to_dataset() was not working due to too many layers of lists in the agg.agg geodataframe. This issue has been fixed by replacing an index with np.squeeze() instead. The broader problem may be that there are too many unecessary layers of lists in the agg.agg geodataframe, which should be simplified in the next round of backend cleanup.

    Furthermore, there are now tests for .to_dataset() and .to_dataframe()

    opened by ks905383 1
  • speed improvement for high res grids in create_raster_polygons

    speed improvement for high res grids in create_raster_polygons

    Hi there, I'm a first timer when it comes to contributing to someone else's repo so please let me know if I need to fix/change anything. I've got a handful of small changes that greatly impact the speed of xagg when using high resolution grids. Planning to submit one at a time when I have the time to spend on it. It may take me a while...

    This first one removes the hard coded 0.1 degree buffer for selecting a subset bounding box in create_raster_polygons. For high res grids this will select a much larger area than desired. The solution I propose is to change the 0.1 degree buffer to twice the max grid spacing.

    opened by kerriegeil 1
  • Rename aux for windows

    Rename aux for windows

    As aux is a protected filename on Windows I could not install the package, and not even clone the repo without renaming the file first. This is a fix inspired by NeuralEnsemble/PyNN#678.

    opened by Hugovdberg 1
  • use aggregated.to_dataset().to_dataframe() within aggregated.to_dataframe()

    use aggregated.to_dataset().to_dataframe() within aggregated.to_dataframe()

    When dealing with time data aggregated.to_dataframe() will return columns as data_var0, data_var1.

    xarray has a method to convert to a dataframe http://xarray.pydata.org/en/stable/generated/xarray.DataArray.to_dataframe.html which moves coords to an multiindex.

    You would just have to add in the geometry and crs from the incoming geopandas to make it a geopandas dataframe.

    opened by raybellwaves 1
  • return geometry key in aggregated.to_dataframe()

    return geometry key in aggregated.to_dataframe()

    When doing aggregated.to_dataframe() it drops the geometry column that is in the original geopandas.DataFrame.

    It would be nice if it was returned to be used for things such as visualization.

    Screen Shot 2021-08-26 at 10 44 18 PM Screen Shot 2021-08-26 at 10 44 34 PM Screen Shot 2021-08-26 at 10 50 34 PM

    Code:

    import geopandas as gpd
    import pooch
    import xagg as xa
    import xarray as xr
    import hvplot.pandas
    
    # Load in example data and shapefiles
    ds = xr.tutorial.open_dataset("air_temperature").isel(time=0)
    file = pooch.retrieve(
        "https://pubs.usgs.gov/of/2006/1187/basemaps/continents/continents.zip", None
    )
    continents = gpd.read_file("zip://" + file)
    continents
    
    wm = xa.pixel_overlaps(ds, continents)
    aggregated = xa.aggregate(ds, wm)
    aggregated.to_dataframe()
    
    pd.merge(aggregated.to_dataframe(), continents, on="CONTINENT").hvplot(c="air")
    
    opened by raybellwaves 1
  • fix index error if input gdf has own index [issue #8]

    fix index error if input gdf has own index [issue #8]

    xa.get_pixel_overlaps() creates a poly_idx column in the gdf that takes as its value the index of the input gdf. However, if there is a pre-existing index, this can lead to bad behavior, since poly_idx is used as an .iloc indexer in the gdf. This update instead makes poly_idx np.arange(0,len(gdf)), which will avoid this indexing issue (and hopefully not cause any more? I figured there would've been a reason I used the existing index if not a new one... fingers crossed).

    opened by ks905383 1
  • silence functions

    silence functions

    Hi, thank you a lot for the great package.

    I was wondering if it is possible to add an argument to the functions (pixel_overlaps and aggregate) to silence them if we want? I am doing aggregations for many geometries and sometimes it becomes too crowded, especially if I try to print other things along while the functions are executed.

    Thanks !

    opened by khalilT 0
  • Mistaken use of ds.var() in `core.py`?

    Mistaken use of ds.var() in `core.py`?

    In core.py, there are a few loops of the form: for var in ds.var():.

    This tries to compute a variance across all dimensions, for each variable. Is that the intention? I think you just mean for var in ds:.

    Note that if any variables are of a type for which var cannot be computed (e.g., timedelta64[ns]) then aggregate fails.

    opened by jrising 3
  • Odd errors from using pixel_overlaps with a weights option

    Odd errors from using pixel_overlaps with a weights option

    This issue is sort of three issues that I encountered while trying to solve a problem. Fixes to any of these would work for me.

    I'm trying to use xagg with some fairly large files including a weights file, and I was getting an error during the regridding process:

    >>> weightmap = xa.pixel_overlaps(ds_tas, gdf_regions, weights=ds_pop.Population, subset_bbox=False)
    creating polygons for each pixel...
    lat/lon bounds not found in dataset; they will be created.
    regridding weights to data grid...
    Create weight file: bilinear_1800x3600_1080x2160.nc
    zsh: illegal hardware instruction  python
    

    (at which point, python crashes)

    I decided to do the regridding myself and save the result. Here are what the data file (ds_tas) and weights file (ds_pop) look like:

    >>> ds_tas
    <xarray.Dataset>
    Dimensions:      (band: 12, x: 2160, y: 1080)
    Coordinates:
      * band         (band) int64 1 2 3 4 5 6 7 8 9 10 11 12
      * x            (x) float64 -179.9 -179.8 -179.6 -179.4 ... 179.6 179.7 179.9
      * y            (y) float64 89.92 89.75 89.58 89.42 ... -89.58 -89.75 -89.92
        spatial_ref  int64 ...
    Data variables:
        band_data    (band, y, x) float32 ...
    
    >>> ds_pop
    <xarray.Dataset>
    Dimensions:     (longitude: 2160, latitude: 1080)
    Coordinates:
      * longitude   (longitude) float64 -179.9 -179.8 -179.6 ... 179.6 179.8 179.9
      * latitude    (latitude) float64 89.92 89.75 89.58 ... -89.58 -89.75 -89.92
    Data variables:
        crs         int32 ...
        Population  (latitude, longitude) float32 ...
    Attributes:
        Conventions:  CF-1.4
        created_by:   R, packages ncdf4 and raster (version 3.4-13)
        date:         2022-02-05 22:14:16
    

    The dimensions line up exactly. But xagg still wanted to regrid my weights file. My guess is that this is because the dimensions are labeled differently (and so an np.allclose fails because taking a difference between the coordinates results in a 2-D matrix).

    So I relabeled my coordinates and dimensions. This results in a new error:

    >>> weightmap = xa.pixel_overlaps(ds_tas, gdf_regions, weights=ds_pop.Population, subset_bbox=False)
    creating polygons for each pixel...
    lat/lon bounds not found in dataset; they will be created.
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/admin/opt/anaconda3/envs/ccenv2/lib/python3.7/site-packages/xagg/wrappers.py", line 50, in pixel_overlaps
        pix_agg = create_raster_polygons(ds,subset_bbox=None,weights=weights)
      File "/Users/admin/opt/anaconda3/envs/ccenv2/lib/python3.7/site-packages/xagg/core.py", line 127, in create_raster_polygons
        ds = get_bnds(ds)
      File "/Users/admin/opt/anaconda3/envs/ccenv2/lib/python3.7/site-packages/xagg/aux.py", line 190, in get_bnds
        bnds_tmp[1:,:] = xr.concat([ds[var]-0.5*ds[var].diff(var),
      File "/Users/admin/opt/anaconda3/envs/ccenv2/lib/python3.7/site-packages/xarray/core/_typed_ops.py", line 209, in __sub__
        return self._binary_op(other, operator.sub)
      File "/Users/admin/opt/anaconda3/envs/ccenv2/lib/python3.7/site-packages/xarray/core/dataarray.py", line 3081, in _binary_op
        self, other = align(self, other, join=align_type, copy=False)
      File "/Users/admin/opt/anaconda3/envs/ccenv2/lib/python3.7/site-packages/xarray/core/alignment.py", line 349, in align
        f"arguments without labels along dimension {dim!r} cannot be "
    ValueError: arguments without labels along dimension 'lat' cannot be aligned because they have different dimension sizes: {1080, 1079}
    

    To be clear, neither of my datasets has a dimension of size 1079.

    opened by jrising 11
  • weightmap (pixel_overlaps) warnings and errors

    weightmap (pixel_overlaps) warnings and errors

    Hi, thanks for the helpful package.

    On a Windows machine, I'm using the package successfully on ERA5 reanalysis data although I do get a user warning when calling pixel_overlaps. It occurs after the output "calculating overlaps between pixels and output polygons...". The warning is:

    "/home/kgeil/miniconda3/envs/xagg/lib/python3.9/site-packages/xagg/core.py:308: UserWarning: keep_geom_type=True in overlay resulted in 1 dropped geometries of different geometry types than df1 has. Set keep_geom_type=False to retain all geometries overlaps = gpd.overlay(gdf_in.to_crs(epsg_set),'

    When I try generating a weight map with the exact same shapefile but on AVHRR NDVI data instead I get a full error at the same-ish location:

    "ValueError: GeoDataFrame does not support setting the geometry column where the column name is shared by multiple columns."

    It looks like something is going wrong in get_pixel_overlaps around line 323 overlap_info...

    I've tried rewriting the NDVI netcdf to be as identical as possible as the ERA5 file (same coord and dim names, etc) and both files are epsg:4326.

    Any ideas how to get past this error?

    opened by kerriegeil 3
  • Trying to install outside target directory

    Trying to install outside target directory

    Get this error when trying to install with pip on windows. Have tried to install from pypi, github, and zip. Same error in each instance. I've tried with a base python install using virtual env and with conda. ERROR: The zip file (C:\Users\profile\Downloads\xagg-main.zip) has a file (C:\Users\khafen\AppData\Local\Temp\3\pip-req-build-cok0yin6\xagg/aux.py) trying to install outside target directory (C:\Users\profile\AppData\Local\Temp\3\pip-req-build-cok0yin6)

    opened by konradhafen 2
  • add to_geodataframe

    add to_geodataframe

    Closes https://github.com/ks905383/xagg/issues/17

    Open to feedback here.

    I believe https://github.com/ks905383/xagg/blob/main/xagg/classes.py#L62 should say geopandas.GeoDataFrame

    but I was thinking to_dataframe could return a pandas dataframe (no geometry and no crs). and to_geodataframe returns the geomety and crs

    opened by raybellwaves 2
Releases(v0.3.0.2)
  • v0.3.0.2(Apr 10, 2022)

    Bug fixes

    • .to_dataset() functions again
    • .read_wm() is now loaded by default

    What's Changed

    • fix export to dataset issue, insert export tests by @ks905383 in https://github.com/ks905383/xagg/pull/35
    • add read_wm() to init by @ks905383 in https://github.com/ks905383/xagg/pull/36

    Full Changelog: https://github.com/ks905383/xagg/compare/v0.3.0.1...v0.3.0.2

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0.1(Apr 7, 2022)

    Fixes dependency error in setup.py that was preventing publication of v0.3* on conda-forge.

    Full Changelog: https://github.com/ks905383/xagg/compare/v0.3.0...v0.3.0.1

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Apr 2, 2022)

    Performance upgrades

    Various performance upgrades, particularly for working with high resolution grids.

    In create_raster_polygons:

    • replacing the for-loop assigning pixels to polygons with a lambda apply
    • creating flexible buffer for subsetting to bounding box, replacing the hardcoded 0.1 degrees used previously with twice the max grid spacing

    In aggregate:

    • an optional replacement of the aggregating calculation with a dot-product implementation (impl='dot_product' in pixel_overlaps() and aggregate()), which may improve performance in certain situations

    Expanded functionality

    Weightmaps can now be saved using wm.to_file() and loaded using xagg.core.read_wm(), and no longer have to be regenerated with each code run.

    Bug fixes

    Various bug fixes

    What's Changed

    • speed improvement for high res grids in create_raster_polygons by @kerriegeil in https://github.com/ks905383/xagg/pull/29
    • dot product implementation by @jsadler2 in https://github.com/ks905383/xagg/pull/4
    • Speedup for large grids - mod gdf_pixels in create_raster_polgons by @kerriegeil in https://github.com/ks905383/xagg/pull/30
    • implement making dot product optional, restoring default agg behavior by @ks905383 in https://github.com/ks905383/xagg/pull/32
    • Implement a way to save weightmaps (output from pixel_overlaps) by @ks905383 in https://github.com/ks905383/xagg/pull/33

    New Contributors

    • @kerriegeil made their first contribution in https://github.com/ks905383/xagg/pull/29
    • @jsadler2 made their first contribution in https://github.com/ks905383/xagg/pull/4

    Full Changelog: https://github.com/ks905383/xagg/compare/v0.2.6...v0.3.0

    Source code(tar.gz)
    Source code(zip)
  • v0.2.6(Jan 26, 2022)

    Bug fixes:

    • #11 pixel_overlaps no longer changes the gdf_in outside of the function

    Functionality tweaks

    • added agg.to_geodataframe(), similar to agg.to_dataframe(), but keeping the geometries from the original shapefile
    • adapted xarray's ds.to_dataframe() in agg.to_dataframe(), which has better functionality
    • .csvs now export long instead of wide, using the output from ds.to_dataframe() above
    Source code(tar.gz)
    Source code(zip)
  • v0.2.5(Jul 24, 2021)

  • v0.2.4(May 14, 2021)

Owner
Kevin Schwarzwald
Researching climate variability + impacts by profession, urban expansion by studies, and transit/land use policy by interest. Moonlight as rock violinist.
Kevin Schwarzwald
Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Mohammed Hassan 13 Mar 31, 2022
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 03, 2023
PyPDC is a Python package for calculating asymptotic Partial Directed Coherence estimations for brain connectivity analysis.

Python asymptotic Partial Directed Coherence and Directed Coherence estimation package for brain connectivity analysis. Free software: MIT license Doc

Heitor Baldo 3 Nov 26, 2022
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
International Space Station data with Python research 🌎

International Space Station data with Python research 🌎 Plotting ISS trajectory, calculating the velocity over the earth and more. Plotting trajector

Facundo Pedaccio 41 Jun 16, 2022
Data imputations library to preprocess datasets with missing data

Impyute is a library of missing data imputation algorithms. This library was designed to be super lightweight, here's a sneak peak at what impyute can do.

Elton Law 329 Dec 05, 2022
Performance analysis of predictive (alpha) stock factors

Alphalens Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open sour

Quantopian, Inc. 2.5k Jan 09, 2023
This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark a

Denny Imanuel 1 Dec 29, 2021
Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Spark-DeltaLake-Demo Reliable, Scalable Machine Learning (2022) This project was completed in an attempt to become better acquainted with the latest b

8 Mar 21, 2022
MotorcycleParts DataAnalysis python

We work with the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.

NASEEM A P 1 Jan 12, 2022
Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

Fabio Barbazza 1 Feb 15, 2022
Python beta calculator that retrieves stock and market data and provides linear regressions.

Stock and Index Beta Calculator Python script that calculates the beta (β) of a stock against the chosen index. The script retrieves the data and resa

sammuhrai 4 Jul 29, 2022
PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

PySpark-Structured-Streaming-ROS-Kafka-ApacheSpark-Cassandra The purpose of this project is to demonstrate a structured streaming pipeline with Apache

Zekeriyya Demirci 5 Nov 13, 2022
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

MatrixProfile MatrixProfile is a Python 3 library, brought to you by the Matrix Profile Foundation, for mining time series data. The Matrix Profile is

Matrix Profile Foundation 302 Dec 29, 2022
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

32 Apr 25, 2022
Data Analysis for First Year Laboratory at Imperial College, London.

Data Analysis for First Year Laboratory at Imperial College, London. For personal reference only, and to reference in lab reports and lab books.

Martin He 0 Aug 29, 2022
Project: Netflix Data Analysis and Visualization with Python

Project: Netflix Data Analysis and Visualization with Python Table of Contents General Info Installation Demo Usage and Main Functionalities Contribut

Kathrin Hälbich 2 Feb 13, 2022
Efficient matrix representations for working with tabular data

Efficient matrix representations for working with tabular data

QuantCo 70 Dec 14, 2022
PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

A part of HyRiver software stack that provides access to NHD+ V2 data through NLDI and WaterData web services

Taher Chegini 23 Dec 14, 2022
Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Allen Institute Visual Behavior Analysis package This repository contains code for analyzing behavioral data from the Allen Brain Observatory: Visual

Allen Institute 16 Nov 04, 2022