PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Last update: Dec 24, 2022

Overview

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

PySpark bindings for the H3 core library.

For available functions, please see the vanilla Python binding documentation at:

uber.github.io/h3-py

Installation

From PyPI:

pip install h3-pyspark

From conda

conda config --add channels conda-forge
conda install h3-pyspark

Usage

>> >>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution')) >>> df.show() +---------+-----------+----------+---------------+ | lat| lng|resolution| h3_9| +---------+-----------+----------+---------------+ |37.769377|-122.388903| 9|89283082e73ffff| +---------+-----------+----------+---------------+ ">

>>> from pyspark.sql import SparkSession, functions as F
>>> import h3_pyspark
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([{"lat": 37.769377, "lng": -122.388903, 'resolution': 9}])
>>>
>>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))
>>> df.show()

+---------+-----------+----------+---------------+
|      lat|        lng|resolution|           h3_9|
+---------+-----------+----------+---------------+
|37.769377|-122.388903|         9|89283082e73ffff|
+---------+-----------+----------+---------------+

Publishing

Bump version in setup.cfg
Publish:

python3 -m build
python3 -m twine upload --repository pypi dist/*

Comments

'TypeError: must be real number, not NoneType' when using h3_pyspark

Hi, I have the following spark dataframe and the column of h3 indices is created by applying the lat, lng pairs and the resolution to h3_pypark.geo_to_h3(lat, lng, resolution) function. However I encountered the following error when I tried to check if there's any null in the index column. And it's not only isNull() not working but also any other subsetting operations which all throw me the same error, could anyone provide some insights on what might be the issue and how to fix it? Thanks in advance!

dataframe:

errors:

opened by Tingmi 5
Fix indexing for polygons and lines

Catches some edge cases where h3_line and polyfill would miss. Could be overbroad, which is why the docstrings are changed to say superset, but at least it should be complete

opened by rwaldman 1
Better error handling when null values are passed in
Currently the behavior for all UDFs is that if any row in your dataframe has a null value, the entire build will fail.

This type behavior would be better/more resilient:

@F.udf(T.ArrayType(T.StringType())) def index_shape(geometry, resolution): if geometry is None: return None return _index_shape(geometry, resolution)
opened by kevinschaich 1
Fix bug in index_shape function which missed hexes for long line segments

Fixes #8

Previous behavior for problematic line:

New behavior for same line:

Previous behavior for problematic polygon:

New behavior for same polygon:

cc: @deankieserman @rwaldman

opened by kevinschaich 0
Bug in index_shape function which misses several hexes

Reported by @rwaldman – we can miss several hexes in the worst case if a line's start and endpoints are east-to-west and towards the north or south edge:

Proposed solution is for long line segments (≥ s where s = hex side length) to interpolate several points along the line based on the selected resolution, so that we catch the ones in between:

opened by kevinschaich 0

polyfill fails with valid multipolygon geojson

h3_pyspark.polyfill fails when a valid multipolygon geojson is provided this is expected behavior when utilizing the h3 native library.

however, i thought it would be helpful if this library is able to accept multipolygons. could I get permission to push a PR?

implementation in src/h3_pyspark/__init__.py

@F.udf(returnType=T.ArrayType(T.StringType()))
@handle_nulls
def polyfill(polygons, res, geo_json_conformant):
    # NOTE: this behavior differs from default
    # h3-pyspark expect `polygons` argument to be a valid GeoJSON string
    polygons = json.loads(polygons)
    type_ = polygons["type"].lower()
    if type_ == "multipolygon":
        output = []
        for i in polygons["coordinates"]:
            _polygon = {"type": "Polygon", "coordinates": i}
            output.extend(list(h3.polyfill(_polygon, res, geo_json_conformant)))
        return sanitize_types(output)
    return sanitize_types(h3.polyfill(polygons, res, geo_json_conformant))

test in tests/test_core.py

multipolygon = '{"type": "MultiPolygon","coordinates": [[[[108.98309290409088,13.240363245242063],[108.98343622684479,13.240363245242063],[108.98343622684479,13.240634779729014],[108.98309290409088,13.240634779729014],[108.98309290409088,13.240363245242063]]],[[[108.98349523544312,13.240002939397714],[108.98389220237732,13.240002939397714],[108.98389220237732,13.240269252464502],[108.98349523544312,13.240269252464502],[108.98349523544312,13.240002939397714]]]]}'

def test_polyfill_multipolygon(self):
        h3_test_args, h3_pyspark_test_args = get_test_args(h3.polyfill)
        print(h3_pyspark_test_args)
        integer = 12
        data = {
            "res": integer,
            "geo_json_conformant": True,
            "geojson": multipolygon,
        }
        df = spark.createDataFrame([data])
        actual = df.withColumn("actual", h3_pyspark.polyfill(*h3_pyspark_test_args))
        actual = actual.collect()[0]["actual"]
        print(actual)
        expected = []
        for i in json.loads(multipolygon)["coordinates"]:
            _polygon = {"type": "Polygon", "coordinates": i}
            expected.extend(list(h3.polyfill(_polygon, integer, True)))
        expected = sanitize_types(expected)
        assert sort(actual) == sort(expected)

opened by kangeugine 0

Releases(1.2.6)

1.2.6(Mar 10, 2022)
Add edge cases for lines (#11)

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.5...1.2.6
Source code(tar.gz)
Source code(zip)
1.2.4(Mar 4, 2022)
What's Changed

Handle null values in inputs to UDFs by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/10

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.3...1.2.4
Source code(tar.gz)
Source code(zip)
1.2.3(Feb 24, 2022)
What's Changed

Add error handling for bad geometries by @deankieserman in https://github.com/kevinschaich/h3-pyspark/pull/3

Fix bug in index_shape function which missed hexes for long line segments by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/9

New Contributors

@deankieserman made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/3

Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.2...1.2.3
Source code(tar.gz)
Source code(zip)
1.2.2(Jan 5, 2022)

Source code(tar.gz)
Source code(zip)
1.1.0(Dec 8, 2021)
What's Changed

Create LICENSE by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/1

Add extension functions (index_shape, k_ring_distinct) for spatial indexing & buffers by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/2

New Contributors

@kevinschaich made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/1

Full Changelog: https://github.com/kevinschaich/h3-pyspark/commits/1.1.0
Source code(tar.gz)
Source code(zip)

Owner

Kevin Schaich

Solving awesome problems @palantir. Part-time open source junkie. Purveyor of hot coffee and thoughtful photographs.

GitHub Repository https://uber.github.io/h3-py/intro.html

This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

This is a project for analysis and estimation of House Prices in King County USA The .csv file contains the data of the house and the .ipynb file con

1 Jan 21, 2022

Retentioneering: product analytics, data-driven customer journey map optimization, marketing analytics, web analytics, transaction analytics, graph visualization, and behavioral segmentation with customer segments in Python.

What is Retentioneering? Retentioneering is a Python framework and library to assist product analysts and marketing analysts as it makes it easier to

581 Jan 07, 2023

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Related tags

Overview

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

Installation

Usage

Publishing

Comments

'TypeError: must be real number, not NoneType' when using h3_pyspark

Fix indexing for polygons and lines

Better error handling when null values are passed in

Fix bug in index_shape function which missed hexes for long line segments

Bug in index_shape function which misses several hexes

polyfill fails with valid multipolygon geojson

Releases(1.2.6)

1.2.6(Mar 10, 2022)

1.2.4(Mar 4, 2022)

What's Changed

1.2.3(Feb 24, 2022)

What's Changed

New Contributors

1.2.2(Jan 5, 2022)

1.1.0(Dec 8, 2021)

What's Changed

New Contributors

Owner

Kevin Schaich

This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

Retentioneering: product analytics, data-driven customer journey map optimization, marketing analytics, web analytics, transaction analytics, graph visualization, and behavioral segmentation with customer segments in Python.

Aggregating gridded data (xarray) to polygons

An easy-to-use feature store

A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

Data exploration done quick.

Fancy data functions that will make your life as a data scientist easier.

cLoops2: full stack analysis tool for chromatin interactions

A forecasting system dedicated to smart city data

peptides.py is a pure-Python package to compute common descriptors for protein sequences

Working Time Statistics of working hours and working conditions by industry and company

2019 Data Science Bowl

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

PyIOmica (pyiomica) is a Python package for omics analyses.

A crude Hy handle on Pandas library

PyEmits, a python package for easy manipulation in time-series data.

Investigating EV charging data

A collection of robust and fast processing tools for parsing and analyzing web archive data.

A data structure that extends pyspark.sql.DataFrame with metadata information.

Exploring the Top ML and DL GitHub Repositories