PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

Overview

H3 Logo

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark

PyPI version PyPI downloads conda version

Tests

PySpark bindings for the H3 core library.

For available functions, please see the vanilla Python binding documentation at:

Installation

From PyPI:

pip install h3-pyspark

From conda

conda config --add channels conda-forge
conda install h3-pyspark

Usage

>> >>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution')) >>> df.show() +---------+-----------+----------+---------------+ | lat| lng|resolution| h3_9| +---------+-----------+----------+---------------+ |37.769377|-122.388903| 9|89283082e73ffff| +---------+-----------+----------+---------------+ ">
>>> from pyspark.sql import SparkSession, functions as F
>>> import h3_pyspark
>>>
>>> spark = SparkSession.builder.getOrCreate()
>>> df = spark.createDataFrame([{"lat": 37.769377, "lng": -122.388903, 'resolution': 9}])
>>>
>>> df = df.withColumn('h3_9', h3_pyspark.geo_to_h3('lat', 'lng', 'resolution'))
>>> df.show()

+---------+-----------+----------+---------------+
|      lat|        lng|resolution|           h3_9|
+---------+-----------+----------+---------------+
|37.769377|-122.388903|         9|89283082e73ffff|
+---------+-----------+----------+---------------+

Publishing

  1. Bump version in setup.cfg
  2. Publish:
python3 -m build
python3 -m twine upload --repository pypi dist/*
Comments
  • 'TypeError: must be real number, not NoneType' when using h3_pyspark

    'TypeError: must be real number, not NoneType' when using h3_pyspark

    Hi, I have the following spark dataframe and the column of h3 indices is created by applying the lat, lng pairs and the resolution to h3_pypark.geo_to_h3(lat, lng, resolution) function. However I encountered the following error when I tried to check if there's any null in the index column. And it's not only isNull() not working but also any other subsetting operations which all throw me the same error, could anyone provide some insights on what might be the issue and how to fix it? Thanks in advance!

    dataframe: image

    errors: image

    opened by Tingmi 5
  • Fix indexing for polygons and lines

    Fix indexing for polygons and lines

    Catches some edge cases where h3_line and polyfill would miss. Could be overbroad, which is why the docstrings are changed to say superset, but at least it should be complete

    opened by rwaldman 1
  • Better error handling when null values are passed in

    Better error handling when null values are passed in

    Currently the behavior for all UDFs is that if any row in your dataframe has a null value, the entire build will fail.

    This type behavior would be better/more resilient:

    @F.udf(T.ArrayType(T.StringType()))
    def index_shape(geometry, resolution):
        if geometry is None:
            return None
        return _index_shape(geometry, resolution)
    
    opened by kevinschaich 1
  • Fix bug in index_shape function which missed hexes for long line segments

    Fix bug in index_shape function which missed hexes for long line segments

    Fixes #8

    Previous behavior for problematic line:

    Screen Shot 2022-02-24 at 3 40 36 PM

    New behavior for same line:

    Screen Shot 2022-02-24 at 4 02 47 PM

    Previous behavior for problematic polygon:

    Screen Shot 2022-02-24 at 4 34 59 PM

    New behavior for same polygon:

    Screen Shot 2022-02-24 at 4 35 46 PM

    cc: @deankieserman @rwaldman

    opened by kevinschaich 0
  • Bug in index_shape function which misses several hexes

    Bug in index_shape function which misses several hexes

    Reported by @rwaldman – we can miss several hexes in the worst case if a line's start and endpoints are east-to-west and towards the north or south edge:

    image

    Proposed solution is for long line segments (≥ s where s = hex side length) to interpolate several points along the line based on the selected resolution, so that we catch the ones in between:

    image
    opened by kevinschaich 0
  • polyfill fails with valid multipolygon geojson

    polyfill fails with valid multipolygon geojson

    h3_pyspark.polyfill fails when a valid multipolygon geojson is provided this is expected behavior when utilizing the h3 native library.

    however, i thought it would be helpful if this library is able to accept multipolygons. could I get permission to push a PR?

    implementation in src/h3_pyspark/__init__.py

    @F.udf(returnType=T.ArrayType(T.StringType()))
    @handle_nulls
    def polyfill(polygons, res, geo_json_conformant):
        # NOTE: this behavior differs from default
        # h3-pyspark expect `polygons` argument to be a valid GeoJSON string
        polygons = json.loads(polygons)
        type_ = polygons["type"].lower()
        if type_ == "multipolygon":
            output = []
            for i in polygons["coordinates"]:
                _polygon = {"type": "Polygon", "coordinates": i}
                output.extend(list(h3.polyfill(_polygon, res, geo_json_conformant)))
            return sanitize_types(output)
        return sanitize_types(h3.polyfill(polygons, res, geo_json_conformant))
    

    test in tests/test_core.py

    multipolygon = '{"type": "MultiPolygon","coordinates": [[[[108.98309290409088,13.240363245242063],[108.98343622684479,13.240363245242063],[108.98343622684479,13.240634779729014],[108.98309290409088,13.240634779729014],[108.98309290409088,13.240363245242063]]],[[[108.98349523544312,13.240002939397714],[108.98389220237732,13.240002939397714],[108.98389220237732,13.240269252464502],[108.98349523544312,13.240269252464502],[108.98349523544312,13.240002939397714]]]]}'
    
    def test_polyfill_multipolygon(self):
            h3_test_args, h3_pyspark_test_args = get_test_args(h3.polyfill)
            print(h3_pyspark_test_args)
            integer = 12
            data = {
                "res": integer,
                "geo_json_conformant": True,
                "geojson": multipolygon,
            }
            df = spark.createDataFrame([data])
            actual = df.withColumn("actual", h3_pyspark.polyfill(*h3_pyspark_test_args))
            actual = actual.collect()[0]["actual"]
            print(actual)
            expected = []
            for i in json.loads(multipolygon)["coordinates"]:
                _polygon = {"type": "Polygon", "coordinates": i}
                expected.extend(list(h3.polyfill(_polygon, integer, True)))
            expected = sanitize_types(expected)
            assert sort(actual) == sort(expected)
    
    opened by kangeugine 0
Releases(1.2.6)
  • 1.2.6(Mar 10, 2022)

  • 1.2.4(Mar 4, 2022)

    What's Changed

    • Handle null values in inputs to UDFs by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/10

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.3...1.2.4

    Source code(tar.gz)
    Source code(zip)
  • 1.2.3(Feb 24, 2022)

    What's Changed

    • Add error handling for bad geometries by @deankieserman in https://github.com/kevinschaich/h3-pyspark/pull/3
    • Fix bug in index_shape function which missed hexes for long line segments by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/9

    New Contributors

    • @deankieserman made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/3

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/compare/1.2.2...1.2.3

    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(Dec 8, 2021)

    What's Changed

    • Create LICENSE by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/1
    • Add extension functions (index_shape, k_ring_distinct) for spatial indexing & buffers by @kevinschaich in https://github.com/kevinschaich/h3-pyspark/pull/2

    New Contributors

    • @kevinschaich made their first contribution in https://github.com/kevinschaich/h3-pyspark/pull/1

    Full Changelog: https://github.com/kevinschaich/h3-pyspark/commits/1.1.0

    Source code(tar.gz)
    Source code(zip)
Owner
Kevin Schaich
Solving awesome problems @palantir. Part-time open source junkie. Purveyor of hot coffee and thoughtful photographs.
Kevin Schaich
Semi-Automated Data Processing

Perform semi automated exploratory data analysis, feature engineering and feature selection on provided dataset by visualizing every possibilities on each step and assisting the user to make a meanin

Arun Singh Babal 1 Jan 17, 2022
Making the DAEN information accessible.

The purpose of this repository is to make the information on Australian COVID-19 adverse events accessible. The Therapeutics Goods Administration (TGA) keeps a database of adverse reactions to medica

10 May 10, 2022
Collections of pydantic models

pydantic-collections The pydantic-collections package provides BaseCollectionModel class that allows you to manipulate collections of pydantic models

Roman Snegirev 20 Dec 26, 2022
Falcon: Interactive Visual Analysis for Big Data

Falcon: Interactive Visual Analysis for Big Data Crossfilter millions of records without latencies. This project is work in progress and not documente

Vega 803 Dec 27, 2022
Nobel Data Analysis

Nobel_Data_Analysis This project is for analyzing a set of data about people who have won the Nobel Prize in different fields and different countries

Mohammed Hassan El Sayed 1 Jan 24, 2022
Top 50 best selling books on amazon

It's a dashboard that shows the detailed information about each book in the top 50 best selling books on amazon over the last ten years

Nahla Tarek 1 Nov 18, 2021
Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle.

2019-indian-election-eda Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle. This project is a part of the Cou

Souradeep Banerjee 5 Oct 10, 2022
PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

Thompson 5 Sep 23, 2022
My first Python project is a simple Mad Libs program.

Python CLI Mad Libs Game My first Python project is a simple Mad Libs program. Mad Libs is a phrasal template word game created by Leonard Stern and R

Carson Johnson 1 Dec 10, 2021
Analyzing Covid-19 Outbreaks in Ontario

My group and I took Covid-19 outbreak statistics from ontario, and analyzed them to find different patterns and future predictions for the virus

Vishwaajeeth Kamalakkannan 0 Jan 20, 2022
scikit-survival is a Python module for survival analysis built on top of scikit-learn.

scikit-survival scikit-survival is a Python module for survival analysis built on top of scikit-learn. It allows doing survival analysis while utilizi

Sebastian Pölsterl 876 Jan 04, 2023
MS in Data Science capstone project. Studying attacks on autonomous vehicles.

Surveying Attack Models for CAVs Guide to Installing CARLA and Collecting Data Our project focuses on surveying attack models for Connveced Autonomous

Isabela Caetano 1 Dec 09, 2021
Techdegree Data Analysis Project 2

Basketball Team Stats Tool In this project you will be writing a program that reads from the "constants" data (PLAYERS and TEAMS) in constants.py. Thi

2 Oct 23, 2021
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

PyMC 7.2k Dec 30, 2022
A lightweight, hub-and-spoke dashboard for multi-account Data Science projects

A lightweight, hub-and-spoke dashboard for cross-account Data Science Projects Introduction Modern Data Science environments often involve many indepe

AWS Samples 3 Oct 30, 2021
A script to "SHUA" H1-2 map of Mercenaries mode of Hearthstone

lushi_script Introduction This script is to "SHUA" H1-2 map of Mercenaries mode of Hearthstone Installation Make sure you installed python=3.6. To in

210 Jan 02, 2023
Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022
First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

dbt-osmosis First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we wan

Alexander Butler 150 Jan 06, 2023
ASOUL直播间弹幕抓取&&数据分析

ASOUL直播间弹幕抓取&&数据分析(更新中) 这些文件用于爬取ASOUL直播间的弹幕(其他直播间也可以)和其他信息,以及简单的数据分析生成。

159 Dec 10, 2022
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 03, 2022