geobeam - adds GIS capabilities to your Apache Beam and Dataflow pipelines.

Overview

geobeam adds GIS capabilities to your Apache Beam pipelines.

What does geobeam do?

geobeam enables you to ingest and analyze massive amounts of geospatial data in parallel using Dataflow. geobeam provides a set of FileBasedSource classes that make it easy to read, process, and write geospatial data, and provides a set of helpful Apache Beam transforms and utilities that make it easier to process GIS data in your Dataflow pipelines.

See the Full Documentation for complete API specification.

Requirements

  • Apache Beam 2.27+
  • Python 3.7+

Note: Make sure the Python version used to run the pipeline matches the version in the built container.

Supported input types

File format Data type Geobeam class
tiff raster GeotiffSource
shp vector ShapefileSource
gdb vector GeodatabaseSource

Included libraries

geobeam includes several python modules that allow you to perform a wide variety of operations and analyses on your geospatial data.

Module Version Description
gdal 3.2.1 python bindings for GDAL
rasterio 1.1.8 reads and writes geospatial raster data
fiona 1.8.18 reads and writes geospatial vector data
shapely 1.7.1 manipulation and analysis of geometric objects in the cartesian plane

How to Use

1. Install the module

pip install geobeam

2. Write your pipeline

Write a normal Apache Beam pipeline using one of geobeams file sources. See geobeam/examples for inspiration.

3. Run

Run locally

python -m geobeam.examples.geotiff_dem \
  --gcs_url gs://geobeam/examples/dem-clipped-test.tif \
  --dataset=examples \
  --table=dem \
  --band_column=elev \
  --centroid_only=true \
  --runner=DirectRunner \
  --temp_location <temp gs://> \
  --project <project_id>

You can also run "locally" in Cloud Shell using the py-37 container variants

Note: Some of the provided examples may take a very long time to run locally...

Run in Dataflow

Write a Dockerfile

This will run in Dataflow as a custom container based on the dataflow-geobeam/base image. See [geobeam/examples/Dockerfile] for an example that installed the latest geobeam from source.

FROM gcr.io/dataflow-geobeam/base
# FROM gcr.io/dataflow-geobeam/base-py37

RUN pip install geobeam

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
# build locally with docker
docker build -t gcr.io/<project_id>/example
docker push gcr.io/<project_id>/example

# or build with Cloud Build
gcloud builds submit --tag gcr.io/<project_id>/<name> --timeout=3600s --machine-type=n1-highcpu-8

Start the Dataflow job

Note on Python versions

If you are starting a Dataflow job on a machine running Python 3.7, you must use the images suffixed with py-37. (Cloud Shell runs Python 3.7 by default, as of Feb 2021). A separate version of the base image is built for Python 3.7, and is available at gcr.io/dataflow-geobeam/base-py37. The Python 3.7-compatible examples image is similarly-named gcr.io/dataflow-geobeam/example-py37

# run the geotiff_soilgrid example in dataflow
python -m geobeam.examples.geotiff_soilgrid \
  --gcs_url gs://geobeam/examples/AWCh3_M_sl1_250m_ll.tif \
  --dataset=examples \
  --table=soilgrid \
  --band_column=h3 \
  --runner=DataflowRunner \
  --worker_harness_container_image=gcr.io/dataflow-geobeam/example \
  --experiment=use_runner_v2 \
  --temp_location=<temp bucket> \
  --service_account_email <service account> \
  --region us-central1 \
  --max_num_workers 2 \
  --machine_type c2-standard-30 \
  --merge_blocks 64

Examples

Polygonize Raster

def run(options):
  from geobeam.io import GeotiffSource
  from geobeam.fn import format_record

  with beam.Pipeline(options) as p:
    (p  | 'ReadRaster' >> beam.io.Read(GeotiffSource(gcs_url))
        | 'FormatRecord' >> beam.Map(format_record, 'elev', 'float')
        | 'WriteToBigquery' >> beam.io.WriteToBigQuery('geo.dem'))

Validate and Simplify Shapefile

def run(options):
  from geobeam.io import ShapefileSource
  from geobeam.fn import make_valid, filter_invalid, format_record

  with beam.Pipeline(options) as p:
    (p  | 'ReadShapefile' >> beam.io.Read(ShapefileSource(gcs_url))
        | 'Validate' >> beam.Map(make_valid)
        | 'FilterInvalid' >> beam.Filter(filter_invalid)
        | 'FormatRecord' >> beam.Map(format_record)
        | 'WriteToBigquery' >> beam.io.WriteToBigQuery('geo.parcel'))

See geobeam/examples/ for complete examples.

A number of example pipelines are available in the geobeam/examples/ folder. To run them in your Google Cloud project, run the included terraform file to set up the Bigquery dataset and tables used by the example pipelines.

Open up Bigquery GeoViz to visualize your data.

Shapefile Example

The National Flood Hazard Layer loaded from a shapefile. Example pipeline at geobeam/examples/shapefile_nfhl.py

Raster Example

The Digital Elevation Model is a high-resolution model of elevation measurements at 1-meter resolution. (Values converted to centimeters). Example pipeline: geobeam/examples/geotiff_dem.py.

Included Transforms

The geobeam.fn module includes several Beam Transforms that you can use in your pipelines.

Module Description
geobeam.fn.make_valid Attempt to make all geometries valid.
geobeam.fn.filter_invalid Filter out invalid geometries that cannot be made valid
geobeam.fn.format_record Format the (props, geom) tuple received from a FileSource into a dict that can be inserted into the destination table

Execution parameters

Each FileSource accepts several parameters that you can use to configure how your data is loaded and processed. These can be parsed as pipeline arguments and passed into the respective FileSources as seen in the examples pipelines.

Parameter Input type Description Default Required?
skip_reproject All True to skip reprojection during read False No
in_epsg All An EPSG integer to override the input source CRS to reproject from No
band_number Raster The raster band to read from 1 No
include_nodata Raster True to include nodata values False No
centroid_only Raster True to only read pixel centroids False No
merge_blocks Raster Number of block windows to combine during read. Larger values will generate larger, better-connected polygons. No
layer_name Vector Name of layer to read Yes
gdb_name Vector Name of geodatabase directory in a gdb zip archive Yes, for GDB files

License

This is not an officially supported Google product, though support will be provided on a best-effort basis.

Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Comments
  • Add get_bigquery_schema_dataflow

    Add get_bigquery_schema_dataflow

    Added get_bigquery_schema_dataflow to create a schema that can read files from Google Cloud Storage and generate schemas that can be used natively with Google Dataflow

    opened by mbforr 7
  • Unable to get worker harness container image

    Unable to get worker harness container image

    Unable to get the worker harness container images used in the examples.

    ~~➜ docker pull gcr.io/dataflow-geobeam/example-py37~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ~~➜ docker pull gcr.io/dataflow-geobeam/base-py37~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ~~➜ docker pull gcr.io/dataflow-geobeam/example~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ~~➜ docker pull gcr.io/dataflow-geobeam/base~~

    ~~Using default tag: latest Cannot connect to the Docker daemon at unix:///Users/jorgen.skontorp/.docker/run/docker.sock. Is the docker daemon running?~~

    ➜ docker pull gcr.io/dataflow-geobeam/base
    Using default tag: latest
    Error response from daemon: manifest for gcr.io/dataflow-geobeam/base:latest not found: manifest unknown: Failed to fetch "latest" from request "/v2/dataflow-geobeam/base/manifests/latest"
    
    opened by jskontorp 5
  • Encountering errors while trying to run examples geotiff examples

    Encountering errors while trying to run examples geotiff examples

    tl;dr This may be an issue with my environment (I'm running Python 3.9.13), but I've had no success getting any of the examples involving gridded data (e.g., geobeam.examples.geotiff_dem) to run locally. Have these been tested recently?

    I was bumping up against TypeError: only size-1 arrays can be converted to Python scalars [while running 'ElevToCentimeters'], which were fixed by using x.astype(int) where appropriate. Then I hit TypeError: format_record() takes from 1 to 2 positional arguments but 3 were given [while running 'FormatRecords']. (Maybe this line needs to be something like 'FormatRecords' >> beam.Map(format_rasterpolygon_record, 'int', known_args.band_column) instead?) Then I got TypeError: Object of type 'ndarray' is not JSON serializable [while running 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)'] and stopped to jump in here. If it's just my environment, then I'm making changes needlessly. If it's the code, it seemed better for these fixes to be applied directly in the upstream repo.

    Any advice you have would be much appreciated!!

    bug documentation 
    opened by lzachmann 3
  • ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

    ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

    I am fairly new to python and Apache beam, however, I used the shapefile_nfhl.py as an example to create a reader for GeoJSON files, therefore I imported the GeoJSONSource (as per documentation) from geobeam.io but when I run the application I get the following error ImportError: cannot import name 'GeoJSONSource' from 'geobeam.io'

    Am I missing something as I did follow the instructions to install geobeam. pip install geobeam

    I have tried this with python 3.7, 3.9 and 3.10, versions 3.7 and 3.9 gives this error where as 3.10 does not work at all - getting issues while installing rasterio.

    I am also running this on macOS Monterey (12.2.1)

    Here is my code:

    def run(pipeline_args, known_args): 
        import apache_beam as beam
        from apache_beam.io.gcp.internal.clients import bigquery as beam_bigquery
        from apache_beam.options.pipeline_options import PipelineOptions, SetupOptions
        from geobeam.io import GeoJSONSource
        from geobeam.fn import format_record, make_valid, filter_invalid
    
        pipeline_options = PipelineOptions([
            '--experiments', 'use_beam_bq_sink',
        ] + pipeline_args)
    
        with beam.Pipeline(options=pipeline_options) as p:
            (p
             | beam.io.Read(GeoJSONSource(known_args.gcs_url,
                 layer_name=known_args.layer_name))
             | 'MakeValid' >> beam.Map(make_valid)
             | 'FilterInvalid' >> beam.Filter(filter_invalid)
             | 'FormatRecords' >> beam.Map(format_record)
             | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(
                 beam_bigquery.TableReference(
                     datasetId=known_args.dataset,
                     tableId=known_args.table),
                 method=beam.io.WriteToBigQuery.Method.FILE_LOADS,
                 write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
                 create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER))
    
    
    if __name__ == '__main__':
        import logging
        import argparse
    
        logging.getLogger().setLevel(logging.INFO)
    
        parser = argparse.ArgumentParser()
        parser.add_argument('--gcs_url')
        parser.add_argument('--dataset')
        parser.add_argument('--table')
        parser.add_argument('--layer_name')
        parser.add_argument('--in_epsg', type=int, default=None)
        known_args, pipeline_args = parser.parse_known_args()
    
        run(pipeline_args, known_args)```
    opened by migaelhartzenberg 3
  • Issue installing geobeam on GCP CloudShell

    Issue installing geobeam on GCP CloudShell

    Seeing the below issue while installing geobeam on GCP Cloud Shell.

    Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-07s73wh5/orjson/
    

    Version of Python used from venv is 3.7.3

    (env) [email protected]:~ $ python
    Python 3.7.3 (default, Jan 22 2021, 20:04:44)
    [GCC 8.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>>
    

    Detailed error message

    Collecting orjson<4.0; python_version >= "3.6" (from apache-beam[gcp]>=2.27.0->geobeam)
      Using cached https://files.pythonhosted.org/packages/75/cd/eac8908d0b4a4b08067bc68c04e52d7601b0ed86bf2a6a3264c46dd51a84/orjson-3.6.3.tar.gz
      Installing build dependencies ... done
        Complete output from command python setup.py egg_info:
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/usr/lib/python3.7/tokenize.py", line 447, in open
            buffer = _builtin_open(filename, 'rb')
        FileNotFoundError: [Errno 2] No such file or directory: '/tmp/pip-install-07s73wh5/orjson/setup.py'
    
    opened by hashkanna 1
  • Adding ESRIServerSource and GeoJSONSource

    Adding ESRIServerSource and GeoJSONSource

    Hey @tjwebb wanted to send these over - still need to do some testing but wanted to run them by you first.

    GeoJSONSource - this one should be fairly straightforward as it is a single file and Fiona can read the file natively

    ESRIServerSource - I added a package that can handle the transformation of ESRI JSON to GeoJSON, as well as loop through a layer request since the ESRI REST API generally limits features that can be requested to 1000 or 2000. I can write some of this code natively or we can use the package, but not sure if we want to limit the dependencies. The package in question is here.

    https://github.com/openaddresses/pyesridump

    Also any tips for testing locally would be great!

    opened by mbforr 1
  • Unable to load 5GB tif file to bigquery

    Unable to load 5GB tif file to bigquery

    It works fine for 1GB tif file. While trying to load 2GB ~ 5GB tif file it is failing with multiple errors during write to bigquery.

    If you would like to reproduce the errors, then you could get these datasets from here - https://files.isric.org/soilgrids/former/2017-03-10/data/ BDRLOG_M_250m_ll.tif OCDENS_M_sl1_250m_ll.tif ORCDRC_M_sl1_250m_ll.tif

    "Error processing instruction process_bundle-1256. Original traceback is Traceback (most recent call last): File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1374, in apache_beam.runners.common._OutputProcessor.process_outputs File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_file_loads.py", line 248, in process writer.write(row) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1396, in write return self._file_handle.write(self._coder.encode(row) + b'\n') File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filesystemio.py", line 205, in write self._uploader.put(b) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 663, in put self._conn.send_bytes(data.tobytes()) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes self._send(buf) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 289, in _execute response = task() File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 362, in lambda: self.create_worker().do_instruction(request), request) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 606, in do_instruction return getattr(self, request_type)( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 644, in process_bundle bundle_processor.process_bundle(instruction_id)) File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 999, in process_bundle input_op_by_transform_id[element.transform_id].process_encoded( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 228, in process_encoded self.output(decoded_value) File "apache_beam/runners/worker/operations.py", line 357, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 359, in apache_beam.runners.worker.operations.Operation.output File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 829, in apache_beam.runners.worker.operations.SdfProcessSizedElements.process File "apache_beam/runners/worker/operations.py", line 838, in apache_beam.runners.worker.operations.SdfProcessSizedElements.process File "apache_beam/runners/common.py", line 1247, in apache_beam.runners.common.DoFnRunner.process_with_sized_restriction File "apache_beam/runners/common.py", line 748, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 886, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1306, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 587, in apache_beam.runners.common.SimpleInvoker.invoke_process File "apache_beam/runners/common.py", line 1401, in apache_beam.runners.common._OutputProcessor.process_outputs File "apache_beam/runners/worker/operations.py", line 221, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/worker/operations.py", line 719, in apache_beam.runners.worker.operations.DoOperation.process File "apache_beam/runners/common.py", line 1241, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 1321, in apache_beam.runners.common.DoFnRunner._reraise_augmented File "/usr/local/lib/python3.8/site-packages/future/utils/init.py", line 446, in raise_with_traceback raise exc.with_traceback(traceback) File "apache_beam/runners/common.py", line 1239, in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 768, in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 891, in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1374, in apache_beam.runners.common._OutputProcessor.process_outputs File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_file_loads.py", line 248, in process writer.write(row) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/bigquery_tools.py", line 1396, in write return self._file_handle.write(self._coder.encode(row) + b'\n') File "/usr/local/lib/python3.8/site-packages/apache_beam/io/filesystemio.py", line 205, in write self._uploader.put(b) File "/usr/local/lib/python3.8/site-packages/apache_beam/io/gcp/gcsio.py", line 663, in put self._conn.send_bytes(data.tobytes()) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes self._send(buf) File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) RuntimeError: BrokenPipeError: [Errno 32] Broken pipe [while running 'WriteToBigQuery/BigQueryBatchFileLoads/ParDo(WriteRecordsToFile)/ParDo(WriteRecordsToFile)-ptransform-3851']

    opened by aswinramakrishnan 1
  • Not able to docker or gcloud submit

    Not able to docker or gcloud submit

    Hi Travis, Thank you for creating geobeam package for our requirement.

    I am raising an issue here to just keep track.

    While using docker build -

     ---> 56341244044b
    Step 9/23 : RUN wget -q https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz     && tar -xzf curl-${CURL_VERSION}.tar.gz && cd curl-${CURL_VERSION}     && ./configure --prefix=/usr/local     && echo "building CURL ${CURL_VERSION}..."     && make --quiet -j$(nproc) && make --quiet install     && cd $WORKDIR && rm -rf curl-${CURL_VERSION}.tar.gz curl-${CURL_VERSION}
     ---> Running in 72bc532c27b8
    The command '/bin/sh -c wget -q https://curl.haxx.se/download/curl-${CURL_VERSION}.tar.gz     && tar -xzf curl-${CURL_VERSION}.tar.gz && cd curl-${CURL_VERSION}     && ./configure --prefix=/usr/local     && echo "building CURL ${CURL_VERSION}..."     && make --quiet -j$(nproc) && make --quiet install     && cd $WORKDIR && rm -rf curl-${CURL_VERSION}.tar.gz curl-${CURL_VERSION}' returned a non-zero code: 5
    (global-env) [email protected] dataflow-geobeam % 
    (global-env) [email protected] dataflow-geobeam % 
    (global-env) [email protected] dataflow-geobeam % docker image ls                                                                                                        
    REPOSITORY                                                      TAG        IMAGE ID       CREATED         SIZE
    <none>                                                          <none>     56341244044b   5 minutes ago   2.55GB
    

    while trying to do gcloud submit command -

    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferOp.lo -MD -MP -MF .deps/BufferOp.Tpo -c BufferOp.cpp  -fPIC -DPIC -o .libs/BufferOp.o
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferOp.lo -MD -MP -MF .deps/BufferOp.Tpo -c BufferOp.cpp -o BufferOp.o >/dev/null 2>&1
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferBuilder.lo -MD -MP -MF .deps/BufferBuilder.Tpo -c BufferBuilder.cpp -o BufferBuilder.o >/dev/null 2>&1
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferParameters.lo -MD -MP -MF .deps/BufferParameters.Tpo -c BufferParameters.cpp  -fPIC -DPIC -o .libs/BufferParameters.o
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferParameters.lo -MD -MP -MF .deps/BufferParameters.Tpo -c BufferParameters.cpp -o BufferParameters.o >/dev/null 2>&1
    libtool: compile:  g++ -std=c++11 -DHAVE_CONFIG_H -I. -I../../../include -I../../../include -DGEOS_INLINE -Wsuggest-override -pedantic -Wall -Wno-long-long -ffp-contract=off -DUSE_UNSTABLE_GEOS_CPP_API -g -O2 -MT BufferSubgraph.lo -MD -MP -MF .deps/BufferSubgraph.Tpo -c BufferSubgraph.cpp  -fPIC -DPIC -o .libs/BufferSubgraph.o
    ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     
    Your build timed out. Use the [--timeout=DURATION] flag to change the timeout threshold.
    ERROR: (gcloud.builds.submit) build ee748b6f-5347-4061-a81b-7f46959086c5 completed with status "TIMEOUT"
    
    documentation customer-reported 
    opened by aswinramakrishnan 1
  • Docstring type of fn.format_record is type but takes in string

    Docstring type of fn.format_record is type but takes in string

    Hi! Cool project that I want to test out but I noticed an inconsistency with the docstring and the code. Not sure which should be followed.

    https://github.com/GoogleCloudPlatform/dataflow-geobeam/blob/21479252be373b795a5c7d6626021b01a042e5de/geobeam/fn.py#L67-L91

    The docstring should be

            band_type (str, optional): Default to int. The data type of the
                raster band column to store in the database.
    ...
            p | beam.Map(geobeam.fn.format_record,
                band_column='elev', band_type='float'
    

    or the code should be

    def format_record(element, band_column=None, band_type=int):
        import json
    
        props, geom = element
        cast = band_type
    

    Thanks!

    opened by jtmiclat 1
  • Create BQ table from shapefile

    Create BQ table from shapefile

    Changes covering creation of the BQ table from the shp file schema + examples of loading "generic shapefile". By doing so we avoid the headache of table creation before, so any SHP could be loaded.

    opened by Vadoid 0
  • [BEAM-12879] Issue affecting examples

    [BEAM-12879] Issue affecting examples

    [BEAM-12879] Issue - https://issues.apache.org/jira/browse/BEAM-12879?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel

    Affecting the examples when reading files from the gs://geobeam bucket. Workaround - download zip files and put them in the own bucket.

    opened by Vadoid 0
  • Create BQ table from SHP schema

    Create BQ table from SHP schema

    Changes covering creation of the BQ table from the shp file schema + examples of loading "generic shapefile". By doing so we avoid the headache of table creation before, so any SHP could be loaded.

    opened by Vadoid 0
  • Geobeam Util  get_bigquery_schema_dataflow only for GDB files

    Geobeam Util get_bigquery_schema_dataflow only for GDB files

    As far as I understand current implementation of get_bigquery_schema_dataflow only works for GDB files and doesn't work for SHP files. Should we make the autoschema work for shapefiles as well via Fiona?

    opened by Vadoid 0
  • Updated the following: Dockerfile to get more recent gdal version

    Updated the following: Dockerfile to get more recent gdal version

    beam sdk 2.36.0 --> beam sdk 2.40.0 Addition of metview binary with the fix for debian curl 7.73.0 --> curl 7.83.1 (open ssl needed) geos 3.9.0 --> curl 3.10.3 sqlite 3330000 --> sqlite 3380500 proj 7.2.1 --> proj 9.0.0 (using cmake) openjpeg 2.3.1 --> openjpeg 2.5.0 addition of hdf5 1.10.5 addition of netcdf-c 4.9.0 gdal 3.2.1 --> gdal 3.5.1 (using cmake) making sure numpy always gets installed with gdal addition gcloud components alpha

    added longer timeout to cloudbuild.yaml because the intial build takes 1h20min at least

    opened by bahmandar 0
  • get_bigquery_schema_dataflow() issue and questions

    get_bigquery_schema_dataflow() issue and questions

    Hi, I am trying to use geobeam to ingest a shapefile into BigQuery, and creating the table with a schema from the shapefile if the table does not exist. I came across few issues and questions.

    I attempt this using a modified example shapefile_nfhl.py. And ran with this command.

    python -m shapefile_nfhl --runner DataflowRunner --project my-project --temp_location gs://mybucket-geobeam/data --region australia-southeast1 --worker_harness_container_image gcr.io/dataflow-geobeam/example --experiment use_runner_v2 --service_account_email [email protected] --gcs_url gs://geobeam/examples/510104_20170217.zip --dataset examples --table output_table
    

    Using get_bigquery_schema_dataflow() from geobeam.util is throwing error due to undefined variable.

    NameError: name 'gcs_url' is not defined
    

    I have opened a PR to fix this. #38

    Once the function is fixed, it seems that it does not accept a shapefile. Passing in the GCS URL to the zipped shapefile is throwing this error.

    Traceback (most recent call last):
      File "fiona/_shim.pyx", line 83, in fiona._shim.gdal_open_vector
      File "fiona/_err.pyx", line 291, in fiona._err.exc_wrap_pointer
    fiona._err.CPLE_OpenFailedError: '/vsigs/geobeam/examples/510104_20170217.zip' not recognized as a supported file format.
    

    Am I using the function in a wrong way or (zipped) shapefile is not support for this? For reference, this is the modified template. Thank you!

    opened by muazamkamal 0
  • centroid_only = false error for a particular GeoTIFF dataset

    centroid_only = false error for a particular GeoTIFF dataset

    When ingesting this cropland dataset https://developers.google.com/earth-engine/datasets/catalog/USDA_NASS_CDL?hl=en#citations: if I set the centroid_only parameter to false, I get the following error: Failed 'Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 319; errors: 1. Please look into the errors[] collection for more details.' reason: 'invalid'> [while running 'WriteToBigQuery/BigQueryBatchFileLoads/WaitForDestinationLoadJobs-ptransform-31'

    Full steps to reproduce are in the 'Ingesting EE data to BQ' blog.

    opened by remylouisew 0
Releases(v0.1.0)
Owner
Google Cloud Platform
Google Cloud Platform
glTF to 3d Tiles Converter. Convert glTF model to Glb, b3dm or 3d tiles format.

gltf-to-3d-tiles glTF to 3d Tiles Converter. Convert glTF model to Glb, b3dm or 3d tiles format. Usage λ python main.py --help Usage: main.py [OPTION

58 Dec 27, 2022
Python script to locate mobile number

Python script to locate mobile number How to use this script run the command to install the required libraries pip install -r requirements.txt run the

Shekhar Gupta 8 Oct 10, 2022
Creates 3D geometries from 2D vector graphics, for use in geodynamic models

geomIO - creating 3D geometries from 2D input This is the Julia and Python version of geomIO, a free open source software to generate 3D volumes and s

3 Feb 01, 2022
3D extension built off of shapely to make working with geospatial/trajectory data easier in python.

PyGeoShape 3D extension to shapely and pyproj to make working with geospatial/trajectory data easier in python. Getting Started Installation pip The e

Marc Brittain 5 Dec 27, 2022
Calculate & view the trajectory and live position of any earth-orbiting satellite

satellite-visualization A cross-platform application to calculate & view the trajectory and live position of any earth-orbiting satellite in 3D. This

Space Technology and Astronomy Cell - Open Source Society 3 Jan 08, 2022
Asynchronous Client for the worlds fastest in-memory geo-database Tile38

This is an asynchonous Python client for Tile38 that allows for fast and easy interaction with the worlds fastest in-memory geodatabase Tile38.

Ben 53 Dec 29, 2022
WhiteboxTools Python Frontend

whitebox-python Important Note This repository is related to the WhiteboxTools Python Frontend only. You can report issues to this repo if you have pr

Qiusheng Wu 304 Dec 15, 2022
Get-countries-info - A python code that fetches data of any country

Country-info A python code getting countries information including country's map

CODE 2 Feb 21, 2022
Python tools for geographic data

GeoPandas Python tools for geographic data Introduction GeoPandas is a project to add support for geographic data to pandas objects. It currently impl

GeoPandas 3.5k Jan 03, 2023
Expose a GDAL file as a HTTP accessible on-the-fly COG

cogserver Expose any GDAL recognized raster file as a HTTP accessible on-the-fly COG (Cloud Optimized GeoTIFF) The on-the-fly COG file is not material

Even Rouault 73 Aug 04, 2022
Geospatial web application developed uisng earthengine, geemap, and streamlit.

geospatial-streamlit Geospatial web applications developed uisng earthengine, geemap, and streamlit. App 1 - Land Surface Temperature A simple, code-f

13 Nov 27, 2022
Digital Earth Australia notebooks and tools repository

Repository for Digital Earth Australia Jupyter Notebooks: tools and workflows for geospatial analysis with Open Data Cube and xarray

Geoscience Australia 335 Dec 24, 2022
Color correction plugin for rasterio

rio-color A rasterio plugin for applying basic color-oriented image operations to geospatial rasters. Goals No heavy dependencies: rio-color is purpos

Mapbox 111 Nov 15, 2022
pure-Python (Numpy optional) 3D coordinate conversions for geospace ecef enu eci

Python 3-D coordinate conversions Pure Python (no prerequistes beyond Python itself) 3-D geographic coordinate conversions and geodesy. API similar to

Geospace code 292 Dec 29, 2022
Yet Another Time Series Model

Yet Another Timeseries Model (YATSM) master v0.6.x-maintenance Build Coverage Docs DOI | About Yet Another Timeseries Model (YATSM) is a Python packag

Chris Holden 60 Sep 13, 2022
Enable geospatial data mining through Google Earth Engine in Grasshopper 3D, via its most recent Hops component.

AALU_Geo Mining This repository is produced for a masterclass at the Architectural Association Landscape Urbanism programme. Requirements Rhinoceros (

4 Nov 16, 2022
GebPy is a Python-based, open source tool for the generation of geological data of minerals, rocks and complete lithological sequences.

GebPy is a Python-based, open source tool for the generation of geological data of minerals, rocks and complete lithological sequences. The data can be generated randomly or with respect to user-defi

Maximilian Beeskow 16 Nov 29, 2022
A proof-of-concept jupyter extension which converts english queries into relevant python code

Text2Code for Jupyter notebook A proof-of-concept jupyter extension which converts english queries into relevant python code. Blog post with more deta

DeepKlarity 2.1k Dec 29, 2022
A utility to search, download and process Landsat 8 satellite imagery

Landsat-util Landsat-util is a command line utility that makes it easy to search, download, and process Landsat imagery. Docs For full documentation v

Development Seed 681 Dec 07, 2022
A python package that extends Google Earth Engine.

A python package that extends Google Earth Engine GitHub: https://github.com/davemlz/eemont Documentation: https://eemont.readthedocs.io/ PyPI: https:

David Montero Loaiza 307 Jan 01, 2023