Template for a Dataflow Flex Template in Python

Overview

Dataflow Flex Template in Python

This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build Dataflow jobs to run in STOIX using Dataflow runner.

The code is based on the same example data as Google Cloud Python Quickstart, "King Lear" which is a tragedy written by William Shakespeare.

The Dataflow job reads the file content, count occurencies of each word and inserts it to a BigQuery table. The schedule date is also added to the table name producing a sharded table for the output.

Source data:

Template maintained by STOIX.

Configuration

The job is configured with the following pipeline options:

  • stoix_scheduled - Scheduled datetime as RFC3339
  • input_file - Text to read
  • output_dataset - BigQuery dataset for output table
  • output_table_prefix - BigQuery output table name prefix
  • project - Google Cloud project id

When using Dataflow runner, stoix_scheduled is automatically set and other pipeline options can be added as described in the Dataflow runner README.

Test the code

Tox is used to format, test and lint the code. Make sure to install it with pip install tox and then just run tox within the project folder.

Run pipeline

In order to work with the code locally, you can use Python virtual environments. Make sure to use Python version 3.7.10 as it is the version supported by Google Dataflow.

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -e .

Run on local machine

See quickstart python for further description of arguments.

python -m main \
    --region europe-north1 \
    --runner DirectRunner \
    --stoix_scheduled 2021-01-01T00:00:00Z \
    --input_file gs://dataflow-samples/shakespeare/kinglear.txt \
    --output_table_prefix kinglear \
    --output_dataset 
   
     \
    --project 
    
      \
    --temp_location gs://
     
      /tmp/

     
    
   

Build Docker image for STOIX

In order to run the pipeline the Flex Template needs to be packaged in a Docker image and pushed to a Docker image repository. In this example Docker Hub is used.

Set the tag to the name and version of your pipeline, e.g: stoix/count-words:1.0.0.

$ docker build --tag stoix/count-words:1.0.0 .

Then upload the image to the Docker image repository.

$ docker push stoix/count-words:1.0.0

Run Dataflow on STOIX

Now the Dataflow Flex Template job can be ran using Dataflow runner. Add a new job with the image stoix/dataflow-runner and the following environment variables:

  • GCP_PROJECT_ID:
  • GCP_REGION: europe-north1
  • GCP_SERVICE_ACCOUNT: BASE64 encoded service account JSON
  • JOB_IMAGE: stoix/count-words:1.0.0
  • JOB_NAME_PREFIX: count-words
  • JOB_PARAM_INPUT_FILE: gs://dataflow-samples/shakespeare/kinglear.txt
  • JOB_PARAM_OUTPUT_DATASET: dataflow
  • JOB_PARAM_OUTPUT_TABLE_PREFIX: kinglear
  • JOB_SDK_LANGUAGE: python

Note: When running this in production, set GCP_SERVICE_ACCOUNT as a secret instead of environment variable.

License

MIT

Owner
STOIX
STOIX
Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Fast Laplacian Eigenmaps in python Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMS

17 Jul 09, 2022
This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Ishan Hegde 1 Nov 17, 2021
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistic

150 Dec 30, 2022
CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner. It is aimed to integrate this tool with several more features including providing a U

Ravi Prakash 3 Jun 27, 2021
CubingB is a timer/analyzer for speedsolving Rubik's cubes, with smart cube support

CubingB is a timer/analyzer for speedsolving Rubik's cubes (and related puzzles). It focuses on supporting "smart cubes" (i.e. bluetooth cubes) for recording the exact moves of a solve in real time.

Zach Wegner 5 Sep 18, 2022
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
Finding project directories in Python (data science) projects, just like there R rprojroot and here packages

Find relative paths from a project root directory Finding project directories in Python (data science) projects, just like there R here and rprojroot

Daniel Chen 102 Nov 16, 2022
Analyzing Covid-19 Outbreaks in Ontario

My group and I took Covid-19 outbreak statistics from ontario, and analyzed them to find different patterns and future predictions for the virus

Vishwaajeeth Kamalakkannan 0 Jan 20, 2022
A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful.

How useful is the aswer? A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful. If you want to l

1 Dec 17, 2021
A Python module for clustering creators of social media content into networks

sm_content_clustering A Python module for clustering creators of social media content into networks. Currently supports identifying potential networks

72 Dec 30, 2022
Data Analysis for First Year Laboratory at Imperial College, London.

Data Analysis for First Year Laboratory at Imperial College, London. For personal reference only, and to reference in lab reports and lab books.

Martin He 0 Aug 29, 2022
Aggregating gridded data (xarray) to polygons

A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons. Check out the binder link above for a sample c

Kevin Schwarzwald 42 Nov 09, 2022
A distributed block-based data storage and compute engine

Nebula is an extremely-fast end-to-end interactive big data analytics solution. Nebula is designed as a high-performance columnar data storage and tabular OLAP engine.

Columns AI 131 Dec 26, 2022
Weather Image Recognition - Python weather application using series of data

Weather Image Recognition - Python weather application using series of data

Kushal Shingote 1 Feb 04, 2022
Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

1 Nov 22, 2021
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

🧪📈 🐍. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python a

Marc Skov Madsen 97 Dec 08, 2022
Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Yongxian (Caroline) Lun 1 Dec 27, 2021
An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

5ePy is an experimental project I'm undertaking for the sole purpose of increasing my Python knowledge. #Goals Goal: Create a working, albeit lightwei

Hayden Covington 1 Nov 24, 2021
small package with utility functions for analyzing (fly) calcium imaging data

fly2p Tools for analyzing two-photon (2p) imaging data collected with Vidrio Scanimage software and micromanger. Loading scanimage data relies on scan

Hannah Haberkern 3 Dec 14, 2022
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021