Template for a Dataflow Flex Template in Python

Overview

Dataflow Flex Template in Python

This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build Dataflow jobs to run in STOIX using Dataflow runner.

The code is based on the same example data as Google Cloud Python Quickstart, "King Lear" which is a tragedy written by William Shakespeare.

The Dataflow job reads the file content, count occurencies of each word and inserts it to a BigQuery table. The schedule date is also added to the table name producing a sharded table for the output.

Source data:

Template maintained by STOIX.

Configuration

The job is configured with the following pipeline options:

  • stoix_scheduled - Scheduled datetime as RFC3339
  • input_file - Text to read
  • output_dataset - BigQuery dataset for output table
  • output_table_prefix - BigQuery output table name prefix
  • project - Google Cloud project id

When using Dataflow runner, stoix_scheduled is automatically set and other pipeline options can be added as described in the Dataflow runner README.

Test the code

Tox is used to format, test and lint the code. Make sure to install it with pip install tox and then just run tox within the project folder.

Run pipeline

In order to work with the code locally, you can use Python virtual environments. Make sure to use Python version 3.7.10 as it is the version supported by Google Dataflow.

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -e .

Run on local machine

See quickstart python for further description of arguments.

python -m main \
    --region europe-north1 \
    --runner DirectRunner \
    --stoix_scheduled 2021-01-01T00:00:00Z \
    --input_file gs://dataflow-samples/shakespeare/kinglear.txt \
    --output_table_prefix kinglear \
    --output_dataset 
   
     \
    --project 
    
      \
    --temp_location gs://
     
      /tmp/

     
    
   

Build Docker image for STOIX

In order to run the pipeline the Flex Template needs to be packaged in a Docker image and pushed to a Docker image repository. In this example Docker Hub is used.

Set the tag to the name and version of your pipeline, e.g: stoix/count-words:1.0.0.

$ docker build --tag stoix/count-words:1.0.0 .

Then upload the image to the Docker image repository.

$ docker push stoix/count-words:1.0.0

Run Dataflow on STOIX

Now the Dataflow Flex Template job can be ran using Dataflow runner. Add a new job with the image stoix/dataflow-runner and the following environment variables:

  • GCP_PROJECT_ID:
  • GCP_REGION: europe-north1
  • GCP_SERVICE_ACCOUNT: BASE64 encoded service account JSON
  • JOB_IMAGE: stoix/count-words:1.0.0
  • JOB_NAME_PREFIX: count-words
  • JOB_PARAM_INPUT_FILE: gs://dataflow-samples/shakespeare/kinglear.txt
  • JOB_PARAM_OUTPUT_DATASET: dataflow
  • JOB_PARAM_OUTPUT_TABLE_PREFIX: kinglear
  • JOB_SDK_LANGUAGE: python

Note: When running this in production, set GCP_SERVICE_ACCOUNT as a secret instead of environment variable.

License

MIT

Owner
STOIX
STOIX
CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

Tenyo Kawamura 1 Oct 20, 2021
Anomaly Detection with R

AnomalyDetection R package AnomalyDetection is an open-source R package to detect anomalies which is robust, from a statistical standpoint, in the pre

Twitter 3.5k Dec 27, 2022
This repository contains some analysis of possible nerdle answers

Nerdle Analysis https://nerdlegame.com/ This repository contains some analysis of possible nerdle answers. Here's a quick overview: nerdle.py contains

0 Dec 16, 2022
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
💬 Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

Chatistics Python 3 scripts to convert chat logs from various messaging platforms into Pandas DataFrames. Can also generate histograms and word clouds

Florian 893 Jan 02, 2023
Titanic data analysis for python

Titanic-data-analysis This Repo is an analysis on Titanic_mod.csv This csv file contains some assumed data of the Titanic ship after sinking This full

Hardik Bhanot 1 Dec 26, 2021
Picka: A Python module for data generation and randomization.

Picka: A Python module for data generation and randomization. Author: Anthony Long Version: 1.0.1 - Fixed the broken image stuff. Whoops What is Picka

Anthony 108 Nov 30, 2021
MDAnalysis is a Python library to analyze molecular dynamics simulations.

MDAnalysis Repository README [*] MDAnalysis is a Python library for the analysis of computer simulations of many-body systems at the molecular scale,

MDAnalysis 933 Dec 28, 2022
Handle, manipulate, and convert data with units in Python

unyt A package for handling numpy arrays with units. Often writing code that deals with data that has units can be confusing. A function might return

The yt project 304 Jan 02, 2023
Feature Detection Based Template Matching

Feature Detection Based Template Matching The classification of the photos was made using the OpenCv template Matching method. Installation Use the pa

Muhammet Erem 2 Nov 18, 2021
Learn machine learning the fun way, with Oracle and RedBull Racing

Red Bull Racing Analytics Hands-On Labs Introduction Are you interested in learning machine learning (ML)? How about doing this in the context of the

Oracle DevRel 55 Oct 24, 2022
collect training and calibration data for gaze tracking

Collect Training and Calibration Data for Gaze Tracking This tool allows collecting gaze data necessary for personal calibration or training of eye-tr

Pascal 5 Dec 17, 2022
Display the behaviour of a realtime program with a scope or logic analyser.

1. A monitor for realtime MicroPython code This library provides a means of examining the behaviour of a running system. It was initially designed to

Peter Hinch 17 Dec 05, 2022
Creating a statistical model to predict 10 year treasury yields

Predicting 10-Year Treasury Yields Intitially, I wanted to see if the volatility in the stock market, represented by the VIX index (data source), had

10 Oct 27, 2021
Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

1 Nov 22, 2021
ASOUL直播间弹幕抓取&&数据分析

ASOUL直播间弹幕抓取&&数据分析(更新中) 这些文件用于爬取ASOUL直播间的弹幕(其他直播间也可以)和其他信息,以及简单的数据分析生成。

159 Dec 10, 2022
pandas: powerful Python data analysis toolkit

pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.

pandas 36.4k Jan 03, 2023
Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln

Cristian Garcia 1.4k Dec 31, 2022
TheMachineScraper 🐱‍👤 is an Information Grabber built for Machine Analysis

TheMachineScraper 🐱‍👤 is a tool made purely for analysing machine data for any reason.

doop 5 Dec 01, 2022
pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

Matthew Johnson 527 Dec 04, 2022