Data Orchestration Platform

Related tags

Miscellaneousdop
Overview

Table of contents

What is DOP

Design Concept

DOP is designed to simplify the orchestration effort across many connected components using a configuration file without the need to write any code. We have a vision to make orchestration easier to manage and more accessible to a wider group of people.

Here are some of the key design concept behind DOP,

  • Built on top of Apache Airflow - Utilises it’s DAG capabilities with interactive GUI
  • DAGs without code - YAML + SQL
  • Native capabilities (SQL) - Materialisation, Assertion and Invocation
  • Extensible via plugins - DBT job, Spark job, Egress job, Triggers, etc
  • Easy to setup and deploy - fully automated dev environment and easy to deploy
  • Open Source - open sourced under the MIT license

Please note that this project is heavily optimised to run with GCP (Google Cloud Platform) services which is our current focus. By focusing on one cloud provider, it allows us to really improve on end user experience through automation

A Typical DOP Orchestration Flow

Typical DOP Flow

Prerequisites - Run in Docker

Note that all the IAM related prerequisites will be available as a Terraform template soon!

For DOP Native Features

  1. Download and install Docker https://docs.docker.com/get-docker/ (if you are on Windows, please follow instruction here as there are some additional steps required for it to work https://docs.docker.com/docker-for-windows/install/)
  2. Download and install Google Cloud Platform (GCP) SDK following instructions here https://cloud.google.com/sdk/docs/install.
  3. Create a dedicated service account for docker with limited permissions for the development GCP project, the Docker instance is not designed to be connected to the production environment
    1. Call it dop-docker-user@<your GCP project id> and create it in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
    2. Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
  4. Your GCP user / group will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role on thedevelopment project just for the dop-docker-user service account in order to enable Service Account Impersonation.
    Grant service account user
  5. Authenticating with your GCP environment by typing in gcloud auth application-default login in your terminal and following instructions. Make sure you proceed to the stage where application_default_credentials.json is created on your machine (For windows users, make a note of the path, this will be required on a later stage)
  6. Clone this repository to your machine.

For DBT

  1. Setup a service account for your GCP project called dop-dbt-user in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
  2. Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account at project level under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
  3. Your GCP user / group will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role on the development project just for the dop-dbt-user service account in order to enable Service Account Impersonation.

Instructions for Setting things up

Run Airflow with DOP in Docker - Mac

See README in the service project setup and follow instructions.

Once it's setup, you should see example DOP DAGs such as dop__example_covid19 Airflow in Docker

Run Airflow with DOP in Docker - Windows

This is currently working in progress, however the instructions on what needs to be done is in the Makefile

Run on Composer

Prerequisites

  1. Create a dedicate service account for Composer and call it dop-composer-user with following roles at project level
    • roles/bigquery.dataEditor
    • roles/bigquery.jobUser
    • roles/composer.worker
    • roles/compute.viewer
  2. Create a dedicated service account for DBT with limited permissions.
    1. [Already done in here if it’s DEV] Call it dop-dbt-user@<GCP project id> and create in https://console.cloud.google.com/iam-admin/serviceaccounts?project=<your GCP project id>
    2. [Already done in here if it’s DEV] Assign the roles/bigquery.dataEditor and roles/bigquery.jobUser role to the service account at project level under https://console.cloud.google.com/iam-admin/iam?project=<your GCP project id>
    3. The dop-composer-user will need to be given the roles/iam.serviceAccountUser and the roles/iam.serviceAccountTokenCreator role just for the dop-dbt-user service account in order to enable Service Account Impersonation.

Create Composer Cluster

  1. Use the service account already created dop-composer-user instead of the default service account
  2. Use the following environment variables
    DOP_PROJECT_ID : {REPLACE WITH THE GCP PROJECT ID WHERE DOP WILL PERSIST ALL DATA TO}
    DOP_LOCATION : {REPLACE WITH GCP REGION LOCATION WHRE DOP WILL PERSIST ALL DATA TO}
    DOP_SERVICE_PROJECT_PATH := {REPLACE WITH THE ABSOLUTE PATH OF THE Service Project, i.e. /home/airflow/gcs/dags/dop_{service project name}
    DOP_INFRA_PROJECT_ID := {REPLACE WITH THE GCP INFRASTRUCTURE PROJECT ID WHERE BUILD ARTIFACTS ARE STORED, i.e. a DBT docker image stored in GCR}
    
    and optionally
    DOP_GCR_PULL_SECRET_NAME:= {This maybe needed if the project storing the gcr images are not he same as where Cloud Composer runs, however this might be a better alternative https://medium.com/google-cloud/using-single-docker-repository-with-multiple-gke-projects-1672689f780c}
    
  3. Add the following Python Packages
    dataclasses==0.7
    
  4. Finally create a new node pool with the following k8 label
    key: cloud.google.com/gke-nodepool
    value: kubernetes-task-pool
    

Deployment

See Service Project README

Misc

Service Account Impersonation

Impersonation is a GCP feature allows a user / service account to impersonate as another service account.
This is a very useful feature and offers the following benefits

  • When doing development locally, especially with automation involved (i.e using Docker), it is very risky to interact with GCP services by using your user account directly because it may have a lot of permissions. By impersonate as another service account with less permissions, it is a lot safer (least privilege)
  • There is no credential needs to be downloaded, all permissions are linked to the user account. If an employee leaves the company, access to GCP will be revoked immediately because the impersonation process is no longer possible

The following diagram explains how we use Impersonation in DOP when it runs in Docker DOP Docker Account Impersonation

And when running DBT jobs on production, we are also using this technique to use the composer service account to impersonate as the dop-dbt-user service account so that service account keys are not required.

There are two very google articles explaining how impersonation works and why using it

You might also like...
Cross-platform config and manager for click console utilities.

climan Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal: https

YourCity is a platform to match people to their prefect city.
YourCity is a platform to match people to their prefect city.

YourCity YourCity is a city matching App that matches users to their ideal city. It is a fullstack React App made with a Redux state manager and a bac

A multi-platform fuzzer for poking at userland binaries and servers

litefuzz A multi-platform fuzzer for poking at userland binaries and servers litefuzz intro why how it works what it does what it doesn't do support p

A platform for developers 👩‍💻  who wants to share their programs and projects.
A platform for developers 👩‍💻 who wants to share their programs and projects.

Fest-Practice-2021 This project is excluded from Hacktoberfest 2021. Please use this as a testing repo/project. A platform for developers 👩‍💻 who wa

Speed up your typing by some exercises in the multi-platform(Windows/Ubuntu).

Introduction This project purpose is speed up your typing by some exercises in the multi-platform(Windows/Ubuntu). Build Environment Software Environm

An Airdrop alternative for cross-platform users only for desktop with Python

PyDrop An Airdrop alternative for cross-platform users only for desktop with Python, -version 1.0 with less effort, just as a practice. ##############

Platform Tree for Xiaomi Redmi Note 7/7S (lavender)
Platform Tree for Xiaomi Redmi Note 7/7S (lavender)

The Xiaomi Redmi Note 7 (codenamed "lavender") is a mid-range smartphone from Xiaomi announced in January 2019. Device specifications Device Xiaomi Re

A Classroom Engagement Platform

Project Introduction This is project introduction Setup Setting up Postgres This is the most tricky part when setting up the application. You will nee

Traffic flow test platform, especially for reinforcement learning
Traffic flow test platform, especially for reinforcement learning

Traffic Flow Test Platform Traffic flow test platform, especially for reinforcement learning, named TFTP. A traffic signal control framework that can

Comments
  • Release DOP v0.3.0

    Release DOP v0.3.0

    A number of new features where added in this version

    DOP v0.3.0 — 2021-08-11

    Features

    • Support for "generic" airflow operators: you can now use regular python operators as part of your config files.

    • Support for “dbt docs” command to generate documentation for all dbt tasks: Users can now add “docs generate” as a target in their DOP configuration and additionally specify a GCS bucket with the --bucket and --bucket-path options where documents are copied to.

    • Serve dbt docs: Documents generated by dbt can be served as a web page by deploying the provided app on GAE. Note that deploying is an additional step that needs to be carried out after docs have been generated. See infrastructure/dbt-docs/README.md for details.

    • dbt tasks artifacts run_results created by dbt tasks saved to BigQuery: This json file contains information on completed dbt invocations and is saved in the BQ table “run_results” for analysis and debugging.

    • Add support for Airflow v1.10.14 and v1.10.15 local environments: Users can specify which version they want to use by setting the AIRFLOW_VERSION environment variable.

    • Pre-commit linters: added pre-commit hooks to ensure python, yaml and some support for plain text file consistency in formatting and style throughout DOP codebase.

    Changes

    • Ensure DAGs using the same DBT project do not run concurrently: Safety feature to safely allow selective execution of workflows by calling specific commands or tags (e.g. dbt run --m) within a single dbt project. This avoids creating inter-dependant workflows to avoid overriding each other's artifacts, since they will share the same target location (within the dbt container).

    • Test time-partitioning: Time-partitioning of datetime type properly validated as part of schema validation.

    • Use Python 3.7 and dbt 0.19.1 in Composer K8s Operator

    • Add Dataflow example task: with the introduction of "regular" in the yaml config Airflow Operators, it is now possible to run compute intensive Dataflow jobs. Check example_dataflow_template for an example on how to implement a Dataflow pipeline.

    opened by dinigo 0
Releases(v0.3.0)
  • v0.3.0(Aug 17, 2021)

    Features

    • Support for "generic" airflow operators: you can now use regular python operators as part of your config files.

    • Support for “dbt docs” command to generate documentation for all dbt tasks: Users can now add “docs generate” as a target in their DOP configuration and additionally specify a GCS bucket with the --bucket and --bucket-path options where documents are copied to.

    • Serve dbt docs: Documents generated by dbt can be served as a web page by deploying the provided app on GAE. Note that deploying is an additional step that needs to be carried out after docs have been generated. See infrastructure/dbt-docs/README.md for details.

    • dbt tasks artifacts run_results created by dbt tasks saved to BigQuery: This json file contains information on completed dbt invocations and is saved in the BQ table “run_results” for analysis and debugging.

    • Add support for Airflow v1.10.14 and v1.10.15 local environments: Users can specify which version they want to use by setting the AIRFLOW_VERSION environment variable.

    • Pre-commit linters: added pre-commit hooks to ensure python, yaml and some support for plain text file consistency in formatting and style throughout DOP codebase.

    Changes

    • Ensure DAGs using the same DBT project do not run concurrently: Safety feature to safely allow selective execution of workflows by calling specific commands or tags (e.g. dbt run --m) within a single dbt project. This avoids creating inter-dependant workflows to avoid overriding each other's artifacts, since they will share the same target location (within the dbt container).

    • Test time-partitioning: Time-partitioning of datetime type properly validated as part of schema validation.

    • Use Python 3.7 and dbt 0.19.1 in Composer K8s Operator

    • Add Dataflow example task: with the introduction of "regular" in the yaml config Airflow Operators, it is now possible to run compute intensive Dataflow jobs. Check example_dataflow_template for an example on how to implement a Dataflow pipeline.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Mar 30, 2021)

Owner
Datatonic
We accelerate business impact through Machine Learning and Analytics
Datatonic
Modern API wrapper for Genshin Impact built on asyncio and pydantic.

genshin.py Modern API wrapper for Genshin Impact built on asyncio and pydantic.

sadru 212 Jan 06, 2023
Module for working with the site dnevnik.ru with python

dnevnikru Module for working with the site dnevnik.ru with python Dnevnik object accepts login and password from the dnevnik.ru account Methods: homew

Aleksandr 21 Nov 21, 2022
A napari plugin to inspect data within a cisTEM project

napari-cistem A plugin to inspect data within a cisTEM project This napari plugin was generated with Cookiecutter using with @napari's cookiecutter-na

Johannes Elferich 1 Nov 07, 2021
使用京东cookie一键生成所有退会链接

JDMemberCloseLinks 本项目旨在使用京东cookie一键生成所有退会链接

hyzaw 68 Jun 10, 2022
MiniJVM is simple java virtual machine written by python language, it can load class file from file system and run it.

MiniJVM MiniJVM是一款使用python编写的简易JVM,能够从本地加载class文件并且执行绝大多数指令。 支持的功能 1.从本地磁盘加载class并解析 2.支持绝大多数指令集的执行 3.支持虚拟机内存分区以及对象的创建 4.支持方法的调用和参数传递 5.支持静态代码块的初始化 不支

keguoyu 60 Apr 01, 2022
Ingestinator is my personal VFX pipeline tool for ingesting folders containing frame sequences that have been pulled and downloaded to a local folder

Ingestinator Ingestinator is my personal VFX pipeline tool for ingesting folders containing frame sequences that have been pulled and downloaded to a

Henry Wilkinson 2 Nov 18, 2022
Here You will Find CodeChef Challenge Solutions

Here You will Find CodeChef Challenge Solutions

kanishk kashyap 1 Sep 03, 2022
RestMapper takes the pain out of integrating with RESTful APIs.

python-restmapper RestMapper takes the pain out of integrating with RESTful APIs. It removes all of the complexity with writing API-specific code, and

Lionheart Software 8 Oct 31, 2020
Utility functions for working with data from Nix in Python

Pynixutil - Utility functions for working with data from Nix in Python Examples Base32 encoding/decoding import pynixutil input = "v5sv61sszx301i0x6x

Tweag 11 Dec 16, 2022
kurwa deska ADB

kurwa-deska-ADB kurwa-deska Запуск Linux -- python3 kurwa_deska.py Termux -- python3 kurwa_deska.py Встановлення cd kurwa_deska ADB і зразу запуск pyt

1 Jan 21, 2022
A collection of convenient parsers for Advent of Code problems.

Advent of Code Parsers A collection of convenient Python parsers for Advent of Code problems. Installation pip install aocp Quickstart You can import

Miguel Blanco Marcos 3 Dec 13, 2021
Hypothesis strategies for generating Python programs, something like CSmith

hypothesmith Hypothesis strategies for generating Python programs, something like CSmith. This is definitely pre-alpha, but if you want to play with i

Zac Hatfield-Dodds 73 Dec 14, 2022
Solutions to the language assignment for Internship in JALA Technologies.

Python Assignment Solutions (JALA Technologies) Solutions to the language assignment for Internship in JALA Technologies. Features Properly formatted

Samyak Jain 2 Jan 17, 2022
This is a Saleae Logic custom high level analyzer that allows you to search and mark specific packets.

SaleaePacketParser This is a Saleae Logic custom high level analyzer that allows you to search and mark specific packets. Field "Search For" is used f

1 Dec 16, 2021
Python-geoarrow - Storing geometry data in Apache Arrow format

geoarrow Storing geometry data in Apache Arrow format Installation $ pip install

Joris Van den Bossche 11 Mar 03, 2022
Generate PNG filles from NFO files.

Installation git clone https://github.com/pcroland/nfopng cd nfopng pip install -r requirements.txt Usage ❯ ./nfopng.py usage: nfopng.py [-h] [-v] [-i

4 Jun 26, 2022
This is a library to do functional programming in Python.

Fpylib This is a library to do functional programming in Python. Index Fpylib Index Features Intelligents Ranges with irange Lazyness to functions Com

Fabián Vega Alcota 4 Jul 17, 2022
The tool helps to find hidden parameters that can be vulnerable or can reveal interesting functionality that other hunters miss.

The tool helps to find hidden parameters that can be vulnerable or can reveal interesting functionality that other hunters miss. Greater accuracy is achieved thanks to the line-by-line comparison of

197 Nov 14, 2022
management tool for systemd-nspawn containers

nspctl nspctl, management tool for systemd-nspawn containers. Why nspctl? There are different tools for systemd-nspawn containers. You can use native

Emre Eryilmaz 5 Nov 27, 2022
GibMacOS - Py2/py3 script that can download macOS components direct from Apple

Py2/py3 script that can download macOS components direct from Apple Can also now build Internet Recovery USB installers from Windows using dd and 7zip

CorpNewt 4.8k Jan 02, 2023