Viewflow is an Airflow-based framework that allows data scientists to create data models without writing Airflow code.

Overview

Viewflow

Viewflow is a framework built on the top of Airflow that enables data scientists to create materialized views. It allows data scientists to focus on the logic of the view creation in their preferred tool (e.g., SQL, Python).

Viewflow automatically creates Airflow DAGs and tasks based on SQL or Python files.

One of the major features of Viewflow is its ability to manage tasks' dependencies, i.e., views used to create another view. Viewflow can automatically extract from the code (SQL query or Python script) the internal and external dependencies. An internal dependency is a view that belongs to the same DAG as a view being created. An external dependency is a view that belongs to a different DAG. The benefits of automatic dependency management are twofold: First, data scientists don't have to manually list dependencies, usually an error-prone process. Second, it makes sure that no view is built on stale data (because all dependent views will be updated beforehand).

Currently, Viewflow supports SQL and Python views and PostgreSQL/Redshift as a destination. We will continue improving Viewflow by adding new view types (e.g., R, Jupyter Notebooks, ...) and destination (e.g., Snowflake, BigQuery, ...).

Viewflow demo

We created a demo that shows how Viewflow works. The demo creates two DAGs: viewflow-demo-1 and viewflow-demo-2. These DAGs create a total of four views in a local Postgres database. Check out the view files in demo/dags/.

Run the demo

We use docker-compose to instantiate an Apache Airflow instance and a Postgres database. The Airflow container and the Postgres container are defined in the docker-compose.yml file. The first time you want to run the demo, you will first have to build the Apache Airflow docker image that embeds Viewflow:

docker-compose build

Then run the docker containers:

docker-compose up

Go to your local Apache Airflow instance on http://localhost:8080. There are two DAGs called viewflow-demo-1 and viewflow-demo-2:

By default, the DAGs are disabled. Turn the DAGs on by clicking on the button Off. It'll trigger the DAGs.

Query the views

Once the DAGs have run and all tasks completed, you can query the views created by Viewflow in the local Postgres database created by Docker. You can use any Postgres client (note that Postgres is running locally on port 5433):

psql -h localhost -p 5433 -U airflow -d airflow

Use airflow when psql asks you for the user password.

There is a schema named viewflow_raw and a schema named viewflow_demo. The first one contains three tables: users, courses, and user_course. They are considered as the raw data. The second schema, viewflow_demo, is the schema in which the views created by Viewflow are stored.

\dn

+---------------+---------+
| Name          | Owner   |
|---------------+---------|
| public        | airflow |
| viewflow_demo | airflow |
| viewflow_raw  | airflow |
+---------------+---------+

Viewflow created four views: user_xp (SQL), user_enriched (SQL), course_enriched (SQL) and top_3_user_xp (Python)

\dt viewflow_demo.

+---------------+-----------------+--------+---------+
| Schema        | Name            | Type   | Owner   |
|---------------+-----------------+--------+---------|
| viewflow_demo | course_enriched | table  | airflow |
| viewflow_demo | top_3_user_xp   | table  | airflow |
| viewflow_demo | user_enriched   | table  | airflow |
| viewflow_demo | user_xp         | table  | airflow |
+---------------+-----------------+--------+---------+

You can query these tables to see their data:

select * from viewflow_demo.user_xp;

+-----------+------+-----------------------+
| user_id   | xp   | __view_generated_at   |
|-----------+------+-----------------------|
| 1         | 750  | 2021-03-17            |
| 2         | 200  | 2021-03-17            |
| 3         | 550  | 2021-03-17            |
| 4         | 500  | 2021-03-17            |
| 5         | 650  | 2021-03-17            |
| 6         | 430  | 2021-03-17            |
| 7         | 300  | 2021-03-17            |
| 8         | 280  | 2021-03-17            |
| 9         | 100  | 2021-03-17            |
| 10        | 350  | 2021-03-17            |
+-----------+------+-----------------------+

You can also access the tables' comment (both table and columns):

select obj_description('viewflow_demo.user_enriched'::regclass) as view_description;

+---------------------------------------------+
| view_description                            |
|---------------------------------------------|
| A table that enriched information on a user |
+---------------------------------------------+
select
   column_name,
   col_description((table_schema||'.'||table_name)::regclass::oid, ordinal_position) as column_comment
 from
   information_schema.columns
 where
   table_schema = 'viewflow_demo'
 and
   table_name = 'user_enriched';

+--------------------------+-----------------------------------------------+
| column_name              | column_comment                                |
|--------------------------+-----------------------------------------------|
| user_id                  | The user id                                   |
| xp                       | The user amount of XP                         |
| last_course_completed_at | When was the last course completed by a user  |
| last_course_completed    | Name of the latest completed course by a user |
| number_courses_completed | Number of completed courses by a user         |
| __view_generated_at      | <null>                                        |
+--------------------------+-----------------------------------------------+

And that's it! Congrats on running the demo 🚀 If you want to play more with Viewflow, follow the installation instructions below.

Installation instructions

✉️ If you have any issue with the installation, configuration, or creation of your DAGs, do not hesitate to contact us!

The current installation process requires you to install Viewflow from the GitHub repository:

RUN pip install git+https://github.com/datacamp/viewflow.git 

Create a new DAG

Viewflow creates the DAGs automatically based on configuration files.

Here are the steps to create a DAG for the first time.

Create the Viewflow main script

In your Airflow DAG directory (usually $AIRFLOW_HOME/dags), create a python script called viewflow-dags.py that contains the following Python code:

from viewflow import create_dags

DAG = create_dags("./dags", globals(), "<views_schema_name>")

This script is executed by Airflow. It calls the main Viewflow function that creates your DAGs. The first parameter is the directory in which your dag folders are located. The third parameter is the schema name in your data warehouse, where your views will be materialized.

Create an Airflow connection to your destination

Viewflow needs to know where to write the views. It uses an Airflow connection that is referred to in the view files. Currently, viewflow supports Postgres (or Redshift) data warehouses. Please look at the Airflow documentation to create a Postgres connection.

Create your DAG directories

In Viewflow, the DAGs are created based on a configuration file and on the SQL and Python files in the same directory.

In $AIRFLOW_HOME/dags/, create a directory called my-first-viewflow-dag. In this directory, create a config.yml file that contains the following yml fields:

default_args:
    owner: <owner>@dag.com
    retries: 1
schedule_interval: 0 6 * * *
start_date: "2021-01-01"

Adapt the values of each element to what suits you. The default_args element contains the Airflow default DAG parameters.

The schedule_interval and start_date elements are the Viewflow counterparts of Airflow's schedule_interval and start_date.

You can now add your SQL and Python files in this directory (see sections below). This will create in Airflow a new DAG called my-first-viewflow-dag that will be triggered every day at 6 AM UTC as of January 1, 2021. All failed tasks will be retried once.

SQL views

A SQL view is created by a SQL file. This SQL file must contain the SQL query (as a SELECT statement) of your view and some metadata about your view. Here's an example:

/* 
---
owner: name-of-the-view-owner
description: A description of your view. It's used as the view's description in the database
fields:
  email: Description of your column -- used as the view column's description in the database
schema: schema_name_in_your_destination (e.g. viewflow_demo)
connection_id: airflow_destination_connection
--- 
*/

SELECT DISTINCT email FROM viewflow_raw.users

Python views

Please note that the implementation of the Python view should be considered as beta. It is a newer implementation of the Python view that we use at DataCamp.

A Python view is created based on a Python script. This script must contain at least one function with the view's description metadata in its docstring, which returns a Pandas dataframe.

Here's an example of a Python view:

import pandas as pd

def python_view(db_engine):
    """
    ---
    owner: name-of-the-view-owner
    description: A description of your view. It's used as the view's description in the database
    fields:
        email: Description of your column -- used as the view column's description in the database
    schema: schema_name_in_your_destination (e.g. viewflow_demo)
    connection_id: airflow_destination_connection
    ---
    """
    df = pd.read_sql_table("users", db_engine, schema="viewflow_raw")
    return df[["email"]]

Please note that Viewflow expects the Python function that creates the view to have the parameter db_engine (used to connect to the database). You don't have to set db_engine anywhere. Viewflow takes care of setting this variable.

View metadata

Viewflow expects some metadata. Here are the fields that should be included in a yml format:

  • owner: The owner of the view (i.e., who is view responsible). The owner appears in Airflow and allows users to know who they should talk to if they have some questions about the view.
  • description: What is the view about. Viewflow uses this field as a view comment in the database. The description can be retrieved in SQL (see Section Query the views).
  • fields (list): Description of each column of the view. Viewflow uses these fields as column comments in the database. The column descriptions can be retrieved in SQL (see Section Query the views).
  • schema: The name of the schema in which Viewflow creates the view. It's also used by Viewflow to create the dependencies.
  • connection_id: Airflow connection name used to connect to the database (See Section Create an Airflow connection to your destination).

Contributing to Viewflow

We welcome all sorts of contributions, be it new features, bug fixes or documentation, we encourage you to create a new PR. To create a new PR or to report new bugs, please read how to contribute to Viewflow.

In the remainder of this section, we show you how to prepare your environment to contribute to Viewflow.

Install Poetry

See https://python-poetry.org/docs/#osx-linux-bashonwindows-install-instructions for comprehensive documentation.

curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python

Install the dependencies

poetry install

Prepare your environment to run the tests

Postgres

Use docker compose to set up a PostgreSQL database locally (password: passw0rd):

docker-compose -f docker-compose-test.yml up

If you get a message saying that port 5432 is in use, it means you have a different PostgreSQL server running on your machine. If you used homebrew to install it, you could use brew services stop postgresql to stop the other server.

Import the fixtures into the local database:

psql -U user -W -h localhost -f tests/fixtures/load_postgres.sql -d viewflow

Run Pytest

Before you can run the following command, you will have to have an airflow sqlite database. Run

poetry run airflow initdb

then,

poetry run pytest

Other useful commands include:

poetry run airflow resetdb # In case the database connection is set up incorrectly

Viewflow architecture

We built Viewflow around three main components: the parser, the adapter, and the dependency extractor.

The parser transforms a source file (e.g., SQL, Rmd, Python) that contains the view's metadata (e.g., view's owner, view's descriptions, and column's descriptions) and the view's code into a specific Viewflow data structure. The data structure is used by the other components in the Viewflow architecture: the adapter and the dependency creator.

The adapter is the translation layer of Viewflow's views to their corresponding Airflow counterpart. It uses the data structure objects created by the parser to create an Airflow task object (i.e., an Airflow operator).

Finally, the dependency extractor uses the parser's data structure objects to set the internal and external dependencies to the Airflow task object created by the adapter.

This architecture allows us to add new source file types in the future easily (e.g., Python notebook, R markdown).

Acknowledgments

Today's version of Viewflow is the result of a joint effort of ex and current DataCampers. We would like to thank in particular the following persons who significantly contributed to Viewflow:

Comments
  • Callbacks for tasks

    Callbacks for tasks

    A description of the new functionality is added to the README under the subtitle 'Configuring callbacks'. In a nutshell, any number of callback functions can be defined viewflow/task_callbacks.py The default callbacks can be overwritten on the DAG-level, but also for a specific task.

    The functionality has been tested successfully (in tests/test_callbacks.py).

    opened by pietervans 0
  • R and Rmd adapters

    R and Rmd adapters

    Full support for R and Rmd scripts is implemented.

    The R and Rmd scripts are converted as following:

    1. Parsing
    2. The scripts are extended. R code for connecting to the database, reading the necessary tables and writing the new view to the database is automatically added. The extended scripts are saved in the $AIRFLOW_HOME/data directory
    3. The ROperator and RmdOperator simply execute the extended scripts

    The constraints on the scripts are analogous to the existing SQL and Python adaptors:

    • filename of the script = name of the view
    • each script generates exactly 1 view

    The Rmd scripts are an exception to these constraints. By default, an Rmd script is simply executed as is. This is what DataCamp currently expects. By adding automate_read_write: True to the metadata, Rmd scripts will behave just like R scripts.

    Both adapters will by default expect views to be referenced as <schema_name>.<table_name> This default behaviour can be changed by adding a new function in dependencies_r_patterns and adding a line dependency_function: custom_function to the metadata of the R(md) script.

    The format of the referenced tables is important for two applications:

    1. Extending the script to automatically read/write to the database (optional for Rmd)
    2. Figuring out the dependencies for the DAG

    The task top_3_user_xp_duplicate.Rmd contains an explanation of the implemented mechanism for Rmd scripts.

    2 simple DAGs have been added to demonstrate the R(md) adapters: viewflow-demo-[3|4]. Some unit tests have also been added. The README.md does not yet explain anything on the R(md) adapters as changes are still likely.

    opened by pietervans 0
  • Compatibility with Airflow 2 and Airflow 1.10

    Compatibility with Airflow 2 and Airflow 1.10

    The demo and tests work with both Airflow 2 and Airflow 1.10. The instructions in README.md are updated. I also added a few links to documentation/posts for troubleshooting that have been useful for me.

    As a new user will use Airflow 2, this is the default option. The files that are different for Airflow 1.10 have been put in the [email protected] directory.

    opened by pietervans 0
  • Compatibility with Airflow 1.10 & 2

    Compatibility with Airflow 1.10 & 2

    The tests now run successfully on mulitple Airflow versions. I tested Airflow 1.10.12 and 2.1.1

    There are still a few remaining warnings, but they are caused by the other packages that are used. If using Airflow 2, there's only 1 warning caused by the sqlalchemy package.

    The README.md file is also updated to contain valid instructions for Airflow versions 1.10.12 through 2.1.1.

    opened by pietervans 0
  • Incremental updates for materialized view

    Incremental updates for materialized view

    Implementation of incremental updates built on the existing SQL Postgres operator. An explanation of the use case and usage of the operator can be found in the README.md file (subtitle: 'Incremental updates').

    Successfully tested: tests/adapters/test_postgres_incremental.py

    An example task is also included in the Docker demo: demo/dags/viewflow-demo-2/emails_blog.sql. I changed the folder structure of the DAGs in the demo a bit to be more intuitive.

    opened by pietervans 0
Releases(v0.1.0)
  • v0.1.0(Jul 23, 2021)

    Viewflow allows creating materialized views based on scripts without having to write any Airflow code!

    • Support for SQL and Python scripts
    • Compatibility with Airflow 1.10 and Airflow 2
    Source code(tar.gz)
    Source code(zip)
Owner
DataCamp
DataCamp
Python NZ COVID Pass Verifier/Generator

Python NZ COVID Pass Verifier/Generator This is quick proof of concept verifier I coded up in a few hours using various libraries to parse and generat

NZ COVID Pass Community 12 Jan 03, 2023
Terrible python code from the "bubble that breaks maths" video.

Terrible python code from the "bubble that breaks maths" video.

Stand-up Maths 12 Oct 25, 2022
Cvdl-hw2 - Find Contour, Camera Calibration, Augmented Reality and Stereo Disparity Map

opevcvdl-hw2 This project uses openCV and Qt to achieve the requirements. Version Python 3.7 opencv-contrib-python 3.4.2.17 Matplotlib 3.1.1 pyqt5 5.1

Kenny Cheng 3 Aug 17, 2022
Mail Me My Social Media stats (SoMeMailMe)

Mail Me My Social Media follower count (SoMeMailMe) TikTok only show data 60 days back in time. With this repo you can easily scrape your follower cou

Daniel Wigh 1 Jan 07, 2022
🐍 A Python lib for (de)serializing Python objects to/from JSON

Turn Python objects into dicts or (json)strings and back No changes required to your objects Easily customizable and extendable Works with dataclasses

Ramon Hagenaars 253 Dec 14, 2022
Arabic to Roman Converter in Python

Arabic-to-Roman-Converter Made together with https://github.com/goltaraya . Arabic to Roman Converter in Python. -Instructions: 1 - Make sure you have

Pedro Lucas Tomazeti Fernandes 6 Oct 28, 2021
Python script for diving image data to train test and val

dataset-division-to-train-val-test-python python script for dividing image data to train test and val If you have an image dataset in the following st

Muhammad Zeeshan 1 Nov 14, 2022
Repositório do programa ConstruDelas - Trilha Python - Módulos 1 e 2

ConstruDelas - Introdução ao Python Nome: Visão Geral Bem vinda ao repositório do curso ConstruDelas, módulo de Introdução ao Python. Aqui vamos mante

WoMakersCode 8 Oct 14, 2022
Open source stenotype engine

Plover Bringing stenography to everyone. Homepage Releases Wiki Blog Google Group Discord Chat About Installation Getting help Contributing Donations

Open Steno Project 2k Jan 09, 2023
management tool for systemd-nspawn containers

nspctl nspctl, management tool for systemd-nspawn containers. Why nspctl? There are different tools for systemd-nspawn containers. You can use native

Emre Eryilmaz 5 Nov 27, 2022
Blender 3.1 Alpha (and later) PLY importer that correctly loads point clouds (and all PLY models as point clouds)

import-ply-as-verts Blender 3.1 Alpha (and later) PLY importer that correctly loads point clouds (and all PLY models as point clouds) Latest News Mand

Michael Prostka 82 Dec 20, 2022
Submission to the HEAR2021 Challenge

Submission to the HEAR 2021 Challenge For model evaluation, python=3.8 and cuda10.2 with cudnn7.6.5 have been tested. The work uses a mixed supervised

Heinrich Dinkel 10 Dec 08, 2022
京东热爱狂欢趴&京东扫码获取cookie

京东热爱狂欢趴 一键完成任务脚本来袭 活动地址: https://wbbny.m.jd.com/babelDiy/Zeus/2s7hhSTbhMgxpGoa9JDnbDzJTaBB/index.html#/home 2021-06-02更新: 1、删除京东星推官 2、更新脚本,修复火爆问题 2021

xoyi 48 Dec 28, 2022
It really seems like Trump is trying to get his own social media started. Not a huge fan tbh.

FuckTruthSocial It really seems like Trump is trying to get his own social media started. Not a huge fan tbh. (When TruthSocial actually releases, I'l

0 Jul 18, 2022
DOP-Tuning(Domain-Oriented Prefix-tuning model)

DOP-Tuning DOP-Tuning(Domain-Oriented Prefix-tuning model)代码基于Prefix-Tuning改进. Files ├── seq2seq # Code for encoder-decoder arch

Andrew Zeng 5 Nov 02, 2022
A tool that bootstraps your dotfiles ⚡️

Dotbot Dotbot makes installing your dotfiles as easy as git clone $url && cd dotfiles && ./install, even on a freshly installed system! Rationale Gett

Anish Athalye 5.9k Jan 07, 2023
Make after-work Mending More flexible In Python

Mending Make after-work Mending More flexible In Python A Lite Package focuses on making project's after-post mending pythonic and flexible. Certainly

2 Jun 15, 2022
A Python utility belt containing simple tools, a stdlib like feel, and extra batteries. Hashing, Caching, Timing, Progress, and more made easy!

Ubelt is a small library of robust, tested, documented, and simple functions that extend the Python standard library. It has a flat API that all behav

Jon Crall 638 Dec 13, 2022
디텍션 유틸 모음

Object detection utils 유틸모음 설명 링크 convert convert 관련코드 https://github.com/AI-infinyx/ob_utils/tree/main/convert crawl 구글, 네이버, 빙 등 크롤링 관련 https://gith

codetest 41 Jan 22, 2021
Fonts used to be an install-and-forget thing, but many of are now updated regularly.

Your font manager. Fonts used to be an install-and-forget thing, but many of are now updated regularly. fontman helps you keep track of the fonts you

Nico Schlömer 20 Feb 07, 2022