Introduction

This repository is to show you how to integrate Zeppelin with Airflow. The philosophy behind the ingtegration is to make the transition from development stage to production stage as smooth as possible.
Zeppelin is good at data pipeline development (Spark, Flink, Hive, Python, Shell and etc), while Airflow is the de-facto standard of Job orchestration.

How to run it

Step 1. Initialize enviromenment.

Run this following commands to initialize environment.

Download spark which is used by Zeppelin
Download zeppelin airflow plugins

git clone https://github.com/zjffdu/zeppelin_airflow.git
cd zeppelin_airflow
./init.sh

Step 2 Start Zeppelin + Airflow via docker-compose

docker-compose -f docker-compose-LocalExecutor.yml up -d

Step 3. Use Zeppelin + Airflow

Open http://localhost:8085 for Zeppelin http://localhost:8083 for Airflow

There's one dag zeppelin_example in Airflow. This dag just run 3 Zeppelin notes:

Python Tutorial/01. IPython Basics
Spark Tutorial/02. Spark Basics Features
Spark Tutorial/03. Spark SQL (PySpark)

You can enable it, then Airflow would run these Zeppelin notes.

Actually Zeppelin would not run these notes directly, instead it would clone note and run the cloned note.

More features would come soon, stay tuned.

Show you how to integrate Zeppelin with Airflow

Related tags

Overview

Introduction

How to run it

Step 1. Initialize enviromenment.

Step 2 Start Zeppelin + Airflow via docker-compose

Step 3. Use Zeppelin + Airflow

More features would come soon, stay tuned.

Owner

Jeff Zhang

Exploring the Top ML and DL GitHub Repositories

NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

Fitting thermodynamic models with pycalphad

This is a repo documenting the best practices in PySpark.

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Data Analysis for First Year Laboratory at Imperial College, London.

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

A python package which can be pip installed to perform statistics and visualize binomial and gaussian distributions of the dataset

ASOUL直播间弹幕抓取&&数据分析

Program that predicts the NBA mvp based on data from previous years.

Pipeline and Dataset helpers for complex algorithm evaluation.

DefAP is a program developed to facilitate the exploration of a material's defect chemistry

yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

INF42 - Topological Data Analysis

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Hydrogen (or other pure gas phase species) depressurization calculations

Modular analysis tools for neurophysiology data

Tokyo 2020 Paralympics, Analytics