A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Last update: Dec 30, 2022

Overview

Streamify

A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Description

Objective

The project will stream events generated from a fake music streaming service (like Spotify) and create a data pipeline that consumes the real-time data. The data coming in would be similar to an event of a user listening to a song, navigating on the website, authenticating. The data would be processed in real-time and stored to the data lake periodically (every two minutes). The hourly batch job will then consume this data, apply transformations, and create the desired tables for our dashboard to generate analytics. We will try to analyze metrics like popular songs, active users, user demographics etc.

Dataset

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

Tools & Technologies

Cloud - Google Cloud Platform
Infrastructure as Code software - Terraform
Containerization - Docker, Docker Compose
Stream Processing - Kafka, Spark Streaming
Orchestration - Airflow
Transformation - dbt
Data Lake - Google Cloud Storage
Data Warehouse - BigQuery
Data Visualization - Data Studio
Language - Python

Architecture

Final Result

Setup

WARNING: You will be charged for all the infra setup. You can avail 300$ in credit by creating a new account on GCP.

Pre-requisites

If you already have a Google Cloud account and a working terraform setup, you can skip the pre-requisite steps.

Google Cloud Platform.
- GCP Account and Access Setup
- gcloud alternate installation method - Windows
Terraform
- Setup Terraform

Get Going!

A video walkthrough of how I run my project - YouTube Video

Procure infra on GCP with Terraform - Setup
(Extra) SSH into your VMs, Forward Ports - Setup
Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
Setup Spark Cluster for stream processing - Setup
Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Debug

If you run into issues, see if you find something in this debug guide.

How can I make this better?!

A lot can still be done :).

Choose managed Infra
- Cloud Composer for Airflow
- Confluent Cloud for Kafka
Create your own VPC network
Build dimensions and facts incrementally instead of full refresh
Write data quality tests
Create dimensional models for additional business processes
Include CI/CD
Add more visualizations

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. All the things I learnt there, enabled me to come up with this project. If you want to upskill on Data Engineering technologies, please check out the course. :)

A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Related tags

Overview

Streamify

Description

Objective

Dataset

Tools & Technologies

Architecture

Final Result

Setup

Pre-requisites

Get Going!

Debug

How can I make this better?!

Special Mentions

Owner

Ankur Chavda

Chalice - A tool to facilitate Python based lambda deployment

Msgpack serialization/deserialization library for Python, written in Rust using PyO3 and rust-msgpack. Reboot of orjson. msgpack.org[Python]

Media Cloud Outlet Filtering

fast_bss_eval is a fast implementation of the bss_eval metrics for the evaluation of blind source separation.

An osu! cheat made in c++ rewritten in python and currently undetected.

Print 'text color' and 'text format' on Term with Python

Basic Hspice runner with Python

This is a simple SV calling package for diploid assemblies.

Swubcase - The shitty programming language

Neptune client library - integrate your Python scripts with Neptune

A tutorial presents several practical examples of how to build DAGs in Apache Airflow

Python3 Interface to numa Linux library

🗽 Like yarn outdated/upgrade, but for pip. Upgrade all your pip packages and automate your Python Dependency Management.

通过简单的卷积神经网络直接预测出验证码图片中滑块的位置

Set of tools to analyze Tinynuke samples

Small projects for python beginners.

ticguide: quick + painless TESS observing information

Sacred is a tool to help you configure, organize, log and reproduce experiments developed at IDSIA.

It was created to conveniently respond to events such as donation, follow, and hosting using the Alert Box provided by twip to streamers

A desktop app to check the unlocked courses bases on previously done courses.