A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Overview

Streamify

A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Description

Objective

The project will stream events generated from a fake music streaming service (like Spotify) and create a data pipeline that consumes the real-time data. The data coming in would be similar to an event of a user listening to a song, navigating on the website, authenticating. The data would be processed in real-time and stored to the data lake periodically (every two minutes). The hourly batch job will then consume this data, apply transformations, and create the desired tables for our dashboard to generate analytics. We will try to analyze metrics like popular songs, active users, user demographics etc.

Dataset

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

Tools & Technologies

Architecture

streamify-architecture

Final Result

dashboard

Setup

WARNING: You will be charged for all the infra setup. You can avail 300$ in credit by creating a new account on GCP.

Pre-requisites

If you already have a Google Cloud account and a working terraform setup, you can skip the pre-requisite steps.

Get Going!

A video walkthrough of how I run my project - YouTube Video

  • Procure infra on GCP with Terraform - Setup
  • (Extra) SSH into your VMs, Forward Ports - Setup
  • Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
  • Setup Spark Cluster for stream processing - Setup
  • Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Debug

If you run into issues, see if you find something in this debug guide.

How can I make this better?!

A lot can still be done :).

  • Choose managed Infra
    • Cloud Composer for Airflow
    • Confluent Cloud for Kafka
  • Create your own VPC network
  • Build dimensions and facts incrementally instead of full refresh
  • Write data quality tests
  • Create dimensional models for additional business processes
  • Include CI/CD
  • Add more visualizations

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. All the things I learnt there, enabled me to come up with this project. If you want to upskill on Data Engineering technologies, please check out the course. :)

Owner
Ankur Chavda
Data Engineer-ish
Ankur Chavda
Multiple GNOME terminals in one window

Terminator by Chris Jones [email protected] and others. Description Terminator was

GNOME Terminator 1.5k Jan 01, 2023
Small tool to use hero .json files created with Optolith for The Dark Eye/ Das Schwarze Auge 5 to perform talent probes.

DSA5-ProbeMaker A little tool for The Dark Eye 5th Edition (Das Schwarze Auge 5) to load .json from Optolith character generation and easily perform t

2 Jan 06, 2022
Use `forge` and `cast` commands in Python scripts

foundrycli.py ( 🔥 , 🐍 ) foundrycli.py is a Python library I've made for personal use; now open source. It lets you access forge and cast CLIs from P

Zero Ekkusu 17 Jul 17, 2022
An example project which contains the Unity components necessary to complete Navigation2's SLAM tutorial with a Turtlebot3, using a custom Unity environment in place of Gazebo.

Navigation 2 SLAM Example This example provides a Unity Project and a colcon workspace that, when used together, allows a user to substitute Unity as

Unity Technologies 183 Jan 04, 2023
Install Firefox from Mozilla.org easily, complete with .desktop file creation.

firefox-installer Install Firefox from Mozilla.org easily, complete with .desktop file creation. Dependencies Python 3 Python LXML Debian/Ubuntu: sudo

rany 7 Nov 04, 2022
InfiniPy has some neat features - like the endpoint for function

InfiniPy has some neat features - like the endpoint for function

ZeroTwo 7 Nov 20, 2022
A 100% python file organizer. Keep your computer always organized!

PythonOrganizer A 100% python file organizer. Keep your computer always organized! To run the project, just clone the folder and run the installation

3 Dec 02, 2022
Function Plotter✨

Function-Plotter Build With : Python PyQt5 unittest matplotlib Getting Started This is an list of needed instructions to set up your project locally,

Ahmed Lotfy 3 Jan 06, 2022
Lagrange Interpolation Method-Python

Lagrange Interpolation Method-Python The Lagrange interpolation formula is a way to find a polynomial, called Lagrange polynomial, that takes on certa

Motahare Soltani 2 Jul 05, 2022
A Python tool to check ASS subtitles for common mistakes and errors.

A Python tool to check ASS subtitles for common mistakes and errors.

1 Dec 18, 2021
Ingestinator is my personal VFX pipeline tool for ingesting folders containing frame sequences that have been pulled and downloaded to a local folder

Ingestinator Ingestinator is my personal VFX pipeline tool for ingesting folders containing frame sequences that have been pulled and downloaded to a

Henry Wilkinson 2 Nov 18, 2022
Final Fantasy XIV Auto House Clicker

Final Fantasy XIV Auto House Clicker

KanameS 0 Mar 31, 2022
NORETURN is an esoteric programming language, based around the idea of not going back

NORETURN NORETURN is an esoteric programming language, based around the idea of not going back Concept Program coded in noreturn runs over one array,

1 Dec 15, 2021
Pseudometa's dotfiles

pseudometa's dotfiles Table of Contents Why this repository? How this Repository works Special Explanations Got an idea for an improvement? Contact Wh

pseudometa 23 Dec 27, 2022
Module-based cryptographic tool

Cryptosploit A decryption/decoding/cracking tool using various modules. To use it, you need to have basic knowledge of cryptography. Table of Contents

/SNESE_AR\ 33 Nov 27, 2022
A python script that will automate the boring task of login to the captive portal again and again

A python script that will automate the boring task of login to the captive portal again and again

Rakib Hasan 2 Feb 09, 2022
An advanced NFT Generator

NFT Generator An advanced NFT Generator Free software: GNU General Public License v3 Documentation: https://nft-generator.readthedocs.io. Features TOD

NFT Generator 5 Apr 21, 2022
Visualize Data From Stray Scanner https://keke.dev/blog/2021/03/10/Stray-Scanner.html

StrayVisualizer A set of scripts to work with data collected using Stray Scanner. Usage Installing Dependencies Install dependencies with pip -r requi

Kenneth Blomqvist 45 Dec 30, 2022
A very terrible python-based programming language that uses folders instead of text files

PYFolders by Lewis L. Foster PYFolders is a very terrible python-based programming language that uses folders instead of regular text files. In this r

Lewis L. Foster 5 Jan 08, 2022
Wrapper for the undocumented CodinGame API. Can be used both synchronously and asynchronlously.

codingame API wrapper Pythonic wrapper for the undocumented CodinGame API. Installation Python 3.6 or higher is required. Install codingame with pip:

Takos 19 Jun 20, 2022