A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Overview

Streamify

A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Description

Objective

The project will stream events generated from a fake music streaming service (like Spotify) and create a data pipeline that consumes the real-time data. The data coming in would be similar to an event of a user listening to a song, navigating on the website, authenticating. The data would be processed in real-time and stored to the data lake periodically (every two minutes). The hourly batch job will then consume this data, apply transformations, and create the desired tables for our dashboard to generate analytics. We will try to analyze metrics like popular songs, active users, user demographics etc.

Dataset

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

Tools & Technologies

Architecture

streamify-architecture

Final Result

dashboard

Setup

WARNING: You will be charged for all the infra setup. You can avail 300$ in credit by creating a new account on GCP.

Pre-requisites

If you already have a Google Cloud account and a working terraform setup, you can skip the pre-requisite steps.

Get Going!

A video walkthrough of how I run my project - YouTube Video

  • Procure infra on GCP with Terraform - Setup
  • (Extra) SSH into your VMs, Forward Ports - Setup
  • Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
  • Setup Spark Cluster for stream processing - Setup
  • Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Debug

If you run into issues, see if you find something in this debug guide.

How can I make this better?!

A lot can still be done :).

  • Choose managed Infra
    • Cloud Composer for Airflow
    • Confluent Cloud for Kafka
  • Create your own VPC network
  • Build dimensions and facts incrementally instead of full refresh
  • Write data quality tests
  • Create dimensional models for additional business processes
  • Include CI/CD
  • Add more visualizations

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. All the things I learnt there, enabled me to come up with this project. If you want to upskill on Data Engineering technologies, please check out the course. :)

Owner
Ankur Chavda
Data Engineer-ish
Ankur Chavda
Chalice - A tool to facilitate Python based lambda deployment

Chalice is a tool to facilitate Python based lambda deployment. This repo contains the output of my basic exploration of this tool.

Csilla Bessenyei 1 Feb 03, 2022
Msgpack serialization/deserialization library for Python, written in Rust using PyO3 and rust-msgpack. Reboot of orjson. msgpack.org[Python]

ormsgpack ormsgpack is a fast msgpack library for Python. It is a fork/reboot of orjson It serializes faster than msgpack-python and deserializes a bi

Aviram Hassan 139 Dec 30, 2022
Media Cloud Outlet Filtering

Using ABYZ and Media-Bias Fact-Check outlet databases, I've provided outlet CSV files for both and scripts to intended to match Media Cloud files to respective outlets.

Stephen Scarano 1 Feb 02, 2022
fast_bss_eval is a fast implementation of the bss_eval metrics for the evaluation of blind source separation.

fast_bss_eval Do you have a zillion BSS audio files to process and it is taking days ? Is your simulation never ending ? Fear no more! fast_bss_eval i

Robin Scheibler 99 Dec 13, 2022
An osu! cheat made in c++ rewritten in python and currently undetected.

megumi-python An osu! cheat made in c++ rewritten in python and currently undetected. Installation Guide Download python 3.9 from https://python.org C

Elaina 2 Nov 18, 2022
Print 'text color' and 'text format' on Term with Python

term-printer Print 'text color' and 'text format' on Term with Python ※ It may not work depending on the OS and shell used. PIP $ pip install term-pri

ななといつ 10 Nov 12, 2022
Basic Hspice runner with Python

HSpicePy Bilgisayarınıza PATH değişkenlerine eklediğiniz HSPICE programını python ile çalıştırmanızı sağlayan basit bir araç. A simple tool that allow

1 Nov 16, 2021
This is a simple SV calling package for diploid assemblies.

dipdiff This is a simple SV calling package for diploid assemblies. It uses a modified version of svim-asm. The package includes its own version minim

Mikhail Kolmogorov 11 Jan 05, 2023
Swubcase - The shitty programming language

What is Swubcase? Swubcase is easy-to-use programming language that can fuck you

5 Jun 19, 2022
Neptune client library - integrate your Python scripts with Neptune

Lightweight experiment tracking tool for AI/ML individuals and teams. Fits any workflow. Neptune is a lightweight experiment logging/tracking tool tha

neptune.ai 353 Jan 04, 2023
A tutorial presents several practical examples of how to build DAGs in Apache Airflow

Apache Airflow - Python Brasil 2021 Este tutorial apresenta vários exemplos práticos de como construir DAGs no Apache Airflow. Background Apache Airfl

Jusbrasil 14 Jun 03, 2022
Python3 Interface to numa Linux library

py-libnuma is python3 interface to numa Linux library so that you can set task affinity and memory affinity in python level for your process which can help you to improve your code's performence.

Dalong 13 Nov 10, 2022
🗽 Like yarn outdated/upgrade, but for pip. Upgrade all your pip packages and automate your Python Dependency Management.

pipupgrade The missing command for pip Table of Contents Features Quick Start Usage Basic Usage Docker Environment Variables FAQ License Features Upda

Achilles Rasquinha 529 Dec 31, 2022
通过简单的卷积神经网络直接预测出验证码图片中滑块的位置

使用说明 1. 在本地测试 运行python3 prdict_one.py即可,默认需要预测的图片路径位于testImg文件夹下的test1.png 运行python3 predict_folder.py预测testImg下的所有图片 2. 部署到服务器 运行python3 run_a_server

12 Mar 08, 2022
Set of tools to analyze Tinynuke samples

tinynuke-toolset You'll find in that repository a set of tools and scripts I developped to analyze Tinynuke samples. Dll extractor: script used to ext

Heat Miser 14 Aug 18, 2022
Small projects for python beginners.

Python Mini Projects For Beginners I recently started doing the #100DaysOfCode Challenge in Python. I've used Python before, but I had switched to JS

Sreekesh Iyer 10 Dec 12, 2022
ticguide: quick + painless TESS observing information

ticguide: quick + painless TESS observing information Complementary to the TESS observing tool tvguide (see also WTV), which tells you if your target

Ashley Chontos 5 Nov 05, 2022
Sacred is a tool to help you configure, organize, log and reproduce experiments developed at IDSIA.

Sacred Every experiment is sacred Every experiment is great If an experiment is wasted God gets quite irate Sacred is a tool to help you configure, or

IDSIA 4k Jan 02, 2023
It was created to conveniently respond to events such as donation, follow, and hosting using the Alert Box provided by twip to streamers

This library is not an official library of twip. It was created to conveniently respond to events such as donation, follow, and hosting using the Alert Box provided by twip to streamers.

junah201 8 Nov 19, 2022
A desktop app to check the unlocked courses bases on previously done courses.

Course Picker A desktop app to check the unlocked courses bases on previously done courses. Table of contents About the Project Built with What it doe

Ahmed Symum Swapno 3 Feb 07, 2022