A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Overview

Streamify

A data pipeline with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Description

Objective

The project will stream events generated from a fake music streaming service (like Spotify) and create a data pipeline that consumes the real-time data. The data coming in would be similar to an event of a user listening to a song, navigating on the website, authenticating. The data would be processed in real-time and stored to the data lake periodically (every two minutes). The hourly batch job will then consume this data, apply transformations, and create the desired tables for our dashboard to generate analytics. We will try to analyze metrics like popular songs, active users, user demographics etc.

Dataset

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

Tools & Technologies

Architecture

streamify-architecture

Final Result

dashboard

Setup

WARNING: You will be charged for all the infra setup. You can avail 300$ in credit by creating a new account on GCP.

Pre-requisites

If you already have a Google Cloud account and a working terraform setup, you can skip the pre-requisite steps.

Get Going!

A video walkthrough of how I run my project - YouTube Video

  • Procure infra on GCP with Terraform - Setup
  • (Extra) SSH into your VMs, Forward Ports - Setup
  • Setup Kafka Compute Instance and start sending messages from Eventsim - Setup
  • Setup Spark Cluster for stream processing - Setup
  • Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Debug

If you run into issues, see if you find something in this debug guide.

How can I make this better?!

A lot can still be done :).

  • Choose managed Infra
    • Cloud Composer for Airflow
    • Confluent Cloud for Kafka
  • Create your own VPC network
  • Build dimensions and facts incrementally instead of full refresh
  • Write data quality tests
  • Create dimensional models for additional business processes
  • Include CI/CD
  • Add more visualizations

Special Mentions

I'd like to thank the DataTalks.Club for offering this Data Engineering course for completely free. All the things I learnt there, enabled me to come up with this project. If you want to upskill on Data Engineering technologies, please check out the course. :)

Owner
Ankur Chavda
Data Engineer-ish
Ankur Chavda
Consulta cpf fds

Consulta-cpf Consulta cpf fds Instalação: apt-get update -y

Moleey 1 Nov 24, 2021
A good Tool to comment on xmw

A good Tool to comment on xmw

1 Feb 10, 2022
WGGCommute - Adding Commute Times to WG-Gesucht Listings

WGGCommute - Adding Commute Times to WG-Gesucht Listings This is a barebones implementation of a chrome extension that can be used to add commute time

Jannis 2 Jul 20, 2022
Perform oocyst segmentation in mercurochrome stained mosquito midgut

Midgut_oocyst_segmentation Perform oocyst segmentation in mercurochrome stained mosquito midguts This oocyst segmentation model also powers the webtoo

Duo Peng 3 Oct 27, 2021
Kubernetes-native workflow automation platform for complex, mission-critical data and ML processes at scale. It has been battle-tested at Lyft, Spotify, Freenome, and others and is truly open-source.

Flyte Flyte is a workflow automation platform for complex, mission-critical data, and ML processes at scale Home Page · Quick Start · Documentation ·

Flyte 3k Jan 01, 2023
A script to download all the challenges and files from the CTFd instance.

Python CTFd Downloader A script to download all the challenges and files from the CTFd instance. Installation Clone this repo: git clone https://githu

Jacob Elliott 19 Dec 16, 2022
Antchain-MPC is a library of MPC (Multi-Parties Computation)

Antchain-MPC Antchain-MPC is a library of MPC (Multi-Parties Computation). It include Morse-STF: A tool for machine learning using MPC. Others: Commin

Alipay 37 Nov 22, 2022
Search and Find Jobs in Ethiopia

✨ EthioJobs ✨ Search and Find Jobs in Ethiopia Easy start critical warning Use pycharm No vscode No sublime No Vim No nothing when you want to use

Abdimk 12 Nov 09, 2022
A project for Perotti's MGIS350 for incorporating Flask

MGIS350_5 This is our project for Perotti's MGIS350 for incorporating Flask... RIT Dev Biz Apps Web Project A web-based Inventory system for company o

1 Nov 07, 2021
Datamol is a python library to work with molecules.

Datamol is a python library to work with molecules. It's a layer built on top of RDKit and aims to be as light as possible.

datamol 276 Dec 19, 2022
Project issue to website data transformation toolkit

braintransform Project issue to website data transformation toolkit. Introduction The purpose of these scripts is to be able to dynamically generate t

Brainhack 1 Nov 19, 2021
Dashboard to view a stock's basic information, RSI, Bollinger bands, EMA, SMA, sentiment analysis via Python

Your One And Only Trading Bot No seriously, we mean it! Contributors Jihad Al-Hussain John Gaffney Shanel Kuchera Kazuki Takehashi Patrick Thornquist

5 May 21, 2022
Un Assistente Vocale scritto in Python e altamente personalizzabile

Un Assistente Vocale scritto in Python e altamente personalizzabile

Marco 2 May 06, 2022
Creates infinite amount of guilded accounts in seconds.

Guilded Cookie Creator [fuck guilded i quit working on this, they patch like every fucking method after 2/3 days i release shit] Optimizations Asynchr

scripted 7 Feb 28, 2022
Anonymous Dark Web Tool

Anonymous Dark Web Tool v1.0 Features Anonymous Mode Darkweb Search Engines Check Onion Url/s Scanning Host/IP Keep eyes on v2.0 soon. Requirement Deb

Mounib Kamhaz 11 Apr 10, 2022
Bad Apple printed out on the console with Python!

bad-apple Bad Apple printed out on the console with Python! Preface A word of disclaimer, while the final code is somewhat original, this project is a

CalvinLoke 186 Dec 01, 2022
Unofficial Python Library to communicate with SESAME 3 series products from CANDY HOUSE, Inc.

pysesame3 Unofficial Python Library to communicate with SESAME 3 series products from CANDY HOUSE, Inc. This project aims to control SESAME 3 series d

Masaki Tagawa 18 Dec 12, 2022
LPCV Winner Solution of Spring Team

LPCV Winner Solution of Spring Team

22 Jul 20, 2022
Imitate Moulinette written in Python

Imitate Moulinette written in Python

Pumidol Leelerdsakulvong 2 Jul 26, 2022
Expense Tracker is a very good tool to keep track of your expenseditures and the total money you saved.

Expense Tracker is a very good tool to keep track of your expenseditures and the total money you saved.

Shreejan Dolai 9 Dec 31, 2022