In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

Overview

ETL Pipeline for AWS

Project Description

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 to stagging tables on Redshift and SQL queries are written to create analytics tables from staging tables.

Dataset Structure

The dataset is composed of two files the Songs data and Logs data that is present in S3 bucket.

Song Data

The song data is dataset with million of entries. Each file is in JSON format that contains the data about song, artist of that song. Moreover, the files are partitioned by the first three letters of song ID. The single entry of the song dataset looks like

  • {
       "num_songs":1,
       "artist_id":"ARJIE2Y1187B994AB7",
       "artist_latitude":null,
       "artist_longitude":null,
       "artist_location":"",
       "artist_name":"Line Renaud",
       "song_id":"SOUPIRU12A6D4FA1E1",
       "title":"Der Kleine Dompfaff",
       "duration":152.92036,
       "year":0
    }
    

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

Logs Data

The logs dataset is also in the JSON formatted, which is formed by the event simulator based on the songs dataset. The logs dataset is the activity logs from the music app.

  • {
        "artist": "Pavement",
        "auth": "Logged in",
        "firstName": "Sylvie",
        "gender": "F",
        "iteminSession": 0,
        "lastName": "Cruz",
        "length": 99.16036,
        "level": "free",
        "location": "Kiamath Falls, OR",
        "method": "PUT",
        "page": "NextSong",
        "registration": 1.540266e+12,
        "sessionId": 345,
        "song": "Mercy: The Laundromat",
        "status": 200,
        "ts": 1541990258796,
        "userAgent": "Mozzilla/5.0...",
        "userId": 10
    }
    

Data Warehouse schema

There are two staging tables: Event table: artist VARCHAR, auth VARCHAR, firstName VARCHAR, gender VARCHAR, itemInSession INT, lastName VARCHAR, length DOUBLE PRECISION, level VARCHAR, location VARCHAR, method VARCHAR , page VARCHAR, registration VARCHAR, sessionid INT, song VARCHAR, status INT, ts VARCHAR, userAgent VARCHAR, userId INT*

Song table* num_songs INTEGER,* artist_id VARCHAR, artist_latitude VARCHAR, artist_longitude VARCHAR, artist_location VARCHAR , artist_name VARCHAR, song_id VARCHAR, title VARCHAR, duration NUMERIC NOT NULL, year integer*

These staging tables helps forming dimension tables and fact tables:

Dimension Tables:
users:
*user_id, first_name, last_name, gender, level*
songs:
*song_id, title, artist_id, year, duration*
artists:
*artist_id, name, location, latitude, longitude*
time:
*start_time, hour, day, week, month, year, weekday*
Fact tables:
Songplays:
*songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent*

All the tables contains Primary Key as there should be something unique to identify the rows in the table.

ETL Process

The ETL process is comprises of two steps:

  • Getting data from S3 bucket to staging table
  • Insert the data in dimension and fact table from staging tables using Star Schema

Files Description

- create_tables.py: When create_tables.py run, it will first create tables and drop if table already exists. 
- etl.py: read and process data files
- dwh.cfg: File contains the data warehouse settings for AWS. It contains CLUSTER, IAM_ROLE and S3 settings for the ETL pipeline
- sql_queries: Contains the sql queries for dropping, creation, selection data from tables.
Owner
Mobeen Ahmed
Mobeen Ahmed
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

ROOT 2k Dec 29, 2022
PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

PostQF Copyright © 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma

Ralph Seichter 11 Nov 24, 2022
Feature Detection Based Template Matching

Feature Detection Based Template Matching The classification of the photos was made using the OpenCv template Matching method. Installation Use the pa

Muhammet Erem 2 Nov 18, 2021
Titanic data analysis for python

Titanic-data-analysis This Repo is an analysis on Titanic_mod.csv This csv file contains some assumed data of the Titanic ship after sinking This full

Hardik Bhanot 1 Dec 26, 2021
A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset

xwrf A lightweight interface for reading in output from the Weather Research and Forecasting (WRF) model into xarray Dataset. The primary objective of

National Center for Atmospheric Research 43 Nov 29, 2022
Describing statistical models in Python using symbolic formulas

Patsy is a Python library for describing statistical models (especially linear models, or models that have a linear component) and building design mat

Python for Data 866 Dec 16, 2022
Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis 📈 This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

Andy Pham 1 Sep 03, 2022
pandas: powerful Python data analysis toolkit

pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.

pandas 36.4k Jan 03, 2023
Template for a Dataflow Flex Template in Python

Dataflow Flex Template in Python This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build D

STOIX 5 Apr 28, 2022
Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Using Streaming Twitter Data with Kafka and Spark Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream

Rustam Zokirov 1 Dec 06, 2021
A columnar data container that can be compressed.

Unmaintained Package Notice Unfortunately, and due to lack of resources, the Blosc Development Team is unable to maintain this package anymore. During

944 Dec 09, 2022
Senator Trades Monitor

Senator Trades Monitor This monitor will grab the most recent trades by senators and send them as a webhook to discord. Installation To use the monito

Yousaf Cheema 5 Jun 11, 2022
Data collection, enhancement, and metrics calculation.

l3_data_collection Data collection, enhancement, and metrics calculation. Summary Repository containing code for QuantDAO's JDT data collection task.

Ruiwyn 3 Dec 23, 2022
Project: Netflix Data Analysis and Visualization with Python

Project: Netflix Data Analysis and Visualization with Python Table of Contents General Info Installation Demo Usage and Main Functionalities Contribut

Kathrin Hälbich 2 Feb 13, 2022
This repository contains some analysis of possible nerdle answers

Nerdle Analysis https://nerdlegame.com/ This repository contains some analysis of possible nerdle answers. Here's a quick overview: nerdle.py contains

0 Dec 16, 2022
Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Cloudera 759 Jan 07, 2023
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

dbt-osmosis First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we wan

Alexander Butler 150 Jan 06, 2023
ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

Павел Максимов 18 Jul 06, 2022
My first Python project is a simple Mad Libs program.

Python CLI Mad Libs Game My first Python project is a simple Mad Libs program. Mad Libs is a phrasal template word game created by Leonard Stern and R

Carson Johnson 1 Dec 10, 2021