In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

Last update: Nov 01, 2021

Related tags

Overview

ETL Pipeline for AWS

Project Description

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 to stagging tables on Redshift and SQL queries are written to create analytics tables from staging tables.

Dataset Structure

The dataset is composed of two files the Songs data and Logs data that is present in S3 bucket.

Song Data

The song data is dataset with million of entries. Each file is in JSON format that contains the data about song, artist of that song. Moreover, the files are partitioned by the first three letters of song ID. The single entry of the song dataset looks like

{
   "num_songs":1,
   "artist_id":"ARJIE2Y1187B994AB7",
   "artist_latitude":null,
   "artist_longitude":null,
   "artist_location":"",
   "artist_name":"Line Renaud",
   "song_id":"SOUPIRU12A6D4FA1E1",
   "title":"Der Kleine Dompfaff",
   "duration":152.92036,
   "year":0
}

The second dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.

Logs Data

The logs dataset is also in the JSON formatted, which is formed by the event simulator based on the songs dataset. The logs dataset is the activity logs from the music app.

{
    "artist": "Pavement",
    "auth": "Logged in",
    "firstName": "Sylvie",
    "gender": "F",
    "iteminSession": 0,
    "lastName": "Cruz",
    "length": 99.16036,
    "level": "free",
    "location": "Kiamath Falls, OR",
    "method": "PUT",
    "page": "NextSong",
    "registration": 1.540266e+12,
    "sessionId": 345,
    "song": "Mercy: The Laundromat",
    "status": 200,
    "ts": 1541990258796,
    "userAgent": "Mozzilla/5.0...",
    "userId": 10
}

Data Warehouse schema

There are two staging tables: Event table: artist VARCHAR, auth VARCHAR, firstName VARCHAR, gender VARCHAR, itemInSession INT, lastName VARCHAR, length DOUBLE PRECISION, level VARCHAR, location VARCHAR, method VARCHAR , page VARCHAR, registration VARCHAR, sessionid INT, song VARCHAR, status INT, ts VARCHAR, userAgent VARCHAR, userId INT*

Song table* num_songs INTEGER,* artist_id VARCHAR, artist_latitude VARCHAR, artist_longitude VARCHAR, artist_location VARCHAR , artist_name VARCHAR, song_id VARCHAR, title VARCHAR, duration NUMERIC NOT NULL, year integer*

These staging tables helps forming dimension tables and fact tables:

Dimension Tables:

users:

*user_id, first_name, last_name, gender, level*

songs:

*song_id, title, artist_id, year, duration*

artists:

*artist_id, name, location, latitude, longitude*

time:

*start_time, hour, day, week, month, year, weekday*

Fact tables:

Songplays:

*songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent*

All the tables contains Primary Key as there should be something unique to identify the rows in the table.

ETL Process

The ETL process is comprises of two steps:

Getting data from S3 bucket to staging table
Insert the data in dimension and fact table from staging tables using Star Schema

Files Description

- create_tables.py: When create_tables.py run, it will first create tables and drop if table already exists. 
- etl.py: read and process data files
- dwh.cfg: File contains the data warehouse settings for AWS. It contains CLUSTER, IAM_ROLE and S3 settings for the ETL pipeline
- sql_queries: Contains the sql queries for dropping, creation, selection data from tables.

In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

Related tags

Overview

ETL Pipeline for AWS

Project Description

Dataset Structure

Song Data

Logs Data

Data Warehouse schema

Dimension Tables:

users:

songs:

artists:

time:

Fact tables:

Songplays:

ETL Process

Files Description

Owner

Mobeen Ahmed

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Techdegree Data Analysis Project 2

An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks

pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Python tools for querying and manipulating BIDS datasets.

Anomaly Detection with R

Exploratory Data Analysis for Employee Retention Dataset

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

In this tutorial, raster models of soil depth and soil water holding capacity for the United States will be sampled at random geographic coordinates within the state of Colorado.

InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics!

Building house price data pipelines with Apache Beam and Spark on GCP

a tool that compiles a csv of all h1 program stats

Spectral Analysis in Python

Validation and inference over LinkML instance data using souffle

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

ETL pipeline on movie data using Python and postgreSQL

This tool parses log data and allows to define analysis pipelines for anomaly detection.