Projects that implement various aspects of Data Engineering.

Last update: Oct 14, 2021

Related tags

Overview

DATAWAREHOUSE ON AWS

The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming application 'Sparkify'. This data model is implemented on AWS cloud infrastructure with following components -

AWS S3 - Source datasets.

AWS Redshift
>for staging extracted data
>for storing the resultant data model (facts and dimensions)

Data model designed for this project consists of a star schema.

Table and attribute details are -

Fact Table
songplays: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Dimension Tables
users: user_id, first_name, last_name, gender, level
songs: song_id, title, artist_id, year, duration
artists: artist_id, name, location, lattitude, longitude
time: start_time, hour, day, week, month, year, weekday

Source datasets to be extracted into dimension model are -

There are two json files for

Song data: s3://udacity-dend/song_data - Data for all songs with their respective artists available in application library.

Log data: s3://udacity-dend/log_data - Data for user events and activity activity on the application.

Datawarehouse is implemented using PostgreSQL.

ETL pipeline to extract and load data from source to target is implemented using Python.

TODO steps:

Create sql_queries.py - to design and build tables for proposed data model

Run create_tables.py - to create tables by implementing the database queries from sql_queries.py

Run etl.py - to implement the data pipeline built over the data model which extract, stage and load data from AWS S3 to DWH on AWS Redshift

Design and fire analytical queries on the populated data model to gain insights of user events over streaming application

Projects that implement various aspects of Data Engineering.

Related tags

Overview

DATAWAREHOUSE ON AWS

The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming application 'Sparkify'. This data model is implemented on AWS cloud infrastructure with following components -

Data model designed for this project consists of a star schema.

Table and attribute details are -

Source datasets to be extracted into dimension model are -

Datawarehouse is implemented using PostgreSQL.

ETL pipeline to extract and load data from source to target is implemented using Python.

TODO steps:

Owner

Data and code accompanying the paper Politics and Virality in the Time of Twitter

InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

Python tools for querying and manipulating BIDS datasets.

Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

CSV database for chihuahua (HUAHUA) blockchain transactions

Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Clean and reusable data-sciency notebooks.

Approximate Nearest Neighbor Search for Sparse Data in Python!

Pipeline and Dataset helpers for complex algorithm evaluation.

General Assembly's 2015 Data Science course in Washington, DC

Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

collect training and calibration data for gaze tracking

Python implementation of Principal Component Analysis

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

A forecasting system dedicated to smart city data

Stitch together Nanopore tiled amplicon data without polishing a reference

Shot notebooks resuming the main functions of GeoPandas

A model checker for verifying properties in epistemic models

Basis Set Format Converter