songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Last update: Jul 13, 2021

Related tags

Overview

Songplays User activity datamart

The following document describes the model used to build the songplays datamart table and the respective ETL process.

About
Getting Started
Data Model and Schema
Deployment
Built Using
Authors

About

The songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system.

This document describes the model of songplays table datamart on sparkify_app schema inside a container sparkify_postgres, and the Python code to load new data. The production directory and data must be simmilar to those in mnt/data/log_data and mnt/data/song_data paths in this repository.

🏁 Getting Started

First you need to have the right permissions to access the source files and write them into sparkify_app tables that generates the songplays datamart table. Contact the owners or your team leader for more information.

Data Model and Schema

Source files and owners

File or table	Description	Directory	Owner
YYYY-MM-DD-events.json	User events.	mnt/data/log_data/YYYY/11	Person 1
.json	Song data.	mnt/data/song_data/a	Person 2
`songplays`	Datamart for recomendation system.	`sparkify_app.songplays`	Person 3
`artists`	Dimension table for artists.	`sparkify_app.artists`	Person 1
`songs`	Dimension table for songs.	`sparkify_app.songs`	Person 1
`time`	Dimension table for streaming start time for a given song.	`sparkify_app.time`	Person 2
`users`	Dimension table for users.	`sparkify_app.users`	Person 3

Prerequisites

To run this project first you need to install the Docker Engine for your operational system and Docker Compose.

After installing and configuring the Docker tools, download this repository and create a folder named postgres that will store all sparkify_postgres service data. To build the proper images and run the services, just execute the following command inside this repository:

docker-compose up

If the service runs successfully you should see something like this:

...
sparkify_python      | 28/30 files processed.
sparkify_python      | 29/30 files processed.
sparkify_python      | 30/30 files processed.
sparkify_python exited with code 0

You can also check the job by following these steps:

Open your browser and access localhost:16543:
- Enter with the following credentials to authenticate:
  - e-mail: [email protected]
  - password: sp4rk1fy
After you log in, click on the Servers option at the upper corner on the left:
- You will be asked to enter with the PostgreSQL credentials:
  - User: sparkifypsql
  - Password: p4ssw0rd
Select the Query Tools under the Tools menu:

Under the Query Editor, run the following query:

SELECT * FROM sparkify_app.songplays WHERE song_id is NOT NULL and artist_id is NOT NULL;

You should get only 5 rows.

Microservice architecture

The following image represents the microservice architecture for this project:

Where:

sparkify_python: runs all Python scripts and stores raw data.
sparkify_postgres: runs Postgre and stores the database.
sparkify_pgadmin: runs the pgAdmin tool to monitor the sparkify_postgres service.

⛏️ Built Using

Dbeaver - Database tool.
Docker Compose - Tool to run multi-container applications.
Docker Engine - Container engine.
pandas - Data analysis and data wrangling tool.
pgAdmin - PostgreSQL tool.
psycopg2 - Database adapter for Python.
PostgreSQL - Reletional database management system.

✍️ Authors

@lkellermann - Idea & Initial work

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Related tags

Overview

Songplays User activity datamart

Table of Contents

About

🏁 Getting Started

Data Model and Schema

Prerequisites

Microservice architecture

⛏️ Built Using

✍️ Authors

Owner

Leandro Kellermann de Oliveira

MIR Cheatsheet - Survival Guidebook for MIR Researchers in the Lab

t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

A multi-platform GUI for bit-based analysis, processing, and visualization

A tax calculator for stocks and dividends activities.

Data and code accompanying the paper Politics and Virality in the Time of Twitter

We're Team Arson and we're using the power of predictive modeling to combat wildfires.

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Intake is a lightweight package for finding, investigating, loading and disseminating data.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

nrgpy is the Python package for processing NRG Data Files

Validation and inference over LinkML instance data using souffle

Active Learning demo using two small datasets

Data Analytics: Modeling and Studying data relating to climate change and adoption of electric vehicles

International Space Station data with Python research 🌎

a tool that compiles a csv of all h1 program stats

Conduits - A Declarative Pipelining Tool For Pandas

Detailed analysis on fraud claims in insurance companies, gives you information as to why huge loss take place in insurance companies

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

A stock analysis app with streamlit

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science