songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Last update: Jul 13, 2021

Related tags

Overview

Songplays User activity datamart

The following document describes the model used to build the songplays datamart table and the respective ETL process.

About
Getting Started
Data Model and Schema
Deployment
Built Using
Authors

About

The songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system.

This document describes the model of songplays table datamart on sparkify_app schema inside a container sparkify_postgres, and the Python code to load new data. The production directory and data must be simmilar to those in mnt/data/log_data and mnt/data/song_data paths in this repository.

🏁 Getting Started

First you need to have the right permissions to access the source files and write them into sparkify_app tables that generates the songplays datamart table. Contact the owners or your team leader for more information.

Data Model and Schema

Source files and owners

File or table	Description	Directory	Owner
YYYY-MM-DD-events.json	User events.	mnt/data/log_data/YYYY/11	Person 1
.json	Song data.	mnt/data/song_data/a	Person 2
`songplays`	Datamart for recomendation system.	`sparkify_app.songplays`	Person 3
`artists`	Dimension table for artists.	`sparkify_app.artists`	Person 1
`songs`	Dimension table for songs.	`sparkify_app.songs`	Person 1
`time`	Dimension table for streaming start time for a given song.	`sparkify_app.time`	Person 2
`users`	Dimension table for users.	`sparkify_app.users`	Person 3

Prerequisites

To run this project first you need to install the Docker Engine for your operational system and Docker Compose.

After installing and configuring the Docker tools, download this repository and create a folder named postgres that will store all sparkify_postgres service data. To build the proper images and run the services, just execute the following command inside this repository:

docker-compose up

If the service runs successfully you should see something like this:

...
sparkify_python      | 28/30 files processed.
sparkify_python      | 29/30 files processed.
sparkify_python      | 30/30 files processed.
sparkify_python exited with code 0

You can also check the job by following these steps:

Open your browser and access localhost:16543:
- Enter with the following credentials to authenticate:
  - e-mail: [email protected]
  - password: sp4rk1fy
After you log in, click on the Servers option at the upper corner on the left:
- You will be asked to enter with the PostgreSQL credentials:
  - User: sparkifypsql
  - Password: p4ssw0rd
Select the Query Tools under the Tools menu:

Under the Query Editor, run the following query:

SELECT * FROM sparkify_app.songplays WHERE song_id is NOT NULL and artist_id is NOT NULL;

You should get only 5 rows.

Microservice architecture

The following image represents the microservice architecture for this project:

Where:

sparkify_python: runs all Python scripts and stores raw data.
sparkify_postgres: runs Postgre and stores the database.
sparkify_pgadmin: runs the pgAdmin tool to monitor the sparkify_postgres service.

⛏️ Built Using

Dbeaver - Database tool.
Docker Compose - Tool to run multi-container applications.
Docker Engine - Container engine.
pandas - Data analysis and data wrangling tool.
pgAdmin - PostgreSQL tool.
psycopg2 - Database adapter for Python.
PostgreSQL - Reletional database management system.

✍️ Authors

@lkellermann - Idea & Initial work

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Related tags

Overview

Songplays User activity datamart

Table of Contents

About

🏁 Getting Started

Data Model and Schema

Prerequisites

Microservice architecture

⛏️ Built Using

✍️ Authors

Owner

Leandro Kellermann de Oliveira

Generate lookml for views from dbt models

Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

Tools for working with MARC data in Catalogue Bridge.

Kats, a kit to analyze time series data, a lightweight, easy-to-use, generalizable, and extendable framework to perform time series analysis, from understanding the key statistics and characteristics, detecting change points and anomalies, to forecasting future trends.

A collection of learning outcomes data analysis using Python and SQL, from DQLab.

DataPrep — The easiest way to prepare data in Python

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

A data analysis using python and pandas to showcase trends in school performance.

NFCDS Workshop Beginners Guide Bioinformatics Data Analysis

Data processing with Pandas.

Create HTML profiling reports from pandas DataFrame objects

This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics!

talkbox is a scikit for signal/speech processing, to extend scipy capabilities in that domain.

Tools for analyzing data collected with a custom unity-based VR for insects.

International Space Station data with Python research 🌎

Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

Probabilistic reasoning and statistical analysis in TensorFlow

Efficient matrix representations for working with tabular data

songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system