songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Overview

Sparkify

Songplays User activity datamart

Status GitHub Issues GitHub Pull Requests License


The following document describes the model used to build the songplays datamart table and the respective ETL process.

Table of Contents

About

The songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system.

This document describes the model of songplays table datamart on sparkify_app schema inside a container sparkify_postgres, and the Python code to load new data. The production directory and data must be simmilar to those in mnt/data/log_data and mnt/data/song_data paths in this repository.

🏁 Getting Started

First you need to have the right permissions to access the source files and write them into sparkify_app tables that generates the songplays datamart table. Contact the owners or your team leader for more information.

Data Model and Schema


songplays datamart

Source files and owners

File or table Description Directory Owner
YYYY-MM-DD-events.json User events. mnt/data/log_data/YYYY/11 Person 1
.json Song data. mnt/data/song_data/a Person 2
songplays Datamart for recomendation system. sparkify_app.songplays Person 3
artists Dimension table for artists. sparkify_app.artists Person 1
songs Dimension table for songs. sparkify_app.songs Person 1
time Dimension table for streaming start time for a given song. sparkify_app.time Person 2
users Dimension table for users. sparkify_app.users Person 3

Prerequisites


To run this project first you need to install the Docker Engine for your operational system and Docker Compose.

After installing and configuring the Docker tools, download this repository and create a folder named postgres that will store all sparkify_postgres service data. To build the proper images and run the services, just execute the following command inside this repository:

docker-compose up

If the service runs successfully you should see something like this:

...
sparkify_python      | 28/30 files processed.
sparkify_python      | 29/30 files processed.
sparkify_python      | 30/30 files processed.
sparkify_python exited with code 0

You can also check the job by following these steps:

  • Open your browser and access localhost:16543: pga1

    • Enter with the following credentials to authenticate:
  • After you log in, click on the Servers option at the upper corner on the left: pga2

    • You will be asked to enter with the PostgreSQL credentials:
      • User: sparkifypsql
      • Password: p4ssw0rd
  • Select the Query Tools under the Tools menu: pga3

  • Under the Query Editor, run the following query:

    SELECT * FROM sparkify_app.songplays WHERE song_id is NOT NULL and artist_id is NOT NULL;
    • You should get only 5 rows. pga3

Microservice architecture

The following image represents the microservice architecture for this project: topology

Where:

  • sparkify_python: runs all Python scripts and stores raw data.
  • sparkify_postgres: runs Postgre and stores the database.
  • sparkify_pgadmin: runs the pgAdmin tool to monitor the sparkify_postgres service.

⛏️ Built Using

✍️ Authors

Owner
Leandro Kellermann de Oliveira
Leandro Kellermann de Oliveira
Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your in

Blue Collar Bioinformatics 917 Jan 03, 2023
Senator Trades Monitor

Senator Trades Monitor This monitor will grab the most recent trades by senators and send them as a webhook to discord. Installation To use the monito

Yousaf Cheema 5 Jun 11, 2022
Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

3.8k Jan 05, 2023
Data Analysis for First Year Laboratory at Imperial College, London.

Data Analysis for First Year Laboratory at Imperial College, London. For personal reference only, and to reference in lab reports and lab books.

Martin He 0 Aug 29, 2022
Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well

Orchest 3.6k Jan 09, 2023
CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

cleanX CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological

Candace Makeda Moore, MD 20 Jan 05, 2023
A notebook to analyze Amazon Recommendation Review Dataset.

Amazon Recommendation Review Dataset Analyzer A notebook to analyze Amazon Recommendation Review Dataset. Features Calculates distinct user count, dis

isleki 3 Aug 22, 2022
simple way to build the declarative and destributed data pipelines with python

unipipeline simple way to build the declarative and distributed data pipelines. Why you should use it Declarative strict config Scaffolding Fully type

aliaksandr-master 0 Jan 26, 2022
MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

SisonkeBiotik 6 Nov 30, 2022
DefAP is a program developed to facilitate the exploration of a material's defect chemistry

DefAP is a program developed to facilitate the exploration of a material's defect chemistry. A large number of features are provided and rapid exploration is supported through the use of autoplotting

6 Oct 25, 2022
Single machine, multiple cards training; mix-precision training; DALI data loader.

Template Script Category Description Category script comparison script train.py, loader.py for single-machine-multiple-cards training train_DP.py, tra

2 Jun 27, 2022
Tools for the analysis, simulation, and presentation of Lorentz TEM data.

ltempy ltempy is a set of tools for Lorentz TEM data analysis, simulation, and presentation. Features Single Image Transport of Intensity Equation (SI

McMorran Lab 1 Dec 26, 2022
Anomaly Detection with R

AnomalyDetection R package AnomalyDetection is an open-source R package to detect anomalies which is robust, from a statistical standpoint, in the pre

Twitter 3.5k Dec 27, 2022
Office365 (Microsoft365) audit log analysis tool

Office365 (Microsoft365) audit log analysis tool The header describes it all WHY?? The first line of code was written long time before other colleague

Anatoly 1 Jul 27, 2022
Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

Corca AI 7 Sep 30, 2022
Python script to automate the plotting and analysis of percentage depth dose and dose profile simulations in TOPAS.

topas-create-graphs A script to automatically plot the results of a topas simulation Works for percentage depth dose (pdd) and dose profiles (dp). Dep

Sebastian Schäfer 10 Dec 08, 2022
Minimal working example of data acquisition with nidaqmx python API

Data Aquisition using NI-DAQmx python API Based on this project It is a minimal working example for data acquisition using the NI-DAQmx python API. It

Pablo 1 Nov 05, 2021
This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

📈 Statistical Quality Control 📉 This repo contains a simple but effective tool made using python which can be used for quality control in statistica

SasiVatsal 8 Oct 18, 2022
Full ELT process on GCP environment.

Rent Houses Germany - GCP Pipeline Project: The goal of the project is to extract data about house rentals in Germany, store, process and analyze it u

Felipe Demenech Vasconcelos 2 Jan 20, 2022
A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

48 Dec 21, 2022