songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system

Overview

Sparkify

Songplays User activity datamart

Status GitHub Issues GitHub Pull Requests License


The following document describes the model used to build the songplays datamart table and the respective ETL process.

Table of Contents

About

The songplays datamart provide details about the musical taste of our customers and can help us to improve our recomendation system.

This document describes the model of songplays table datamart on sparkify_app schema inside a container sparkify_postgres, and the Python code to load new data. The production directory and data must be simmilar to those in mnt/data/log_data and mnt/data/song_data paths in this repository.

🏁 Getting Started

First you need to have the right permissions to access the source files and write them into sparkify_app tables that generates the songplays datamart table. Contact the owners or your team leader for more information.

Data Model and Schema


songplays datamart

Source files and owners

File or table Description Directory Owner
YYYY-MM-DD-events.json User events. mnt/data/log_data/YYYY/11 Person 1
.json Song data. mnt/data/song_data/a Person 2
songplays Datamart for recomendation system. sparkify_app.songplays Person 3
artists Dimension table for artists. sparkify_app.artists Person 1
songs Dimension table for songs. sparkify_app.songs Person 1
time Dimension table for streaming start time for a given song. sparkify_app.time Person 2
users Dimension table for users. sparkify_app.users Person 3

Prerequisites


To run this project first you need to install the Docker Engine for your operational system and Docker Compose.

After installing and configuring the Docker tools, download this repository and create a folder named postgres that will store all sparkify_postgres service data. To build the proper images and run the services, just execute the following command inside this repository:

docker-compose up

If the service runs successfully you should see something like this:

...
sparkify_python      | 28/30 files processed.
sparkify_python      | 29/30 files processed.
sparkify_python      | 30/30 files processed.
sparkify_python exited with code 0

You can also check the job by following these steps:

  • Open your browser and access localhost:16543: pga1

    • Enter with the following credentials to authenticate:
  • After you log in, click on the Servers option at the upper corner on the left: pga2

    • You will be asked to enter with the PostgreSQL credentials:
      • User: sparkifypsql
      • Password: p4ssw0rd
  • Select the Query Tools under the Tools menu: pga3

  • Under the Query Editor, run the following query:

    SELECT * FROM sparkify_app.songplays WHERE song_id is NOT NULL and artist_id is NOT NULL;
    • You should get only 5 rows. pga3

Microservice architecture

The following image represents the microservice architecture for this project: topology

Where:

  • sparkify_python: runs all Python scripts and stores raw data.
  • sparkify_postgres: runs Postgre and stores the database.
  • sparkify_pgadmin: runs the pgAdmin tool to monitor the sparkify_postgres service.

⛏️ Built Using

✍️ Authors

Owner
Leandro Kellermann de Oliveira
Leandro Kellermann de Oliveira
Data and code accompanying the paper Politics and Virality in the Time of Twitter

Politics and Virality in the Time of Twitter Data and code accompanying the paper Politics and Virality in the Time of Twitter. In specific: the code

Cardiff NLP 3 Jul 02, 2022
Anomaly Detection with R

AnomalyDetection R package AnomalyDetection is an open-source R package to detect anomalies which is robust, from a statistical standpoint, in the pre

Twitter 3.5k Dec 27, 2022
a tool that compiles a csv of all h1 program stats

h1stats - h1 Program Stats Scraper This python3 script will call out to HackerOne's graphql API and scrape all currently active programs for informati

Evan 40 Oct 27, 2022
Analyzing Earth Observation (EO) data is complex and solutions often require custom tailored algorithms.

eo-grow Earth observation framework for scaled-up processing in Python. Analyzing Earth Observation (EO) data is complex and solutions often require c

Sentinel Hub 18 Dec 23, 2022
Projeto para realizar o RPA Challenge . Utilizando Python e as bibliotecas Selenium e Pandas.

RPA Challenge in Python Projeto para realizar o RPA Challenge (www.rpachallenge.com), utilizando Python. O objetivo deste desafio é criar um fluxo de

Henrique A. Lourenço 1 Apr 12, 2022
ELFXtract is an automated analysis tool used for enumerating ELF binaries

ELFXtract ELFXtract is an automated analysis tool used for enumerating ELF binaries Powered by Radare2 and r2ghidra This is specially developed for PW

Monish Kumar 49 Nov 28, 2022
Randomisation-based inference in Python based on data resampling and permutation.

Randomisation-based inference in Python based on data resampling and permutation.

67 Dec 27, 2022
MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

SisonkeBiotik 6 Nov 30, 2022
Performance analysis of predictive (alpha) stock factors

Alphalens Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open sour

Quantopian, Inc. 2.5k Jan 09, 2023
Demonstrate a Dataflow pipeline that saves data from an API into BigQuery table

Overview dataflow-mvp provides a basic example pipeline that pulls data from an API and writes it to a BigQuery table using GCP's Dataflow (i.e., Apac

Chris Carbonell 1 Dec 03, 2021
Manage large and heterogeneous data spaces on the file system.

signac - simple data management The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, and reproduc

Glotzer Group 109 Dec 14, 2022
CSV database for chihuahua (HUAHUA) blockchain transactions

super-fiesta Shamelessly ripped components from https://github.com/hodgerpodger/staketaxcsv - Thanks for doing all the hard work. This code does only

Arlene Macciaveli 1 Jan 07, 2022
SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

SNV Pipeline SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

East Genomics 1 Nov 02, 2021
Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Fast Laplacian Eigenmaps in python Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMS

17 Jul 09, 2022
Data cleaning tools for Business analysis

Datacleaning datacleaning tools for Business analysis This program is made for Vicky's work. You can use it, too. 数据清洗 该数据清洗工具是为了商业分析 这个程序是为了Vicky的工作而

Lin Jian 3 Nov 16, 2021
For making Tagtog annotation into csv dataset

tagtog_relation_extraction for making Tagtog annotation into csv dataset How to Use On Tagtog 1. Go to Project Downloads 2. Download all documents,

hyeong 4 Dec 28, 2021
Monitor the stability of a pandas or spark dataframe ⚙︎

Population Shift Monitoring popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

ING Bank 403 Dec 07, 2022
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 1.2k Dec 31, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 09, 2023