Pyspark Spotify ETL

Description

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data is shown as a Spark Dataframe before loading and the whole ETL job is scheduled with crontab. Token never expires since an HTTP POST method with Spotify's token API is used in the beginning of the script.

The purpose of this is to help those that want to become Data Engineers, like myself, create their first project.

Essentials

Extra libraries that must be imported: sys, json, datetime.

ETL Execution

Install all the necessary libraries from the Pipfile.
Read the "Token_request_instructions" to get your own refresh token. In case you don't want that you can get one from this website https://developer.spotify.com/console/get-recently-played/ which will have to be changed every hour.
Add your you postgreSQL credentials in the engine variable. In case you'll be using another RDBMS, use this website https://docs.sqlalchemy.org/en/14/core/engines.html.
Create SQL Database/Table (Optional).
Create a bash file. This file is were you'll write down the path to Spark, Python and your script. If this isn't created you'll get the "ModuleNotFoundError" for each module you import inside your script. (Think of this as the ETL's own ~/.bash_profile)
Create a new crontab or use the existing one if you want the job to run on midnight every day.

Extras

To verify that your scheduled job is working you can change the crontab to "* * * * *".
Here is the website https://developer.spotify.com/documentation/general/guides/scopes/ with other Spotify scopes in case you don't want to use "recently played tracks".
Thank you Karolina Sowinska for your DE Beginners guide.

Pyspark Spotify ETL

Related tags

Overview

Pyspark Spotify ETL

Owner

Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

signac-flow - manage workflows with signac

TE-dependent analysis (tedana) is a Python library for denoising multi-echo functional magnetic resonance imaging (fMRI) data

WAL enables programmable waveform analysis.

Anomaly Detection with R

A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

INF42 - Topological Data Analysis

Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

Lale is a Python library for semi-automated data science.

General Assembly's 2015 Data Science course in Washington, DC

MapReader: A computer vision pipeline for the semantic exploration of maps at scale

statDistros is a Python library for dealing with various statistical distributions

Aggregating gridded data (xarray) to polygons

The Master's in Data Science Program run by the Faculty of Mathematics and Information Science

Weather analysis with Python, SQLite, SQLAlchemy, and Flask

CaterApp is a cross platform, remotely data sharing tool created for sharing files in a quick and secured manner.

Tools for the analysis, simulation, and presentation of Lorentz TEM data.

Galvanalyser is a system for automatically storing data generated by battery cycling machines in a database

This tool parses log data and allows to define analysis pipelines for anomaly detection.

Performance analysis of predictive (alpha) stock factors