ETL pipeline on movie data using Python and postgreSQL

Last update: Jul 07, 2021

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

This project consisted on a automated Extraction, Transformation and Load pipeline. This ETL extracted movie data from wikipedia, kaggle, and MovieLens to clean it, transform it, and merge it using Pandas. The product was a merged table with movies and ratings loaded to PostgreSQL.

Resources

Data sources:
- movies_metadata.csv
- ratings.csv
- wikipedia_movies.json
Software:
- Python
- PostgreSQL
- Pandas
- SQLAlchemy
- Regular Expressions

Results

Final output table: FINAL_Merged_Movies_and_Ratings.csv
Datasets uploaded to PostgreSQL for other users to analyze movie data (Hacketon):

Summary

The pipeline was created under the following assumptions:

I was able to join the wikipedia, kaggle, and ratings movie data on the IMDB ID column.
The wikipedia dataset didn't have a IMDB ID, so I had to extract it from the url link given.
Each dataset had to be cleaned on their own because they had overlapping columns, suck as 'Director' and 'Directed By', unecessary columns, many null values, TV shows, outliers, duplicates, incorrect data types, formatting, and other errors.
The wikipedia movie data was in json format.
Not every every movie had a rating for each rating level.
The ratings dataset had more than 26 million entries which generated a time constraint and a processing data challenge.

ETL pipeline on movie data using Python and postgreSQL

Related tags

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

Resources

Results

Summary

Owner

Juan Nicolas Serrano

Spaghetti: an open-source Python library for the analysis of network-based spatial data

.npy, .npz, .mtx converter.

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

An Aspiring Drop-In Replacement for NumPy at Scale

Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI

This tool parses log data and allows to define analysis pipelines for anomaly detection.

Airflow ETL With EKS EFS Sagemaker

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

Using approximate bayesian posteriors in deep nets for active learning

Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

A simplified prototype for an as-built tracking database with API

This is a python script to navigate and extract the FSD50K dataset

A library to create multi-page Streamlit applications with ease.

PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

Educational project on how to build an ETL (Extract, Transform, Load) data pipeline, orchestrated with Airflow.

Data cleaning tools for Business analysis

Example Of Splunk Search Query With Python And Splunk Python SDK

Techdegree Data Analysis Project 2