ETL pipeline on movie data using Python and postgreSQL

Last update: Jul 07, 2021

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

This project consisted on a automated Extraction, Transformation and Load pipeline. This ETL extracted movie data from wikipedia, kaggle, and MovieLens to clean it, transform it, and merge it using Pandas. The product was a merged table with movies and ratings loaded to PostgreSQL.

Resources

Data sources:
- movies_metadata.csv
- ratings.csv
- wikipedia_movies.json
Software:
- Python
- PostgreSQL
- Pandas
- SQLAlchemy
- Regular Expressions

Results

Final output table: FINAL_Merged_Movies_and_Ratings.csv
Datasets uploaded to PostgreSQL for other users to analyze movie data (Hacketon):

Summary

The pipeline was created under the following assumptions:

I was able to join the wikipedia, kaggle, and ratings movie data on the IMDB ID column.
The wikipedia dataset didn't have a IMDB ID, so I had to extract it from the url link given.
Each dataset had to be cleaned on their own because they had overlapping columns, suck as 'Director' and 'Directed By', unecessary columns, many null values, TV shows, outliers, duplicates, incorrect data types, formatting, and other errors.
The wikipedia movie data was in json format.
Not every every movie had a rating for each rating level.
The ratings dataset had more than 26 million entries which generated a time constraint and a processing data challenge.

ETL pipeline on movie data using Python and postgreSQL

Related tags

Overview

Movies-ETL

ETL pipeline on movie data using Python and postgreSQL

Overview

Resources

Results

Summary

Owner

Juan Nicolas Serrano

Open source platform for Data Science Management automation

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

A data parser for the internal syncing data format used by Fog of World.

Convert monolithic Jupyter notebooks into Ploomber pipelines.

:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

An ETL framework + Monitoring UI/API (experimental project for learning purposes)

A 2-dimensional physics engine written in Cairo

Implementation in Python of the reliability measures such as Omega.

Jupyter notebooks for the book "The Elements of Statistical Learning".

📊 Python Flask game that consolidates data from Nasdaq, allowing the user to practice buying and selling stocks.

bigdata_analyse 大数据分析项目

Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

Bigdata Simulation Library Of Dream By Sandman Books

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

A library to create multi-page Streamlit applications with ease.

Describing statistical models in Python using symbolic formulas

Pipetools enables function composition similar to using Unix pipes.

Ejercicios Panda usando Pandas

PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai)