Processing NYC Taxi Data using PySpark ETL pipeline

Description

This is an project to extract, transform, and load large amount of data from NYC Taxi Rides database (Hosted on AWS S3). It extracts data from CSV files of large size (~2GB per month) and applies transformations such as datatype conversions, drop unuseful rows/columns, etc. Finally, the data is written back in parquet format. This saves time for tasks such as machine learning. It also saves a huge amount of space (~97% space reduction from csv to parquet) making it easy to store for downstream tasks.

How to use it (Using GCP as the cloud service of choice)

Setup a bucket on Google Cloud Storage
Use get_raw_data.sh to download raw data from s3 in the form of CSV files to the GCS bucket
Setup a GCP dataproc service
SSH into the master node and copy the entire project folder to the Persistent Disk
Edit the configuration file for application
Submit the job: submit-spark main.py --filename [raw_data_filename] or Execute submit_job.sh with appropriate args

Project structure

root/
|---bash/
    |---create_cluster.sh
    |---install.sh
|---configs/
    |---app_config.json
    |---cols_config.json
|---jobs/
    |---etl_tasks.py
    |---transformations.py
|   get_raw_data.sh
|   main.py
|   requirements.txt
|   submit_job.sh

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Related tags

Overview

Processing NYC Taxi Data using PySpark ETL pipeline

Description

How to use it (Using GCP as the cloud service of choice)

Project structure

Owner

Unnikrishnan

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

COVID-19 deaths statistics around the world

A tool to compare differences between dataframes and create a differences report in Excel

Active Learning demo using two small datasets

Candlestick Pattern Recognition with Python and TA-Lib

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Jupyter notebooks for the book "The Elements of Statistical Learning".

This module is used to create Convolutional AutoEncoders for Variational Data Assimilation

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

Validation and inference over LinkML instance data using souffle

An Integrated Experimental Platform for time series data anomaly detection.

Scraping and analysis of leetcode-compensations page.

Template for a Dataflow Flex Template in Python

A highly efficient and modular implementation of Gaussian Processes in PyTorch

Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

Techdegree Data Analysis Project 2

PipeChain is a utility library for creating functional pipelines.

A stock analysis app with streamlit

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data