Processing NYC Taxi Data using PySpark ETL pipeline

Description

This is an project to extract, transform, and load large amount of data from NYC Taxi Rides database (Hosted on AWS S3). It extracts data from CSV files of large size (~2GB per month) and applies transformations such as datatype conversions, drop unuseful rows/columns, etc. Finally, the data is written back in parquet format. This saves time for tasks such as machine learning. It also saves a huge amount of space (~97% space reduction from csv to parquet) making it easy to store for downstream tasks.

How to use it (Using GCP as the cloud service of choice)

Setup a bucket on Google Cloud Storage
Use get_raw_data.sh to download raw data from s3 in the form of CSV files to the GCS bucket
Setup a GCP dataproc service
SSH into the master node and copy the entire project folder to the Persistent Disk
Edit the configuration file for application
Submit the job: submit-spark main.py --filename [raw_data_filename] or Execute submit_job.sh with appropriate args

Project structure

root/
|---bash/
    |---create_cluster.sh
    |---install.sh
|---configs/
    |---app_config.json
    |---cols_config.json
|---jobs/
    |---etl_tasks.py
    |---transformations.py
|   get_raw_data.sh
|   main.py
|   requirements.txt
|   submit_job.sh

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Related tags

Overview

Processing NYC Taxi Data using PySpark ETL pipeline

Description

How to use it (Using GCP as the cloud service of choice)

Project structure

Owner

Unnikrishnan

Predictive Modeling & Analytics on Home Equity Line of Credit

Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

Tools for analyzing data collected with a custom unity-based VR for insects.

MDAnalysis is a Python library to analyze molecular dynamics simulations.

Universal data analysis tools for atmospheric sciences

Shot notebooks resuming the main functions of GeoPandas

pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

Convert tables stored as images to an usable .csv file

Project: Netflix Data Analysis and Visualization with Python

ASOUL直播间弹幕抓取&&数据分析

Mortgage-loan-prediction - Show how to perform advanced Analytics and Machine Learning in Python using a full complement of PyData utilities

Vectorizers for a range of different data types

Python tools for querying and manipulating BIDS datasets.

Numerical Analysis toolkit centred around PDEs, for demonstration and understanding purposes not production

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

track your GitHub statistics

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python