Vaex library for Big Data Analytics of an Airline dataset

Last update: Feb 13, 2022

Overview

Vaex-Big-Data-Analytics-for-Airline-data

A Python notebook (ipynb) created in Jupyter Notebook, which utilizes the Vaex library for Big Data Analytics of an Airline dataset.

Author: Nikolas Petrou, MSc in Data Science

Overview

The main part of the work focuses on the exploration a big dataset of 17 GB. Specifically, the dataset contains information on flights within the United States between 1988 and 2018. It can be directly downloaded from: vaex.s3.us-east-2.amazonaws.com.

In addition, in this project the Out-of-Core DataFrames Python library Vaex is employed, in order to visualize, explore acalculate statistics of this big tabular dataset.

The goal of this project is to utilize Vaex to perform an Exploratory Data Analysis (EDA), as well as to predict the arrival delay of a flight using Machine Learning models (regression task).

What is Vaex and why Vaex?

Vaex is a Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. It can calculate statistics such as mean, sum, count, standard deviation etc, on an N-dimensional grid up to a billion (10^9) objects/rows per second. Visualization is done using histograms, density plots and 3d volume rendering, allowing interactive exploration of big data. Furthermore, Vaex provides wrappers to powerful libraries for predictive models (e.g. Scikit-learn, xgboost) and make them work efficiently with Vaex. Vaex does implement a variety of standard data transformers (e.g. PCA, numerical scalers, categorical encoders) and a very efficient KMeans algorithm that take full advantage. Finally, Vaex uses memory mapping, a zero memory copy policy, and lazy computations for best performance (no memory wasted).

Advantage of using Vaex over using Pandas with a more powerful machine

Switching to a more powerful machine (with more RAM and/or better CPU) may solve some memory issues, but still, Pandas will only use one out of the 32 cores of your fancy machine. With Vaex, all operations are out of the core and executed in parallel and lazily evaluated, allowing for crunching through a billion-row dataset effortlessly.

Data

The dataset has a relatively big size (17 GB), and contains information on flights within the United States between 1988 and 2018. It can be directly downloaded from: vaex.s3.us-east-2.amazonaws.com

Each row-record of the dataset represents an individual flight. Specifically, each record contains information of the airline (UniqueCarrier), airports (origin airport, destination airport) and flight level information such as time schedule (day of week, day of month, month, year), flight distance, departure time and delay, arrival time and delay.

Vaex library for Big Data Analytics of an Airline dataset

Related tags

Overview

Vaex-Big-Data-Analytics-for-Airline-data

Overview

What is Vaex and why Vaex?

Advantage of using Vaex over using Pandas with a more powerful machine

Data

Owner

Nikolas Petrou

A multi-platform GUI for bit-based analysis, processing, and visualization

Transform-Invariant Non-Negative Matrix Factorization

Example Of Splunk Search Query With Python And Splunk Python SDK

Fit models to your data in Python with Sherpa.

Display the behaviour of a realtime program with a scope or logic analyser.

An Indexer that works out-of-the-box when you have less than 100K stored Documents

This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.

Data Scientist in Simple Stock Analysis of PT Bukalapak.com Tbk for Long Term Investment

PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

A set of procedures that can realize covid19 virus detection based on blood.

A pipeline that creates consensus sequences from a Nanopore reads. I

Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

Pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).

Orchest is a browser based IDE for Data Science.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

PyNHD is a part of HyRiver software stack that is designed to aid in watershed analysis through web services.

The Dash Enterprise App Gallery "Oil & Gas Wells" example

General Assembly's 2015 Data Science course in Washington, DC

Pipetools enables function composition similar to using Unix pipes.