Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Last update: Nov 20, 2021

Related tags

Data Analysis Covid-datapipeline-using-pyspark-and-mysql

Overview

Covid-datapipeline-using-pyspark-and-mysql

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Tools used : PySpark , MySQL

Procedure

Fetch latest data from API using requests & pandas module of python.
Apply some data processing and filtering to generate summarized information.
Store that summarized information into database using MySQL.

To build above pipeline i had used pyspark

{IMPORTANT}

Before move to the execution part please read below sentences

Use correct connector and drivername while making connection with MySQL db if you are going to use different db then procedure may differ.
change login credentials (username & password) in covid-config.json.
Make sure that mentioned database and table is already created.

How to use

clone Covid-datapipeline-using-pyspark-and-mysql repo.
start MySQL server
execute following command

  python main.py

Results:

command line output:

Database status after execution:

Owner

GitHub Repository

Visions provides an extensible suite of tools to support common data analysis operations

Visions And these visions of data types, they kept us up past the dawn. Visions provides an extensible suite of tools to support common data analysis

168 Dec 28, 2022

Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

3 Oct 03, 2022

CubingB is a timer/analyzer for speedsolving Rubik's cubes, with smart cube support

CubingB is a timer/analyzer for speedsolving Rubik's cubes (and related puzzles). It focuses on supporting "smart cubes" (i.e. bluetooth cubes) for recording the exact moves of a solve in real time.

5 Sep 18, 2022

BErt-like Neurophysiological Data Representation

BENDR BErt-like Neurophysiological Data Representation This repository contains the source code for reproducing, or extending the BERT-like self-super

114 Dec 23, 2022

Falcon: Interactive Visual Analysis for Big Data

Falcon: Interactive Visual Analysis for Big Data Crossfilter millions of records without latencies. This project is work in progress and not documente

803 Dec 27, 2022

A Python adaption of Augur to prioritize cell types in perturbation analysis.

A Python adaption of Augur to prioritize cell types in perturbation analysis.

2 Mar 29, 2022

Streamz helps you build pipelines to manage continuous streams of data

Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelines that involve branching, joining, flow control, feedbac

1.1k Dec 28, 2022

Randomisation-based inference in Python based on data resampling and permutation.

Randomisation-based inference in Python based on data resampling and permutation.

67 Dec 27, 2022

A crude Hy handle on Pandas library

Quickstart Hyenas is a curde Hy handle written on top of Pandas API to allow for more elegant access to data-scientist's powerhouse that is Pandas. In

4 Sep 05, 2022

WaveFake: A Data Set to Facilitate Audio DeepFake Detection

WaveFake: A Data Set to Facilitate Audio DeepFake Detection This is the code repository for our NeurIPS 2021 (Track on Datasets and Benchmarks) paper

27 Dec 22, 2022

The micro-framework to create dataframes from functions.

The micro-framework to create dataframes from functions.

762 Jan 07, 2023

A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

TennisBusinessIntelligenceProject - A project consists in a set of assignements corresponding to a BI process: data integration, construction of an OLAP cube, qurying of a OPLAP cube and reporting.

1 Jan 02, 2022

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

2 Dec 12, 2021

COVID-19 deaths statistics around the world

COVID-19-Deaths-Dataset COVID-19 deaths statistics around the world This is a daily updated dataset of COVID-19 deaths around the world. The dataset c

4 Jul 10, 2022

Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Tweetmetric Tweetmetric allows you to track various metrics on your most recent tweets, such as impressions, retweets and clicks on your profile. The

29 Oct 18, 2022

Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

1 Sep 05, 2021

Evaluation of a Monocular Eye Tracking Set-Up

Evaluation of a Monocular Eye Tracking Set-Up As part of my master thesis, I implemented a new state-of-the-art model that is based on the work of Che

19 Dec 17, 2022

MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

6 Nov 30, 2022

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

2 Dec 01, 2021

TE-dependent analysis (tedana) is a Python library for denoising multi-echo functional magnetic resonance imaging (fMRI) data

tedana: TE Dependent ANAlysis TE-dependent analysis (tedana) is a Python library for denoising multi-echo functional magnetic resonance imaging (fMRI)

136 Dec 22, 2022