This mini project showcase how to build and debug Apache Spark application using Python

Overview

Spark Python

by Denny Imanuel

This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark application on Spark container

Spark on Localhost

Requirement

  1. PyCharm IDE - You need to install PyCharm IDE
  2. Java JDK - You need to install Java JDK and set JAVA_HOME env
  3. Python - You need to install Python and set PYTHONPATH env
  4. Spark Hadoop - You need to install Spark Hadoop and set HADOOP_HOME and SPARK_HOME env

For more info: https://dotnet.microsoft.com/en-us/learn/data/spark-tutorial/install-spark

Run Config

To run Spark app run Spark Submit command or create a new 'Run Config' under Shell Script as follows:

\SparkPython\venv\Scripts\python.exe" spark-submit --class SparkPython SparkPython.py">
set PYSPARK_PYTHON "
    
     \SparkPython\venv\Scripts\python.exe"
spark-submit --class SparkPython SparkPython.py

    

Build Config

To build Spark app run Spark Submit command or create a new 'Build Config' under Python Debug Server as follows:

venv\Scripts\activate
pip install pydevd-pycharm~=
    

   

Debug Config

To debug Spark app create 'Debug Config' using standard Python configuration file and then insert following code. In order to debug run above 'Build Config' first, set breakpoint, and then run this 'Debug Config':

import pydevd_pycharm
pydevd_pycharm.settrace('localhost', port=8888, stdoutToServer=True, stderrToServer=True)

Spark on Docker

Requirement

  1. Rider IDE / Visual Studio - You need to install Rider IDE or Visual Studio
  2. Docker Desktop - You need to install Docker Desktop to run Docker
  3. Spark Image - Make sure you pull same version of Spark image as your local Spark:

docker pull bitnami/spark:3.1.2

Spark Clusters

Docker Compose below will run Spark cluster in master and worker node. First comment the debug line(6,7) and then pack the venv folder into venv.tar.gz and then submit both SparkPython.py file and venv.tar.gz to Spark cluster.

docker-compose up
spark-submit --master spark://localhost:7070 --class SparkPython SparkPython.py --archives venv.tar.gz

Output Result

If the Spark application is successfully build it should print out result table as follows:

Owner
Denny Imanuel
This repos shows how to develop mini application using various kind of framework in different programing languages (C#, Java, Python, Angular, React, Vue, etc)
Denny Imanuel
Monitor the stability of a pandas or spark dataframe ⚙︎

Population Shift Monitoring popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

ING Bank 403 Dec 07, 2022
Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python

Sentiment analysis on streaming twitter data using Spark Structured Streaming & Python This project is a good starting point for those who have little

Himanshu Kumar singh 2 Dec 04, 2021
Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022
Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

Tirthajyoti Sarkar 249 Jan 08, 2023
Evaluation of a Monocular Eye Tracking Set-Up

Evaluation of a Monocular Eye Tracking Set-Up As part of my master thesis, I implemented a new state-of-the-art model that is based on the work of Che

Pascal 19 Dec 17, 2022
MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI

MidTerm Project for the Data Analysis FT Bootcamp, Adam Tycner and Florent ZAHOUI Hallo

Florent Zahoui 1 Feb 07, 2022
A set of procedures that can realize covid19 virus detection based on blood.

A set of procedures that can realize covid19 virus detection based on blood.

Nuyoah-xlh 3 Mar 07, 2022
Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Aryan Raj 7 Sep 04, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 09, 2023
A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

databooks is a package for reducing the friction data scientists while using Jupyter notebooks, by reducing the number of git conflicts between different notebooks and assisting in the resolution of

dataroots 86 Dec 25, 2022
Sample code for Harry's Airflow online trainng course

Sample code for Harry's Airflow online trainng course You can find the videos on youtube or bilibili. I am working on adding below things: the slide p

102 Dec 30, 2022
The Dash Enterprise App Gallery "Oil & Gas Wells" example

This app is based on the Dash Enterprise App Gallery "Oil & Gas Wells" example. For more information and more apps see: Dash App Gallery See the Dash

Austin Caudill 1 Nov 08, 2021
Data processing with Pandas.

Processing-data-with-python This is a simple example showing how to use Pandas to create a dataframe and the processing data with python. The jupyter

1 Jan 23, 2022
PyTorch implementation for NCL (Neighborhood-enrighed Contrastive Learning)

NCL (Neighborhood-enrighed Contrastive Learning) This is the official PyTorch implementation for the paper: Zihan Lin*, Changxin Tian*, Yupeng Hou* Wa

RUCAIBox 73 Jan 03, 2023
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics. It provides: Frequency table constr

Angel Chavez 1 Oct 31, 2021
Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video.

Datashredder is a simple data corruption engine written in python. You can corrupt anything text, images and video. You can chose the cha

2 Jul 22, 2022
Bearsql allows you to query pandas dataframe with sql syntax.

Bearsql adds sql syntax on pandas dataframe. It uses duckdb to speedup the pandas processing and as the sql engine

14 Jun 22, 2022
PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

PostQF Copyright © 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma

Ralph Seichter 11 Nov 24, 2022