A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
A modern,feature-rich, and async ready API wrapper for Discord written in Python

discord.io A modern, easy to use, feature-rich, and async ready API wrapper for Discord written in Python. Key Features Modern Pythonic API using asyn

Vincent 18 Jan 02, 2023
Another secured and Yet Fastest telegram userbot

Vision-UserBot A stable, simple Telegram UserBot in Pyrogram! Support Variables ➨ TG_APP_ID - Your Telegram Api id. ➨ TG_API_HASH - Your Telegram Api

TeamVision 40 Oct 24, 2022
A library for demo trading | backtest and forward test simulation

Trade Engine a library for demo trading | backtest and forward test simulation Features Limit/Market orders: you can place a Limit or Market order in

Ali Moradi 7 Jul 02, 2022
Crypto-trading-simulator - Cryptocurrency trading simulator using Python, Streamlit

Crypto Trading Simulator Run streamlit run main.py Dependency Python 3 streamli

Brad 12 Jul 02, 2022
Pythonic and easy iCalendar library (rfc5545)

ics.py 0.8.0-dev : iCalendar for Humans Original repository (GitHub) - Bugtracker and issues (GitHub) - PyPi package (ics) - Documentation (Read The D

ics.py 513 Jan 02, 2023
Python Client for Instagram API

This project is not actively maintained. Proceed at your own risk! python-instagram A Python 2/3 client for the Instagram REST and Search APIs Install

Facebook Archive 2.9k Dec 30, 2022
The best Fortnite all-in-one lobby bot!

Recommended to use on Python v3.8 stable for bot. FLB The best free Fortnite lobby bot experience! Discord server: PDennSploit Softworks LLC Getting S

Payson Holmes 2 May 11, 2022
Rocks vc Userbot: A Telegram Bot Project That's Allow You To Play Audio And Video Music On Telegram Voice Chat Group

⭐️ Rocks VC Userbot ⭐️ Telegram Userbot To Play Audio And Video Song On VC Chat

Dr Asad Ali 10 Jul 18, 2022
Student-Management-System-in-Python - Student Management System in Python

Student-Management-System-in-Python Student Management System in Python

G.Niruthian 3 Jan 01, 2022
Fast discord token checker with high cpm

Discord-Token-checker Fast discord token checker with high cpm preivew Download git clone https://github.com/TusTusDev/Discord-Token-checker pip insta

Tustus 1 Oct 15, 2021
A discord program that will send a message to nearly every user in a discord server

Discord Mass DM Scrapes users from a discord server to promote/mass dm Report Bug · Request Feature Features Asynchronous Easy to use Free Auto scrape

dropout 56 Jan 02, 2023
• Create Your Own YouTube Info Api.

youtube_data_api • Create Your Own YouTube Info Api. Deploy How to Use https://{ Heroku App Name }.herokuapp.com/api?link={YouTube link} In local Host

lokaman chendekar 12 Oct 02, 2022
Discord heximals: More colors for your bot

DISCORD-HEXIMALS More colors for your bot ! Support : okimii#0434 COLORS ( 742 )

4 Feb 04, 2022
Github repository started notify 💕

Github repository started notify 💕

4 Aug 06, 2022
A Python interface module to the SAS System. It works with Linux, Windows, and mainframe SAS. It supports the sas_kernel project (a Jupyter Notebook kernel for SAS) or can be used on its own.

A Python interface to MVA SAS Overview This module creates a bridge between Python and SAS 9.4. This module enables a Python developer, familiar with

SAS Software 319 Dec 19, 2022
ANKIT-OS/TG-SESSION-HACK-BOT: A Special Repository.Telegram Bot Which Can Hack The Victim By Using That Victim Session

🔰 ᵀᴱᴸᴱᴳᴿᴬᴹ ᴴᴬᶜᴷ ᴮᴼᵀ 🔰 The owner would not be responsible for any kind of bans due to the bot. • ⚡ INSTALLING ⚡ • • 🛠️ Lᴀɴɢᴜᴀɢᴇs Aɴᴅ Tᴏᴏʟs 🔰 • If

ANKIT KUMAR 2 Dec 24, 2021
Python API to interact with Uwazi

Python Uwazi API Quick Start To use the API install the requirements pip3 install -r requirements.txt and use it like this: uwazi_adapter = UwaziAdap

HURIDOCS 2 Dec 16, 2021
Detects members having unicode names. Public bot: @scarletwitchprobot

✨ Scarletwitch bot ✨ Detects unicode names members in a tg chat & provides a option to take action on that user ! Public bot: @scarletwitchprobot Supp

ÁÑÑÍHÌLÅTØR SPÄRK 18 Nov 12, 2022
Modified Version Of Media Search bot

Modified Version Of Media Search bot

1 Oct 09, 2021
Automatically Message From Discord Account

Discord-AutoMessage A robust and versatile solution for automated social interactions HOW TO INSTALL Open cmd cd into your project directory Run the f

13 Jul 11, 2022