A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
Change the name and pfp of ur accounts, uses tokens.txt for ur tokens.

Change the name and pfp of ur accounts, uses tokens.txt for ur tokens. Also scrapes the pfps+names from a server chosen by you. For hq tokens go to discord.gg/tokenshop or t.me/praisetelegram

cChimney 36 Dec 09, 2022
Reverse engineered connection to the TradingView ticker in Python

Tradingview-ticker Reverse engineered connection to the TradingView ticker in Python. Makes a websocket connection to the Tradeview website and receiv

Aaron 20 Dec 02, 2022
Some random bot for Discord which was created just for fun (Made with Discord.py library)

Ghosty Previously known as 'secondthunder-py-bot' This is repository of some random bot for Discord which was created just for fun and for some educat

Владислав 8 Oct 02, 2022
Send embeds using your discord personal account

Welcome to Embed Sender 👋 Send embeds using your discord personal account Install pip install -r requirements.txt Usage Put your discord token in ./

SkydenFly 11 Sep 07, 2022
Script Crack Facebook, and Instagram 🚶‍♂

in-mbf Script Crack Facebook, and Instagram 🚶‍♂ Bukti Install Script $ pkg update && pkg upgrade $ pkg install git $ pkg install python2 $ pip2 insta

Yumasaa 5 Dec 27, 2021
Pythonic event-processing library based on decorators

Process Events In Style This library aims to simplify the common pattern of event processing. It simplifies the process of filtering, dispatching and

Nicolas Marier 3 Sep 01, 2022
A bot which provides online/offline and player status for Thicc SMP, using Replit.

AlynaaStatus A bot which provides online/offline and player status for Thicc SMP. Currently being hosted on Replit. How to use? Create a repl on Repli

QuanTrieuPCYT 8 Dec 15, 2022
A MassDM selfbot which is working in 2021

mass-dm-discord - Little preview of the Logger and the Spammer Features Logging User IDS Sending DMs to the logged IDs Blacklist IDs (add the ID of th

karma.meme 88 Dec 26, 2022
Uses discords api to see if a token has a valid payment method.

Discord Payment Checker Uses discords api to see if a token has a valid payment method. Report Bug · Request Feature Features Checks tokens Checks all

dropout 10 Dec 01, 2022
A Tᴇʟᴇɢʀᴀᴍ Vɪᴅᴇᴏ Pʟᴀʏᴇʀ Bᴏᴛ Tᴏ Pʟᴀʏ YT Vɪᴅᴇᴏs & Lɪᴠᴇ Sᴛʀᴇᴀᴍ.

Tuktuky_Music Telegram bot to stream videos in telegram voicechat for both groups and channels. Supports live strams, YouTube videos and telegram medi

TᑌKTᑌKY ᖇᗩᕼᗰᗩᑎ 3 Sep 14, 2021
Hellomotoot - PSTN Mastodon Client using Mastodon.py and the Twilio API

Hello MoToot PSTN Mastodon Client using Mastodon.py and the Twilio API. Allows f

Lorenz Diener 9 Nov 22, 2022
This checks that your credit card is valid or not

Credit_card_Validator This checks that your credit card is valid or not. Where is the app ? main.exe is the application to run and main.py is the file

Ritik Ranjan 1 Dec 21, 2021
Translator based on Google API

Yakusu Toshiko Translator based on Google API. Instance of this bot is running as @yakusubot. Features Add a plus to a language's name to show an orig

Arisu W. 2 Sep 21, 2022
A custom discord bot maker in python

custom-discord-bot-maker Sorry for using Translator. Each description may be inaccurate. how to use 1. Make new application at https://discord.com/dev

2 Nov 29, 2021
Live Weather Updates using Flask and OpenWeather

AuraX Live Weather Updates using Flask and OpenWeather Installation To setup this project on your local machine, first clone this repository and insta

Ayush Gupta 3 Nov 02, 2021
Python wrapper for WhatsApp web-based on selenium

alright Python wrapper for WhatsApp web made with selenium inspired by PyWhatsApp Why alright ? I was looking for a way to control and automate WhatsA

Jordan Kalebu 193 Jan 06, 2023
WhatsAppCrashingToolv1.1 - WhatsApp Crashing Tool v1.1

WhatsAppCrashingTool v1.1 This is just for Educational Purpose WhatsApp Crashing

E4crypt3d 3 Dec 20, 2022
Properly-formatted dynamic timestamps for Discord messages

discord-timestamps discord-timestamps generates properly-formatted dynamic timestamps for Discord messages, with support for Arrow objects. format

Ben Soyka 2 Mar 10, 2022
A Telegram Bot to manage your music channel with some cool features.

Music Channel Manager V2 A Telegram Bot to manage your music channel with some cool features like appending your predefined username to the musics tag

11 Oct 21, 2022
tgEasy's Official Assistant Bot and Example Bot

tgEasy Assistant The assistant bot that helps people with tgEasy directly on Telegram. This repository contains the source code of @tgEasyRobot and th

Divide Projects™ 4 Dec 26, 2022