A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
Instagram GiftShop Scam Killer

Instagram GiftShop Scam Killer A basic tool for Windows which kills acess to any giftshop scam from Instagram. Protect your Instagram account from the

1 Mar 31, 2022
Spotify Top Lists - get the current top lists of a user from the Spotify API and display them in a Flask app

Spotify Top Lists This is a simple script that will get the current top lists of a user from the Spotify API and display them in a Flask app. Requirem

Yasin 0 Oct 16, 2022
Official API documentation for Highrise

Highrise API The Highrise API is implemented as vanilla XML over HTTP using all four verbs (GET/POST/PUT/DELETE). Every resource, like Person, Deal, o

Basecamp 128 Dec 06, 2022
A comand-line utility for taking automated screenshots of websites

shot-scraper A comand-line utility for taking automated screenshots of websites For background on this project see shot-scraper: automated screenshots

Simon Willison 837 Jan 07, 2023
Indian Space Research Organisation API With Python

ISRO Indian Space Research Organisation API Installation pip install ISRO Usage import isro isro.spacecrafts() # returns spacecrafts data isro.lau

Fayas Noushad 5 Aug 11, 2022
The public discord bot, created by: primitt, further developed by: duino-coin team.

Duino Stats Mini A public Duino-Stats Discord bot. Click this link to invite the bot to your server. License Duino Stats Mini distributed under the MI

primboi 8 Mar 14, 2022
UniHub API is my solution to bringing students and their universities closer

🎓 UniHub API UniHub API is my solution to bringing students and their universities closer... By joining UniHub, students will be able to join their r

Abdelbaki Boukerche 5 Nov 21, 2021
A Discord Bot coded using Python. Open to collaboration

DisPy-Bot A Discord Bot coded using Python. Open to collaboration La syntax pour intégrer le bot (imaginons la fonction lol_reponse dans le fichier au

BiMathAx 2 Mar 03, 2022
A Telegram Bot which will ask new Group Members to verify them by solving an emoji captcha.

Emoji-Captcha-Bot A Telegram Bot which will ask new Group Members to verify them by solving an emoji captcha. About API: Using api.abirhasan.wtf/captc

Abir Hasan 52 Dec 11, 2022
Mushahid Ali 1 Dec 31, 2021
stories-matiasucker created by GitHub Classroom

Stories do Instagram Este projeto tem como objetivo desenvolver uma pequena aplicação que simule os efeitos e funcionalidades ao estilo Instagram. A a

1 Dec 20, 2021
Modified Version Of Media Search bot

Modified Version Of Media Search bot

1 Oct 09, 2021
Notion4ever - Python tool for export all your content of Notion page using official Notion API

NOTION4EVER Notion4ever is a small python tool that allows you to free your cont

50 Dec 30, 2022
A Python bot that uses the Reddit API to send users inspiring messages.

AnonBot By Edric Antoine A Python bot that uses the Reddit API to send users inspiring messages. When a message includes 'What would Anon do?', the bo

1 Jan 05, 2022
Simple Similarities Service

simsity Simsity is a Super Simple Similarities Service[tm]. It's all about building a neighborhood. Literally! This repository contains simple tools t

vincent d warmerdam 95 Dec 25, 2022
Based on falcondai and fenhl's Python snowflake tool, but with documentation and simliarities to Discord.

python-snowflake-2 Based on falcondai and fenhl's Python snowflake tool, but with documentation and simliarities to Discord. Docs make_snowflake This

2 Mar 19, 2022
Download song lyrics and metadata from Genius.com 🎶🎤

LyricsGenius: a Python client for the Genius.com API lyricsgenius provides a simple interface to the song, artist, and lyrics data stored on Genius.co

John W. Miller 738 Jan 04, 2023
Secure Tunnel Manager

Making life easy of those who are in need of OpenSource alternative of AWS Secure Tunnel.

Suyash Chavan 1 Sep 27, 2022
Build a better understanding of your data in PostgreSQL.

Data Fluent for PostgreSQL Build a better understanding of your data in PostgreSQL. The following shows an example report generated by this tool. It g

Mark Litwintschik 28 Aug 30, 2022
An unofficial client library for Google Music.

gmusicapi: an unofficial API for Google Play Music gmusicapi allows control of Google Music with Python. from gmusicapi import Mobileclient api = Mob

Simon Weber 2.5k Dec 15, 2022