A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
(@Tablada32BOT is my bot in twitter) This is a simple bot, its main and only function is to reply to tweets where they mention their bot with their @

Remember If you are going to host your twitter bot on a page where they can read your code, I recommend that you create an .env file and put your twit

3 Jun 04, 2021
Thread-safe Python RabbitMQ Client & Management library

AMQPStorm Thread-safe Python RabbitMQ Client & Management library. Introduction AMQPStorm is a library designed to be consistent, stable and thread-sa

Erik Olof Gunnar Andersson 167 Nov 20, 2022
Slack Developer Kit for Python

Python Slack SDK The Slack platform offers several APIs to build apps. Each Slack API delivers part of the capabilities from the platform, so that you

SlackAPI 3.5k Jan 02, 2023
This is RequestTrackerBot and it used for tracking request made by user in a group

This is a Request Tracker Bot repo, It is for those who upload content like movies, anime, etc. It can be used for tracking request of content that your members asked for.

Abhijeet 27 Dec 29, 2022
NFTs Upload to OpenSea CuseEdition

NFTs-Upload-to-OpenSea-CuseEdition YOUTUBE VIDEO - Soon... Download Python and

Lil Cuse 2 Jan 04, 2022
Discord bot for playing Werewolf game on League of Legends.

LoLWolf LoL人狼をプレイするときのDiscord用botです。 (Discord bot for playing Werewolf game on League of Legends.) 以下のボタンを押してbotをあなたのDiscordに招待することで誰でも簡単に使用することができます。

Hatsuka 4 Oct 18, 2021
PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to Reddit's API.

PRAW: The Python Reddit API Wrapper PRAW, an acronym for "Python Reddit API Wrapper", is a Python package that allows for simple access to Reddit's AP

Python Reddit API Wrapper Development 3k Dec 29, 2022
Python wrapper for the Intercom API.

python-intercom Not officially supported Please note that this is NOT an official Intercom SDK. The third party that maintained it reached out to us t

Intercom 215 Dec 22, 2022
The Python SDK for the Rackspace Cloud

pyrax Python SDK for OpenStack/Rackspace APIs DEPRECATED: Pyrax is no longer being developed or supported. See openstacksdk and the rackspacesdk plugi

PyContribs 238 Sep 21, 2022
Dados Públicos de CNPJ disponibilizados pela Receita Federal do Brasil

Dados Públicos CNPJ Fonte oficial da Receita Federal do Brasil, aqui. Layout dos arquivos, aqui. A Receita Federal do Brasil disponibiliza bases com o

Aphonso Henrique do Amaral Rafael 102 Dec 28, 2022
Collection of AWS Fault Injection Simulator (FIS) experiment templates.

Collection of AWS Fault Injection Simulator (FIS) experiment templates. These templates let you perform chaos engineering experiments on resources (applications, network, and infrastructure) in the A

Adrian Hornsby 8 Nov 27, 2022
The official Pushy SDK for Python apps.

pushy-python The official Pushy SDK for Python apps. Pushy is the most reliable push notification gateway, perfect for real-time, mission-critical app

Pushy 1 Dec 21, 2021
Efetuar teste de automação usando linguagem gherkin

🚀 Teste-de-Automação - QA---CI-T 🚀 Descrição • Primeira Parte • Segunda Parte • Terceira Parte Contributors Descrição Efetuamos testes de automação

Eliel martins 6 Dec 07, 2021
Python library for the DeepL language translation API.

The DeepL API is a language translation API that allows other computer programs to send texts and documents to DeepL's servers and receive high-quality translations. This opens a whole universe of op

DeepL 535 Jan 04, 2023
Programa de código abierto para probar el API de Bitso, el exchange más importante de América Latina.

Bitso Semiautomático Programa de código abierto para probar el API de Bitso, el exchange más importante de América Latina. Desarrollador Fernando Mire

Fernando Mireles 17 Dec 07, 2022
A bot that can play songs in Telegram group voice chats like AK 47

🎧 47Music Player 🎧 A bot that can play songs in Telegram group voice chats like AK 47 ✨ Easy To Deploy Pyrogram Session Config Vars API_ID : Assista

Janindu Malshan 23 Dec 07, 2022
A simple Discord Token Grabber sending the new token if the victim changes his password.

💎 Riot 💎 Riot is a simple Discord token grabber written in Python3 running in background and executing when the victim start their computer. If the

Billy 66 Dec 26, 2022
Eva Maria Telegram Bot

Eva Maria Bot Features Auto Filter Manuel Filter IMDB Admin Commands Broadcast Index IMDB search Inline Search Random pics ids and User info Stats, Us

Eva Maria TG 477 Dec 31, 2022
Framework for creating and running trading strategies. Blatantly stolen copy of qtpylib to make it work for Indian markets.

_• Kinetick Trade Bot Kinetick is a framework for creating and running trading strategies without worrying about integration with broker and data str

Vinay 41 Dec 31, 2022
A python library for creating Slack slash commands using AWS Lambda Functions

slashbot Slashbot makes it easy to create slash commands using AWS Lambda functions. These can be handy for creating a secure way to execute automated

Eric Brassell 17 Oct 21, 2022