A solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

Overview

Crime data- Batch Processing:

RDBMS Data Extraction Implementation

This project is intended to implement a solution designed to extract, transform and load Chicago crime data from an RDS instance to other services in AWS.

  • There is an airflow dag script, 2 pyspark application scripts, and a bootstrap actions script in this project which are explained below.

Deployment

Preparation:

  • An AWS RDS MySQL instance is created to store the batch of data.
    • An EC2 instance is created to communicate with the RDS instance.
    • The data is loaded onto the EC2 instance.
    • The database and table are created on the RDS instance with the help of the above created EC2 instance. The data is loaded in the table created above.
    • The create&Load.sql file contains the code for the above table data preparation step.
    • A secret on the Secrets Manager console is stored to communicate with the RDS instance secretly. Also, password rotation after 30 days has been configured for security purposes.
  • The following dag loads the data created from the above step into the AWS environment.

Implementation:

  • The airflow dag is put in the s3://yavula-da-capstone/dag/ location in the S3 bucket. An environment is created on the Amazon Managed Workflows for Apache Airflow(MWAA) console in a specific VPC.
  • The dag is scheduled to run on a daily basis along with SLA monitoring to trigger an alarm if the tasks take more than 36 minutes to finish the whole ETL process.
  • It usually takes 32-34 minutes to finish the dag processes. But if it takes, more than that, it means that something has interrupted the dag from finishing its process and we can check the logs accordingly.

emr_job_flow_manual_steps_dag.py

This script is used to create an airflow dag.

Description

  • The script has steps for the airflow to create an EMR cluster on AWS for a process which is explained later in the next steps.
  • It runs the STEPS that process the spark script on the EMR along with the bootstrap actions present in the bootstrap_actions.sh script which is in an s3 bucket that will install the required package like boto3 onto the EMR instance.
  • Then the step checker is also added to watch this process. This step sensor will periodically check if that last step is completed or skipped or terminated.

spark_ingest_script.py

The spark script which is put into S3 manually, is used to ingest the required data from a table which is present on an RDS isntance and store the data into a raw s3 bucket and catalog into Glue.

Description

  • The ingest script connects to the RDS instance using the mysql-connector.
  • It takes the required crime data from the table and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

spark_process_script.py

The spark script which is put into S3 manually, is used to query the latest target table, filter required crime details from it, then store the query results into a new final table and further save it to a latest partition.

Description

  • The spark script uses the crime data and performs some query processing using it.
  • It queries the required crime data from the table, performs some processing and puts it into a spark dataframe which is then written to the AWS S3 and Glue data catalog.
  • S3 File Structure where the snapshot data is saved
    • (bucket)
    • (key)
    • (db-name)
    • (table-name)
  • Glue Data Catalog table pointing to the latest partition

bootstrap_actions.sh

Required for the bootstrap actions.

Description

Used to install the packages and dependencies on the cluster that are required for the processes inside the spark script to run.

Deployment

  • This bootstrap script is put manually in an S3 bucket.
  • The location of this bucket is used inside the airflow dag to mention in the bootstrap actions that the required actions are present in the script which is in this particular s3 location.

Business Analysis

The final processed table had the crime type details for all the crimes for which the arrest is not made yet. This business analysis can be viewed from Athena and also has been imported into QuickSight Spice to view the details of different types of crimes and their comparisions.

Owner
Yesaswi Avula
An Applied Data Science student with an escalating learning and performance graph Data analytics, Data engineering, Business Intelligence, ML, Big Data & Cloud
Yesaswi Avula
批量下载抖音无水印视频

版权说明 本项目fork自Johnserf-Seed TikTokDownload。目的是为了增加个性化的功能,若想体验更多完善的功能请支持原作者的项目。 免责声明 本代码仅用于学习,下载后请勿用于商业用途。 环境要求 请检查宿主机,是否安装了python环境,并且配置了环境变量 pytho

Zhiwu Mao 44 Dec 28, 2022
Scheduled Block Checker for Cardano Stakepool Operators

ScheduledBlocks Scheduled Block Checker for Cardano Stakepool Operators Lightweight and Portable Scheduled Blocks Checker for Current Epoch. No cardan

SNAKE (Cardano Stakepool) 4 Oct 18, 2022
Poll-Bot Repo For Telegram #telegram

Intro This Is A Simple Bot To Create Poll In Channel and Groups And Also This Is our First Project Too.. Enter you tokens at these are very important

BotsUniverse 6 Oct 21, 2022
A multi-platform HTTP(S) Reverse Shell Server and Client in Python 3

Phantom - A multi-platform HTTP(S) Reverse Shell Server and Client Phantom is a multi-platform HTTP(S) Reverse Shell server and client in Python 3. Bi

85 Nov 18, 2022
A discord token nuker With loads of options that will screw an account up real bad

A discord token nuker With loads of options that will screw an account up real bad, also has inbuilt massreport, GroupChat Spammer and Token/Password/Creditcard grabber and so much more!

XPTGR 0 Aug 07, 2022
Free TradingView webhook alert for basic plan users.

TradingView-Free-Webhook-Alerts Project start on 01-02-2022 Providing the free webhook service to the basic plan users in TradingView. Portal ↠ Instal

Freeman 31 Dec 25, 2022
An attempt to make a bot that can auto-archive Danganronpa KG RPs on Discord.

Danganronpa Killing Game Archiving Bot An attempt to make a bot that can auto-archive Danganronpa KG RPs on Discord. The final format is meant to look

Astrea 1 Nov 30, 2021
A discord bot that will help you browse/download nhentai sources.

Risa Introduction Risa is an nHentai discord bot that will help you browse and download your favorite doujin inside your own discord server. Hosting M

markee7 14 Oct 25, 2021
Python bot for send videos of a Youtube channel to a telegram group , channel or chat

py_youtube_to_telegram Usage: If you want to install ytt and use it, run this command: sudo sh -c "$(curl -fsSL https://raw.githubusercontent.com/nima

Nima Fanniasl 8 Nov 22, 2022
Auto Liker, Auto Reaction, Auto Comment, Auto Follower Tool. RajeLiker Credit Hacker.

Auto Liker, Auto Reaction, Auto Comment, Auto Follower Tool. RajeLiker Credit Hacker. Unlimited RajeLiker Credit Hack. Thanks To RajeLiker.

Md. Mehedi Hasan 32 Dec 28, 2022
Wechat-file-cleaner - Clean files in PC WeChat FileStorage directory

Wechat-file-cleaner - Clean files in PC WeChat FileStorage directory

Xingjian Zhang 1 Feb 06, 2022
A quick way to verify your Climate Hack.AI (2022) submission locally!

Climate Hack.AI (2022) Submission Validator This repository contains code that allows you to quickly validate your Climate Hack.AI (2022) submission l

Jeremy 3 Mar 03, 2022
Riverside Rocks Python API

APIv2 Riverside Rocks Python API Routes GET / Get status of the API GET /api/v1/tor Get Tor metrics of RR family GET /api/v1/metrics Get bandwidth

3 Dec 20, 2021
Create Multiple CF entry for multiple websites

AWS-CloudFront Problem: Deploy multiple CloudFront for account with multiple domains. Functionality: Running this script in loop and deploy CloudFront

Giten Mitra 5 Nov 18, 2022
Github integration with Telegram

The Telegram bot myGit is your GiHub assistant. In your conversations with your team, you can simply insert the information about the projects you are working at.

Alexandru Buzescu 2 Jan 06, 2022
数字货币BTC量化交易系统-实盘行情服务器,虚拟币自动炒币-火币API-币安交易所-量化交易-网格策略。趋势跟踪策略,最简源码,可在线回测,一键部署,可定制的比特币量化交易框架,3年实盘检验!

huobi_intf 提供火币网的实时行情服务器(支持火币网所有交易对的实时行情),自带API缓存,可用于实盘交易和模拟回测。 行情数据,是一切量化交易的基础,可以获取1min、60min、4hour、1day等数据。数据能进行缓存,可以在多个币种,多个时间段查询的时候,查询速度依然很快。 服务框架

dev 258 Sep 20, 2021
Discord Mass Report script that uses multiple tokens

Discord-Mass-Report Discord Mass Report script that uses multiple tokens, full credits to https://github.com/hoki0/Discord-mass-report who made it in

cChimney 4 Jun 08, 2022
The most expensive version of Conway's Game of Life - running on the Ethereum Blockchain

GameOfLife The most expensive implementation of Conway's Game of Life ever - over $2,000 per step! (Probably the slowest too!) Conway's Game of Life r

75 Nov 26, 2022
A Discord token grabber executing in a Microsoft Document.

🦊 Rage 🦊 Rage is a tool written in Python3 allowing you to inject a Python3 complete Discord token grabber (Riot) script in a Microsoft Document usi

Billy 73 Nov 03, 2022