The scope of this project will be to build a data ware house on Google Cloud Platform that will help answer common business questions as well as powering dashboards

Overview

Project Scopes

The scope of this project will be to build a data ware house on Google Cloud Platform that will help answer common business questions as well as powering dashboards. To do that, a conceptual data model and a data pipeline will be defined.

Architecture

Data are uploaded to Google Cloud Storage bucket. GCS will act as the data lake where all raw files are stored. Data will then be loaded to staging tables on BigQuery. The ETL process will take data from those staging tables and create data mart tables. An Airflow instance can be deployed on a Google Compute Engine or locally to orchestrate the pipeline.

Here are the justifications for the technologies used:

  • Google Cloud Storage: act as the data lake, vertically scalable.
  • Google Big Query: act as data base engine for data warehousing, data mart and ETL processes. BigQuery is a serverless solution that can easily and effectively process petabytes scale dataset.
  • Apache Airflow: orchestrate the workflow by issuing command line to load data to BigQuery or SQL queries for ETL process. Airflow does not have to process any data by itself, thus allowing the architecture to scale.

Data Model

The database is designed following a star-schema principal with 1 fact table and 5 dimensions tables.

image

  • F_IMMIGRATION_DATA: contains immigration information such as arrival date, departure date, visa type, gender, country of origin, etc.
  • D_TIME: contains dimensions for date column
  • D_PORT: contains port_id and port_name
  • D_AIRPORT: contains airports within a state
  • D_STATE: contains state_id and state_name
  • D_COUNTRY: contains country_id and country_name
  • D_WEATHER: contains average weather for a state
  • D_CITY_DEMO: contains demographic information for a city

Data pipeline

This project uses Airflow for orchestration.

image

A DummyOperator start_pipeline kick off the pipeline followed by 4 load operations. Those operations load data from GCS bucket to BigQuery tables. The immigration_data is loaded as parquet files while the others are csv formatted. There are operations to check rows after loading to BigQuery.

Next the pipeline loads 3 master data object from the I94 Data dictionary. Then the F_IMMIGRATION_DATA table is created and check to make sure that there is no duplicates. Other dimension tables are also created and the pipelines finishes.

Scenarios

Data increase by 100x

Currently infrastructure can easily supports 100x increase in data size. GCS and BigQuery can handle petabytes scale data. Airflow is not a bottle neck since it only issue commands to other services.

Pipelines would be run on 7am daily. how to update dashboard? would it still work?

Schedule dag to be run daily at 7 AM. Setup dag retry, email/slack notification on failures.

Make it available to 100+ people

BigQuery is auto-scaling so if 100+ people need to access, it can handle that easily. If more people or services need access to the database, we can add steps to write to a NoSQL database like Data Store or Cassandra, or write to a SQL one that supports horizontal scaling like BigTable.

Project Instructions

GCP setup

Follow the following steps:

  • Create a project on GCP
  • Enable billing by adding a credit card (you have free credits worth $300)
  • Navigate to IAM and create a service account
  • Grant the account project owner. It is convenient for this project, but not recommended for production system. You should keep your key somewhere safe.

Create a bucket on your project and upload the data with the following structure:

gs://cloud-data-lake-gcp/airports/:
gs://cloud-data-lake-gcp/airports/airport-codes_csv.csv
gs://cloud-data-lake-gcp/airports/airport_codes.json

gs://cloud-data-lake-gcp/cities/:
gs://cloud-data-lake-gcp/cities/us-cities-demographics.csv
gs://cloud-data-lake-gcp/cities/us_cities_demo.json

gs://cloud-data-lake-gcp/immigration_data/:
gs://cloud-data-lake-gcp/immigration_data/part-00000-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00001-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00002-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00003-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00004-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00005-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00006-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00007-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00008-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00009-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00010-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00011-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00012-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet
gs://cloud-data-lake-gcp/immigration_data/part-00013-b9542815-7a8d-45fc-9c67-c9c5007ad0d4-c000.snappy.parquet

gs://cloud-data-lake-gcp/master_data/:
gs://cloud-data-lake-gcp/master_data/
gs://cloud-data-lake-gcp/master_data/I94ADDR.csv
gs://cloud-data-lake-gcp/master_data/I94CIT_I94RES.csv
gs://cloud-data-lake-gcp/master_data/I94PORT.csv

gs://cloud-data-lake-gcp/weather/:
gs://cloud-data-lake-gcp/weather/GlobalLandTemperaturesByCity.csv
gs://cloud-data-lake-gcp/weather/temperature_by_city.json

You can copy the data to your own bucket by running the following:

gsutil cp -r gs://cloud-data-lake-gcp/ gs://{your_bucket_name}

Local setup

Clone the project, create environment, install required packages by running the following:

Install docker if it's not already installed. You can find the resources to do that here.

Install the Astronomer CLI following the instructions here.

Run the following commands to bring up the Airflow instance:

astro d start

You can look at the logs by running make logs if you need to debug something. You can access and manage the pipeline by typing the following address to a browser:

localhost:8080/admin/

If everything is setup correctly, you will see the following screen:

image

Navigate to Admin -> Connections and paste in the credentials for the following two connections: bigquery_default and google_cloud_default

image

Navigate to the main dag on path dags\cloud-data-lake-pipeline.py and change the following parameters with your own setup:

project_id = 'cloud-data-lake'
staging_dataset = 'IMMIGRATION_DWH_STAGING'
dwh_dataset = 'IMMIGRATION_DWH'
gs_bucket = 'cloud-data-lake-gcp'

You can then trigger the dag and the pipeline will run.

The data warehouse

The final data warehouse looks like this: img

Owner
Shweta_kumawat
AI software @ Computer programmer, Deep Learning, Computer Vision Researcher and Developer. Implement AI products and machine learning solutions
Shweta_kumawat
A telegram media to gofile bot

GoFile-Bot A telegram media to gofile bot Made with Python3 (C) @FayasNoushad Copyright permission under MIT License License - https://github.com/Fay

Fayas Noushad 37 Nov 14, 2022
Yes, it's true :revolving_hearts: This repository has 301 stars.

Yes, it's true! Inspired by a similar repository from @RealPeha, but implemented using a webhook on AWS Lambda and API Gateway, so it's serv

510 Dec 28, 2022
Polars-fun - Example notebooks for how to use pola.rs

polars-fun Polars is an awesome Rust DataFrame library with Python language bindings. This repo makes it easy to run Polars code on your local machine

Matthew Powers 2 Jun 28, 2022
An Telegram Bot By @AsmSafone To Stream Videos in Telegram Voice Chat. This is Also The Source Code of The Bot Which is Being Used In @SafoTheBot Group! ❤️

Telegram Video Player Bot (Beta) An Telegram Bot By @AsmSafone To Stream Videos in Telegram Voice Chat. Special Features Supports Live Streaming From

SAF ONE 206 Jan 03, 2023
Repository for the IPvSeeYou talk at Black Hat 2021

IPvSeeYou Geolocation Lookup Tool Overview IPvSeeYou.py is a tool to assist with geolocating EUI-64 IPv6 hosts. It takes as input an EUI-64-derived MA

57 Nov 08, 2022
NiceHash Python Library and Command Line Rest API

NiceHash Python Library and Command Line Rest API Requirements / Modules pip install requests Required data and where to get it Following data is nee

Ashlin Darius Govindasamy 2 Jan 02, 2022
A simple Discord Mass-Ban that's still working with Member Scraper.

Mass-Ban [!] This was made for education / you can use for revenge. Please don't skid it. [!] If you want to use it, please use member scraper before

WoahThatsHot 1 Nov 20, 2021
Simple PoC script that allows you to exploit telegram's "send with timer" feature by saving any media sent with this functionality.

Simple PoC script that allows you to exploit telegram's "send with timer" feature by saving any media sent with this functionality.

Matteo 52 Nov 29, 2022
TG-Url-Uploader-Bot - Telegram RoBot to Upload Links

MW-URL-Uploader Bot Telegram RoBot to Upload Links. Features: 👉 Only Auth Users

Aadhi 3 Jun 27, 2022
VC-Music , Playing music without bot.

VC-Userbot A Telegram Userbot to play or streaming Audio and Video songs / files in Telegram Voice Chats. It's made with PyTgCalls and Pyrogram Requir

RioProjectX 8 Aug 04, 2022
A crashbot for Discord

Description A Effective crash bot code How to use Setup First, we need to install the library: pip install discord or (for linux users): pip3 install

3 Sep 17, 2021
simple discord token grabber with webhook hiding feature.

Token Grabber A simple Discord token grabber with base64 webhook encoding, it uses pastebin as a database to get webhook, so next time u dont get your

0 Dec 01, 2021
Autofilter with imdb bot || broakcast , imdb poster and imdb rating

LuciferMoringstar_Robot How To Deploy Video Subscribe YouTube Channel Added Features Imdb posters for autofilter. Imdb rating for autofilter. Custom c

Muhammed 127 Dec 29, 2022
Open Source API and interchange format for editorial timeline information.

OpenTimelineIO is currently in Public Beta. That means that it may be missing some essential features and there are large changes planned. During this phase we actively encourage you to provide feedb

Pixar Animation Studios 1.2k Jan 01, 2023
Template to create a telegram bot in python

Template for Telegram Bot Template to create a telegram bot in python. How to Run Set your telegram bot token as environment variable TELEGRAM_BOT_TOK

PyTopia 10 Mar 07, 2022
A powerful, cool and well-made userbot for your Telegram profile with promising extension capabilities.

Telecharm userbot A powerful, fast and simple Telegram userbot written in Python 3 and based on Pyrogram 1.X. Currently in active WIP state, so feel f

Daniil Kovalenko 16 Dec 01, 2022
Github repository started notify 💕

Github repository started notify 💕

4 Aug 06, 2022
A Python module for communicating with the Twilio API and generating TwiML.

twilio-python The default branch name for this repository has been changed to main as of 07/27/2020. Documentation The documentation for the Twilio AP

Twilio 1.6k Jan 05, 2023
Python script to backup/convert your Spotify playlists into the XSPF format.

Python script to backup/convert your Spotify playlists into the XSPF format.

Chris Ovenden 4 Jun 09, 2022
Pixiv 爬虫,使用 Python 实现。支持批量下载、上传到图床。

用 Python 实现的 Pixiv 爬虫,支持批量下载和上传。 随机图片 API: https://loliapi.ml/ Deploy Github Action 集成部署 建议使用本方法部署,相较于本地部署,无需搭建环境,全程在线上完成。并且使用国外服务器下载、上传,网络更加通畅。 Fork

18 Feb 26, 2022