Scrapping malaysianpaygap & Extracting data from the Instagram posts

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

  1. Run the following to get conda environment setup
  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt
  1. Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.
# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

  1. You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.
# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

  1. So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
  2. If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.
Owner
Yudhiesh Ravindranath
Data Scientist @MoneyLion
Yudhiesh Ravindranath
Discord.py Bot Series With Python

Discord.py Bot Series YouTube Playlist: https://www.youtube.com/playlist?list=PL9nZZVP3OGOAx2S75YdBkrIbVpiSL5oc5 Installation pip install -r requireme

Step 1 Dec 17, 2021
Wechat-file-cleaner - Clean files in PC WeChat FileStorage directory

Wechat-file-cleaner - Clean files in PC WeChat FileStorage directory

Xingjian Zhang 1 Feb 06, 2022
Unofficial YooMoney API python library

API Yoomoney - unofficial python library This is an unofficial YooMoney API python library. Summary Introduction Features Installation Quick start Acc

Aleksey Korshuk 136 Dec 30, 2022
BleachBit system cleaner for Windows and Linux

BleachBit BleachBit cleans files to free disk space and to maintain privacy. Running from source To run BleachBit without installation, unpack the tar

1.9k Jan 06, 2023
Python script to download WAX transactions

WAXtax Python script to download WAX transactions WAXtax uses the CoinGecko API and the WAX Blockchain History API to download csvs for each account y

SixPM Software 11 Oct 09, 2022
Telegram Auto Filter Bot

Pro Auto Filter Bot V2.o Hey Mo Tech, I'm an Autofilter bot v2.O and you can not Add Me to your Group. I was made for this one group. So don't waste y

14 Oct 20, 2021
Download videos from Youtube and other platforms through a Telegram Bot

ytdl-bot Download videos from YouTube and other platforms through a Telegram Bot Usage: https://t.me/benny_ytdlbot Send link from YouTube directly to

Telegram Bot Collection 289 Jan 03, 2023
A project that alerts me when there's a dog outside so I can go look at it.

Dog Detector A project that alerts me when there's a dog outside so I can go look at it. Tech Specs This script uses the YOLOv3 object detection model

rydercalmdown 58 Jul 29, 2022
Easy to use reaction role Discord bot written in Python.

Reaction Light - Discord Role Bot Light yet powerful reaction role bot coded in Python. Key Features Create multiple custom embedded messages with cus

eibex 109 Dec 20, 2022
Verkehrsunfälle in Deutschland, aufgeschlüsselt nach Verkehrsmittel des Hauptverursachers und Nebenverursachers

How-To Einfach ./main.py ausführen mit der Statistik-Datei aus dem Ordner "Unfälle_mit_mehreren_Beteiligten" als erstem Argument. Requirements python,

4 Oct 12, 2022
Telegram bot that search for the classrooms status of the chosen day and then return all the free classrooms using your preferred time slot

Aule Libere Polimi Since the PoliMi site no longer allows people to search for free classrooms this bot was necessary! It simply search for the classr

Daniele Ferrazzo 16 Nov 09, 2022
Bavera is an extensive and extendable Python 3.x library for the Discord API

Bavera is an extensive and extendable Python 3.x library for the Discord API. Bavera boasts the following major features: Expressive, functiona

Bavera 1 Nov 17, 2021
A pypi packages finder telegram bot.

PyPi-Bot A pypi packages information finder telegram bot. Made with Python3 (C) @FayasNoushad Copyright permission under MIT License License - https:

Fayas Noushad 17 Oct 21, 2022
A simple url uploader bot with permenent thumbnail support

URL-Uploader A simple url uploader bot with permenent thumbnail support Scrapped some code from @SpEcHIDe's AnyDLBot Repository Please fork this repos

Fayas Noushad 40 Nov 29, 2021
Pixiv 爬虫,使用 Python 实现。支持批量下载、上传到图床。

用 Python 实现的 Pixiv 爬虫,支持批量下载和上传。 随机图片 API: https://loliapi.ml/ Deploy Github Action 集成部署 建议使用本方法部署,相较于本地部署,无需搭建环境,全程在线上完成。并且使用国外服务器下载、上传,网络更加通畅。 Fork

18 Feb 26, 2022
Python API for Photoshop.

Python API for Photoshop. The example above was created with Photoshop Python API.

Hal 372 Jan 02, 2023
Connect your Nintendo Switch playing status to Discord!

Disclaimer: Unfortunately, it appears that Nintendo has removed returning self-Presence in their API as of recently, making this project near obsolete

Deltaion Lee 145 Dec 30, 2022
Request based Python module(s) to help with the Newegg raffle.

Newegg Shuffle Python module(s) to help you with the Newegg raffle How to use $ git clone https://github.com/Matthew17-21/Newegg-Shuffle $ cd Newegg-S

Matthew 45 Dec 01, 2022
Throttle and debounce add-on for Pyrogram

pyrothrottle Throttle and debounce add-on for Pyrogram Quickstart implementation on decorators from pyrogram import Client, filters from pyrogram.type

7 Oct 01, 2022
An anime themed telegram group management bot based on sqlalchemy database running on python3.

Kazuko Robot A Telegram Python bot running on python3 forked with saitama and DiasyX with a sqlalchemy database and an entirely themed persona to make

heyaaman 22 Dec 07, 2022