Scrapping malaysianpaygap & Extracting data from the Instagram posts

Last update: Nov 09, 2022

Overview

Scrapping malaysianpaygap & Extracting data from the posts

Recently @malaysianpaygap has gotten quite famous as a platform that enables workers throughout Malaysia to anonymously share their salaries amongst other Malaysians. Its a great initiative and I am fully supportive behind ensuring that Malaysians are not taken advantage of by companies and get a liveable wage(especially when inflation is sky high).

NOTE: If you just want the data then you can download the zipped folder from here.

How to run

Run the following to get conda environment setup

  conda create --name pay python=3.7
  conda activate pay
  pip install -r requirements.txt

Next we will need to scrape all the data from Instagram manually using BeautifulSoup! Just kidding I am too lazy so I will be using InstaLoader to do all the heavy lifting for me. The conda environment will have it installed for you already.

# you might need to pass in your username to login
instaloader --login=USERNAME profile malaysianpaygap --dirname-pattern={profile} --comments --no-profile-pic --post-metadata-txt="Caption: {caption}\n{likes} likes\n{comments} comments\n" --filename-pattern={date_utc:%Y}/{shortcode}

This should create the following directory structure:

|-- malaysianpaygap
|   |-- 2022
|   |   |-- CaRp-1uPh8l.jpg                    # image
|   |   |-- CaRp-1uPh8l.json.xz
|   |   |-- CaRp-1uPh8l.txt                    # text data which was specified under --post-metadata-txt
|   |   |-- CaRp-1uPh8l_comments.json          # all the comments
|   |   |-- CaT5MguPpDI.jpg
|   |   |-- CaT5MguPpDI.json.xz
|   |-- 2022-02-27_04-58-58_UTC_profile_pic.jpg
|   |-- id
|   `-- malaysianpaygap_47523401972.json.xz
|-- requirements.txt
|-- scripts
|   `-- entrypoint.sh
`-- src
    |-- __init__.py
    |-- extract_text_images.py
    |-- main.py
    |-- preprocess_comments.py
    `-- preprocess_images.py

NOTE: Please do NOT change the directory structure, it will break the entire pipeline.

You should have everything ready to run the preprocessing scripts that I have made! I have a bash script that runs everything in the correct order.

# make bash script runnable
chmod +x scripts/entrypoint.sh
bash scripts/entrypoint.sh

You should see the following output:

2022-03-02 22:59:54.012 | INFO     | src.preprocess_comments:main_preprocess_comments:83 - Running preprocess_comments
2022-03-02 22:59:56.276 | INFO     | src.preprocess_comments:main_preprocess_comments:110 - DataFrame saved to /Users/yravindranath/pay/data/comments.csv
2022-03-02 22:59:56.277 | INFO     | src.preprocess_comments:main_preprocess_comments:111 - Completed preprocess_comments
2022-03-02 22:59:57.537 | INFO     | src.preprocess_images:main_preprocess_images:140 - Running preprocess_images
2022-03-02 22:59:57.840 | INFO     | src.preprocess_images:main_preprocess_images:160 - DataFrame saved to /Users/yravindranath/pay/data/posts.csv
2022-03-02 22:59:57.841 | INFO     | src.preprocess_images:main_preprocess_images:161 - Completed preprocess_images
2022-03-02 22:59:59.099 | INFO     | src.extract_text_images:main_extract_text_images:54 - Running extract_text_images
Pandas Apply: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 159/159 [02:09<00:00,  1.23it/s]
2022-03-02 23:02:25.087 | INFO     | src.extract_text_images:main_extract_text_images:70 - DataFrame saved to /Users/yravindranath/pay/data/posts_text.csv
2022-03-02 23:02:25.088 | INFO     | src.extract_text_images:main_extract_text_images:71 - Completed extract_text_images

A new directory data will be created like so:

|-- data
|   |-- comments.csv
|   |-- comments.json
|   |-- posts.csv
|   |-- posts_text.csv
|   `-- processed_images
|       |-- CaRp-1uPh8l.jpg
|       |-- CaT5MguPpDI.jpg
|       |-- CaT6d2Yve5X.jpg

In the next section I will go over the data that was created.

Data

comments.csv - Contains all the comments under a post

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2816 entries, 0 to 2815
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   image_ids        2816 non-null   object
 1   comment_paths    2816 non-null   object
 2   id               2814 non-null   float64
 3   created_at       2814 non-null   float64
 4   text             2814 non-null   object
 5   likes_count      2814 non-null   float64
 6   answers          2814 non-null   object
 7   id.1             2814 non-null   float64 # ID of the user who commented
 8   is_verified      2814 non-null   object
 9   profile_pic_url  2814 non-null   object
 10  username         2814 non-null   object
dtypes: float64(4), object(7)
memory usage: 242.1+ KB

posts_text.csv - Contains all the posts with their text extracted through their image using OCR(Optical Character Recognition)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   hashtags     159 non-null    object
 1   captions     139 non-null    object
 2   likes        159 non-null    int64
 3   comments     159 non-null    int64
 4   image_ids    159 non-null    object
 5   image_paths  159 non-null    object
 6   image_text   159 non-null    object
dtypes: int64(2), object(5)
memory usage: 8.8+ KB

FAQ

I am getting a `ModuleNotFoundError: No module named 'src'` error what can I do?

This is an issue with your PYTHONPATH, setting it to something like export PYTHONPATH="${PYTHONPATH}:/Users/yravindranath/REPO" should fix it.

Optimizations

So currently the entire project isn't repoducible therefore I will dockerise it soon and allow anyone to run it locally without any issues.
If you notice there is a slow apply() used for binarizing the images and extracting the text from it using OCR. I am using swifter to speed it up as it is.

Scrapping malaysianpaygap & Extracting data from the Instagram posts

Related tags

Overview

Scrapping malaysianpaygap & Extracting data from the posts

How to run

Data

FAQ

I am getting a `ModuleNotFoundError: No module named 'src'` error what can I do?

Optimizations

Owner

Yudhiesh Ravindranath

A multipurpose Telegram Bot writen in Python for mirroring files

Upvotes and karma for Discord: Heart 💗 or Crush 💔 a comment to give points to an user, or Star ⭐ it to add it to the Best Of!

A collection of scripts to steal BTC from Lightning Network enabled custodial services. Only for educational purpose! Share your findings only when design flaws are fixed.

This repository contains free labs for setting up an entire workflow and DevOps environment from a real-world perspective in AWS

Send song lyrics to iMessage users using the Genius lyrics API

Access LeetCode problems via id

Using a Gameboy emulator and making it into a DIscord bot !

SongBot2.0 With Python

This repository is used to simplify the process of cloning the SSM documents across the AWS regions.

Faster Twitch Alerts is a highly customizable, lightning-fast alternative to Twitch's slow mobile notification system

A Discord bot that may save your day by predicting it.

Scheduled Block Checker for Cardano Stakepool Operators

Python tool to Check running WebClient services on multiple targets based on @leechristensen

A Discord webhook spammer made in Python

Bot para automatizacao de registros no Vacivida para o COVID19

Crosschat - A bot for cross-server communication

Grade Notifyer Bot

A Powerful telegram giveawayz bot based on the python-telegram-bot API

𝐀 𝐦𝐨𝐝𝐮𝐥𝐚𝐫 𝐓𝐞𝐥𝐞𝐠𝐫𝐚𝐦 𝐆𝐫𝐨𝐮𝐩 𝐦𝐚𝐧𝐚𝐠𝐞𝐦𝐞𝐧𝐭 𝐛𝐨𝐭 𝐰𝐢𝐭𝐡 𝐮𝐥𝐭𝐢𝐦𝐚𝐭𝐞 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬

A simple terminal UI for viewing fund P/L analysis through TEFAS

Scrapping malaysianpaygap & Extracting data from the Instagram posts

Related tags

Overview

Scrapping malaysianpaygap & Extracting data from the posts

How to run

Data

FAQ

I am getting a ModuleNotFoundError: No module named 'src' error what can I do?

Optimizations

Owner

Yudhiesh Ravindranath

A multipurpose Telegram Bot writen in Python for mirroring files

Upvotes and karma for Discord: Heart 💗 or Crush 💔 a comment to give points to an user, or Star ⭐ it to add it to the Best Of!

A collection of scripts to steal BTC from Lightning Network enabled custodial services. Only for educational purpose! Share your findings only when design flaws are fixed.

This repository contains free labs for setting up an entire workflow and DevOps environment from a real-world perspective in AWS

Send song lyrics to iMessage users using the Genius lyrics API

Access LeetCode problems via id

Using a Gameboy emulator and making it into a DIscord bot !

SongBot2.0 With Python

This repository is used to simplify the process of cloning the SSM documents across the AWS regions.

Faster Twitch Alerts is a highly customizable, lightning-fast alternative to Twitch's slow mobile notification system

A Discord bot that may save your day by predicting it.

Scheduled Block Checker for Cardano Stakepool Operators

Python tool to Check running WebClient services on multiple targets based on @leechristensen

A Discord webhook spammer made in Python

Bot para automatizacao de registros no Vacivida para o COVID19

Crosschat - A bot for cross-server communication

Grade Notifyer Bot

A Powerful telegram giveawayz bot based on the python-telegram-bot API

𝐀 𝐦𝐨𝐝𝐮𝐥𝐚𝐫 𝐓𝐞𝐥𝐞𝐠𝐫𝐚𝐦 𝐆𝐫𝐨𝐮𝐩 𝐦𝐚𝐧𝐚𝐠𝐞𝐦𝐞𝐧𝐭 𝐛𝐨𝐭 𝐰𝐢𝐭𝐡 𝐮𝐥𝐭𝐢𝐦𝐚𝐭𝐞 𝐟𝐞𝐚𝐭𝐮𝐫𝐞𝐬

A simple terminal UI for viewing fund P/L analysis through TEFAS

I am getting a `ModuleNotFoundError: No module named 'src'` error what can I do?