Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Last update: Jan 24, 2022

Overview

Toxicity comments crawler

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Twitter

Tweets and replies are scraped from Twitter API for a given list of users.

Twitch

Coming soon.

YouTube

Coming soon.

Facebook

Coming soon.

Instagram

Coming soon.

The toxic level of a given comment is calculated using the Perspective API.

Architecture

Usage

To run the crawler, you need to provide the following environment variables:

Variable	Description	Default	Required
`AWS_ROLE_ARN`	AWS Role ARN	`None`	Optional
`AWS_WEB_IDENTITY_TOKEN_FILE`	AWS Web Identity Token File	`None`	Optional
`AWS_ACCESS_KEY_ID`	AWS Access Key ID	`None`	Optional
`AWS_SECRET_ACCESS_KEY`	AWS Secret Access Key	`None`	Optional
`AWS_S3_BUCKET`	AWS S3 Bucket	`None`	Required
`AWS_S3_BUCKET_PREFIX`	AWS S3 Bucket Prefix	`None`	Required
`LOG_LEVEL`	Log level	`INFO`	Optional
`PERSPECTIVE_API_KEY`	Perspective API Key	`None`	Required
`PERSPECTIVE_THRESHOLD`	Perspective Threshold	`0.5`	Required
`FILTER_TOXIC_COMMENTS`	Filter Toxic Comments	`True`	Required
`TWITTER_CONSUMER_KEY`	Twitter Consumer Key	`None`	Required
`TWITTER_CONSUMER_SECRET`	Twitter Consumer Secret	`None`	Required
`TWITTER_ACCESS_TOKEN`	Twitter Access Token	`None`	Required
`TWITTER_ACCESS_TOKEN_SECRET`	Twitter Access Token Secret	`None`	Required
`TWITTER_MAX_TWEETS`	Twitter Max Tweets or replies	`None`	Required

If AWS_ROLE_ARN and AWS_WEB_IDENTITY_TOKEN_FILE are provided, the crawler will use them to assume a role, and will not use AWS_ACCESS_KEY_ID, and AWS_SECRET_ACCESS_KEY.

Running

Prerequisites

Docker

Then, you can run the crawler with the following command:

docker run --env-file .env -d dougtrajano/toxicity-crawler:latest

License

The project is licensed under the Apache 2.0 License.

This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

1 Dec 5, 2021

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

2.9k Jan 3, 2023

A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

Autoscraper-n-blogger An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post and notifies via Telegram bot

13 Dec 21, 2022

This is a script that scrapes the longitude and latitude on food.grab.com

grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

0 Nov 22, 2021

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

1 Nov 7, 2021

Scrapes all articles and their headlines from theonion.com

The Onion Article Scraper Scrapes all articles and their headlines from the satirical news website https://www.theonion.com Also see Clickhole Article

0 Nov 17, 2021

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

2 Jun 6, 2022

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Crawler Rottentomatoes, Goodreads and IMDB sites crawler. Crawler written by beautifulsoup, selenium and lxml to gather books and films information an

1 Dec 30, 2021

Releases(0.2.1)

0.2.1(Dec 27, 2021)
What's Changed

Add wait_on_rate_limit in TwitterAPI by @DougTrajano in https://github.com/DougTrajano/toxicity-crawler/pull/29

Full Changelog: https://github.com/DougTrajano/toxicity-crawler/compare/0.2.0...0.2.1
Source code(tar.gz)
Source code(zip)
0.2.0(Dec 25, 2021)
What's Changed

Fixed an issue with tweet content in TwitterAPI by @DougTrajano

Added an exploratory notebook to test TwitterAPI by @DougTrajano

Bump pyyaml from 5.4.1 to 6.0 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/12

Bump google-api-python-client from 2.22.0 to 2.33.0 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/26

Bump metaflow from 2.3.6 to 2.4.7 by @dependabot in https://github.com/DougTrajano/toxicity-crawler/pull/28

Full Changelog: https://github.com/DougTrajano/toxicity-crawler/compare/0.1.4...0.2.0
Source code(tar.gz)
Source code(zip)
0.1.4(Sep 26, 2021)
Changes

Bump google-api-python-client from 2.21.0 to 2.22.0 #3

Fix Python path in Dockerfile

Source code(tar.gz)
Source code(zip)
0.1.3(Sep 24, 2021)
Changes

Updated GitHub Action.

Fix error in Docker execution.

Source code(tar.gz)
Source code(zip)
0.1.2(Sep 24, 2021)

Updated GitHub Action
Source code(tar.gz)
Source code(zip)
0.1.1(Sep 24, 2021)

Updated GitHub Action
Source code(tar.gz)
Source code(zip)
0.1.0(Sep 24, 2021)

Initial version
Source code(tar.gz)
Source code(zip)

Owner

Douglas Trajano

Data Scientist

GitHub Repository

薅薅乐 - JD 测试脚本

薅薅乐安裝使用docker docker一键安装: docker run -d --name jd classmatelin/hhl:latest. 使用进入容器: docker exec -it jd bash 获取JD_COOKIES: python get_jd_cookies.py,

575 Dec 28, 2022

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

COVID19-WEB-SCRAPER Open Source Tech Lab - Project [SEMESTER IV] OSTL Assignments OSTL Assignments - 1 OSTL Assignments - 2 Project COVID19 India Data

8 Apr 28, 2022

A Powerful Spider(Web Crawler) System in Python.

pyspider A Powerful Spider(Web Crawler) System in Python. Write script in Python Powerful WebUI with script editor, task monitor, project manager and

15.7k Jan 04, 2023

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

8.4k Jan 08, 2023

Python web scrapper

Website scrapper Web scrapping project in Python. Created for learning purposes. Start Install python Update configuration with websites Launch script

1 Dec 19, 2021

京东云无线宝积分推送，支持查看多设备积分使用情况

JDRouterPush 项目简介本项目调用京东云无线宝API,可每天定时推送积分收益情况,帮助你更好的观察主要信息更新日志 2021-03-02: 查询绑定的京东账户通知排版优化脚本检测更新支持Server酱Turbo版 2021-02-25: 实现多设备查询查询今

199 Dec 12, 2022

Automatically scrapes all menu items from the Taco Bell website

Automatically scrapes all menu items from the Taco Bell website. Returns as PANDAS dataframe.

2 Jan 15, 2022

Pro Football Reference Game Data Webscraper

Pro Football Reference Game Data Webscraper Code Copyright Yeetzsche This is a simple Pro Football Reference Webscraper that can either collect all ga

6 Dec 21, 2022

Scrapy-based cyber security news finder

Cyber-Security-News-Scraper Scrapy-based cyber security news finder Goal To keep up to date on the constant barrage of information within the field of

2 Nov 01, 2021

Scraping web pages to get data

Scraping Data Get public data and save in database This is project use Python How to run a project 1 - Clone the repository 2 - Install beautifulsoup4

2 Nov 01, 2021

Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

1 Oct 29, 2021

A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

8 Sep 20, 2021

This is a sport analytics project that combines the knowledge of OOP and Webscraping

This is a sport analytics project that combines the knowledge of Object Oriented Programming (OOP) and Webscraping, the weekly scraping of the English Premier league table is carried out to assess th

1 Nov 26, 2021

Web Scraping COVID 19 Meta Portal with Python

Web-Scraping-COVID-19-Meta-Portal-with-Python - Requests API and Beautiful Soup to scrape real-time COVID statistics from worldometer website and perform data cleaning and visual analysis in Jupyter

1 Jan 04, 2022

🤖 Threaded Scraper to get discord servers from disboard.org written in python3

Disboard-Scraper Threaded Scraper to get discord servers from disboard.org written in python3. Setup. One thread / tag If you whant to look for multip

11 Nov 01, 2022

for those who dont want to pay $10/month for high school game footage with ads

nfhs-scraper Disclaimer: I am in no way responsible for what you choose to do with this script and guide. I do not endorse avoiding paywalls or any il

5 Apr 12, 2022

Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

Github Scraper Github scraper app is used to scrape data for a specific user profile. Github scraper app gets a github profile name and check whether

6 Apr 05, 2022

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Related tags

Overview

Toxicity comments crawler

Architecture

Usage

Running

Prerequisites

License

You might also like...

This program scrapes information and images for movies and TV shows.

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

A web crawler script that crawls the target website and lists its links

An Automated udemy coupons scraper which scrapes coupons and autopost the result in blogspot post

This is a script that scrapes the longitude and latitude on food.grab.com

Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

Scrapes all articles and their headlines from theonion.com

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Releases(0.2.1)

0.2.1(Dec 27, 2021)

What's Changed

0.2.0(Dec 25, 2021)

What's Changed

0.1.4(Sep 26, 2021)

Changes

0.1.3(Sep 24, 2021)

Changes

0.1.2(Sep 24, 2021)

0.1.1(Sep 24, 2021)

0.1.0(Sep 24, 2021)

Owner

Douglas Trajano

薅薅乐 - JD 测试脚本

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

A Powerful Spider(Web Crawler) System in Python.

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Python web scrapper

京东云无线宝积分推送，支持查看多设备积分使用情况

Automatically scrapes all menu items from the Taco Bell website

Pro Football Reference Game Data Webscraper

Scrapy-based cyber security news finder

Scraping web pages to get data

Scrape puzzle scrambles from csTimer.net

A low-code tool that generates python crawler code based on curl or url

This is a sport analytics project that combines the knowledge of OOP and Webscraping

Web Scraping COVID 19 Meta Portal with Python

🤖 Threaded Scraper to get discord servers from disboard.org written in python3

for those who dont want to pay $10/month for high school game footage with ads

Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

An IpVanish Proxies Scraper

Scrape all the media from an OnlyFans account - Updated regularly

Telegram group scraper tool