A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Last update: Feb 10, 2022

Overview

Parallel web scraping

The project is a training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Goal

The script extracts names and prices of the Top-100 crypto coins and stores the data into a db.

Disclaimer

The task is quite contrived and serves mainly for study purpose. There are innumerous of mature sources containing both real-time and historical cryptocurrency data.

Solved problems within the project

Multiple pages with one level nesting have been scraped. The propagation has been implemented by gathering internal links from the main page followed by looping on them.
To avoid getting banned from the remote server, a mechanism dealing with proxy servers was implemented.
A free public proxy server is commonly assumed as unreliable in terms of availability. To overcome this issue:
- another scraping script extracts a list of free public proxy servers from a web site.
- with each launch of the script, the list of 10 proxy servers gets updated by currently available proxy servers.
- during the script execution, some proxy servers get unavailable. Thus, each scraping query goes through this list and searches for an alive proxy server to execute a query.
To speed up the scraping of the total 101 web pages multithreading is involved. The work is divided among 4 threads running almost simultaneously.
The extracted data is being written directly to a DataBase.

A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Related tags

Overview

Parallel web scraping

Goal

Disclaimer

Solved problems within the project

Owner

Kushal Shingote

Web scraper build using python.

This is python to scrape overview and reviews of companies from Glassdoor.

A Python package that scrapes Google News article data while remaining undetected by Google.

哔哩哔哩爬取器：以个人为中心

Library to scrape and clean web pages to create massive datasets.

Automated Linkedin bot that will improve your visibility and increase your network.

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

Scrape Twitter for Tweets

抢京东茅台脚本，定时自动触发，自动预约，自动停止

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Unja is a fast & light tool for fetching known URLs from Wayback Machine

Simple proxy scraper made by using ProxyScrape's api.

Scraping Top Repositories for Topics on GitHub,

SkyScrapers: A collection of variety of Scraping Apps

Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

crypto currency scraping

Screenhook is a script that captures an image of a web page and send it to a discord webhook.

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js