a high-performance, lightweight and human friendly serving engine for scrapy

Last update: Mar 01, 2022

Related tags

Overview

scrapy-x (X)

a distributed, scalable and lightweight environment for deploying and running scrapy spiders/projects with no-hassle on commodity hardware, also it is compatible with scrapyd /schedule.json and /daemonstatus.json.

Installation

$ pip install -U git+git://github.com/speakol-ads/scrapy-x.git

Usage

let's assume that you have a project called TestCrawler

cd to TestCrawler
run scrapy x
that is all!

Default Settings

it utilizes your default project settings.py file

# whether to enable debug mode or not
X_DEBUG = True

# the default queue name that the system will use
# actually it will be used as a prefix for its internal
# queues, currently there is only one queue called `X_QUEUE_NAME + '.BACKLOG'`
# which holds all jobs that should be crawled.
X_QUEUE_NAME = 'SCRAPY_X_QUEUE'

# the queue workers
# by default it uses the cpu cores count
# try to adjust it based on your resources & needs
X_QUEUE_WORKERS_COUNT = os.cpu_count()

# the webserver workers count
# the workers count required from uvicorn to spwan
# defaults to the available cpu count
# try to adjust it based on your resources & needs
X_SERVER_WORKERS_COUNT = os.cpu_count()

# the port the http server should listen on
X_SERVER_LISTEN_PORT = 6800

# the host used by the http server to listen on
X_SERVER_LISTEN_HOST = '0.0.0.0'

# whether to enable access log or not
X_ENABLE_ACCESS_LOG = True

# redis host
X_REDIS_HOST = 'localhost'

# redis port
X_REDIS_PORT = 6379

# redis db
X_REDIS_DB = 0

# redis password
X_REDIS_PASSWORD = ''

# the maximum allowed wait time for a running task
# it will be killed after that time.
X_TASK_TIMEOUT = 25

Available Endpoints

as well scrapyd core endpoints like (schedule.json, daemonstatus.json), you have the following too:

GET /

returns some info about the engine like the available spiders and backlog queue length

GET|POST /run/{spider_name}

execute the specified spider in {spider_name} and wait for it to return its result, P.S: any query param and json post data will be passed to the spider as argument -a key=value

GET|POST /enqueue/{spider_name}

adding the specified spider in {spider_name} to the backlog to be executed later, P.S: any query param and json post data will be used as spider argument

Technologies Used

Author

I'm Mohamed, a software engineer who enjoys writing code in his free time, I'm speaking python, php, go, rust and js

My Similar Projects

P.S: star the project if you liked it ^_^

a high-performance, lightweight and human friendly serving engine for scrapy

Related tags

Overview

scrapy-x (X)

Installation

Usage

Default Settings

Available Endpoints

Technologies Used

Author

My Similar Projects

Owner

Speakol Ads

Here I provide the source code for doing web scraping using the python library, it is Selenium.

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

Simply scrape / download all the media from an fansly account.

京东云无线宝积分推送，支持查看多设备积分使用情况

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Amazon web scraping using Scrapy Framework

此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

Open Crawl Vietnamese Text

HappyScrapper - Google news web scrapper with python

A dead simple crawler to get books information from Douban.

A simple Discord scraper for discord bots

A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

This repo has the source code for the crawler and data crawled from auto-data.net

中国大学生在线四史自动答题刷分(现仅支持英雄篇)

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

API to parse tibia.com content into python objects.

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Web-Scrapper using Python and Flask

a high-performance, lightweight and human friendly serving engine for scrapy

Related tags

Overview

scrapy-x (X)

Installation

Usage

Default Settings

Available Endpoints

Technologies Used

Author

My Similar Projects

Owner

Speakol Ads

Here I provide the source code for doing web scraping using the python library, it is Selenium.

A web scraping pipeline project that retrieves TV and movie data from two sources, then transforms and stores data in a MySQL database.

Simply scrape / download all the media from an fansly account.

京东云无线宝积分推送，支持查看多设备积分使用情况

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Amazon web scraping using Scrapy Framework

此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

Open Crawl Vietnamese Text

HappyScrapper - Google news web scrapper with python

A dead simple crawler to get books information from Douban.

A simple Discord scraper for discord bots

A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

This repo has the source code for the crawler and data crawled from auto-data.net

中国大学生在线 四史自动答题刷分(现仅支持英雄篇)

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

API to parse tibia.com content into python objects.

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

Twitter Claimer / Swapper / Turbo - Proxyless - Multithreading

WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Web-Scrapper using Python and Flask

中国大学生在线四史自动答题刷分(现仅支持英雄篇)