Async Python 3.6+ web scraping micro-framework based on asyncio

Last update: Jan 01, 2023

Overview

Ruia

🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio.

⚡ Write less, run faster.

Overview

Ruia is an async web scraping micro-framework, written with asyncio and aiohttp, aims to make crawling url as convenient as possible.

Write less, run faster:

Documentation: 中文文档 |documentation
Organization: python-ruia
Plugin: awesome-ruia(Any contributions you make are greatly appreciated!)

Features

Easy: Declarative programming
Fast: Powered by asyncio
Extensible: Middlewares and plugins
Powerful: JavaScript support

Installation

# For Linux & Mac
pip install -U ruia[uvloop]

# For Windows
pip install -U ruia

# New features
pip install git+https://github.com/howie6879/ruia

Tutorials

TODO

Cache for debug, to decreasing request limitation, ruia-cache
Provide an easy way to debug the script, ruia-shell
Distributed crawling/scraping

Contribution

Ruia is still under developing, feel free to open issues and pull requests:

Report or fix bugs
Require or publish plugins
Write or fix documentation
Add test cases

!!!Notice: We use black to format the code

Thanks

Comments

Add rtds support.

I notice that you have tried to use mkdoc to generate the website.

Here's an example at readthedocs.org, powered by sphinx.

There's a little bug, but it is still great.

RTDs

opened by panhaoyu 32
Log crucial information regardless of log-level
I've reduced the log level of a Spider in my script as I find it too verbose, however I also filter out crucial info, particularly the after completion info (number of requests, time, ect.) - https://github.com/howie6879/ruia/blob/651fac54540fe0030d3a3d5eefca6c67d0dcb3c3/ruia/spider.py#L280-L287

This is code I currently use to reduce verbosity:

import logging # Disable logging (for speed) logging.root.setLevel(logging.ERROR)

I'm thinking of changing the code so that it shows regardless of log level, but will there ever be a case where you wouldn't want to see it?
opened by abmyii 13
`DELAY` attribute specifically for retries

I assumed the DELAY attr would set the delay for retries but instead it applies to all requests. I would appreciate it if there was a DELAY attr specifically for retries (RETRY_DELAY). I'd be happy to implement it if given the go-ahead.

Thank you for this great library!

opened by abmyii 13

Calling `self.start` as an instance method for a `Spider`

I have the following parent class which has reusable code for all the spiders in my project (this is just a basic example):

class Downloader(Spider):
    concurrency = 15
    worker_numbers = 2

    # RETRY_DELAY (secs) is time between retries
    request_config = {
        "RETRIES": 10,
        "DELAY": 0,
        "RETRY_DELAY": 0.1
    }

    db_name = "DB"
    db_url = "postgresql://..."
    main_table = "test"

    def __init__(self, *args, **kwargs):
        # Initialise DB connection
        self.db = DB(self.db_url, self.db_name, self.main_table)

    def download(self):
        self.start()
        
		# After completion, commit to DB
        self.db.commit()

I use it by sub-classing for each different spider. However, it seems that self.start cannot be accessed as an instance for spiders (since it's a classmethod) - giving this error:

Traceback (most recent call last):
  File "src/scraper.py", line 107, in <module>
    scraper = Scraper()
  File "src/downloader.py", line 31, in __init__
    super(Downloader, self).__init__(*args, **kwargs)
  File "/usr/lib/python3.8/site-packages/ruia/spider.py", line 159, in __init__
    self.request_session = ClientSession()
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 210, in __init__
    loop = get_running_loop(loop)
  File "/usr/lib/python3.8/site-packages/aiohttp/helpers.py", line 269, in get_running_loop
    loop = asyncio.get_event_loop()
  File "/usr/lib/python3.8/asyncio/events.py", line 639, in get_event_loop
    raise RuntimeError('There is no current event loop in thread %r.'
RuntimeError: There is no current event loop in thread 'MainThread'.
Exception ignored in: <function ClientSession.__del__ at 0x7f28875e8b80>
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 302, in __del__
    if not self.closed:
  File "/usr/lib/python3.8/site-packages/aiohttp/client.py", line 916, in closed
    return self._connector is None or self._connector.closed
AttributeError: 'ClientSession' object has no attribute '_connector'

Any idea how I can solve this issue whilst maintaining the structure I am trying to implement?

opened by abmyii 11

asyncio `RuntimeError`

ERROR asyncio Exception in callback BaseSelectorEventLoop._sock_write_done(150)(<Future finished result=None>)
handle: <Handle BaseSelectorEventLoop._sock_write_done(150)(<Future finished result=None>)>
Traceback (most recent call last):
  File "/usr/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 516, in _sock_write_done
    self.remove_writer(fd)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 346, in remove_writer
    self._ensure_fd_no_transport(fd)
  File "/usr/lib/python3.8/asyncio/selector_events.py", line 251, in _ensure_fd_no_transport
    raise RuntimeError(
RuntimeError: File descriptor 150 is used by transport <_SelectorSocketTransport fd=150 read=idle write=<polling, bufsize=0>>

Getting this quite a bit still. I don't think it's ruia directly, but aiohttp. Any ideas?

One thing that may be causing it is that in clean functions I call other functions synchronously, i.e.:

    async def clean_<...>(self, value):
        return <function>(value)

Could that be causing it? I tried doing return await ... but the error still persisted.

opened by abmyii 11

Show URL in Error for easier debugging

I think errors would be more useful if they also showed the URL of the parsed page. Example:

ERROR Spider <Item: extract ... error, please check selector or set parameter named default>, https://...

I hacked a solution together by passing around the url parameter, but I can't think of a clean solution ATM. Any ideas? I can also push my changes if you would like to see them (very hacky).

opened by abmyii 11

运行示例代码报错

我参考的是 https://github.com/howie6879/ruia/blob/master/docs/en/tutorials/item.md 里的代码

import asyncio
from ruia import Item, TextField, AttrField


class PythonDocumentationItem(Item):
    title = TextField(css_select='title')
    tutorial_link = AttrField(xpath_select="//a[text()='Tutorial']", attr='href')


async def main():
    url = 'https://docs.python.org/3/'
    item = await PythonDocumentationItem.get_item(url=url)
    print(item.title)
    print(item.tutorial_link)


if __name__ == '__main__':
    # Python 3.7 required
    asyncio.run(main())

运行能获取到正常的结果 3.9.5 Documentation tutorial/index.html 但是会报错，提示RuntimeError: Event loop is closed 完整的运行结果如下所示：

[2021:05:06 14:11:55] INFO  Request <GET: https://docs.python.org/3/>
3.9.5 Documentation
tutorial/index.html
Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x0000018664A679D0>
Traceback (most recent call last):
  File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\proactor_events.py", line 116, in __del__
    self.close()
  File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\proactor_events.py", line 108, in close
    self._loop.call_soon(self._call_connection_lost, None)
  File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 746, in call_soon
    self._check_closed()
  File "C:\Users\lenovo\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 510, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

PC ：win10 64bit Python ：3.9.4 64bit

opened by qgyhd1234 10

我愿意用分布式函数调度框架合和你来比，看谁代码更少谁更自由来爬任意网站，欢迎交流。

https://github.com/ydf0509/distributed_framework/blob/master/test_frame/car_home_crawler_sample/car_home_consumer.py

欢迎来对比，或者你不想用汽车之家测试，可以指定一个任何两层级网站的爬虫调度，看谁的代码少，写法更快更自由，看谁的控制手段多，看谁的运行速度更快，。

opened by ydf0509 9
`TextField` strips strings which may not be desirable

https://github.com/howie6879/ruia/blob/8a91c0129d38efd8fcd3bee10b78f694a1c37213/ruia/field.py#L120

My use case is extracting paragraphs which have newlines between them, and these are stripped out by TextField. Should a new field be introduced (I have already made one for my scraper), or should the stripping be optional? Perhaps both is best.

opened by abmyii 9

Trouble scraping deck.tk/deckstats.net

For example:

import asyncio
from ruia import Request


async def request_example():
    url = "https://deck.tk/07Pw8tfr"
    params = {
        'name': 'ruia',
    }
    headers = {
        'User-Agent': 'Python3.6',
    }
    request = Request(url=url, method='GET', params=params, headers=headers)
    response = await request.fetch()
    json_result = await response.json()
    print(json_result)


if __name__ == '__main__':
    asyncio.get_event_loop().run_until_complete(request_example())

This simply hangs without resolution. That is, the request is never resolved, and I must Ctrl-C out of it. Scrapy handles this without issue, but I was hoping to transition to ruia. Any ideas?

bug enhancement

opened by Triquetra 7

Would be nice to be able to pass in "start_urls"

Ruia seems like a brilliant way to write simple and elegant web scrapers, but I can't figure out how to have a different "start_urls" value. I want a web scraper that can check all links on any GIVEN web page, not just whatever the start_urls lead me to, but also with the simplicity and asynchronous power that Ruia provides. Maybe this is a feature but I can't tell from the documentation or code

opened by JacobJustice 7
Improve Chinese documentation
Toc: Ruia中文文档

[x] 快速开始

[ ] 入门指南

[ ] 1.概览

[ ] 2.爱美妆

[ ] 3.定义Item

[ ] 4.运行 Spider

[ ] 5.个性化

[ ] 6.插件

[ ] 7.帮助

[ ] 基础概念

[ ] 1.Request

[ ] 2.Response

[ ] 3.Item

[ ] 4.Field

[ ] 5.Spider

[ ] 6.Middleware

[ ] 开发指南

[ ] 1.搭建开发环境

[ ] 2.Ruia架构

[ ] 3.为Ruia编写插件

[ ] 4.贡献代码

[ ] 实践指南

[ ] 1.谈谈对Python爬虫的理解

enhancement
opened by howie6879 0

Releases(v0.8.0)

v0.8.0(Jan 7, 2021)

Source code(tar.gz)
Source code(zip)
v0.6.9(Aug 15, 2020)

Source code(tar.gz)
Source code(zip)
ruia-0.6.9.tar.gz(22.53 KB)
v0.5.0(Feb 14, 2019)

Source code(tar.gz)
Source code(zip)
v0.4.6(Feb 9, 2019)
Changes:

Improved codebase test coverage from 93% to 96%

Add response hook

Add json() text() read() for Response

Update docs

Source code(tar.gz)
Source code(zip)
v0.4.0(Feb 9, 2019)

Source code(tar.gz)
Source code(zip)

Owner

howie.hu

奇文共欣赏，疑义相与析

GitHub Repository https://docs.python-ruia.org/

Scraping followers of an instagram account

ScrapInsta A script to scraping data from Instagram Install First of all you can run: pip install scrapinsta After that you need to install these requ

1 Sep 05, 2021

Free-Game-Scraper is a useful script that allows you to track down free games and DLCs on many platforms.

Game Scraper Free-Game-Scraper is a useful script that allows you to track down free games and DLCs on many platforms. Join the discord About The Proj

2 Mar 28, 2022

用python爬取江苏几大高校的就业网站，并提供3种方式通知给用户，分别是通过微信发送、命令行直接输出、windows气泡通知。

crawler_for_university 用python爬取江苏几大高校的就业网站，并提供3种方式通知给用户，分别是通过微信发送、命令行直接输出、windows气泡通知。环境依赖 wxpy,requests,bs4等库功能描述该项目基于python，通过爬虫爬各高校的就业信息网，爬取招聘信

8 Aug 16, 2021

Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

Agroforestry Species Switchboard 2.0 Scraper Scrape plants scientific name information from Species Switchboard 2.0. Requirements python = 3.10 (you

2 Dec 23, 2021

script to scrape direct download links (ddls) from google drive index.

bhadoo Google Personal/Shared Drive Index scraper. A small script to scrape direct download links (ddls) of downloadable files from bhadoo google driv

53 Dec 16, 2022

A simplistic scraper made to download tons of random screenshots made by people.

printStealer 1.1 What is this tool? This tool is developed to show the insecurity of the screenshot utility called prnt sc. It is a site that stores s

4 Jul 26, 2022

Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

TwitterScraper Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine . Screenshot Data Users Only

19 Nov 17, 2022

An Web Scraping API for MDL(My Drama List) for Python.

PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

6 Dec 10, 2022

A repository with scraping code and soccer dataset from understat.com.

UNDERSTAT - SHOTS DATASET As many people interested in soccer analytics know, Understat is an amazing source of information. They provide Expected Goa

48 Jan 03, 2023

IGLS - Instagram Like Scraper CLI tool

IGLS - Instagram Like Scraper It's a web scraping command line tool based on python and selenium. Description This is a trial tool for learning purpos

5 Oct 29, 2021

OSTA web scraper, for checking the status of school buses in Ottawa

OSTA-La-Vista OSTA web scraper, for checking the status of school buses in Ottawa. Getting Started Using a Raspberry Pi, download Python 3, and option

1 Jan 28, 2022

TikTok Username Swapper/Claimer/etc

TikTok-Turbo TikTok Username Swapper/Claimer/etc I wanted to create it as fast as possible but i eventually gave up and recoded it many many many many

12 Dec 19, 2022

This is a sport analytics project that combines the knowledge of OOP and Webscraping

This is a sport analytics project that combines the knowledge of Object Oriented Programming (OOP) and Webscraping, the weekly scraping of the English Premier league table is carried out to assess th

1 Nov 26, 2021

An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

TikTok Scraper An utility library to scrape data from TikTok hassle-free Go to the website » View Demo · Report Bug · Request Feature About The Projec

6 Jan 08, 2023

WebScrapping Project - G1 Latest News

Web Scrapping com Python Esse projeto consiste em um código para o usuário buscar as últimas nóticias sobre um termo qualquer, no site G1. Para esse p

2 Feb 13, 2022

Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

Github Scraper Github scraper app is used to scrape data for a specific user profile. Github scraper app gets a github profile name and check whether

6 Apr 05, 2022

Async Python 3.6+ web scraping micro-framework based on asyncio

Related tags

Overview

Ruia

Overview

Features

Installation

Tutorials

TODO

Contribution

Thanks

Comments

Releases(v0.8.0)

v0.8.0(Jan 7, 2021)

v0.6.9(Aug 15, 2020)

v0.5.0(Feb 14, 2019)

v0.4.6(Feb 9, 2019)

v0.4.0(Feb 9, 2019)

Owner

howie.hu

Scraping followers of an instagram account

Free-Game-Scraper is a useful script that allows you to track down free games and DLCs on many platforms.

用python爬取江苏几大高校的就业网站，并提供3种方式通知给用户，分别是通过微信发送、命令行直接输出、windows气泡通知。

Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

script to scrape direct download links (ddls) from google drive index.

A simplistic scraper made to download tons of random screenshots made by people.

Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

An Web Scraping API for MDL(My Drama List) for Python.

A repository with scraping code and soccer dataset from understat.com.

IGLS - Instagram Like Scraper CLI tool

OSTA web scraper, for checking the status of school buses in Ottawa

TikTok Username Swapper/Claimer/etc

This is a sport analytics project that combines the knowledge of OOP and Webscraping

An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

WebScrapping Project - G1 Latest News

Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

Scrape all the media from an OnlyFans account - Updated regularly

抖音批量下载用户所有无水印视频

Automatically scrapes all menu items from the Taco Bell website

A Python module to bypass Cloudflare's anti-bot page.