Web crawling framework based on asyncio.

Last update: Jan 05, 2023

Overview

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp.

Requirements

Python3.5+

Installation

pip install gain

pip install uvloop (Only linux)

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider
import aiofiles

class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        async with aiofiles.open('scrapinghub.txt', 'a+') as f:
            await f.write(self.results['title'])


class MySpider(Spider):
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

Or use XPathParser:

from gain import Css, Item, Parser, XPathParser, Spider


class Post(Item):
    title = Css('.breadcrumb_last')

    async def save(self):
        print(self.title)


class MySpider(Spider):
    start_url = 'https://mydramatime.com/europe-and-us-drama/'
    concurrency = 5
    headers = {'User-Agent': 'Google Spider'}
    parsers = [
               XPathParser('//span[@class="category-name"]/a/@href'),
               XPathParser('//div[contains(@class, "pagination")]/ul/li/a[contains(@href, "page")]/@href'),
               XPathParser('//div[@class="mini-left"]//div[contains(@class, "mini-title")]/a/@href', Post)
              ]
    proxy = 'https://localhost:1234'

MySpider.run()

You can add proxy setting to spider as above.

Run python spider.py
Result:

Example

The examples are in the /example/ directory.

Contribution

Pull request.
Open issue.

Web crawling framework based on asyncio.

Related tags

Overview

Requirements

Installation

Usage

Example

Contribution

Owner

Jiuli Gao

Incredibly fast crawler designed for OSINT.

script to scrape direct download links (ddls) from google drive index.

A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

Ebay Webscraper for Getting Average Product Price

TarkovScrappy - A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov!

Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

一些爬虫相关的签名、验证码破解

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Extract embedded metadata from HTML markup

Scrap-mtg-top-8 - A top 8 mtg scraper using python

WebScrapping Project - G1 Latest News

Audio media crawler for lbry.

Dex-scrapper - Hobby project for scrapping dex data on VeChain

Crawl the information of a given keyword on Google search engine

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

原神爬虫抓取原神界面圣遗物信息

a high-performance, lightweight and human friendly serving engine for scrapy

A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

A dead simple crawler to get books information from Douban.

12306抢票脚本

Web crawling framework based on asyncio.

Related tags

Overview

Requirements

Installation

Usage

Example

Contribution

Owner

Jiuli Gao

Incredibly fast crawler designed for OSINT.

script to scrape direct download links (ddls) from google drive index.

A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

Ebay Webscraper for Getting Average Product Price

TarkovScrappy - A nifty little bot that lets you know if a queried item might be required for a quest at some point in the land of Tarkov!

Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

一些爬虫相关的签名、验证码破解

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Extract embedded metadata from HTML markup

Scrap-mtg-top-8 - A top 8 mtg scraper using python

WebScrapping Project - G1 Latest News

Audio media crawler for lbry.

Dex-scrapper - Hobby project for scrapping dex data on VeChain

Crawl the information of a given keyword on Google search engine

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

原神爬虫 抓取原神界面圣遗物信息

a high-performance, lightweight and human friendly serving engine for scrapy

A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

A dead simple crawler to get books information from Douban.

12306抢票脚本

原神爬虫抓取原神界面圣遗物信息