A pure-python HTML screen-scraping library

Related tags

Web Crawlingscrapely
Overview

Scrapely

https://api.travis-ci.org/scrapy/scrapely.svg?branch=master

Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

Overview

Scrapinghub wrote a nice blog post explaining how scrapely works and how it's used in Portia.

Installation

Scrapely works in Python 2.7 or 3.3+. It requires numpy and w3lib Python packages.

To install scrapely on any platform use:

pip install scrapely

If you're using Ubuntu (9.10 or above), you can install scrapely from the Scrapy Ubuntu repos. Just add the Ubuntu repos as described here: http://doc.scrapy.org/en/latest/topics/ubuntu.html

And then install scrapely with:

aptitude install python-scrapely

Usage (API)

Scrapely has a powerful API, including a template format that can be edited externally, that you can use to build very capable scrapers.

What follows is a quick example of the simplest possible usage, that you can run in a Python shell.

Start by importing and instantiating the Scraper class:

>>> from scrapely import Scraper
>>> s = Scraper()

Then, proceed to train the scraper by adding some page and the data you expect to scrape from there (note that all keys and values in the data you pass must be strings):

>>> url1 = 'http://pypi.python.org/pypi/w3lib/1.1'
>>> data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
>>> s.train(url1, data)

Finally, tell the scraper to scrape any other similar page and it will return the results:

>>> url2 = 'http://pypi.python.org/pypi/Django/1.3'
>>> s.scrape(url2)
[{u'author': [u'Django Software Foundation <foundation at djangoproject com>'],
  u'description': [u'A high-level Python Web framework that encourages rapid development and clean, pragmatic design.'],
  u'name': [u'Django 1.3']}]

That's it! No xpaths, regular expressions, or hacky python code.

Usage (command line tool)

There is also a simple script to create and manage Scrapely scrapers.

It supports a command-line interface, and an interactive prompt. All commands supported on interactive prompt are also supported in the command-line interface.

To enter the interactive prompt type the following without arguments:

python -m scrapely.tool myscraper.json

Example:

$ python -m scrapely.tool myscraper.json
scrapely> help

Documented commands (type help <topic>):
========================================
a  al  s  ta  td  tl

scrapely>

To create a scraper and add a template:

scrapely> ta http://pypi.python.org/pypi/w3lib/1.1
[0] http://pypi.python.org/pypi/w3lib/1.1

This is equivalent as typing the following in one command:

python -m scrapely.tool myscraper.json ta http://pypi.python.org/pypi/w3lib/1.1

To list available templates from a scraper:

scrapely> tl
[0] http://pypi.python.org/pypi/w3lib/1.1

To add a new annotation, you usually test the selection criteria first:

scrapely> t 0 w3lib 1.1
[0] u'<h1>w3lib 1.1</h1>'
[1] u'<title>Python Package Index : w3lib 1.1</title>'

You can also quote the text, if you need to specify an arbitrary number of spaces, for example:

scrapely> t 0 "w3lib 1.1"

You can refine by position. To take the one in position [0]:

scrapely> a 0 w3lib 1.1 -n 0
[0] u'<h1>w3lib 1.1</h1>'

To annotate some fields on the template:

scrapely> a 0 w3lib 1.1 -n 0 -f name
[new] (name) u'<h1>w3lib 1.1</h1>'
scrapely> a 0 Scrapy project -n 0 -f author
[new] u'<span>Scrapy project</span>'

To list annotations on a template:

scrapely> al 0
[0-0] (name) u'<h1>w3lib 1.1</h1>'
[0-1] (author) u'<span>Scrapy project</span>'

To scrape another similar page with the already added templates:

scrapely> s http://pypi.python.org/pypi/Django/1.3
[{u'author': [u'Django Software Foundation'], u'name': [u'Django 1.3']}]

Tests

tox is the preferred way to run tests. Just run: tox from the root directory.

Support

Scrapely is created and maintained by the Scrapy group, so you can get help through the usual support channels described in the Scrapy community page.

Architecture

Unlike most scraping libraries, Scrapely doesn't work with DOM trees or xpaths so it doesn't depend on libraries such as lxml or libxml2. Instead, it uses an internal pure-python parser, which can accept poorly formed HTML. The HTML is converted into an array of token ids, which is used for matching the items to be extracted.

Scrapely extraction is based upon the Instance Based Learning algorithm [1] and the matched items are combined into complex objects (it supports nested and repeated objects), using a tree of parsers, inspired by A Hierarchical Approach to Wrapper Induction [2].

[1] Yanhong Zhai , Bing Liu, Extracting Web Data Using Instance-Based Learning, World Wide Web, v.10 n.2, p.113-132, June 2007
[2] Ion Muslea , Steve Minton , Craig Knoblock, A hierarchical approach to wrapper induction, Proceedings of the third annual conference on Autonomous Agents, p.190-197, April 1999, Seattle, Washington, United States

Known Issues

The training implementation is currently very simple and is only provided for references purposes, to make it easier to test Scrapely and play with it. On the other hand, the extraction code is reliable and production-ready. So, if you want to use Scrapely in production, you should use train() with caution and make sure it annotates the area of the page you intended.

Alternatively, you can use the Scrapely command line tool to annotate pages, which provides more manual control for higher accuracy.

How does Scrapely relate to Scrapy?

Despite the similarity in their names, Scrapely and Scrapy are quite different things. The only similarity they share is that they both depend on w3lib, and they are both maintained by the same group of developers (which is why both are hosted on the same Github account).

Scrapy is an application framework for building web crawlers, while Scrapely is a library for extracting structured data from HTML pages. If anything, Scrapely is more similar to BeautifulSoup or lxml than Scrapy.

Scrapely doesn't depend on Scrapy nor the other way around. In fact, it is quite common to use Scrapy without Scrapely, and viceversa.

If you are looking for a complete crawler-scraper solution, there is (at least) one project called Slybot that integrates both, but you can definitely use Scrapely on other web crawlers since it's just a library.

Scrapy has a builtin extraction mechanism called selectors which (unlike Scrapely) is based on XPaths.

License

Scrapely library is licensed under the BSD license.

Owner
Scrapy project
An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
Scrapy project
Telegram group scraper tool

Telegram Group Scrapper

Wahyusaputra 2 Jan 11, 2022
This is a webscraper for a specific website

This is a webscraper for a specific website. It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

Rahul Siyanwal 1 Dec 13, 2021
Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Christian Gracia 0 Jan 22, 2022
A way to scrape sports streams for use with Jellyfin.

Sportyfin Description Stream sports events straight from your Jellyfin server. Sportyfin allows users to scrape for live streamed events and watch str

axelmierczuk 38 Nov 05, 2022
Find papers by keywords and venues. Then download it automatically

paper finder Find papers by keywords and venues. Then download it automatically. How to use this? Search CLI python search.py -k "knowledge tracing,kn

Jiahao Chen (TabChen) 2 Dec 15, 2022
High available distributed ip proxy pool, powerd by Scrapy and Redis

高可用IP代理池 README | 中文文档 本项目所采集的IP资源都来自互联网,愿景是为大型爬虫项目提供一个高可用低延迟的高匿IP代理池。 项目亮点 代理来源丰富 代理抓取提取精准 代理校验严格合理 监控完备,鲁棒性强 架构灵活,便于扩展 各个组件分布式部署 快速开始 注意,代码请在release

SpiderClub 5.2k Jan 03, 2023
👨🏼‍⚖️ reddit bot that turns comment chains into ace attorney scenes

Ace Attorney reddit bot 👨🏼‍⚖️ Reddit bot that turns comment chains into ace attorney scenes. You'll need to sign up for streamable and reddit and se

763 Nov 17, 2022
Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

Porames Vatanaprasan 31 Apr 17, 2022
An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line!

Social Media Scraper An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line! Go to the website » Vie

2 Aug 03, 2022
Google Developer Profile Badge Scraper

Google Developer Profile Badge Scraper GDev Profile Badge Scraper is a Google Developer Profile Web Scraper which scrapes for specific badges in a use

Siddhant Lad 7 Jan 10, 2022
A simple app to scrap data from Twitter.

Twitter-Scraping-App A simple app to scrap data from Twitter. Available Features Search query. Select number of data you want to fetch from twitter. C

Davis David 2 Oct 31, 2022
A command-line program to download media, like and unlike posts, and more from creators on OnlyFans.

onlyfans-scraper A command-line program to download media, like and unlike posts, and more from creators on OnlyFans. Installation You can install thi

185 Jul 23, 2022
京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

MaoTai 129 Dec 14, 2022
An arxiv spider

An Arxiv Spider 做为一个cser,杰出男孩深知内核对连接到计算机上的硬件设备进行管理的高效方式是中断而不是轮询。每当小伙伴发来一篇刚挂在arxiv上的”热乎“好文章时,杰出男孩都会感叹道:”师兄这是每天都挂在arxiv上呀,跑的好快~“。于是杰出男孩找了找 github,借鉴了一下其

Jie Liu 11 Sep 09, 2022
This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Introduction This was supposed to be a web scraping project, but somehow I've turned it into a spamming project.

Boss Perry (Pez) 1 Jan 23, 2022
Minimal set of tools to conduct stealthy scraping.

Stealthy Scraping Tools Do not use puppeteer and playwright for scraping. Explanation. We only use the CDP to obtain the page source and to get the ab

Nikolai Tschacher 88 Jan 04, 2023
爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

lxSpider 爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说网站、招标采购网》 简介: 时光荏苒,记不清写了多少案例了。

lx 793 Jan 05, 2023
Web Scraping COVID 19 Meta Portal with Python

Web-Scraping-COVID-19-Meta-Portal-with-Python - Requests API and Beautiful Soup to scrape real-time COVID statistics from worldometer website and perform data cleaning and visual analysis in Jupyter

Aarif Munwar Jahan 1 Jan 04, 2022
PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

Une PS5 pour Noël Python + Chrome --headless = une PS5 pour noël MacOS Installer chrome Tweaker le .yaml pour la listes sites a scrap et les criteres

Olivier Giniaux 3 Feb 13, 2022
🕷 Phone Crawler with multi-thread functionality

Phone Crawler: Phone Crawler with multi-thread functionality Disclaimer: I'm not responsible for any illegal/misuse actions, this program was made for

Kmuv1t 3 Feb 10, 2022