A pure-python HTML screen-scraping library

Related tags

Web Crawlingscrapely
Overview

Scrapely

https://api.travis-ci.org/scrapy/scrapely.svg?branch=master

Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

Overview

Scrapinghub wrote a nice blog post explaining how scrapely works and how it's used in Portia.

Installation

Scrapely works in Python 2.7 or 3.3+. It requires numpy and w3lib Python packages.

To install scrapely on any platform use:

pip install scrapely

If you're using Ubuntu (9.10 or above), you can install scrapely from the Scrapy Ubuntu repos. Just add the Ubuntu repos as described here: http://doc.scrapy.org/en/latest/topics/ubuntu.html

And then install scrapely with:

aptitude install python-scrapely

Usage (API)

Scrapely has a powerful API, including a template format that can be edited externally, that you can use to build very capable scrapers.

What follows is a quick example of the simplest possible usage, that you can run in a Python shell.

Start by importing and instantiating the Scraper class:

>>> from scrapely import Scraper
>>> s = Scraper()

Then, proceed to train the scraper by adding some page and the data you expect to scrape from there (note that all keys and values in the data you pass must be strings):

>>> url1 = 'http://pypi.python.org/pypi/w3lib/1.1'
>>> data = {'name': 'w3lib 1.1', 'author': 'Scrapy project', 'description': 'Library of web-related functions'}
>>> s.train(url1, data)

Finally, tell the scraper to scrape any other similar page and it will return the results:

>>> url2 = 'http://pypi.python.org/pypi/Django/1.3'
>>> s.scrape(url2)
[{u'author': [u'Django Software Foundation <foundation at djangoproject com>'],
  u'description': [u'A high-level Python Web framework that encourages rapid development and clean, pragmatic design.'],
  u'name': [u'Django 1.3']}]

That's it! No xpaths, regular expressions, or hacky python code.

Usage (command line tool)

There is also a simple script to create and manage Scrapely scrapers.

It supports a command-line interface, and an interactive prompt. All commands supported on interactive prompt are also supported in the command-line interface.

To enter the interactive prompt type the following without arguments:

python -m scrapely.tool myscraper.json

Example:

$ python -m scrapely.tool myscraper.json
scrapely> help

Documented commands (type help <topic>):
========================================
a  al  s  ta  td  tl

scrapely>

To create a scraper and add a template:

scrapely> ta http://pypi.python.org/pypi/w3lib/1.1
[0] http://pypi.python.org/pypi/w3lib/1.1

This is equivalent as typing the following in one command:

python -m scrapely.tool myscraper.json ta http://pypi.python.org/pypi/w3lib/1.1

To list available templates from a scraper:

scrapely> tl
[0] http://pypi.python.org/pypi/w3lib/1.1

To add a new annotation, you usually test the selection criteria first:

scrapely> t 0 w3lib 1.1
[0] u'<h1>w3lib 1.1</h1>'
[1] u'<title>Python Package Index : w3lib 1.1</title>'

You can also quote the text, if you need to specify an arbitrary number of spaces, for example:

scrapely> t 0 "w3lib 1.1"

You can refine by position. To take the one in position [0]:

scrapely> a 0 w3lib 1.1 -n 0
[0] u'<h1>w3lib 1.1</h1>'

To annotate some fields on the template:

scrapely> a 0 w3lib 1.1 -n 0 -f name
[new] (name) u'<h1>w3lib 1.1</h1>'
scrapely> a 0 Scrapy project -n 0 -f author
[new] u'<span>Scrapy project</span>'

To list annotations on a template:

scrapely> al 0
[0-0] (name) u'<h1>w3lib 1.1</h1>'
[0-1] (author) u'<span>Scrapy project</span>'

To scrape another similar page with the already added templates:

scrapely> s http://pypi.python.org/pypi/Django/1.3
[{u'author': [u'Django Software Foundation'], u'name': [u'Django 1.3']}]

Tests

tox is the preferred way to run tests. Just run: tox from the root directory.

Support

Scrapely is created and maintained by the Scrapy group, so you can get help through the usual support channels described in the Scrapy community page.

Architecture

Unlike most scraping libraries, Scrapely doesn't work with DOM trees or xpaths so it doesn't depend on libraries such as lxml or libxml2. Instead, it uses an internal pure-python parser, which can accept poorly formed HTML. The HTML is converted into an array of token ids, which is used for matching the items to be extracted.

Scrapely extraction is based upon the Instance Based Learning algorithm [1] and the matched items are combined into complex objects (it supports nested and repeated objects), using a tree of parsers, inspired by A Hierarchical Approach to Wrapper Induction [2].

[1] Yanhong Zhai , Bing Liu, Extracting Web Data Using Instance-Based Learning, World Wide Web, v.10 n.2, p.113-132, June 2007
[2] Ion Muslea , Steve Minton , Craig Knoblock, A hierarchical approach to wrapper induction, Proceedings of the third annual conference on Autonomous Agents, p.190-197, April 1999, Seattle, Washington, United States

Known Issues

The training implementation is currently very simple and is only provided for references purposes, to make it easier to test Scrapely and play with it. On the other hand, the extraction code is reliable and production-ready. So, if you want to use Scrapely in production, you should use train() with caution and make sure it annotates the area of the page you intended.

Alternatively, you can use the Scrapely command line tool to annotate pages, which provides more manual control for higher accuracy.

How does Scrapely relate to Scrapy?

Despite the similarity in their names, Scrapely and Scrapy are quite different things. The only similarity they share is that they both depend on w3lib, and they are both maintained by the same group of developers (which is why both are hosted on the same Github account).

Scrapy is an application framework for building web crawlers, while Scrapely is a library for extracting structured data from HTML pages. If anything, Scrapely is more similar to BeautifulSoup or lxml than Scrapy.

Scrapely doesn't depend on Scrapy nor the other way around. In fact, it is quite common to use Scrapy without Scrapely, and viceversa.

If you are looking for a complete crawler-scraper solution, there is (at least) one project called Slybot that integrates both, but you can definitely use Scrapely on other web crawlers since it's just a library.

Scrapy has a builtin extraction mechanism called selectors which (unlike Scrapely) is based on XPaths.

License

Scrapely library is licensed under the BSD license.

Owner
Scrapy project
An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
Scrapy project
Audio media crawler for lbry.

Audio media crawler for lbry. Requirements Python 3.8 Poetry 1.1.7 Elasticsearch 7.14.0 Lbry-sdk 0.99.0 Development This project uses poetry as a depe

Hound.fm 4 Dec 03, 2022
Lovely Scrapper

Lovely Scrapper

Tushar Gadhe 2 Jan 01, 2022
Screenhook is a script that captures an image of a web page and send it to a discord webhook.

screenshot from the web for discord webhooks screenhook is a script that captures an image of a web page and send it to a discord webhook.

Toast Energy 3 Jun 04, 2022
Pseudo API for Google Trends

pytrends Introduction Unofficial API for Google Trends Allows simple interface for automating downloading of reports from Google Trends. Only good unt

General Mills 2.6k Dec 28, 2022
Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers.

Louie Cai 13 Oct 15, 2022
Python scrapper scrapping torrent website and download new movies Automatically.

torrent-scrapper Python scrapper scrapping torrent website and download new movies Automatically. If you like it Put a ⭐ on this repo 😇 Run this git

Fazil vk 1 Jan 08, 2022
Web scraper for Zillow

Zillow-Scraper Instructions All terminal commands are highlighted. Make sure you first have python 3 installed. You can check this by running "python

Ali Rastegar 1 Nov 23, 2021
Automatically download and crop key information from the arxiv daily paper.

Arxiv daily 速览 功能:按关键词筛选arxiv每日最新paper,自动获取摘要,自动截取文中表格和图片。 1 测试环境 Ubuntu 16+ Python3.7 torch 1.9 Colab GPU 2 使用演示 首先下载权重baiduyun 提取码:il87,放置于code/Pars

HeoLis 20 Jul 30, 2022
New World Market Scraper

Bean Seller A New Worlds market scraper. Deployment This must be installed on Windows as it uses the Windows api to do its stuff Install Prerequisites

4 Sep 21, 2022
A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

combined-shop-scraper A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items. Features Define an

2 Dec 13, 2021
A web scraper that exports your entire WhatsApp chat history.

WhatSoup 🍲 A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

Eddy Harrington 87 Jan 06, 2023
Scrape and display grades onto the console

WebScrapeGrades About The Project This Project is a personal project where I learned how to webscrape using python requests. Being able to get request

Cyrus Baybay 1 Oct 23, 2021
用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。

crawler_for_university 用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。 环境依赖 wxpy,requests,bs4等库 功能描述 该项目基于python,通过爬虫爬各高校的就业信息网,爬取招聘信

8 Aug 16, 2021
Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

Amit 6 Aug 26, 2022
Minimal set of tools to conduct stealthy scraping.

Stealthy Scraping Tools Do not use puppeteer and playwright for scraping. Explanation. We only use the CDP to obtain the page source and to get the ab

Nikolai Tschacher 88 Jan 04, 2023
simple http & https proxy scraper and checker

simple http & https proxy scraper and checker

Neospace 11 Nov 15, 2021
Example of scraping a paginated API endpoint and dumping the data into a DB

Provider API Scraper Example Example of scraping a paginated API endpoint and dumping the data into a DB. Pre-requisits Python = 3.9 Pipenv Setup # i

Alex Skobelev 1 Oct 20, 2021
抢京东茅台脚本,定时自动触发,自动预约,自动停止

jd_maotai 抢京东茅台脚本,定时自动触发,自动预约,自动停止 小白信用 99.6,暂时还没抢到过,朋友 80 多抢到了一瓶,所以我感觉是跟信用分没啥关系,完全是看运气的。

Aruelius.L 117 Dec 22, 2022
A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

Annex Bubt Scraping Script I think this is the first public repository that provides free annex-BUBT, BUBT-Soft, and BUBT website scraping API script

Md Imam Hossain 4 Dec 03, 2022
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021