PyQuery-based scraping micro-framework.

Last update: Jul 20, 2022

Related tags

Web Crawling demiurge

Overview

demiurge

PyQuery-based scraping micro-framework. Supports Python 2.x and 3.x.

Documentation: http://demiurge.readthedocs.org

Installing demiurge

$ pip install demiurge

Quick start

Define items to be scraped using a declarative (Django-inspired) syntax:

import demiurge

class TorrentDetails(demiurge.Item):
    label = demiurge.TextField(selector='strong')
    value = demiurge.TextField()

    def clean_value(self, value):
        unlabel = value[value.find(':') + 1:]
        return unlabel.strip()

    class Meta:
        selector = 'div#specifications p'

class Torrent(demiurge.Item):
    url = demiurge.AttributeValueField(
        selector='td:eq(2) a:eq(1)', attr='href')
    name = demiurge.TextField(selector='td:eq(2) a:eq(2)')
    size = demiurge.TextField(selector='td:eq(3)')
    details = demiurge.RelatedItem(
        TorrentDetails, selector='td:eq(2) a:eq(2)', attr='href')

    class Meta:
        selector = 'table.maintable:gt(0) tr:gt(0)'
        base_url = 'http://www.mininova.org'


>>> t = Torrent.one('/search/ubuntu/seeds')
>>> t.name
'Ubuntu 7.10 Desktop Live CD'
>>> t.size
u'695.81\xa0MB'
>>> t.url
'/get/1053846'
>>> t.html
u'<td>19\xa0Dec\xa007</td><td><a href="/cat/7">Software</a></td><td>...'

>>> results = Torrent.all('/search/ubuntu/seeds')
>>> len(results)
116
>>> for t in results[:3]:
...     print t.name, t.size
...
Ubuntu 7.10 Desktop Live CD 695.81 MB
Super Ubuntu 2008.09 - VMware image 871.95 MB
Portable Ubuntu 9.10 for Windows 559.78 MB
...

>>> t = Torrent.one('/search/ubuntu/seeds')
>>> for detail in t.details:
...     print detail.label, detail.value
... 
Category: Software > GNU/Linux
Total size: 695.81 megabyte
Added: 2467 days ago by Distribution
Share ratio: 17 seeds, 2 leechers
Last updated: 35 minutes ago
Downloads: 29,085

See documentation for details: http://demiurge.readthedocs.org

Why demiurge?

Plato, as the speaker Timaeus, refers to the Demiurge frequently in the Socratic dialogue Timaeus, c. 360 BC. The main character refers to the Demiurge as the entity who "fashioned and shaped" the material world. Timaeus describes the Demiurge as unreservedly benevolent, and hence desirous of a world as good as possible. The world remains imperfect, however, because the Demiurge created the world out of a chaotic, indeterminate non-being.

http://en.wikipedia.org/wiki/Demiurge

Contributors

Martín Gaitán (@mgaitan)

Comments

Reausable cleaning functions
You can now add a "clean" kwarg containing a function to a field.

This makes it easy to use quick filtering (I want this data to be an int) and to re-use functions such as parsedatetime.

score = demiurge.TextField(selector=".score .upvoted", clean=int)
opened by traverseda 5
proof of concept: subitem field

short rationale: Sometimes I need to scrap a page to retrieve the actual links where the items are. I would like a way to nest Item classes, analog (in some way) to a in ForeignKey / ManyToManyField in Django.

This is a first PR as a proof of concept, to discuss the idea and its API.

opened by mgaitan 5
RelatedItems only work across urls

An obvious use of RelatedItems (or a similar construct) is recursively mapping a comment tree. Right now there's no elegant way to do that.

An example

http://pastebin.com/WDL4RjkE

Reading through the actual code, I think I might be wrong about this. I'll try and make the docs clearer.

opened by traverseda 2
Use lib "requests" for downloading

I'm right now making use of https://pypi.python.org/pypi/requests-cache which creates a cache of the downloaded stuff magically, and it's awesome. So, I would like to be able to take advantage of it using demiurge.

I don't know if just as an option or as a replacement of pyquery downloader.

What do you think?

opened by jmansilla 2
docs: fix simple typo, ocurrence -> occurrence

There is a small typo in docs/index.rst.

Should read occurrence rather than ocurrence.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

opened by timgates42 1
Fix when no selector defined
the default selector is the whole page ('html') but this is applied through PyQuery.find wich traverses down. example:

In [2]: PyQuery('<html>hello</html>').find('html') Out[2]: [] In [3]: PyQuery('<html>hello</html>')('html') Out[3]: [<html>]
opened by mgaitan 1
support self reference in RelatedItem

RelatedItem('self'). Also, the relateditem's item class could be given by its name (i.eRelatedItem("ItemClass")` ) A typical use case is a listing page with a "next page" link.

opened by mgaitan 0

Releases(v0.2)

v0.2(Sep 20, 2014)
Added docs.

Added RelatedItem.

Added field clean support.

Source code(tar.gz)
Source code(zip)

Owner

Matias Bordese

GitHub Repository http://demiurge.readthedocs.org

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

2.9k Jan 03, 2023

Tool to scan for secret files on HTTP servers

snallygaster Finds file leaks and other security problems on HTTP servers. what? snallygaster is a tool that looks for files accessible on web servers

2k Dec 28, 2022

A simple proxy scraper that utilizes the requests module in python.

Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

3 Sep 08, 2021

Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

proxy scraper 🔎 Installation: git clone https://github.com/ebankoff/proxy_scraper Required pip libraries (pip install library name): lxml beautifulso

19 Dec 07, 2022

🐞 Douban Movie / Douban Book Scarpy

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

1 Dec 03, 2022

Danbooru scraper with python

Danbooru Version: 0.0.1 License under: MIT License Dependencies Python: = 3.9.7 beautifulsoup4 cloudscraper Example of use Danbooru from danbooru imp

2 Oct 27, 2022

A database scraper created with mechanical soup and sqlite

WebscrapingDatabases a database scraper created with mechanical soup and sqlite author: Mariya Sha Watch on YouTube: This repository was created to su

30 Aug 08, 2022

Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

TwitterScraper Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine . Screenshot Data Users Only

19 Nov 17, 2022

tweet random sand cat pictures

sandcatbot setup pip3 install --user -r requirements.txt cp sandcatbot.example.conf sandcatbot.conf vim sandcatbot.conf running the first parameter i

8 Aug 07, 2022

Crawl BookCorpus

These are scripts to reproduce BookCorpus by yourself.

590 Jan 03, 2023

Amazon web scraping using Scrapy Framework

Amazon-web-scraping-using-Scrapy-Framework Scrapy Scrapy is an application framework for crawling web sites and extracting structured data which can b

1 Jan 25, 2022

Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

Github Scraper Github scraper app is used to scrape data for a specific user profile. Github scraper app gets a github profile name and check whether

6 Apr 05, 2022

PyQuery-based scraping micro-framework.

Related tags

Overview

demiurge

Installing demiurge

Quick start

Why demiurge?

Contributors

Comments

Reausable cleaning functions

proof of concept: subitem field

RelatedItems only work across urls

Use lib "requests" for downloading

docs: fix simple typo, ocurrence -> occurrence

Fix when no selector defined

support self reference in RelatedItem

Releases(v0.2)

v0.2(Sep 20, 2014)

Owner

Matias Bordese

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Tool to scan for secret files on HTTP servers

A simple proxy scraper that utilizes the requests module in python.

Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

🐞 Douban Movie / Douban Book Scarpy

Danbooru scraper with python

A database scraper created with mechanical soup and sqlite

Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

tweet random sand cat pictures

Crawl BookCorpus

Amazon web scraping using Scrapy Framework

Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

Scrapy-soccer-games - Scraping information about soccer games from a few websites

This repo has the source code for the crawler and data crawled from auto-data.net

A simple Discord scraper for discord bots

An experiment to deploy a serverless infrastructure for a scrapy project.

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

A command-line program to download media, like and unlike posts, and more from creators on OnlyFans.

Goblyn is a Python tool focused to enumeration and capture of website files metadata.

自动完成每日体温上报（Github Actions）