🥫 The simple, fast, and modern web scraping library

Last update: Dec 22, 2022

Overview

About

gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies.

Install

Install with pip at the command line:

pip install -U gazpacho

Quickstart

Give this a try:

from gazpacho import get, Soup

url = 'https://scrape.world/books'
html = get(url)
soup = Soup(html)
books = soup.find('div', {'class': 'book-'}, partial=True)

def parse(book):
    name = book.find('h4').text
    price = float(book.find('p').text[1:].split(' ')[0])
    return name, price

[parse(book) for book in books]

Tutorial

Import

Import gazpacho following the convention:

from gazpacho import get, Soup

get

Use the get function to download raw HTML:

url = 'https://scrape.world/soup'
html = get(url)
print(html[:50])
# '<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <met'

Adjust get requests with optional params and headers:

get(
    url='https://httpbin.org/anything',
    params={'foo': 'bar', 'bar': 'baz'},
    headers={'User-Agent': 'gazpacho'}
)

Soup

Use the Soup wrapper on raw html to enable parsing:

soup = Soup(html)

Soup objects can alternatively be initialized with the .get classmethod:

soup = Soup.get(url)

.find

Use the .find method to target and extract HTML tags:

h1 = soup.find('h1')
print(h1)
# <h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>

attrs=

Use the attrs argument to isolate tags that contain specific HTML element attributes:

soup.find('div', attrs={'class': 'section-'})

partial=

Element attributes are partially matched by default. Turn this off by setting partial to False:

soup.find('div', {'class': 'soup'}, partial=False)

mode=

Override the mode argument {'auto', 'first', 'all'} to guarantee return behaviour:

print(soup.find('span', mode='first'))
# <span class="navbar-toggler-icon"></span>
len(soup.find('span', mode='all'))
# 8

dir()

Soup objects have html, tag, attrs, and text attributes:

dir(h1)
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']

Use them accordingly:

print(h1.html)
# '<h1 id="firstHeading" class="firstHeading" lang="en">Soup</h1>'
print(h1.tag)
# h1
print(h1.attrs)
# {'id': 'firstHeading', 'class': 'firstHeading', 'lang': 'en'}
print(h1.text)
# Soup

Support

If you use gazpacho, consider adding the badge to your project README.md:

[![scraper: gazpacho](https://img.shields.io/badge/scraper-gazpacho-C6422C)](https://github.com/maxhumber/gazpacho)

Contribute

For feature requests or bug reports, please use Github Issues

For PRs, please read the CONTRIBUTING.md document

Comments

.text is empty on Soup creation
Describe the bug

When I create a soup object...

To Reproduce

Calling .text returns an empty string:

from gazpacho import Soup html = """<p>£682m</p>""" soup = Soup(html) print(soup.text) ''

Expected behavior

Should output:

print(soup.text) '£682m'

Environment:

OS: macOS

Version: 1.1

Additional context

Inspired by this S/O question
bug hacktoberfest
opened by maxhumber 15

.text returns nested data instead of trailing data

Describe the bug When trying to get text from a tag, gazpacho returns empty string

To Reproduce Code to reproduce the behaviour:

from gazpacho import Soup

html  = '<a href="/Sorasful?source=gig_cards&referrer_gig_slug=edit-mixing-and-mastering&ref_ctx_id=42d34014-b499-46fa-a1d3-04318b12fecc" rel="nofollow noopener noreferrer" target="_self"><span>by </span>Sorasful</a>'

soup = Soup(html)

print(soup.text)
# prints nothing

print(soup.find('a').text)
# prints "by"

Expected behavior Should return "by Sorasful"

Environment:

OS: Windows 10
Version: Python3.8

bug help wanted hacktoberfest

opened by sorasful 6

API suggestion: soup.all("div") and soup.first("div")

The default auto behavior of .find() doesn't work for me, because it means I can't trust my code not to start throwing errors if the page I am scraping adds another matching element, or drops the number of elements down to one (triggering a change in return type).

I know I can do this:

div = soup.find("div", mode="first") # Or this: divs = soup.find("div", mode="all")

But having function parameters that change the return type is still a bit weird - not great for code hinting and suchlike.

Changing how .find() works would be a backwards incompatible change, which isn't good now that you're past the 1.0 release. I suggest adding two new methods instead:

div = soup.first("div") # Returns a single element # Or: divs = soup.all("div") # Returns a list of elements

This would be consistent with your existing API design (promoting the mode arguments to first class method names) and could be implemented without breaking existing code.

opened by simonw 6
A select function similar to soups.
Is your feature request related to a problem? Please describe. It's great to be able to run find and then find within the initial result, but it seems more readable to be able to find based on CSS selectors.

Describe the solution you'd like

selector = '.foo img.bar' soup.select(selector) # this would return any img item with the class "bar" inside of an object with the class "foo"
opened by kjaymiller 5
separate find into find and find_one
Is your feature request related to a problem? Please describe. Right now it's hard to reason about the behaviour of the find method. If it finds one element it will return a Soup object, if it finds more than one it will return a list of Soup objects.

Describe the solution you'd like Separate find into a find method and find_one method.

Describe alternatives you've considered Keep it and YOLO?

Additional context Conversation with Michael Kennedy:

If I were designing the api, i'd have that always return a List[Node] (or whatever the class is). Then add two methods:

find() -> List[Node]

find_one() -> Optional[Node]

one() -> Node (exception if the there are zero or two or more nodes)

enhancement hacktoberfest
opened by maxhumber 5
Format/Pretty Print can't handle void tags
Describe the bug

Soup can handle and format matched tags no problem:

from gazpacho import Soup html = """<ul><li>Item 1</li><li>Item 2</li></ul>""" Soup(html)

Which correctly formats to:

<ul> <li>Item 1</li> <li>Item 2</li> </ul>

But it can't handle void tags (like img)...

To Reproduce

For example, this bit of html:

html = """<ul><li>Item 1</li><li>Item 2</li></ul><img src="image.png">""" Soup(html)

Will fail to format on print:

<ul><li>Item 1</li><li>Item 2</li></ul><img src="image.png">

Expected behavior

Ideally Soup formats it as:

<ul> <li>Item 1</li> <li>Item 2</li> </ul> <img src="image.png">

Environment:

OS: macOS

Version: 1.1

Additional context

The problem has to do with the underlying parseString function unable to handle void tags:

from xml.dom.minidom import parseString as string_to_dom string_to_dom(html)

Possible solution, turn void tags into self-closing tags on input, and the transform them back to void tags on print....
help wanted hacktoberfest
opened by maxhumber 4
Add release versions to GitHub?

$ git tag v0.7.2 && git push --tags 🎉 🎈

I really like this project. I think that adding releases to the repository can help the project grow in popularity. I'd like to see that!

opened by naltun 4
User Agent Rotation / Faking

Is your feature request related to a problem? Please describe.

It might be nice if gazpacho had the ability to rotate/fake a user agent

Describe the solution you'd like

Sort of like this but more primitive. (Importantly gazpacho does not want to take on any dependencies)

Additional context

Right now gazpacho just spoofs the latest Firefox User Agent
enhancement hacktoberfest

opened by maxhumber 3
Enable strict matching for find
Describe the bug Right now match has an ability to be strict. This functionality is presently not enable for find.

To Reproduce Code to reproduce the behaviour:

from gazpacho import Soup, match match({'foo': 'bar'}, {'foo': 'bar baz'}) # True match({'foo': 'bar'}, {'foo': 'bar baz'}, strict=True) # False

Expected behavior The find method should be forgiving (partial match) to protect ease of use, and maintain backwards compatibility, but there should be an argument to enable strict/exact matching that piggybacks on match

Environment:

OS: macOS

Version: 0.7.2

hacktoberfest
opened by maxhumber 3
Get all the child elements of a Soup object

Is your feature request related to a problem? Please describe. I would like try to a .children() method in the Soup object that can list all the child elements of the Soup object.

Describe the solution you'd like I would make a regex pattern to match each inner element and return a list of Soup() objects with those elements. I might also try to make an option for recurse or not.

Describe alternatives you've considered All that I can think of is doing the same thing mentioned above in the scraping code

Additional context None

opened by Vthechamp22 2

Improve issue and feature request templates

Is your feature request related to a problem? Please describe. Improve the .github issue template

Describe the solution you'd like I would like a better issue and feature request template in the .github folder. The format I would like is the bolded headings to become proper sections, and the help line below them comments.

Describe alternatives you've considered None

Additional context What I would like is instead of:

---
name: Bug report
about: Create a report to help gazpacho improve
title: ''
labels: ''
assignees: ''
---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Code to reproduce the behaviour:

```python

\```

**Expected behavior**
A clear and concise description of what you expected to happen.

**Environment:**
 - OS: [macOS, Linux, Windows]
 - Version: [e.g. 0.8.1]

**Additional context**
Add any other context about the problem here.

It should be something like:

---
name: Bug report
about: Create a report to help gazpacho improve
title: ''
labels: ''
assignees: ''
---

## Describe the bug
<!-- A clear and concise description of what the bug is. -->

## To Reproduce
<!-- Code to reproduce the behaviour: -->

```python
# code
\```

## Expected behavior
<!-- A clear and concise description of what you expected to happen. -->

**Environment:**
 - OS: [macOS, Linux, Windows]
 - Version: [e.g. 0.8.1]

## Additional context
<!-- Add any other context about the problem here. Delete this section if not applicable -->

Or something like this

opened by Vthechamp22 2

Needs a Render method (like Requests-Html) to allow pulling text rendered by Javascript...

Need support for dynamic text rendering...

Need a method that triggers the Javascript on a page to fire (see https://github.com/psf/requests-html, r.html.render()).

opened by jasonvogel 0

Can't parse some HTML entries

Describe the bug

Can't parse some entries, there are 40 entries for every page, but some are not being parsed correctly.

Steps to reproduce the issue

from gazpacho import get, Soup

for i in range(1, 15):
    link = f'https://1337x.to/category-search/aladdin/Movies/{i}/'
    html = get(link)
    soup = Soup(html)
    body = soup.find("tbody")

    # extracting all the entries in the body,
    # there are 40 entries for every page, the last one can have less,
    entries = body.find("tr", mode='all')[::-1]

    # but for some pages it can't retrives all the entries from some reason
    print(f'{len(entries)} entries -> {link}')

Expected behavior

See 40 entries for every page

Environment:

Arch Linux - 5.13.10-arch1-1 Python - 3.9.6 Gazpacho - 1.1

opened by NicKoehler 0

Finding tags return entire html
Describe the bug

Using soup.find on particular website(s) returns entire html instead of the matching tag(s)

Steps to reproduce the issue

Look for ul tag with attribute class="cves" (<ul class="cves">) on https://mariadb.com/kb/en/security/

from gazpacho import get, Soup endpoint = "https://mariadb.com/kb/en/security/" html_dump = Soup.get(endpoint) sample = html_dump.find('ul', attrs={'class': 'cves'}, mode='all')

sample contains the contents of an entire html

Expected behavior

sample should contain the contents of the tag <ul class "cves">, which in this case would be rows of <li>-s, listing the CVEs and corresponding fixed version in MariaDB, something like:

<ul class="cves"> <li>..</li> ... <li>..</li> </ul>

Environment:

OS: Ubuntu Linux 18.04

Version: gazpacho 1.1, python 3.6.9

Additional information

Using BeautifulSoup on the same html_dump did get the job done, although the <li>-tags are weirdly nested together.

from bs4 import BeautifulSoup # html_dump from above Soup.get(endpoint) bs_soup = BeautifulSoup(html_dump.html, 'html.parser') ul_cves = bs_soup.find_all('ul','cves')

ul_cves contain strangely nested <li>-s, from which it was still possible to extract the rows of <li>-s I was looking for.

<ul class="cves"> <li> <li> ... </li></li> </ul>
opened by jz-ang 0
Support not a utf-8 encoding
Thank you for your nice project!

Please add an argument encoding to decode that does not utf-8 encoded pages. https://github.com/maxhumber/gazpacho/blob/ecd53aff4e3d8bdf9eaaea4e0244a75cbabf6259/gazpacho/get.py#L51

I tried EUC-KR encoded page and got an error message.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbd in position 95: invalid start byte
opened by KwangYeol 0
attrs method output is changed when using find
find changes the content of attrs

When using the find method on a Soup object, the content of attrs is overwritten by the parameter attrs in find.

Steps to reproduce the issue

Try the following:

from gazpacho import Soup div = Soup("<div id='my_id' />").find("div") print(div.attrs) div.find("span", {"id": "invalid_id"}) print(div.attrs)

The expected output will be the following, because we twice print the attributes of a:

{'id': 'my_id'} {'id': 'my_id'}

But instead you actually receive:

{'id': 'my_id'} {'id': 'invalid_id'}

which is wrong.

Environment:

OS: Linux

Version: 1.1

My current workaround is to save the attributes before I execute find.
opened by cfrahnow 1
Can't install whl files
Describe the bug

Hi,

There was a pull request (https://github.com/maxhumber/gazpacho/pull/48) to add whl publishing but it appears to have been lost somewhere in a merge on October 31st, 2020. (https://github.com/maxhumber/gazpacho/compare/v1.1...master). Therefore, no wheels have been published for 1.1.

This causes the installation error on my system that the PR was meant to address.

Expected behavior

Install gazpacho with a wheel, not a tar.gz;. Please re-add the whl publishing.

Environment:

OS: Windows 10
opened by daddycocoaman 0

Releases(v1.1)

v1.1(Oct 9, 2020)
1.1 (2020-10-09)

Feature: now PEP 561 compliant

Feature: Soup now automatically formats and indents (pretty print) HTML where possible

Source code(tar.gz)
Source code(zip)
v1.0(Sep 24, 2020)
1.0 (2020-09-24)

Feature: gazpacho is now fully baked with type hints (thanks for the suggestion @ju-sh!)

Feature: Soup.get("url") alternative initializer

Fixed: .find is now able to capture malformed void tags (<img />, vs. <img>) (thanks for the Issue @mallegrini!)

Renamed: .find(..., strict=) is now find(..., partial=)

Renamed: .remove_tags is now .strip

Source code(tar.gz)
Source code(zip)
v0.9.4(Jul 7, 2020)
0.9.4 (2020-07-07)

Feature: automagical json-to-dictionary return behaviour for get

Improvement: automatic missing URL protocol inference for get

Improvement: condensed HTTPError Exceptions

Source code(tar.gz)
Source code(zip)
v0.9.3(Apr 29, 2020)
0.9.3 (2020-04-29)

Updated the README (thanks for flagging the lxml error, @koaning!)

Source code(tar.gz)
Source code(zip)
v0.9.2(Apr 21, 2020)
0.9.2 (2020-04-21)

Fixed find(..., mode='first') to return None and not an IndexError (thanks, psyonara!)

Source code(tar.gz)
Source code(zip)
v0.9.1(Feb 16, 2020)
Fixed UnicodeEncodeError lurking beneath get (thanks for the "Issue" mlehotay!)

Fixed find method to properly handle non-closing HTML tags

Source code(tar.gz)
Source code(zip)
v0.9(Nov 25, 2019)
Added the remove_tags method for isolating formatted text in a block of HTML

Source code(tar.gz)
Source code(zip)
v0.8.1(Oct 11, 2019)
Changelog

Fixed empty element tag counting within the find method

Source code(tar.gz)
Source code(zip)
v0.8(Oct 7, 2019)
Changelog

Added mode argument to the find method to adjust return behaviour (defaults to mode='auto')

Enabled strict attribute matching for the find method (defaults to strict=False)

Source code(tar.gz)
Source code(zip)

Owner

Max Humber

Human

GitHub Repository https://www.gazpacho.xyz

Create crawler get some new products with maximum discount in banimode website

crawler-banimode create crawler and get some new products with maximum discount in banimode website. این پروژه کوچک جهت یادگیری و کار با ابزار سلنیوم

2 Feb 17, 2022

VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

VG-Scraper VG-Scraper is a convinient program where you can find all the news articles instead of finding one yourself. Installing [Linux] Open a term

3 Feb 13, 2022

Google Scholar Web Scraping

Google Scholar Web Scraping This is a python script that asks for a user to input the url for a google scholar profile, and then it writes publication

1 Dec 12, 2021

jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人

jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人, 照顾我们这样的马大哈, 不会忘记抢购了, 祝大家过年都能喝上茅台. 特别声明: 本仓库发布的jd_maotai_rpa项目定义为自动化rpa项目, 是用于防止忘记参与jd茅台的活动(由于本人时常忘记), 而不是为了秒杀和抢

35 Nov 18, 2022

This is my CS 20 final assesment.

eeeeeSpider This is my CS 20 final assesment. How to use: Open program Run to your hearts content! There are no external dependancies that you will ha

1 Jan 17, 2022

A simplistic scraper made to download tons of random screenshots made by people.

printStealer 1.1 What is this tool? This tool is developed to show the insecurity of the screenshot utility called prnt sc. It is a site that stores s

4 Jul 26, 2022

Dictionary - Application focused on word search through web scraping

Dictionary - Application focused on word search through web scraping, in addition to other functions such as dictation, spell and conjugation of syllables.

2 May 09, 2022

Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

Facebook Scraper Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key. (Currently working 2021) Setup Befo

2 Dec 27, 2021

原神爬虫抓取原神界面圣遗物信息

原神圣遗物半自动爬虫说明直接抓取原神界面中的圣遗物数据目前只适配了背包页面的抓取准确率：97.5%(普通通用接口，对 40 件随机圣遗物识别，统计完全正确的数量为 39) 准确率：100%(4k 屏幕，普通通用接口，对 110 件圣遗物识别，统计完全正确的数量为 110) 不排除还有小错误的

28 Oct 10, 2022

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

859 Dec 29, 2022

A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

1 Nov 28, 2021

Danbooru scraper with python

Danbooru Version: 0.0.1 License under: MIT License Dependencies Python: = 3.9.7 beautifulsoup4 cloudscraper Example of use Danbooru from danbooru imp

2 Oct 27, 2022

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

lxSpider 爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说网站、招标采购网》简介：时光荏苒，记不清写了多少案例了。

793 Jan 05, 2023

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Scrapy Cluster This Scrapy project uses Redis and Kafka to create a distributed

0 Jan 06, 2022

Goblyn is a Python tool focused to enumeration and capture of website files metadata.

Goblyn Metadata Enumeration What's Goblyn? Goblyn is a tool focused to enumeration and capture of website files metadata. How it works? Goblyn will se

46 Nov 22, 2022

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）

python+selenium实现的web端自动打卡说明本打卡脚本适用于郑州大学健康打卡，其他web端打卡也可借鉴学习。（自己用的，从2月分稳定运行至今）仅供学习交流使用，请勿依赖。开发者对使用本脚本造成的问题不负任何责任，不对脚本执行效果做出任何担保，原则上不提供任何形式的技术支持。为防止

1 Aug 27, 2022

An IpVanish Proxies Scraper

EzProxies Tired of searching for good proxies for hours? Just get an IpVanish account and get thousands of good proxies in few seconds! Showcase Watch

11 Nov 13, 2022

Google Developer Profile Badge Scraper

Google Developer Profile Badge Scraper GDev Profile Badge Scraper is a Google Developer Profile Web Scraper which scrapes for specific badges in a use

7 Jan 10, 2022

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

1 Jan 10, 2022

一些爬虫相关的签名、验证码破解

cracking4crawling 一些爬虫相关的签名、验证码破解，目前已有脚本：小红书App接口签名（shield）（2020.12.02）小红书滑块（数美）验证破解（2020.12.02）海南航空App接口签名（hnairSign）（2020.12.05）说明：脚本按目标网站、App命

90 Feb 09, 2021

🥫 The simple, fast, and modern web scraping library

Related tags

Overview

About

Install

Quickstart

Tutorial

Import

get

Soup

.find

attrs=

partial=

mode=

dir()

Support

Contribute

Comments

Need support for dynamic text rendering...

Describe the bug

Steps to reproduce the issue

Expected behavior

Environment:

Describe the bug

Steps to reproduce the issue

Expected behavior

Environment:

Additional information

find changes the content of attrs

Steps to reproduce the issue

Environment:

Describe the bug

Expected behavior

Environment:

Releases(v1.1)

v1.1(Oct 9, 2020)

1.1 (2020-10-09)

v1.0(Sep 24, 2020)

1.0 (2020-09-24)

v0.9.4(Jul 7, 2020)

0.9.4 (2020-07-07)

v0.9.3(Apr 29, 2020)

0.9.3 (2020-04-29)

v0.9.2(Apr 21, 2020)

v0.9.1(Feb 16, 2020)

v0.9(Nov 25, 2019)

v0.8.1(Oct 11, 2019)

Changelog

v0.8(Oct 7, 2019)

Changelog

Owner

Max Humber

Create crawler get some new products with maximum discount in banimode website

VG-Scraper is a python program using the module called BeautifulSoup which allows anyone to scrape something off an website. This program lets you put in a number trough an input and a number is 1 news article.

Google Scholar Web Scraping

jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人

This is my CS 20 final assesment.

A simplistic scraper made to download tons of random screenshots made by people.

Dictionary - Application focused on word search through web scraping

Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

原神爬虫 抓取原神界面圣遗物信息

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

Danbooru scraper with python

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Goblyn is a Python tool focused to enumeration and capture of website files metadata.

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤（从2月份稳定运行至今）

An IpVanish Proxies Scraper

Google Developer Profile Badge Scraper

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

一些爬虫相关的签名、验证码破解

`find` changes the content of `attrs`

原神爬虫抓取原神界面圣遗物信息

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）