mlscraper: Scrape data from HTML pages automatically with Machine Learning

Last update: Dec 29, 2022

Overview

mlscraper: Scrape data from HTML pages automatically with Machine Learning

https://img.shields.io/travis/lorey/mlscraper?nocache:alt:Travis(.org)

https://img.shields.io/pypi/v/mlscraper?nocache:alt:PyPI

https://img.shields.io/pypi/pyversions/mlscraper?nocache:alt:PyPI-PythonVersion

mlscraper allows you to extract structured data from HTML automatically with Machine Learning. You train it by providing a few examples of your desired output. It will then figure out the extraction rules for you automatically and afterwards you'll be able to extract data from any new page you provide.

Background Story

Many services for crawling and scraping automation allow you to select data in a browser and get JSON results in return. No need to specify CSS selectors or anything else.

I've been wondering for a long time why there's no Open Source solution that does something like this. So here's my attempt at creating a python library to enable automatic scraping.

All you have to do is define some examples of scraped data. mlscraper will figure out everything else and return clean data.

Currently, this is a proof of concept with a simplistic solution.

How it works

After you've defined the data you want to scrape, mlscraper will:

find your samples inside the HTML DOM
determine which rules/methods to apply for extraction
extract the data for you and return it in a dictionary

import requests

from mlscraper import RuleBasedSingleItemScraper
from mlscraper.training import SingleItemPageSample

# the items found on the training page
targets = {
    "https://test.com/article/1": {"title": "One great result!", "description": "Some description"},
    "https://test.com/article/2": {"title": "Another great result!", "description": "Another description"},
    "https://test.com/article/3": {"title": "Result to be found", "description": "Description to crawl"},
}

# fetch html and create samples
samples = [SingleItemPageSample(requests.get(url).content, targets[url]) for url in targets]

# training the scraper with the items
scraper = RuleBasedSingleItemScraper.build(samples)

# apply the learned rules and extract new item automatically
result = scraper.scrape(requests.get('https://test.com/article/4').content)

print(result)
# results in something like:
# {'title': 'Article four', 'description': 'Scraped automatically'}

You can find working scrapers like a stackoverflow and a quotes scraper in the examples folder.

Getting started

Install the library via pip install mlscraper. You can then import it via mlscraper and use it as shown in the examples.

Development

See CONTRIBUTING.rst

Related work

If you're interested in the underlying research, I can highly recommend these publications:

Learning to extract hierarchical information from semi-structured documents: http://ftp.cse.buffalo.edu/users/azhang/disc/disc01/cd1/out/papers/cikm/p250.pdf
WHISK: Extraction of structured and unstructured information: https://www.cis.uni-muenchen.de/~yeong/Kurse/ws0809/WebDataMining/whisk.pdf

I originally called this autoscraper but while working on it someone else released a library named exactly the same. Check it out here: autoscraper.

Comments

missing mlscraper.html

Followed the readme and was testing the code after pip install --pre mlscraper

But got a module not found error

from mlscraper.html import Page ModuleNotFoundError: No module named 'mlscraper.html'

checking the installed library, only the following were present: ml.py parser.py training.py util.py

For people checking out to library it will be convenient if we add all dependencies in readme present: from mlscraper.html import Page from mlscraper.samples import Sample, TrainingSet from mlscraper.training import train_scraper

opened by appsec-airito 6
Find and fix issue with github profile pages
follower counts have no unique selector (need nth or something else)

image width and height get matched when searching for 20 followers (as icons have manually set dimensions)
opened by lorey 4

Stackoverflow example not working

This is the code

import logging

import requests

from mlscraper import SingleItemPageSample, RuleBasedSingleItemScraper


items = {
    "https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array": {
        "title": "Why is processing a sorted array faster than processing an unsorted array?"
    },
    "https://stackoverflow.com/questions/927358/how-do-i-undo-the-most-recent-local-commits-in-git": {
        "title": "How do I undo the most recent local commits in Git?"
    },
    "https://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do": {
        "title": "What does the “yield” keyword do?"
    },
}

results = {url: requests.get(url) for url in items.keys()}

# train scraper
samples = [
    SingleItemPageSample(results[url].content, items[url]) for url in items.keys()
]
scraper = RuleBasedSingleItemScraper.build(samples)

print("Scraping new question")
html = requests.get(
    "https://stackoverflow.com/questions/2003505/how-do-i-delete-a-git-branch-locally-and-remotely"
).content
result = scraper.scrape(html)

print("Result: %s" % result)

Output

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-9f646dab1fca> in <module>()
     24     SingleItemPageSample(results[url].content, items[url]) for url in items.keys()
     25 ]
---> 26 scraper = RuleBasedSingleItemScraper.build(samples)
     27 
     28 print("Scraping new question")

4 frames
/usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in build(samples)
     89                     matches_per_page_right = [
     90                         len(m) == 1 and m[0].get_text() == s.item[attr]
---> 91                         for m, s in zip(matches_per_page, samples)
     92                     ]
     93                     score = sum(matches_per_page_right) / len(samples)

/usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in <listcomp>(.0)
     88                     matches_per_page = (s.page.select(selector) for s in samples)
     89                     matches_per_page_right = [
---> 90                         len(m) == 1 and m[0].get_text() == s.item[attr]
     91                         for m, s in zip(matches_per_page, samples)
     92                     ]

/usr/local/lib/python3.7/dist-packages/mlscraper/__init__.py in <genexpr>(.0)
     86                 if selector not in selector_scoring:
     87                     logging.info("testing %s (%d/%d)", selector, i, len(selectors))
---> 88                     matches_per_page = (s.page.select(selector) for s in samples)
     89                     matches_per_page_right = [
     90                         len(m) == 1 and m[0].get_text() == s.item[attr]

/usr/local/lib/python3.7/dist-packages/mlscraper/parser.py in select(self, css_selector)
     28     def select(self, css_selector):
     29         try:
---> 30             return [SoupNode(res) for res in self._soup.select(css_selector)]
     31         except NotImplementedError:
     32             logging.warning(

/usr/local/lib/python3.7/dist-packages/bs4/element.py in select(self, selector, _candidate_generator, limit)
   1495                 if tag_name == '':
   1496                     raise ValueError(
-> 1497                         "A pseudo-class must be prefixed with a tag name.")
   1498                 pseudo_attributes = re.match(r'([a-zA-Z\d-]+)\(([a-zA-Z\d]+)\)', pseudo)
   1499                 found = []

ValueError: A pseudo-class must be prefixed with a tag name.

opened by rish-hyun 2

Example from docs does not work

This example from the README does not work unfortunately. Perhaps, I'm doing something wrong.

Example:

import requests
from mlscraper.html import Page
from mlscraper.samples import Sample, TrainingSet
from mlscraper.training import train_scraper

# fetch the page to train
einstein_url = 'http://quotes.toscrape.com/author/Albert-Einstein/'
resp = requests.get(einstein_url)
assert resp.status_code == 200

# create a sample for Albert Einstein
training_set = TrainingSet()
page = Page(resp.content)
sample = Sample(page, {'name': 'Albert Einstein', 'born': 'March 14, 1879'})
training_set.add_sample(sample)

# train the scraper with the created training set
scraper = train_scraper(training_set)

# scrape another page
resp = requests.get('http://quotes.toscrape.com/author/J-K-Rowling')
result = scraper.get(Page(resp.content))
print(result)

Error:

File ~/miniconda3/envs/colbert/lib/python3.8/site-packages/sklearn/externals/joblib/externals/cloudpickle/cloudpickle.py:133, in _make_cell_set_template_code()
    116     return types.CodeType(
    117         co.co_argcount,
    118         co.co_nlocals,
   (...)
    130         (),
    131     )
    132 else:
--> 133     return types.CodeType(
    134         co.co_argcount,
    135         co.co_kwonlyargcount,
    136         co.co_nlocals,
    137         co.co_stacksize,
    138         co.co_flags,
    139         co.co_code,
    140         co.co_consts,
    141         co.co_names,
    142         co.co_varnames,
    143         co.co_filename,
    144         co.co_name,
    145         co.co_firstlineno,
    146         co.co_lnotab,
    147         co.co_cellvars,  # this is the trickery
    148         (),
    149     )

TypeError: an integer is required (got type bytes)

opened by creatorrr 1

Bump lxml from 4.5.1 to 4.6.5
Bumps lxml from 4.5.1 to 4.6.5.

Changelog

Sourced from lxml's changelog.

4.6.5 (2021-12-12)

Bugs fixed

A vulnerability (GHSL-2021-1038) in the HTML cleaner allowed sneaking script content through SVG images.

A vulnerability (GHSL-2021-1037) in the HTML cleaner allowed sneaking script content through CSS imports and other crafted constructs.

4.6.4 (2021-11-01)

Features added

GH#317: A new property system_url was added to DTD entities. Patch by Thirdegree.

GH#314: The STATIC_* variables in setup.py can now be passed via env vars. Patch by Isaac Jurado.

4.6.3 (2021-03-21)

Bugs fixed

A vulnerability (CVE-2021-28957) was discovered in the HTML Cleaner by Kevin Chung, which allowed JavaScript to pass through. The cleaner now removes the HTML5 formaction attribute.

4.6.2 (2020-11-26)

Bugs fixed

A vulnerability (CVE-2020-27783) was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

4.6.1 (2020-10-18)

... (truncated)

Commits

a9611ba Fix a test in Py2.

a3eacbc Prepare release of 4.6.5.

b7ea687 Update changelog.

69a7473 Cleaner: cover some more cases where scripts could sneak through in specially...

54d2985 Fix condition in test decorator.

4b220b5 Use the non-depcrecated TextTestResult instead of _TextTestResult (GH-333)

d85c6de Exclude a test when using the macOS system libraries because it fails with li...

cd4bec9 Add macOS-M1 as wheel build platform.

fd0d471 Install automake and libtool in macOS build to be able to install the latest ...

f233023 Cleaner: Remove SVG image data URLs since they can embed script content.

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump pip from 19.2.3 to 21.1
Bumps pip from 19.2.3 to 21.1.

Changelog

Sourced from pip's changelog.

21.1 (2021-04-24)

Process

Start installation scheme migration from distutils to sysconfig. A warning is implemented to detect differences between the two implementations to encourage user reports, so we can avoid breakages before they happen.

Features

Add the ability for the new resolver to process URL constraints. ([#8253](https://github.com/pypa/pip/issues/8253) <https://github.com/pypa/pip/issues/8253>_)

Add a feature --use-feature=in-tree-build to build local projects in-place when installing. This is expected to become the default behavior in pip 21.3; see Installing from local packages <https://pip.pypa.io/en/stable/user_guide/#installing-from-local-packages>_ for more information. ([#9091](https://github.com/pypa/pip/issues/9091) <https://github.com/pypa/pip/issues/9091>_)

Bring back the "(from versions: ...)" message, that was shown on resolution failures. ([#9139](https://github.com/pypa/pip/issues/9139) <https://github.com/pypa/pip/issues/9139>_)

Add support for editable installs for project with only setup.cfg files. ([#9547](https://github.com/pypa/pip/issues/9547) <https://github.com/pypa/pip/issues/9547>_)

Improve performance when picking the best file from indexes during pip install. ([#9748](https://github.com/pypa/pip/issues/9748) <https://github.com/pypa/pip/issues/9748>_)

Warn instead of erroring out when doing a PEP 517 build in presence of --build-option. Warn when doing a PEP 517 build in presence of --global-option. ([#9774](https://github.com/pypa/pip/issues/9774) <https://github.com/pypa/pip/issues/9774>_)

Bug Fixes

Fixed --target to work with --editable installs. ([#4390](https://github.com/pypa/pip/issues/4390) <https://github.com/pypa/pip/issues/4390>_)

Add a warning, discouraging the usage of pip as root, outside a virtual environment. ([#6409](https://github.com/pypa/pip/issues/6409) <https://github.com/pypa/pip/issues/6409>_)

Ignore .dist-info directories if the stem is not a valid Python distribution name, so they don't show up in e.g. pip freeze. ([#7269](https://github.com/pypa/pip/issues/7269) <https://github.com/pypa/pip/issues/7269>_)

Only query the keyring for URLs that actually trigger error 401. This prevents an unnecessary keyring unlock prompt on every pip install invocation (even with default index URL which is not password protected). ([#8090](https://github.com/pypa/pip/issues/8090) <https://github.com/pypa/pip/issues/8090>_)

Prevent packages already-installed alongside with pip to be injected into an isolated build environment during build-time dependency population. ([#8214](https://github.com/pypa/pip/issues/8214) <https://github.com/pypa/pip/issues/8214>_)

Fix pip freeze permission denied error in order to display an understandable error message and offer solutions. ([#8418](https://github.com/pypa/pip/issues/8418) <https://github.com/pypa/pip/issues/8418>_)

Correctly uninstall script files (from setuptools' scripts argument), when installed with --user. ([#8733](https://github.com/pypa/pip/issues/8733) <https://github.com/pypa/pip/issues/8733>_)

New resolver: When a requirement is requested both via a direct URL (req @ URL) and via version specifier with extras (req[extra]), the resolver will now be able to use the URL to correctly resolve the requirement with extras. ([#8785](https://github.com/pypa/pip/issues/8785) <https://github.com/pypa/pip/issues/8785>_)

New resolver: Show relevant entries from user-supplied constraint files in the error message to improve debuggability. ([#9300](https://github.com/pypa/pip/issues/9300) <https://github.com/pypa/pip/issues/9300>_)

Avoid parsing version to make the version check more robust against lousily debundled downstream distributions. ([#9348](https://github.com/pypa/pip/issues/9348) <https://github.com/pypa/pip/issues/9348>_)

--user is no longer suggested incorrectly when pip fails with a permission error in a virtual environment. ([#9409](https://github.com/pypa/pip/issues/9409) <https://github.com/pypa/pip/issues/9409>_)

Fix incorrect reporting on Requires-Python conflicts. ([#9541](https://github.com/pypa/pip/issues/9541) <https://github.com/pypa/pip/issues/9541>_)

... (truncated)

Commits

2b2a268 Bump for release

ea761a6 Update AUTHORS.txt

2edd3fd Postpone a deprecation to 21.2

3cccfbf Rename mislabeled news fragment

21cd124 Fix NEWS.rst placeholder position

e46bdda Merge pull request #9827 from pradyunsg/fix-git-improper-tag-handling

0e4938d :newspaper:

ca832b2 Don't split git references on unicode separators

1320bac Merge pull request #9814 from pradyunsg/revamp-ci-apr-2021-v2

e9cc23f Skip checks on PRs only

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump urllib3 from 1.25.9 to 1.26.5
Bumps urllib3 from 1.25.9 to 1.26.5.

Release notes

Sourced from urllib3's releases.

1.26.5

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

Fixed deprecation warnings emitted in Python 3.10.

Updated vendored six library to 1.16.0.

Improved performance of URL parser when splitting the authority component.

If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

1.26.4

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

Changed behavior of the default SSLContext when connecting to HTTPS proxy during HTTPS requests. The default SSLContext now sets check_hostname=True.

If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

1.26.3

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

Fixed bytes and string comparison issue with headers (Pull #2141)

Changed ProxySchemeUnknown error message to be more actionable if the user supplies a proxy URL without a scheme (Pull #2107)

If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors

1.26.2

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

Fixed an issue where wrap_socket and CERT_REQUIRED wouldn't be imported properly on Python 2.7.8 and earlier (Pull #2052)

1.26.1

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

Fixed an issue where two User-Agent headers would be sent if a User-Agent header key is passed as bytes (Pull #2047)

1.26.0

:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap

Added support for HTTPS proxies contacting HTTPS servers (Pull #1923, Pull #1806)

Deprecated negotiating TLSv1 and TLSv1.1 by default. Users that still wish to use TLS earlier than 1.2 without a deprecation warning should opt-in explicitly by setting ssl_version=ssl.PROTOCOL_TLSv1_1 (Pull #2002) Starting in urllib3 v2.0: Connections that receive a DeprecationWarning will fail

Deprecated Retry options Retry.DEFAULT_METHOD_WHITELIST, Retry.DEFAULT_REDIRECT_HEADERS_BLACKLIST and Retry(method_whitelist=...) in favor of Retry.DEFAULT_ALLOWED_METHODS, Retry.DEFAULT_REMOVE_HEADERS_ON_REDIRECT, and Retry(allowed_methods=...) (Pull #2000) Starting in urllib3 v2.0: Deprecated options will be removed

... (truncated)

Changelog

Sourced from urllib3's changelog.

1.26.5 (2021-05-26)

Fixed deprecation warnings emitted in Python 3.10.

Updated vendored six library to 1.16.0.

Improved performance of URL parser when splitting the authority component.

1.26.4 (2021-03-15)

Changed behavior of the default SSLContext when connecting to HTTPS proxy during HTTPS requests. The default SSLContext now sets check_hostname=True.

1.26.3 (2021-01-26)

Fixed bytes and string comparison issue with headers (Pull #2141)

Changed ProxySchemeUnknown error message to be more actionable if the user supplies a proxy URL without a scheme. (Pull #2107)

1.26.2 (2020-11-12)

Fixed an issue where wrap_socket and CERT_REQUIRED wouldn't be imported properly on Python 2.7.8 and earlier (Pull #2052)

1.26.1 (2020-11-11)

Fixed an issue where two User-Agent headers would be sent if a User-Agent header key is passed as bytes (Pull #2047)

1.26.0 (2020-11-10)

NOTE: urllib3 v2.0 will drop support for Python 2. Read more in the v2.0 Roadmap <https://urllib3.readthedocs.io/en/latest/v2-roadmap.html>_.

Added support for HTTPS proxies contacting HTTPS servers (Pull #1923, Pull #1806)

Deprecated negotiating TLSv1 and TLSv1.1 by default. Users that still wish to use TLS earlier than 1.2 without a deprecation warning

... (truncated)

Commits

d161647 Release 1.26.5

2d4a3fe Improve performance of sub-authority splitting in URL

2698537 Update vendored six to 1.16.0

07bed79 Fix deprecation warnings for Python 3.10 ssl module

d725a9b Add Python 3.10 to GitHub Actions

339ad34 Use pytest==6.2.4 on Python 3.10+

f271c9c Apply latest Black formatting

1884878 [1.26] Properly proxy EOF on the SSLTransport test suite

a891304 Release 1.26.4

8d65ea1 Merge pull request from GHSA-5phf-pp7p-vc2r

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump py from 1.8.1 to 1.10.0
Bumps py from 1.8.1 to 1.10.0.

Changelog

Sourced from py's changelog.

1.10.0 (2020-12-12)

Fix a regular expression DoS vulnerability in the py.path.svnwc SVN blame functionality (CVE-2020-29651)

Update vendored apipkg: 1.4 => 1.5

Update vendored iniconfig: 1.0.0 => 1.1.1

1.9.0 (2020-06-24)

Add type annotation stubs for the following modules:

py.error

py.iniconfig

py.path (not including SVN paths)

py.io

py.xml

There are no plans to type other modules at this time.

The type annotations are provided in external .pyi files, not inline in the code, and may therefore contain small errors or omissions. If you use py in conjunction with a type checker, and encounter any type errors you believe should be accepted, please report it in an issue.

1.8.2 (2020-06-15)

On Windows, py.path.locals which differ only in case now have the same Python hash value. Previously, such paths were considered equal but had different hashes, which is not allowed and breaks the assumptions made by dicts, sets and other users of hashes.

Commits

e5ff378 Update CHANGELOG for 1.10.0

94cf44f Update vendored libs

5e8ded5 testing: comment out an assert which fails on Python 3.9 for now

afdffcc Rename HOWTORELEASE.rst to RELEASING.rst

2de53a6 Merge pull request #266 from nicoddemus/gh-actions

fa1b32e Merge pull request #264 from hugovk/patch-2

887d6b8 Skip test_samefile_symlink on pypy3 on Windows

e94e670 Fix test_comments() in test_source

fef9a32 Adapt test

4a694b0 Add GitHub Actions badge to README

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump lxml from 4.5.1 to 4.6.3
Bumps lxml from 4.5.1 to 4.6.3.

Changelog

Sourced from lxml's changelog.

4.6.3 (2021-03-21)

Bugs fixed

A vulnerability (CVE-2021-28957) was discovered in the HTML Cleaner by Kevin Chung, which allowed JavaScript to pass through. The cleaner now removes the HTML5 formaction attribute.

4.6.2 (2020-11-26)

Bugs fixed

A vulnerability (CVE-2020-27783) was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

4.6.1 (2020-10-18)

Bugs fixed

A vulnerability was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

4.6.0 (2020-10-17)

Features added

GH#310: lxml.html.InputGetter supports __len__() to count the number of input fields. Patch by Aidan Woolley.

lxml.html.InputGetter has a new .items() method to ease processing all input fields.

lxml.html.InputGetter.keys() now returns the field names in document order.

GH-309: The API documentation is now generated using sphinx-apidoc. Patch by Chris Mayo.

Bugs fixed

... (truncated)

Commits

a5f9cb5 Prepare release of lxml 4.6.3.

2d01a1b Add HTML-5 "formaction" attribute to "defs.link_attrs" (GH-316)

e986a9c Fix reference in docs.

4cb5736 Work around Py2's lack of "re.ASCII".

c30106f Prepare release of 4.6.2.

a105ab8 Prevent combinations of <math/svg> and <style> to sneak JavaScript through th...

c053dc1 Add a recipe for a look-ahead generator to allow modifications during tree it...

b083124 lxml actually works in Py3.9.

0f80590 lxml actually works in Py3.9.

fd8893c Add a doc note that the .find() methods are usually faster than one might exp...

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Bump lxml from 4.5.1 to 4.6.2
Bumps lxml from 4.5.1 to 4.6.2.

Changelog

Sourced from lxml's changelog.

4.6.2 (2020-11-26)

Bugs fixed

A vulnerability (CVE-2020-27783) was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

4.6.1 (2020-10-18)

Bugs fixed

A vulnerability was discovered in the HTML Cleaner by Yaniv Nizry, which allowed JavaScript to pass through. The cleaner now removes more sneaky "style" content.

4.6.0 (2020-10-17)

Features added

GH#310: lxml.html.InputGetter supports __len__() to count the number of input fields. Patch by Aidan Woolley.

lxml.html.InputGetter has a new .items() method to ease processing all input fields.

lxml.html.InputGetter.keys() now returns the field names in document order.

GH-309: The API documentation is now generated using sphinx-apidoc. Patch by Chris Mayo.

Bugs fixed

LP#1869455: C14N 2.0 serialisation failed for unprefixed attributes when a default namespace was defined.

TreeBuilder.close() raised AssertionError in some error cases where it should have raised XMLSyntaxError. It now raises a combined exception to keep up backwards compatibility, while switching to XMLSyntaxError as an interface.

4.5.2 (2020-07-09)

... (truncated)

Commits

4cb5736 Work around Py2's lack of "re.ASCII".

c30106f Prepare release of 4.6.2.

a105ab8 Prevent combinations of <math/svg> and <style> to sneak JavaScript through th...

c053dc1 Add a recipe for a look-ahead generator to allow modifications during tree it...

b083124 lxml actually works in Py3.9.

0f80590 lxml actually works in Py3.9.

fd8893c Add a doc note that the .find() methods are usually faster than one might exp...

eb6df27 Update release version on homepage.

69b5c9b Automate the build artefact downloading from github and appveyor.

61432a8 Prepare release of lxml 4.6.1.

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 1
Include fetching in scrapers
Scrapers should be able to deal with urls, HTML, and parsed DOMs (even requests response objects?) to enable for flexible library usage.

scrapers input HTML, DOM, and responses, e.g. via scrape_soup, scrape_html, scrape_url

examples can be created via static methods, e.g. via SingleItemPageSample.from_soup, etc.

enhancement
opened by lorey 1
Bump wheel from 0.37.1 to 0.38.1 in /requirements
Bumps wheel from 0.37.1 to 0.38.1.

Changelog

Sourced from wheel's changelog.

Release Notes

UNRELEASED

Updated vendored packaging to 22.0

0.38.4 (2022-11-09)

Fixed PKG-INFO conversion in bdist_wheel mangling UTF-8 header values in METADATA (PR by Anderson Bravalheri)

0.38.3 (2022-11-08)

Fixed install failure when used with --no-binary, reported on Ubuntu 20.04, by removing setup_requires from setup.cfg

0.38.2 (2022-11-05)

Fixed regression introduced in v0.38.1 which broke parsing of wheel file names with multiple platform tags

0.38.1 (2022-11-04)

Removed install dependency on setuptools

The future-proof fix in 0.36.0 for converting PyPy's SOABI into a abi tag was faulty. Fixed so that future changes in the SOABI will not change the tag.

0.38.0 (2022-10-21)

Dropped support for Python < 3.7

Updated vendored packaging to 21.3

Replaced all uses of distutils with setuptools

The handling of license_files (including glob patterns and default values) is now delegated to setuptools>=57.0.0 (#466). The package dependencies were updated to reflect this change.

Fixed potential DoS attack via the WHEEL_INFO_RE regular expression

Fixed ValueError: ZIP does not support timestamps before 1980 when using SOURCE_DATE_EPOCH=0 or when on-disk timestamps are earlier than 1980-01-01. Such timestamps are now changed to the minimum value before packaging.

0.37.1 (2021-12-22)

Fixed wheel pack duplicating the WHEEL contents when the build number has changed (#415)

Fixed parsing of file names containing commas in RECORD (PR by Hood Chatham)

0.37.0 (2021-08-09)

Added official Python 3.10 support

Updated vendored packaging library to v20.9

... (truncated)

Commits

6f1608d Created a new release

cf8f5ef Moved news item from PR #484 to its proper place

9ec2016 Removed install dependency on setuptools (#483)

747e1f6 Fixed PyPy SOABI parsing (#484)

7627548 [pre-commit.ci] pre-commit autoupdate (#480)

7b9e8e1 Test on Python 3.11 final

a04dfef Updated the pypi-publish action

94bb62c Fixed docs not building due to code style changes

d635664 Updated the codecov action to the latest version

fcb94cd Updated version to match the release

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Bump certifi from 2022.6.15 to 2022.12.7 in /requirements
Bumps certifi from 2022.6.15 to 2022.12.7.

Commits

9e9e840 2022.12.07

b81bdb2 2022.09.24

939a28f 2022.09.14

aca828a 2022.06.15.2

de0eae1 Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ...

b8eb5e9 2022.06.15.1

47fb7ab Fix deprecation warning on Python 3.11 (#199)

b0b48e0 fixes #198 -- update link in license

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Find better selectors

Currently, we just use the next best selector we find, starting from generic to specific. But too generic selectors are bad, e.g. div most likely has no meaning, and on the other hand, to specific selectors like the full path are likely too specific and will break.

Maybe there's a heuristic for good selectors. An idea: What if we compute selectivity for each selector, e.g. how unique this selector is on the whole page. Would prefer ids and unique classes and discourage generic selectors. We then take the most selective but simplest selector.

opened by lorey 0
Match substrings
Often, user do not want to match full attributes or text of nodes, but specific substrings.

Solutions:

generate extractors that use appropriate rules to transform node.text to desired outcome.

enhancement
opened by lorey 1

Releases(v1.0.0rc3)

v1.0.0rc3(Jun 24, 2022)
improved training performance by 10x (again) by trying to generate scrapers for highly similar matches first

added first pseudo css selectors by implementing nth-child. e.g. div a:nth-child(1)

added child selector generation, e.g. .user-box > a

added attribute-based css selectors, e.g. a[itemprop="user"]

added automated tests for GitHub profile pages

added lazy hashing for node elements

extended text matching to also include parent elements that contain the same text

fixed a bug where searching for values resulted in image dimensions being matched

fixed a bug where text did not exactly match the sample provided but was selected anyway

Source code(tar.gz)
Source code(zip)
v1.0.0rc2(Jun 21, 2022)
fixed a bug where text inside a tag was only selected if not enclosed by whitespace

Source code(tar.gz)
Source code(zip)
v1.0.0rc1(Jun 21, 2022)
mlscraper has been rewritten from the core and is now easier to use, more flexible, and faster than ever. This is the first release candidate for the upcoming 1.0 version. Feel free to try it out with pip install --pre mlscraper.

scrapers can extract arbitrary data structures (lists, dicts, lists of dicts and even lists of lists)

depending on the page, one example might be enough to train a scraper

the generation of CSS selectors has been overhauled and is now more efficient

Source code(tar.gz)
Source code(zip)

mlscraper: Scrape data from HTML pages automatically with Machine Learning

Related tags

Overview

mlscraper: Scrape data from HTML pages automatically with Machine Learning

Background Story

How it works

Getting started

Development

Related work

Comments

4.6.5 (2021-12-12)

Bugs fixed

4.6.4 (2021-11-01)

Features added

4.6.3 (2021-03-21)

Bugs fixed

4.6.2 (2020-11-26)

Bugs fixed

4.6.1 (2020-10-18)

21.1 (2021-04-24)

Process

Features

Bug Fixes

1.26.5

1.26.4

1.26.3

1.26.2

1.26.1

1.26.0

1.26.5 (2021-05-26)

1.26.4 (2021-03-15)

1.26.3 (2021-01-26)

1.26.2 (2020-11-12)

1.26.1 (2020-11-11)

1.26.0 (2020-11-10)

1.10.0 (2020-12-12)

1.9.0 (2020-06-24)

1.8.2 (2020-06-15)

4.6.3 (2021-03-21)

Bugs fixed

4.6.2 (2020-11-26)

Bugs fixed

4.6.1 (2020-10-18)

Bugs fixed

4.6.0 (2020-10-17)

Features added

Bugs fixed

4.6.2 (2020-11-26)

Bugs fixed

4.6.1 (2020-10-18)

Bugs fixed

4.6.0 (2020-10-17)

Features added

Bugs fixed

Release Notes

Releases(v1.0.0rc3)

v1.0.0rc3(Jun 24, 2022)

v1.0.0rc2(Jun 21, 2022)

v1.0.0rc1(Jun 21, 2022)

Owner

Karl Lorey

爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

A Scrapper with python

Transistor, a Python web scraping framework for intelligent use cases.

爱奇艺会员,腾讯视频,哔哩哔哩,百度,各类签到

Simply scrape / download all the media from an fansly account.

WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

Scrapes Every Email Address of Every Society in Every University

A repository with scraping code and soccer dataset from understat.com.

Screenhook is a script that captures an image of a web page and send it to a discord webhook.

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

The core packages of security analyzer web crawler

An arxiv spider

This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

河南工业大学 完美校园 自动校外打卡

A Python module to bypass Cloudflare's anti-bot page.

Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing

Scrapy, a fast high-level web crawling & scraping framework for Python.

河南工业大学完美校园自动校外打卡