Scrapy, a fast high-level web crawling & scraping framework for Python.

Last update: Jan 07, 2023

Overview

Scrapy

Overview

Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

Scrapy is maintained by Zyte (formerly Scrapinghub) and many other contributors.

Check the Scrapy homepage at https://scrapy.org for more information, including a list of features.

Requirements

Python 3.6+
Works on Linux, Windows, macOS, BSD

Install

The quick way:

pip install scrapy

See the install section in the documentation at https://docs.scrapy.org/en/latest/intro/install.html for more details.

Documentation

Documentation is available online at https://docs.scrapy.org/ and in the docs directory.

Releases

You can check https://docs.scrapy.org/en/latest/news.html for the release notes.

Community (blog, twitter, mail list, IRC)

See https://scrapy.org/community/ for details.

Contributing

See https://docs.scrapy.org/en/master/contributing.html for details.

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct (see https://github.com/scrapy/scrapy/blob/master/CODE_OF_CONDUCT.md).

By participating in this project you agree to abide by its terms. Please report unacceptable behavior to [email protected].

Companies using Scrapy

See https://scrapy.org/companies/ for a list.

Commercial Support

See https://scrapy.org/support/ for details.

Comments

CSS selectors

I'm a web developer and designer turned into a web scraper. Python is easy and I love it. Scrapy is wonderful. But XPath... it's a foreign language that mostly does what CSS selectors do — but those are second language to me. Anyone coming from jQuery or CSS will already be used to CSS selectors.

So I decided to improve scrapy to support CSS selectors, using the cssselect package that got extracted from the lxml package. I've created two Selector classes, HtmlCSSSelector and XmlCSSSelector, that basically translate the CSS selectors into XPath selectors and run the parent class select method. Easy Peasy.

I'm looking into how to provide tests for these new classes, but would love some guidance. This is my first contribution to a Python package.

opened by barraponto 67
Scrapy.selector Enhancement Proposal

Make scrapy.selector into a separate project.

When we're scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this: BeautifulSoup, lxml. And although scrapy selectors are built over the lxml library, it is a better way to extract data from HTML source. In my opinion, people will prefer selector to BeautifulSoup and lxml , if make the scrapy.selectors into a separate project,

opened by zwidny 60
New selector method: extract_first()
I think about suggestion to improve scrapy Selector. I've seen this construction in many projects:

result = sel.xpath('//div/text()').extract()[0]

And what about if result: and else:, or try: and except:, which should be always there? When we don't want ItemLoaders, the most common use of selector is retrieving only single element. Maybe there should be method xpath1 or xpath_one or xpath_first that returns first matched element or None?
opened by shirk3y 54
[GSoC 2019] Support for Different robots.txt Parsers
This issue is a single place for all students and mentors to discuss ideas and proposals for Support for Different robots.txt Parsers GSoC project. First of all, every student involved should have not very big contribution to https://github.com/scrapy/scrapy project. It should not be very big, just to get your hands dirty and get accustomed to processes and tools. Contribution should be done in a form of open Pull Request to solve a problem not related to robots.txt project. You can read open issues or open PRs and choose one for yourself. Or you can ask here and mentors and contributors will some recommendations.

Problems for current robots.txt implementation can be traced in relevant issues:

https://github.com/scrapy/scrapy/issues/754 (this is a main one)

#892

https://github.com/scrapy/scrapy/issues/2443

https://github.com/scrapy/scrapy/issues/3637

Previous attempts to fix issues can be seen in relevant PRs:

https://github.com/scrapy/scrapy/pull/2669

https://github.com/scrapy/scrapy/pull/2385

Ask for more details in comments
opened by whalebot-helmsman 48

Issue with running scrapy spider from script.

Hi, I'm trying to run scrapy from a script like this:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    name = "basic"
    allowed_domains = ["web"]
    start_urls = ['http://www.example.com']

    def parse(self, response):
        l = ItemLoader(item=PropertiesItem(), response = response)
        l.add_xpath('title', '//h1[1]/text()')

        return l.load_item()
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start()

However, when I run this script I get the following error:

File "/Library/Python/2.7/site-packages/Twisted-16.7.0rc1-py2.7-macosx-10.11-
intel.egg/twisted/internet/_sslverify.py", line 38, in <module>
TLSVersion.TLSv1_1: SSL.OP_NO_TLSv1_1,
AttributeError: 'module' object has no attribute 'OP_NO_TLSv1_1'

Does anyone know how to fix this? Thanks in advance.

enhancement docs

opened by tituskex 42

' error: command 'x86_64-linux-gnu-gcc' failed with exit status 1 '

I wanted to install scrapy in virtualenv using pip (Python 3.5) but I get the following error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

I tried with Python 2.7 but I get the same error
install

opened by euler16 41
Centralized Request fingerprints
It is very easy to have a subtle bug when using a custom duplicates filter that changes how request fingerprint is calculated.

Duplicate filter checks request fingerprint and makes Scheduler drop the request if it is a duplicate.

Cache storage checks request fingerprint and fetches response from cache if it is a duplicate.

If fingerprint algorithms differ we're in trouble.

The problem is that there is no way to override request fingerprint globally; to make Scrapy always take something extra in account (an http header, a meta option) user must override duplicates filter and all cache storages that are in use.

Ideas about how to fix it:

Use duplicates filter request_fingerprint method in cache storage if this method is available;

create a special Request.meta key that request_fingerprint function will take into account;

create a special Request.meta key that will allow to provide a pre-calculated fingerprint;

add a settings.py option to override request fingerprint function globally.

enhancement
opened by kmike 40
Support relative urls better

Building an URL relative to current URL is a very common task; currently users are required to do that themselves - import urlparse and then urlparse.urljoin(response.url, href).

What do you think about adding a shortcut for that - something like response.urljoin(href) or response.absolute_url(href)?

Another crazy idea is to support relative urls directly, by assuming they are relative to a response.url. I think it can be implemented as a spider middleware. I actually like this idea :)
enhancement

opened by kmike 38

Scrapy chokes on HTTP response status lines without a Reason phrase

Try fetch page:

$ scrapy fetch 'http://www.gidroprofmontag.ru/bassein/sbornue_basseynu'

output:

2013-07-11 09:15:37+0400 [scrapy] INFO: Scrapy 0.17.0-304-g3fe2a32 started (bot: amon)
/home/tonal/amon/amon/amon/downloadermiddleware/blocked.py:6: ScrapyDeprecationWarning: Module `scrapy.stats` is deprecated, use `crawler.stats` attribute instead
  from scrapy.stats import stats
2013-07-11 09:15:37+0400 [amon_ra] INFO: Spider opened
2013-07-11 09:15:37+0400 [amon_ra] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-11 09:15:37+0400 [amon_ra] ERROR: Error downloading <GET http://www.gidroprofmontag.ru/bassein/sbornue_basseynu>: [<twisted.python.failure.Failure <class 'scrapy.xlib.tx._newclient.ParseError'>>]
2013-07-11 09:15:37+0400 [amon_ra] INFO: Closing spider (finished)
2013-07-11 09:15:37+0400 [amon_ra] INFO: Dumping Scrapy stats:
        {'downloader/exception_count': 1,
         'downloader/exception_type_count/scrapy.xlib.tx._newclient.ResponseFailed': 1,
         'downloader/request_bytes': 256,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2013, 7, 11, 5, 15, 37, 512010),
         'log_count/ERROR': 1,
         'log_count/INFO': 4,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2013, 7, 11, 5, 15, 37, 257898)}
2013-07-11 09:15:37+0400 [amon_ra] INFO: Spider closed (finished)

bug

opened by tonal 37

sslv3 alert handshake failure when making a request

Hi there, I recently upgraded to the latest scrapy and on some sites SSL enabled sites I get an exception when trying to make requests to it, while on previous scrapy versions I didn't have this issue. The issue can be seen by making a request with scrapy shell:

scrapy shell "https://www.gohastings.com/"

The error I get is: Retrying <GET https://www.gohastings.com/> (failed 1 times): <twisted.python.failure.Failure OpenSSL.SSL.Error: ('SSL routines', 'SSL3_READ_BYTES', 'sslv3 alert handshake failure'), ('SSL routines', 'SSL3_WRITE_BYTES', 'ssl handshake failure')>
bug https

opened by lagenar 36
GSoC 2021: Feeds enhancements
This is a single issue to discuss feeds enhancements as a project for GSoC 2021.

My idea here is to make a project to work on 3 (or more) improvements detailed below.

1 - filter items on export based on custom rules.

Relevant issues:

https://github.com/scrapy/scrapy/issues/4607

https://github.com/scrapy/scrapy/issues/4575

https://github.com/scrapy/scrapy/issues/4786

https://github.com/scrapy/scrapy/issues/3193

There is already a PR for this one (note my last comment there) https://github.com/scrapy/scrapy/pull/4576

However, if the author doesn't reply on time, we can continue the work from the branch and only finish the feature.

2 - Compress feeds

This is an old feature request and there's an issue for it here https://github.com/scrapy/scrapy/issues/2174 The API changed a bit since then, but I think it'd be something like

FEEDS = { "myfile.jl": { "compression": "gzip" } }

I think gzip is a good starting point, but we should put some effort to design an API that will be extensible and allow different formats.

3 - Spider open/close a batch

Recently we added support for batch delivery in scapy. Say, every X items, we deliver a file and open a new file. Sometimes, we don't know the threshold upfront or it can be based on an external signal. In this case, we should be able to trigger a batch delivery from the spider. I have two possible ideas for it:

Create a new signal: scrapy.signals.close_batch

Add a method that can be called from the spider spider.feed_exporter.close_batch()

Note that, this can be tricky as we allow multiple feeds (so it may require an argument specifying which feed batch to close).
enhancement
opened by ejulio 35

Fix overwriting repeated headers in HTTP2

This commit fixes the overwriting bug of repeated header values using HTTP2

raw_response = \
"""
HTTP 1.1 200 OK
Set-Cookie: first=1
Set-Cookie: second=2
"""

# Old logic
>>> response.headers.getlist(b'Set-Cookie')
[b'second=2']

# New logic
>>> response.headers.getlist(b'Set-Cookie')
[b'first=1', b'second=2']

opened by Malkiz223 3

AsyncioTest failures

There are two new failing tests (AsyncioTest.test_install_asyncio_reactor and AsyncioTest.test_is_asyncio_reactor_installed) on 3.10 and 3.11, and it looks like they started failing after updating to 3.10.9 and 3.11.1 (it may be a coincidence though). I see an asyncio-related change in those versions: "asyncio.get_event_loop() now only emits a deprecation warning when a new event loop was created implicitly. It no longer emits a deprecation warning if the current event loop was set." and it looks like it's the cause for the test_install_asyncio_reactor failure as that checks for a warning and maybe there is some new warning. But the second test is about installing the reactor and checking its type and the failure looks like the asyncio reactor is installed when it shouldn't.
bug CI asyncio

opened by wRAR 1
Spelling

This PR corrects misspellings identified by the check-spelling action.

The misspellings have been reported at https://github.com/jsoref/scrapy/actions/runs/3727974993

The action reports that the changes in this PR would make it happy: https://github.com/jsoref/scrapy/actions/runs/3727977653

Note: this PR does not include the action. If you're interested in running a spell check on every PR and push, that can be offered separately.

opened by jsoref 1
Finish exporting for each start exporting
Figured I'd take a crack at #5537 since it's been stale for a while.

This patch moves the call to slot.finish_exporting() up to before the conditional that checks if slot.item_count > 0 so that if an exception occurs when an exporter's export_item() method is called, slot.finish_exporting() will still be called (as discussed in the issue).

I've included 4 new tests to that each check that:

if slot.start_exporting() was called that there is a call to slot.finish_exporting() after

slot.finish_exporting() is not called before a call to slot.start_exporting(). I.e. they assert that slot.finish_exporting() isn't called twice in a row without a call to slot.start_exporting() in between and that slot.finish_exporting() isn't called before an initial call to slot.start_exporting.

test_start_finish_exporting_items_exception fails before the patch and passes after the patch. The other 3 tests pass before and after.
opened by MattyMay 1

Releases(2.7.1)

2.7.1(Nov 2, 2022)
Relaxed the restriction introduced in 2.6.2 so that the Proxy-Authentication header can again be set explicitly in certain cases, restoring compatibility with scrapy-zyte-smartproxy 2.1.0 and older

Bug fixes

See the full changelog
Source code(tar.gz)
Source code(zip)
2.7.0(Oct 17, 2022)
Added Python 3.11 support, dropped Python 3.6 support

Improved support for asynchronous callbacks

Asyncio support is enabled by default on new projects

Output names of item fields can now be arbitrary strings

Centralized request fingerprinting configuration is now possible

See the full changelog
Source code(tar.gz)
Source code(zip)
2.6.3(Sep 27, 2022)

Makes pip install Scrapy work again.

It required making changes to support pyOpenSSL 22.1.0. We had to drop support for SSLv3 as a result.

We also upgraded the minimum versions of some dependencies.

See the changelog.
Source code(tar.gz)
Source code(zip)
2.6.2(Jul 25, 2022)

Fixes a security issue around HTTP proxy usage, and addresses a few regressions introduced in Scrapy 2.6.0.

See the changelog.
Source code(tar.gz)
Source code(zip)
1.8.3(Jul 25, 2022)

Fixes a security issue around HTTP proxy usage. See the changelog for details.
Source code(tar.gz)
Source code(zip)
2.6.1(Mar 1, 2022)

Fixes a regression introduced in 2.6.0 that would unset the request method when following redirects.
Source code(tar.gz)
Source code(zip)
2.6.0(Mar 1, 2022)
Security fixes for cookie handling (see details below)

Python 3.10 support

asyncio support is no longer considered experimental, and works out-of-the-box on Windows regardless of your Python version

Feed exports now support pathlib.Path output paths and per-feed item filtering and post-processing

See the full changelog

Security bug fixes

When a Request object with cookies defined gets a redirect response causing a new Request object to be scheduled, the cookies defined in the original Request object are no longer copied into the new Request object.

If you manually set the Cookie header on a Request object and the domain name of the redirect URL is not an exact match for the domain of the URL of the original Request object, your Cookie header is now dropped from the new Request object.

The old behavior could be exploited by an attacker to gain access to your cookies. Please, see the cjvr-mfj7-j4j8 security advisory for more information.

Note: It is still possible to enable the sharing of cookies between different domains with a shared domain suffix (e.g. example.com and any subdomain) by defining the shared domain suffix (e.g. example.com) as the cookie domain when defining your cookies. See the documentation of the Request class for more information.

When the domain of a cookie, either received in the Set-Cookie header of a response or defined in a Request object, is set to a public suffix <https://publicsuffix.org/>_, the cookie is now ignored unless the cookie domain is the same as the request domain.

The old behavior could be exploited by an attacker to inject cookies from a controlled domain into your cookiejar that could be sent to other domains not controlled by the attacker. Please, see the mfjm-vh54-3f96 security advisory for more information.

Source code(tar.gz)
Source code(zip)
1.8.2(Mar 1, 2022)
Security bug fixes

When a Request object with cookies defined gets a redirect response causing a new Request object to be scheduled, the cookies defined in the original Request object are no longer copied into the new Request object.

If you manually set the Cookie header on a Request object and the domain name of the redirect URL is not an exact match for the domain of the URL of the original Request object, your Cookie header is now dropped from the new Request object.

The old behavior could be exploited by an attacker to gain access to your cookies. Please, see the cjvr-mfj7-j4j8 security advisory for more information.

Note: It is still possible to enable the sharing of cookies between different domains with a shared domain suffix (e.g. example.com and any subdomain) by defining the shared domain suffix (e.g. example.com) as the cookie domain when defining your cookies. See the documentation of the Request class for more information.

When the domain of a cookie, either received in the Set-Cookie header of a response or defined in a Request object, is set to a public suffix <https://publicsuffix.org/>_, the cookie is now ignored unless the cookie domain is the same as the request domain.

The old behavior could be exploited by an attacker to inject cookies from a controlled domain into your cookiejar that could be sent to other domains not controlled by the attacker. Please, see the mfjm-vh54-3f96 security advisory for more information.

Source code(tar.gz)
Source code(zip)
2.5.1(Oct 5, 2021)

Security bug fix:

If you use HttpAuthMiddleware (i.e. the http_user and http_pass spider attributes) for HTTP authentication, any request exposes your credentials to the request target.

To prevent unintended exposure of authentication credentials to unintended domains, you must now additionally set a new, additional spider attribute, http_auth_domain, and point it to the specific domain to which the authentication credentials must be sent.

If the http_auth_domain spider attribute is not set, the domain of the first request will be considered the HTTP authentication target, and authentication credentials will only be sent in requests targeting that domain.

If you need to send the same HTTP authentication credentials to multiple domains, you can use w3lib.http.basic_auth_header instead to set the value of the Authorization header of your requests.

If you really want your spider to send the same HTTP authentication credentials to any domain, set the http_auth_domain spider attribute to None.

Finally, if you are a user of scrapy-splash, know that this version of Scrapy breaks compatibility with scrapy-splash 0.7.2 and earlier. You will need to upgrade scrapy-splash to a greater version for it to continue to work.
Source code(tar.gz)
Source code(zip)
1.8.1(Oct 5, 2021)

Security bug fix:

If you use HttpAuthMiddleware (i.e. the http_user and http_pass spider attributes) for HTTP authentication, any request exposes your credentials to the request target.

To prevent unintended exposure of authentication credentials to unintended domains, you must now additionally set a new, additional spider attribute, http_auth_domain, and point it to the specific domain to which the authentication credentials must be sent.

If the http_auth_domain spider attribute is not set, the domain of the first request will be considered the HTTP authentication target, and authentication credentials will only be sent in requests targeting that domain.

If you need to send the same HTTP authentication credentials to multiple domains, you can use w3lib.http.basic_auth_header instead to set the value of the Authorization header of your requests.

If you really want your spider to send the same HTTP authentication credentials to any domain, set the http_auth_domain spider attribute to None.

Finally, if you are a user of scrapy-splash, know that this version of Scrapy breaks compatibility with scrapy-splash 0.7.2 and earlier. You will need to upgrade scrapy-splash to a greater version for it to continue to work.
Source code(tar.gz)
Source code(zip)
2.5.0(Apr 6, 2021)
Official Python 3.9 support

Experimental HTTP/2 support

New get_retry_request() function to retry requests from spider callbacks

New headers_received signal that allows stopping downloads early

New Response.protocol attribute

See the full changelog
Source code(tar.gz)
Source code(zip)
2.4.1(Nov 17, 2020)
Fixed feed exports overwrite support

Fixed the asyncio event loop handling, which could make code hang

Fixed the IPv6-capable DNS resolver CachingHostnameResolver for download handlers that call reactor.resolve

Fixed the output of the genspider command showing placeholders instead of the import part of the generated spider module (issue 4874)

Source code(tar.gz)
Source code(zip)
2.4.0(Oct 11, 2020)
Hihglights:

Python 3.5 support has been dropped.

The file_path method of media pipelines can now access the source item.

This allows you to set a download file path based on item data.

The new item_export_kwargs key of the FEEDS setting allows to define keyword parameters to pass to item exporter classes.

You can now choose whether feed exports overwrite or append to the output file.

For example, when using the crawl or runspider commands, you can use the -O option instead of -o to overwrite the output file.

Zstd-compressed responses are now supported if zstandard is installed.

In settings, where the import path of a class is required, it is now possible to pass a class object instead.

See the full changelog
Source code(tar.gz)
Source code(zip)
2.3.0(Aug 4, 2020)
Hihglights:

Feed exports now support Google Cloud Storage as a storage backend

The new FEED_EXPORT_BATCH_ITEM_COUNT setting allows to deliver output items in batches of up to the specified number of items.

It also serves as a workaround for delayed file delivery, which causes Scrapy to only start item delivery after the crawl has finished when using certain storage backends (S3, FTP, and now GCS).

The base implementation of item loaders has been moved into a separate library, itemloaders, allowing usage from outside Scrapy and a separate release schedule

See the full changelog
Source code(tar.gz)
Source code(zip)
2.2.1(Jul 17, 2020)

The startproject command no longer makes unintended changes to the permissions of files in the destination folder, such as removing execution permissions.
Source code(tar.gz)
Source code(zip)
2.2.0(Jun 24, 2020)
Highlights:

Python 3.5.2+ is required now

dataclass objects and attrs objects are now valid item types

New TextResponse.json method

New bytes_received signal that allows canceling response download

CookiesMiddleware fixes

See the full changelog
Source code(tar.gz)
Source code(zip)
2.1.0(Apr 24, 2020)
Highlights:

New FEEDS setting to export to multiple feeds

New Response.ip_address attribute

See the full changelog
Source code(tar.gz)
Source code(zip)
2.0.1(Mar 18, 2020)
Response.follow_all now supports an empty URL iterable as input (#4408, #4420)

Removed top-level reactor imports to prevent errors about the wrong Twisted reactor being installed when setting a different Twisted reactor using TWISTED_REACTOR (#4401, #4406)

Source code(tar.gz)
Source code(zip)
2.0.0(Mar 3, 2020)
Highlights:

Python 2 support has been removed

Partial coroutine syntax support and experimental asyncio support

New Response.follow_all method

FTP support for media pipelines

New Response.certificate attribute

IPv6 support through DNS_RESOLVER

See the full changelog
Source code(tar.gz)
Source code(zip)
1.7.4(Oct 21, 2019)

Revert the fix for #3804 (#3819), which has a few undesired side effects (#3897, #3976).
Source code(tar.gz)
Source code(zip)
1.7.3(Aug 1, 2019)

Enforce lxml 4.3.5 or lower for Python 3.4 (#3912, #3918)
Source code(tar.gz)
Source code(zip)
1.7.2(Jul 23, 2019)

Fix Python 2 support (#3889, #3893, #3896)
Source code(tar.gz)
Source code(zip)
1.7.0(Jul 18, 2019)
Highlights:

Improvements for crawls targeting multiple domains

A cleaner way to pass arguments to callbacks

A new class for JSON requests

Improvements for rule-based spiders

New features for feed exports

See the full change log
Source code(tar.gz)
Source code(zip)
1.6.0(Feb 11, 2019)
Highlights:

Better Windows support

Python 3.7 compatibility

Big documentation improvements, including a switch from .extract_first() + .extract() API to .get() + .getall() API

Feed exports, FilePipeline and MediaPipeline improvements

Better extensibility: item_error and request_reached_downloader signals; from_crawler support for feed exporters, feed storages and dupefilters.

scrapy.contracts fixes and new features

Telnet console security improvements, first released as a backport in Scrapy 1.5.2 (2019-01-22)

Clean-up of the deprecated code

Various bug fixes, small new features and usability improvements across the codebase.

Full changelog is in the docs.
Source code(tar.gz)
Source code(zip)
1.5.0(Dec 30, 2017)
This release brings small new features and improvements across the codebase. Some highlights:

Google Cloud Storage is supported in FilesPipeline and ImagesPipeline.

Crawling with proxy servers becomes more efficient, as connections to proxies can be reused now.

Warnings, exception and logging messages are improved to make debugging easier.

scrapy parse command now allows to set custom request meta via --meta argument.

Compatibility with Python 3.6, PyPy and PyPy3 is improved; PyPy and PyPy3 are now supported officially, by running tests on CI.

Better default handling of HTTP 308, 522 and 524 status codes.

Documentation is improved, as usual.

Full changelog is in the docs.
Source code(tar.gz)
Source code(zip)
1.4.0(Dec 29, 2017)

Release notes at https://doc.scrapy.org/en/latest/news.html#scrapy-1-4-0-2017-05-18
Source code(tar.gz)
Source code(zip)
1.3.3(Dec 29, 2017)

Release notes at https://doc.scrapy.org/en/latest/news.html#scrapy-1-3-3-2017-03-10
Source code(tar.gz)
Source code(zip)
1.2.2(Dec 8, 2016)
Bug fixes

Fix a cryptic traceback when a pipeline fails on open_spider() (#2011)

Fix embedded IPython shell variables (fixing #396 that re-appeared in 1.2.0, fixed in #2418)

A couple of patches when dealing with robots.txt:

handle (non-standard) relative sitemap URLs (#2390)

handle non-ASCII URLs and User-Agents in Python 2 (#2373)

Documentation

Document "download_latency" key in Request‘s meta dict (#2033)

Remove page on (deprecated & unsupported) Ubuntu packages from ToC (#2335)

A few fixed typos (#2346, #2369, #2369, #2380) and clarifications (#2354, #2325, #2414)

Other changes

Advertize conda-forge as Scrapy’s official conda channel (#2387)

More helpful error messages when trying to use .css() or .xpath() on non-Text Responses (#2264)

startproject command now generates a sample middlewares.py file (#2335)

Add more dependencies’ version info in scrapy version verbose output (#2404)

Remove all *.pyc files from source distribution (#2386)

Source code(tar.gz)
Source code(zip)
1.2.1(Dec 8, 2016)
Bug fixes

Include OpenSSL’s more permissive default ciphers when establishing TLS/SSL connections (#2314).

Fix “Location” HTTP header decoding on non-ASCII URL redirects (#2321).

Documentation

Fix JsonWriterPipeline example (#2302).

Various notes: #2330 on spider names, #2329 on middleware methods processing order, #2327 on getting multi-valued HTTP headers as lists.

Other changes

Removed www. from start_urls in built-in spider templates (#2299).

Source code(tar.gz)
Source code(zip)
1.2.0(Oct 3, 2016)
New Features

New FEED_EXPORT_ENCODING setting to customize the encoding used when writing items to a file. This can be used to turn off \uXXXX escapes in JSON output. This is also useful for those wanting something else than UTF-8 for XML or CSV output (#2034).

startproject command now supports an optional destination directory to override the default one based on the project name (#2005).

New SCHEDULER_DEBUG setting to log requests serialization failures (#1610).

JSON encoder now supports serialization of set instances (#2058).

Interpret application/json-amazonui-streaming as TextResponse (#1503).

scrapy is imported by default when using shell tools (shell, inspect_response) (#2248).

Bug fixes

DefaultRequestHeaders middleware now runs before UserAgent middleware (#2088). Warning: this is technically backwards incompatible, though we consider this a bug fix.

HTTP cache extension and plugins that use the .scrapy data directory now work outside projects (#1581). Warning: this is technically backwards incompatible, though we consider this a bug fix.

Selector does not allow passing both response and text anymore (#2153).

Fixed logging of wrong callback name with scrapy parse (#2169).

Fix for an odd gzip decompression bug (#1606).

Fix for selected callbacks when using CrawlSpider with scrapy parse (#2225).

Fix for invalid JSON and XML files when spider yields no items (#872).

Implement flush() for StreamLogger avoiding a warning in logs (#2125).

Refactoring

canonicalize_url has been moved to w3lib.url (#2168).

Tests & Requirements

Scrapy's new requirements baseline is Debian 8 "Jessie". It was previously Ubuntu 12.04 Precise. What this means in practice is that we run continuous integration tests with these (main) packages versions at a minimum: Twisted 14.0, pyOpenSSL 0.14, lxml 3.4.

Scrapy may very well work with older versions of these packages (the code base still has switches for older Twisted versions for example) but it is not guaranteed (because it's not tested anymore).

Documentation

Grammar fixes: #2128, #1566.

Download stats badge removed from README (#2160).

New scrapy architecture diagram (#2165).

Updated Response parameters documentation (#2197).

Reworded misleading RANDOMIZE_DOWNLOAD_DELAY description (#2190).

Add StackOverflow as a support channel (#2257).

Source code(tar.gz)
Source code(zip)

Owner

Scrapy project

An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

GitHub Repository https://scrapy.org

This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

Movies-Scraper You are probably tired of navigating through a movie website to get the right movie you'd want to watch during the weekend. There may e

1 Jan 31, 2022

A web scraper that exports your entire WhatsApp chat history.

WhatSoup 🍲 A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

87 Jan 06, 2023

SmartScraper: 简单、自动、快捷的Python网络爬虫

SmartScraper: 简单、自动、快捷的Python网络爬虫 Note: The origin developer of SmartScraper is Alireza Mika， I only change a little code of AutoScraper. SmartScraper

9 Apr 16, 2022

This is my CS 20 final assesment.

eeeeeSpider This is my CS 20 final assesment. How to use: Open program Run to your hearts content! There are no external dependancies that you will ha

1 Jan 17, 2022

This is a python api to scrape search results from a url.

googlescrape Installation Installation is simple! # Stable version pip install googlescrape Examples from googlescrape import client scrapeClient=cli

1 Dec 15, 2022

京东云无线宝积分推送，支持查看多设备积分使用情况

JDRouterPush 项目简介本项目调用京东云无线宝API,可每天定时推送积分收益情况,帮助你更好的观察主要信息更新日志 2021-03-02: 查询绑定的京东账户通知排版优化脚本检测更新支持Server酱Turbo版 2021-02-25: 实现多设备查询查询今

199 Dec 12, 2022

Script used to download data for stocks.

This script is useful for downloading stock market data for a wide range of companies specified by their respective tickers. The script reads in the d

71 Oct 04, 2022

🥫 The simple, fast, and modern web scraping library

About gazpacho is a simple, fast, and modern web scraping library. The library is stable, actively maintained, and installed with zero dependencies. I

692 Dec 22, 2022

A Python module to bypass Cloudflare's anti-bot page.

cloudscraper A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests.

2.6k Dec 31, 2022

This tool can be used to extract information from any website

WEB-INFO- This tool can be used to extract information from any website Install Termux and run the command --- $ apt-get update $ apt-get upgrade $ pk

1 Oct 24, 2021

A database scraper created with mechanical soup and sqlite

WebscrapingDatabases a database scraper created with mechanical soup and sqlite author: Mariya Sha Watch on YouTube: This repository was created to su

30 Aug 08, 2022

A distributed crawler for weibo, building with celery and requests.

4.8k Jan 03, 2023

爱奇艺会员,腾讯视频,哔哩哔哩,百度,各类签到

My-Actions 个人收集并适配Github Actions的各类签到大杂烩不要fork了 ⭐️ star就行使用方式新建仓库并同步代码点击Settings - Secrets - 点击绿色按钮 (如无绿色按钮说明已激活。直接到下一步。) 新增 new secret 并设置 Secr

280 Dec 30, 2022

Unja is a fast & light tool for fetching known URLs from Wayback Machine

Unja Fetch Known Urls What's Unja? Unja is a fast & light tool for fetching known URLs from Wayback Machine, Common Crawl, Virus Total & AlienVault's

10 Aug 07, 2022

Web Scraping Framework

Grab Framework Documentation Installation $ pip install -U grab See details about installing Grab on different platforms here http://docs.grablib.

2.3k Jan 04, 2023

Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

3.8k Jan 02, 2023

FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

1 Nov 17, 2022

A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

cybernews A package that provides you Latest Cyber/Hacker News from website using Web-Scraping. Latest Cyber/Hacker News Using Webscraping Developed b

4 Jun 02, 2022

Crawl the information of a given keyword on Google search engine

4 Nov 09, 2022

simple http & https proxy scraper and checker

11 Nov 15, 2021

Scrapy, a fast high-level web crawling & scraping framework for Python.

Related tags

Overview

Scrapy

Overview

Requirements

Install

Documentation

Releases

Community (blog, twitter, mail list, IRC)

Contributing

Code of Conduct

Companies using Scrapy

Commercial Support

Comments

Make scrapy.selector into a separate project.

1 - filter items on export based on custom rules.

2 - Compress feeds

3 - Spider open/close a batch

Releases(2.7.1)

2.7.1(Nov 2, 2022)

2.7.0(Oct 17, 2022)

2.6.3(Sep 27, 2022)

2.6.2(Jul 25, 2022)

1.8.3(Jul 25, 2022)

2.6.1(Mar 1, 2022)

2.6.0(Mar 1, 2022)

Security bug fixes

1.8.2(Mar 1, 2022)

Security bug fixes

2.5.1(Oct 5, 2021)

1.8.1(Oct 5, 2021)

2.5.0(Apr 6, 2021)

2.4.1(Nov 17, 2020)

2.4.0(Oct 11, 2020)

2.3.0(Aug 4, 2020)

2.2.1(Jul 17, 2020)

2.2.0(Jun 24, 2020)

2.1.0(Apr 24, 2020)

2.0.1(Mar 18, 2020)

2.0.0(Mar 3, 2020)

1.7.4(Oct 21, 2019)

1.7.3(Aug 1, 2019)

1.7.2(Jul 23, 2019)

1.7.0(Jul 18, 2019)

1.6.0(Feb 11, 2019)

1.5.0(Dec 30, 2017)

1.4.0(Dec 29, 2017)

1.3.3(Dec 29, 2017)

1.2.2(Dec 8, 2016)

Bug fixes

Documentation

Other changes

1.2.1(Dec 8, 2016)

Bug fixes

Documentation

Other changes

1.2.0(Oct 3, 2016)

New Features

Bug fixes

Refactoring

Tests & Requirements

Documentation

Owner

Scrapy project

This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

A web scraper that exports your entire WhatsApp chat history.

SmartScraper: 简单、自动、快捷的Python网络爬虫

This is my CS 20 final assesment.

This is a python api to scrape search results from a url.

京东云无线宝积分推送，支持查看多设备积分使用情况

Script used to download data for stocks.

🥫 The simple, fast, and modern web scraping library

A Python module to bypass Cloudflare's anti-bot page.

This tool can be used to extract information from any website

A database scraper created with mechanical soup and sqlite

A distributed crawler for weibo, building with celery and requests.

爱奇艺会员,腾讯视频,哔哩哔哩,百度,各类签到

Unja is a fast & light tool for fetching known URLs from Wayback Machine

Web Scraping Framework