A client interface for Scrapinghub's API

Overview

Client interface for Scrapinghub API

https://secure.travis-ci.org/scrapinghub/python-scrapinghub.svg?branch=master

The scrapinghub is a Python library for communicating with the Scrapinghub API.

Requirements

  • Python 2.7 or above

Installation

The quick way:

pip install scrapinghub

You can also install the library with MessagePack support, it provides better response time and improved bandwidth usage:

pip install scrapinghub[msgpack]

Documentation

Documentation is available online via Read the Docs or in the docs directory.

Comments
  • msgpack errors when using iter() with intervals between each batch call

    msgpack errors when using iter() with intervals between each batch call

    Good Day!

    I've encountered this peculiar issue when trying to save up memory by processing the items in chunks. Here's a strip down version of the code for reproduction of the issue:

    import pandas as pd
    
    from scrapinghub import ScrapinghubClient
    
    def read_job_items_by_chunk(jobkey, chunk=10000):
        """In order to prevent OOM issues, the jobs' data must be read in
        chunks.
    
        This will return a generator of pandas DataFrames.
        """
    
        client = ScrapinghubClient("APIKEY123")
    
        item_generator = client.get_job(jobkey).items.iter()
    
        while item_generator:
            yield pd.DataFrame(
                [next(item_generator) for _ in range(chunk)]
            )
    
    for df_chunk in read_job_items_by_chunk('123/123/123'):
        # having a small chunk-size like 10000 won't have any problems
    
    for df_chunk in read_job_items_by_chunk('123/123/123', chunk=25000):
        # having a bug chunk-size like 25000 will throw out errors like the one below
    

    Here's the common error it throws:

    <omitted stack trace above>
    
        [next(item_generator) for _ in range(chunk)]
      File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
        _path, requests_params, **apiparams
      File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
        for obj in unpacker:
      File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
      File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
      File "/usr/local/lib/python2.7/encodings/utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 67: invalid start byte
    

    Moreover, it throws out a different error when using a much bigger chunk-size, like 50000:

    <omitted stack trace above>
    
        [next(item_generator) for _ in range(chunk)]
      File "/usr/local/lib/python2.7/site-packages/scrapinghub/client/proxy.py", line 115, in iter
        _path, requests_params, **apiparams
      File "/usr/local/lib/python2.7/site-packages/scrapinghub/hubstorage/serialization.py", line 33, in mpdecode
        for obj in unpacker:
      File "msgpack/_unpacker.pyx", line 459, in msgpack._unpacker.Unpacker.__next__ (msgpack/_unpacker.cpp:459)
      File "msgpack/_unpacker.pyx", line 390, in msgpack._unpacker.Unpacker._unpack (msgpack/_unpacker.cpp:390)
    TypeError: unhashable type: 'dict'
    

    I find that the workaround/solution for this would be to have a lower value for chunk. So far, 1000 works great.

    This uses scrapy:1.5 stack in Scrapy Cloud.

    I'm guessing this might have something to do with the long waiting time that happens when processing the pandas DataFrame chunk, and when the next batch of items are being iterated, the server might have deallocated the pointer to it or something.

    May I ask if there might be a solution for this? since a much bigger chunk size will help with the speed of our jobs.

    I've marked it as bug for now as this is quite an unexpected/undocumented behavior.

    Cheers!

    bug 
    opened by BurnzZ 10
  • basic py3.3 compatibility while keeping py2.7 compatibility

    basic py3.3 compatibility while keeping py2.7 compatibility

    makes the api callable from py2.x and py3.x. Since scrapy itself is not yet python3 compatible this might be still useful if one has a control application/api written in py3 which should be able to control scrapy crawlers.

    opened by ms5 9
  • UnicodeDecodeError while fetching items

    UnicodeDecodeError while fetching items

    It seems like I randomly get errors like this:

     UnicodeDecodeError: 'utf-8' codec can't decode byte 0xde in position 174: invalid continuation byte
    
            at msgpack._cmsgpack.Unpacker._unpack (_unpacker.pyx:443)
            at msgpack._cmsgpack.Unpacker.__next__ (_unpacker.pyx:518)
            at mpdecode (/usr/local/lib/python3.7/site-packages/scrapinghub/hubstorage/serialization.py:33)
            at iter (/usr/local/lib/python3.7/site-packages/scrapinghub/client/proxy.py:115) 
    

    This happens while iterating the items through last_job.items.iter() It seems to happen about 50% of the time from what I see. I scrape the same website every day and run that function and sometimes it works fine, sometimes raise that error. I am not sure if this is an issue with this library or with the ScrapingHub API though but it is very problematic.

    This happens on the latest (2.3.1) version

    opened by mijamo 8
  • Use SHUB_JOBAUTH environment variable in utils.parse_auth method

    Use SHUB_JOBAUTH environment variable in utils.parse_auth method

    Currently, the parse_auth method tries to get the API Key from the SH_APIKEY environment variable, which needs to be manually set either in spider's code or in the [docker] image's code. A regular practice is to create dummy users and associate them with the project so that real contributors don't have to share their API Keys.

    Another option is to use the credentials provided by the SHUB_JOBAUTH, defined during runtime when executing jobs in the Scrapy Cloud platform.

    Although it's possible to use Collections and Frontera, this is not a regular Dash API Key but a JWT Token generated in runtime by JobQ service, which works only for a part of our API endpoints (JobQ/Hubstorage).

    I'd like to contribute with a Pull Request adding support for this ephemeral API Key.

    opened by victor-torres 8
  • Avoid races for hubstorage frontier tests

    Avoid races for hubstorage frontier tests

    Looks like there're races in sh.hubstorage.frontier tests, it's relatively easy to reproduce by rerunning the Travis job (I can't reproduce it locally) https://travis-ci.org/scrapinghub/python-scrapinghub/jobs/172664296

    After checking internals, my guess is that as Batchuploader works in a separate thread trying to upload next messages batch from queue, and frontier.flush() operation waits for the queue being empty (by doing queue.join()), there's a probability that on context switch the queue is empty, but Batchuploader hasn't called callback yet, and the frontier.newcounter is not updated yet. In this case simple short delay should fix this, at least I wasn't able to reproduce the issue after the fix.

    Could you please confirm/disapprove my finding please?

    opened by vshlapakov 7
  • Add truncate method to collections

    Add truncate method to collections

    This makes possible to delete an entire Collection with a single API request, without having to iterate through records and therefore making multiple API requests.

    opened by victor-torres 6
  • How to run a job?

    How to run a job?

    I can't see how to run a job. There's two examples in the docs. In the project section:

    For example, to schedule a spider run (it returns a job object):
    
    >>> project.jobs.run('spider1', job_args={'arg1':'val1'})
    <scrapinghub.client.Job at 0x106ee12e8>>
    
    

    and in the spider section:

    Like project instance, spider instance has jobs field to work with the spider's jobs.
    
    To schedule a spider run:
    
    >>> spider.jobs.run(job_args={'arg1:'val1'})
    <scrapinghub.client.Job at 0x106ee12e8>>
    

    Neither works, both throw AttributeError: 'Jobs' object has no attribute 'run'

    opened by ollieglass 6
  • Some imports from standard lib collections are breaking on python 3.10

    Some imports from standard lib collections are breaking on python 3.10

    Hi everyone,

    Based on an issue from another repo (https://github.com/okfn-brasil/querido-diario/issues/502), I noticed that scrapinghub is using some imports from standard lib collections that are deprecated and not working on Python 3.10.

    In Python 3.8 I have these results on ipython console:

    In [1]: from collections import Iterator
    <ipython-input-1-4fb967d2a9f8>:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
      from collections import Iterator
    
    In [2]: from collections import Iterable
    <ipython-input-2-c0513a1e6784>:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
      from collections import Iterable
    
    In [3]: from collections import MutableMapping
    <ipython-input-3-069a7babadbf>:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.9 it will stop working
      from collections import MutableMapping
    

    According to this, it is necessary to change the imports of Iterable, Iterator and MutableMapping to get these items from "collections.abc" instead of just "collections"

    Here are the list of imports that I found:

    • tests/client/test_job.py - from collections import Iterator
    • tests/client/test_frontiers.py - from collections import Iterable
    • tests/client/test_projects.py - from collections import defaultdict, Iterator
    • scrapinghub/hubstorage/resourcetype.py - from collections import MutableMapping
    opened by lbmendes 5
  • Collections key not found with library

    Collections key not found with library

    I'm curious about the difference between Collection.get() and Collection.iter(key=[KEY])

    >>> key = '456/789'
    >>> store = project.collections.get_store('trump')
    >>> store.set({'_key': key, 'value': 'abc'})
    >>> print(store.list(key=[key]))
    
    [{'value': 'abc', '_key': '456/789'}]  # https://storage.scrapinghub.com/collections/9328/s/trump?key=456%2F789&meta=_key
    
    >>> try:
    >>>     print(store.get(key))
    >>> except scrapinghub.client.exceptions.NotFound as e:
    >>>     print(getattr(e, 'http_error', e))
    
    404 Client Error: Not Found for url: https://storage.scrapinghub.com/collections/9328/s/trump/456/789
    

    I assume that Collection.get() is a handy shortcut for the key-filtered .iter() function so I guess the point of my issue is that .get() will raise an exception if given bad input, for example slashes

    opened by stav 5
  • project.jobs close_reason support needed

    project.jobs close_reason support needed

    I would like to get the last "finished" job for a spider.

    But if I do:

    project.jobs(spider='myspider', state='finished', count=-1)
    

    I will only get jobs with a state of finished but this may include jobs with a close_reason of shutdown or something other than "finished".

    I would like to be able to do:

    project.jobs(spider='myspider', close_reason='finished', count=-1)
    

    which would of course assume that state is finished as well.

    opened by stav 5
  • Drop versions earlier than Python 3.7 and update requirements

    Drop versions earlier than Python 3.7 and update requirements

    library upgrades

    This update nominally drops support for Python 2.7, 3.5, and 3.6, and tests support for 3.10 to avoid libraries being pinned to very old versions, many of them with bugs or with security issues.

    It's "nominally" because the code hasn't been changed except for deprecations enforced in Python 3.9 or 3.10.

    disabled tests

    Tests that required running test servers are disabled:

    • Running the servers locally is too complicated
    • There are no changes to the library's logic. Only required library versions were changed

    The tests can be re-enabled by someone with access to test servers.

    maintenance 
    opened by apalala 4
  • Add retry logic to Job Tag Update function

    Add retry logic to Job Tag Update function

    Description

    An Internal Server Error pops up whenever a large number of tag updates run parallel or sequentially.

    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/site-packages/project-1.0-py3.10.egg/XX/utils/workflow/__init__.py", line 930, in run
        start_stage, active_stage, ran_stages = self.setup_continuation(
      File "/usr/local/lib/python3.10/site-packages/project-1.0-py3.10.egg/XX/utils/workflow/__init__.py", line 667, in setup_continuation
        self._discard_jobs(start_stage, ran_stages)
      File "/usr/local/lib/python3.10/site-packages/project-1.0-py3.10.egg/XX/utils/workflow/__init__.py", line 705, in _discard_jobs
        self.get_job(jobinfo["key"]).update_tags(
      File "/usr/local/lib/python3.10/site-packages/scrapinghub/client/[jobs.py](http://jobs.py/)", line 503, in update_tags
        self._client._connection._post('jobs_update', 'json', params)
      File "/usr/local/lib/python3.10/site-packages/scrapinghub/[legacy.py](http://legacy.py/)", line 120, in _post
        return self._request(url, params, headers, format, raw, files)
      File "/usr/local/lib/python3.10/site-packages/scrapinghub/client/[exceptions.py](http://exceptions.py/)", line 98, in wrapped
        raise ServerError(http_error=exc)
    scrapinghub.client.exceptions.ServerError: Internal server error
    

    This is not a problem if you are doing updates for a couple of jobs, but if you want to mass update this error will pop up eventually.

    Adding adaptable retry logic to the update_tags function through that ServerError exception would make it easier to debug and implement large-scale workflows.

    opened by ftadao 0
  • Incorrect information for Samples in Job documentation

    Incorrect information for Samples in Job documentation

    At this link - https://python-scrapinghub.readthedocs.io/en/latest/client/overview.html#job-data-1, the description for samples refer to the job stats which is confusing and seems incorrect.

    I think it should be runtime samples that the job uploaded;

    Please correct me if I have misinterpreted this.

    image

    opened by gutsytechster 0
  • Jobs.iter() is unable to accept has_tag as a list.

    Jobs.iter() is unable to accept has_tag as a list.

    From the docs:

    jobs_summary = project.jobs.iter( ... has_tag=['new', 'verified'], lacks_tag='obsolete')

    has_tag accepts a string but no a list. lacks_tag works perfectly fine with both.

    opened by PeteRoyAlex 0
  • KeyError: 'status' when trying to schedule spider

    KeyError: 'status' when trying to schedule spider

    I am getting this error when trying to schedule a spider. This is happening with version 2.3.1

    Traceback (most recent call last):
      File "/home/molveyra/.local/share/virtualenvs/mollie-AtuAN_AE/lib/python3.8/site-packages/scrapinghub/legacy.py", line 157, in _decode_response
        if data['status'] == 'ok':
    KeyError: 'status'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/molveyra/.local/share/virtualenvs/mollie-AtuAN_AE/lib/python3.8/site-packages/scrapinghub/client/exceptions.py", line 69, in wrapped
        return method(*args, **kwargs)
      File "/home/molveyra/.local/share/virtualenvs/mollie-AtuAN_AE/lib/python3.8/site-packages/scrapinghub/client/__init__.py", line 19, in _request
        return super(Connection, self)._request(*args, **kwargs)
      File "/home/molveyra/.local/share/virtualenvs/mollie-AtuAN_AE/lib/python3.8/site-packages/scrapinghub/legacy.py", line 143, in _request
        return self._decode_response(response, format, raw)
      File "/home/molveyra/.local/share/virtualenvs/mollie-AtuAN_AE/lib/python3.8/site-packages/scrapinghub/legacy.py", line 169, in _decode_response
        raise APIError("JSON response does not contain status")
    scrapinghub.legacy.APIError: JSON response does not contain status
    
    enhancement 
    opened by kalessin 6
  • collections.get_store is not working as documented

    collections.get_store is not working as documented

    Upon going through collections doc we see 2. call .get_store(<somename>) to create or access the named collection you want (the collection will be created automatically if it doesn't exist) ; you get a "store" object back, But when you try this

    >>> store = collections.get_store('store_which_does_not_exist')
    >>> store.get('key_which_does_not_exist')
    DEBUG:https://storage.scrapinghub.com:443 "GET /collections/462630/s/store_which_does_not_exist/key_which_does_not_exist HTTP/1.1" 404 46
    2021-02-04 13:33:20 [urllib3.connectionpool] DEBUG: https://storage.scrapinghub.com:443 "GET /collections/462630/s/store_which_does_not_exist/key_which_does_not_exist HTTP/1.1" 404 46
    DEBUG:<Response [404]>: b'unknown collection store_which_does_not_exist\n'
    2021-02-04 13:33:20 [HubstorageClient] DEBUG: <Response [404]>: b'unknown collection store_which_does_not_exist\n'
    *** scrapinghub.client.exceptions.NotFound: unknown collection store_which_does_not_exist
    

    When we .set some value to store which doesn’t exist, store is created and then the values are stored.

    >>> store.set({'_key': 'some_key', 'value': 'some_value'})
    DEBUG:https://storage.scrapinghub.com:443 "POST /collections/462630/s/store_which_does_not_exist HTTP/1.1" 200 0
    2021-02-04 13:36:56 [urllib3.connectionpool] DEBUG: https://storage.scrapinghub.com:443 "POST /collections/462630/s/store_which_does_not_exist HTTP/1.1" 200 0
    According to docs, shouldn’t the store be created when we call .get_store ?
    
    bug docs 
    opened by realslimshanky-sh 0
Releases(2.4.0)
  • 2.4.0(Mar 10, 2022)

    What's Changed

    • update iter() for better fallback in getting 'meta' argument by @BurnzZ in https://github.com/scrapinghub/python-scrapinghub/pull/146
    • switch from Travis to GH actions by @pawelmhm in https://github.com/scrapinghub/python-scrapinghub/pull/162
    • Python 3.10 compatibility by @elacuesta in https://github.com/scrapinghub/python-scrapinghub/pull/166

    New Contributors

    • @pawelmhm made their first contribution in https://github.com/scrapinghub/python-scrapinghub/pull/162

    Full Changelog: https://github.com/scrapinghub/python-scrapinghub/compare/2.3.1...2.4.0

    Source code(tar.gz)
    Source code(zip)
  • 2.3.1(Mar 13, 2020)

  • 2.3.0(Dec 17, 2019)

  • 2.1.1(Apr 25, 2019)

    • add Python 3.7 support
    • update msgpack dependency
    • fix iter logic for items/requests/logs
    • add truncate method to collections
    • improve documentation
    Source code(tar.gz)
    Source code(zip)
  • 2.2.1(Aug 7, 2019)

  • 2.2.0(Aug 7, 2019)

  • 2.1.0(Jan 14, 2019)

    • add an option to schedule jobs with custom environment variables
    • fallback to SHUB_JOBAUTH environment variable if SH_APIKEY is not set
    • provide a unified connection timeout used by both internal clients
    • increase a chunk size when working with the items stats endpoint

    Python 3.3 is considered unmaintained.

    Source code(tar.gz)
    Source code(zip)
  • 2.0.0(Mar 29, 2017)

    We're very happy to finally announce official major release of new Scrapinghub python client. Documentation is available online via Read The Docs http://python-scrapinghub.readthedocs.io/

    Source code(tar.gz)
    Source code(zip)
  • 1.9.0(Nov 29, 2016)

    python-scrapinghub 1.9.0

    • python-hubstorage merged into python-scrapinghub
    • all tests are improved and rewritten with py.test
    • hubstorage tests use vcrpy cassettes, work faster and don't require any external services to run

    python-hubstorage is going to be considered deprecated, its next version will contain a deprecation warning and a proposal to use python-scrapinghub >=1.9.0 instead.

    Source code(tar.gz)
    Source code(zip)
Owner
Scrapinghub
Turn web content into useful data
Scrapinghub
A bot that updates about the most subscribed artist' channels on YouTube

A bot that updates about the most subscribed artist' channels on YouTube. A weekly top chart report is provided every Monday. It posts updates on Twitter

Marco Fantauzzo 5 Dec 14, 2022
对hermit 的API进行简单的封装,做成了这个python moudle

hermit-py 对hermit 的API进行简单的封装,做成了这个Python Moudle,推荐通过wheel的方式安装。 目前对点击、滑动、模拟输入、找组件、等支持较好,支持查看页面的实时布局信息,再通过布局信息进行点击滑动等操作。 支持剪贴板相关的操作,支持设置剪贴的任意语言内容。

LookCos 40 Jun 25, 2022
troposphere - Python library to create AWS CloudFormation descriptions

troposphere - Python library to create AWS CloudFormation descriptions

4.8k Jan 06, 2023
This script books automatically a slot on Doctolib in one of the public vaccination centers in Berlin.

BOOKING IN BERLINS VACCINATION CENTERS This python script books automatically a slot on Doctolib in one of the public vaccination centers in Berlin. T

17 Jan 13, 2022
Linky bot, A open-source discord bot that allows you to add links to ur website, youtube url, etc for the people all around discord to see!

LinkyBot Linky bot, An open-source discord bot that allows you to add links to ur website, youtube url, etc for the people all around discord to see!

AlexyDaCoder 1 Sep 20, 2022
A Slash Commands Discord Bot created using Pycord!

Hey, I am Slash Bot. A Bot which works with Slash Commands! Prerequisites Python 3+ Check out. the requirements.txt and install all the pakages. Insta

Saumya Patel 18 Nov 15, 2022
Automatically changes your discord status

Automatically changes your discord status, Be careful as this may get you rate limited and banned

octo 5 Sep 20, 2022
A thin Python Wrapper for the Dark Sky (formerly forecast.io) weather API

Dark Sky Wrapper This is a wrapper for the Dark Sky (formerly forecast.io) API. It allows you to get the weather for any location, now, in the past, o

Ze'ev Gilovitz 414 Nov 16, 2022
HASOKI DDOS TOOL- powerful DDoS toolkit for penetration tests

DDoS Attack Panel includes CloudFlare Bypass (UAM, CAPTCHA, GS ,VS ,BFM, etc..) This is open source code. I am not responsible if you use it for malic

Rebyc 1 Dec 02, 2022
An Anime Themed Fast And Safe Group Managing Bot.

Ξ L I N Λ 👸 A Powerful, Smart And Simple Group Manager bot Avaiilable a latest version as Ξ L I N Λ 👸 on Telegram Self-hosting (For Devs) vps # Inst

7 Nov 12, 2022
Send alert to telegram use telegram cli

Run standalone: Rename conf.yml.example to conf.yml Change block cli(Add your api_id and hash) Install requirements.txt Run python AlertManagerTG.py I

Eugene Arkharov 1 Nov 12, 2021
Telegram Bot for generating and decoding QR-codes

Telegram openqrgen_bot Telegram Bot that generates from user's messages and decodes QR-codes from photos. Also contains rickroll detection :) Just typ

2 Nov 14, 2021
An information scroller Twitter trends, news, weather for raspberry pi and Pimoroni Unicorn Hat Mini and Scroll Phat HD.

uticker An information scroller Twitter trends, news, weather for raspberry pi and Pimoroni Unicorn Hat Mini and Scroll Phat HD. Features include: Twi

Tansu Şenyurt 1 Nov 28, 2021
A Twitch bot to provide information from the WebNowPlayingCompanion extension

WebNowPlayingTwitch A Twitch bot made using TwitchIO which displays information obtained from the WebNowPlaying Compaion browser extension. Image is o

NiceAesth 1 Mar 21, 2022
Python wrapper for the GitLab API

Python GitLab python-gitlab is a Python package providing access to the GitLab server API. It supports the v4 API of GitLab, and provides a CLI tool (

1.9k Dec 31, 2022
Sakura: an powerfull Autofilter bot that can be used in your groups

Sakura AutoFilter This Bot May Look Like Mwk_AutofilterBot And Its Because I Like Its UI, That's All Sakura is an powerfull Autofilter bot that can be

PaulWalker 12 Oct 21, 2022
Project for the discipline of Visual Data Analysis at EMAp FGV.

Analysis of the dissemination of fake news about COVID-19 on Twitter This project was the final work for the discipline of Visual Data Analysis of the

Giovani Valdrighi 2 Jan 17, 2022
A delightful and complete interface to GitHub's amazing API

ghapi A delightful and complete interface to GitHub's amazing API ghapi provides 100% always-updated coverage of the entire GitHub API. Because we aut

fast.ai 428 Jan 08, 2023
A Telegram Bot for searching any channel messages from Inline by @AbirHasan2005

Message-Search-Bot A Telegram Bot for searching any channel messages from Inline by @AbirHasan2005. I made this for @AHListBot. You can use this for s

Abir Hasan 44 Dec 27, 2022
Easy to use API Wrapper for somerandomapi.ml.

Overview somerandomapi is an API Wrapper for some-random-api.ml Examples Asynchronous from somerandomapi import Animal import asyncio async def main

Myxi 1 Dec 31, 2021