This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

Overview

Scrapy Cluster

Build Status Documentation Join the chat at https://gitter.im/istresearch/scrapy-cluster Coverage Status License Docker Pulls

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

The goal is to distribute seed URLs among many waiting spider instances, whose requests are coordinated via Redis. Any other crawls those trigger, as a result of frontier expansion or depth traversal, will also be distributed among all workers in the cluster.

The input to the system is a set of Kafka topics and the output is a set of Kafka topics. Raw HTML and assets are crawled interactively, spidered, and output to the log. For easy local development, you can also disable the Kafka portions and work with the spider entirely via Redis, although this is not recommended due to the serialization of the crawl requests.

Dependencies

Please see the requirements.txt within each sub project for Pip package dependencies.

Other important components required to run the cluster

Core Concepts

This project tries to bring together a bunch of new concepts to Scrapy and large scale distributed crawling in general. Some bullet points include:

  • The spiders are dynamic and on demand, meaning that they allow the arbitrary collection of any web page that is submitted to the scraping cluster
  • Scale Scrapy instances across a single machine or multiple machines
  • Coordinate and prioritize their scraping effort for desired sites
  • Persist data across scraping jobs
  • Execute multiple scraping jobs concurrently
  • Allows for in depth access into the information about your scraping job, what is upcoming, and how the sites are ranked
  • Allows you to arbitrarily add/remove/scale your scrapers from the pool without loss of data or downtime
  • Utilizes Apache Kafka as a data bus for any application to interact with the scraping cluster (submit jobs, get info, stop jobs, view results)
  • Allows for coordinated throttling of crawls from independent spiders on separate machines, but behind the same IP Address
  • Enables completely different spiders to yield crawl requests to each other, giving flexibility to how the crawl job is tackled

Scrapy Cluster test environment

To set up a pre-canned Scrapy Cluster test environment, make sure you have Docker.

Steps to launch the test environment:

  1. Build your containers (or omit --build to pull from docker hub)
docker-compose up -d --build
  1. Tail kafka to view your future results
docker-compose exec kafka_monitor python kafkadump.py dump -t demo.crawled_firehose -ll INFO
  1. From another terminal, feed a request to kafka
curl localhost:5343/feed -H "content-type:application/json" -d '{"url": "http://dmoztools.net", "appid":"testapp", "crawlid":"abc123"}'
  1. Validate you've got data!
# wait a couple seconds, your terminal from step 2 should dump json data
{u'body': '...content...', u'crawlid': u'abc123', u'links': [], u'encoding': u'utf-8', u'url': u'http://dmoztools.net', u'status_code': 200, u'status_msg': u'OK', u'response_url': u'http://dmoztools.net', u'request_headers': {u'Accept-Language': [u'en'], u'Accept-Encoding': [u'gzip,deflate'], u'Accept': [u'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'], u'User-Agent': [u'Scrapy/1.5.0 (+https://scrapy.org)']}, u'response_headers': {u'X-Amz-Cf-Pop': [u'IAD79-C3'], u'Via': [u'1.1 82c27f654a5635aeb67d519456516244.cloudfront.net (CloudFront)'], u'X-Cache': [u'RefreshHit from cloudfront'], u'Vary': [u'Accept-Encoding'], u'Server': [u'AmazonS3'], u'Last-Modified': [u'Mon, 20 Mar 2017 16:43:41 GMT'], u'Etag': [u'"cf6b76618b6f31cdec61181251aa39b7"'], u'X-Amz-Cf-Id': [u'y7MqDCLdBRu0UANgt4KOc6m3pKaCqsZP3U3ZgIuxMAJxoml2HTPs_Q=='], u'Date': [u'Tue, 22 Dec 2020 21:37:05 GMT'], u'Content-Type': [u'text/html']}, u'timestamp': u'2020-12-22T21:37:04.736926', u'attrs': None, u'appid': u'testapp'}

Documentation

Please check out the official Scrapy Cluster documentation for more information on how everything works!

Branches

The master branch of this repository contains the latest stable release code for Scrapy Cluster 1.2.

The dev branch contains bleeding edge code and is currently working towards Scrapy Cluster 1.3. Please note that not everything may be documented, finished, tested, or finalized but we are happy to help guide those who are interested.

Comments
  • CentOS 7 Compatibility. Merged with dev branch.

    CentOS 7 Compatibility. Merged with dev branch.

    Travis build is running and all tests are passing for both OS (centos 7 + ubuntu trusty) inside docker.

    I am getting one error log due to which build is failing in my fork

    You have to provide either repo_token in .coveralls.yml, or launch via Travis or CircleCI

    It looks like the https://coveralls.io/ requires all coverall calls to go from istresearch/scrapy-cluster fork.

    Let me know if anything is required to do the merge.

    opened by knirbhay 22
  • UI service MVP

    UI service MVP

    Pull request to provide the first steps to satisfying #25.

    A simple angularjs + Flask UI provides a UI to the user to check status of the Scrapy Cluster and submit crawl requests easily.

    The status checks and crawl requests are provided via the rest service.

    opened by damienkilgannon 20
  • Scrapy-Cluster UI

    Scrapy-Cluster UI

    PR to merge my UI branch into IST UI branch. For discussions and collaboration.

    This code has been lingering around on my computer for a while now I think its about time I share it and try and get it to a place where it can be merged in. Further work still required on testing but core pieces of the ui_service are in place.

    opened by damienkilgannon 17
  • 1.1 Troubles

    1.1 Troubles

    Having a bit of trouble getting started. Below I've included commands and their outputs (note: some outputs are truncated):

    python kafka_monitor.py run
    2015-12-06 19:59:00,030 [kafka-monitor] INFO: Kafka Monitor Stats Dump:
    {
        "fail_21600": 0,
        "fail_3600": 0,
        "fail_43200": 0,
        "fail_604800": 0,
    ....
        "plugin_StatsHandler_lifetime": 0,
        "total_21600": 13,
        "total_3600": 13,
        "total_43200": 13,
        "total_604800": 13,
        "total_86400": 13,
        "total_900": 1,
        "total_lifetime": 13
    }
    
    python redis_monitor.py
    ....
        "total_604800": 6,
        "total_86400": 6,
        "total_900": 0,
        "total_lifetime": 6
    }
    2015-12-06 20:02:39,862 [redis-monitor] INFO: Crawler Stats Dump:
    {
        "total_spider_count": 0
    }
    
    
    scrapy runspider crawling/spiders/link_spider.py
    2015-12-06 19:56:46,817 [scrapy-cluster] INFO: Changed Public IP: None -> 52.91.192.73
    
    (scrapy_dev)[email protected]:~/scrapy-cluster/kafka-monitor$ python kafka_monitor.py feed '{"url": "http://dmoz.org", "appid":"testapp", "crawlid":"abc1234", "maxdepth":1}'
    No override settings found
    2015-12-06 19:58:44,573 [kafka-monitor] INFO: Feeding JSON into demo.incoming
    {
        "url": "http://dmoz.org",
        "maxdepth": 1,
        "crawlid": "abc1234",
        "appid": "testapp"
    }
    2015-12-06 19:58:44,580 [kafka-monitor] INFO: Successly fed item to Kafka
    
    python kafkadump.py dump -t demo.crawled_firehose
    
    
    (scrapy_dev)[email protected]:~/scrapy-cluster/kafka-monitor$ python kafkadump.py dump -t demo.outbound_firehose
    No override settings found
    2015-12-06 19:35:31,640 [kafkadump] INFO: Connected to localhost:9092
    {u'server_time': 1449430706, u'crawlid': u'abc1234', u'total_pending': 0, u'total_domains': 0, u'spiderid': u'link', u'appid': u'testapp', u'domains': {}, u'uuid': u'someuuid'}
    

    I haven't changed any of the default settings and I'm currently using the dev branch. However, I don't think my setup is working. I was expecting some updates in dump -t demo.crawled_firehose. So while I think I've successfully feed a url to be crawled scrapy isn't doing the crawl ? Any ideas?

    opened by quasiben 17
  • No output when dumping incoming or outbound_firehose

    No output when dumping incoming or outbound_firehose

    I'm attempting to get started with 1.2.1 in docker. I've downloaded the project and followed the docker instructions in getting started. When doing the first scrape I can dump and get output from the crawl but not demo.incoming or demo.outbound_firehose.

    I don't think this is related but I ran into compatibility issues with the latest Kafka image so I set the version to 1.0.0 in the docker-compose.yml which seemed to be the latest when 1.2.1 was released. This got me past that issue. It's the only change I've made to the project.

    Also all the tests pass in the docker images. However in the redis monitor on the first run I get:

    OK
    test_process_item (__main__.TestRedisMonitor) ... No handlers could be found for logger "redis_lock"
    ok
    

    My steps are:

    1. docker-compose up -d
    2. [terminal 1] docker exec -i scrapycluster121_kafka_monitor_1 python kafkadump.py dump -t demo.crawled_firehose
    3. [terminal 2] docker exec -i scrapycluster121_kafka_monitor_1 python kafkadump.py dump -t demo.incoming
    4. [terminal 3] docker exec -i scrapycluster121_kafka_monitor_1 python kafkadump.py dump -t demo.outbound_firehose
    5. [terminal 4] docker exec -i scrapycluster121_kafka_monitor_1 python kafka_monitor.py feed '{"url": "http://dmoztools.net", "appid":"testapp", "crawlid":"abc1234", "maxdepth":1}'
    6. [terminal 4] docker exec -i scrapycluster121_kafka_monitor_1 python kafka_monitor.py feed '{"action":"info", "appid":"testapp", "uuid":"someuuid", "crawlid":"abc1234", "spiderid":"link"}'

    After step 5 I start getting Scrapy output on terminal 1. I never get output on terminal 2 or 3.

    opened by cliff-km 14
  • Scutils log callbacks

    Scutils log callbacks

    This PR provides a starting point for registering callbacks using the LogFactory. This PR addresses Issue #91

    Usage

    Given a logging object logger, you can register a callback via

    logger.register_callback(log_level, callback_function, optional_criteria_dict)
    

    Some examples:

    logger.register_callback('ERROR', report)
    

    Explanation: The callback function report will fire when the .error() logging method is called

    logger.register_callback('<=INFO', add_1, {'key': 'val1'})
    

    Explanation: The callback function add_1 will fire when .debug() or .info() are called AND the {'key': 'val1'} is a subdict of theextras` passed to the logging functions

    logger.register_callback('>INFO', negate, {'key': 'val2'})
    

    Explanation: The callback function negate will fire when .warning(), .error(), or .critical() are called AND {'key': 'val2'} is a subdict of extras passed to the logging functions.

    logger.register_callback('*', always_fire)
    

    Explanation: The callback function always_fire will fire for all log levels with no concern of the extras dict passed to the logging functions.

    Testing

    $ python -m unittest tests.test_log_factory
    

    Notes

    The callbacks respect the log level. If the log level for a logger is CRITICAL then a .debug() invocation will not trigger the callbacks registered for CRITICAL.

    opened by joequery 12
  • First part of Docker images optimizations

    First part of Docker images optimizations

    Now Docker containers based Official Python container imaged which are also based on Alpine Linux. Optimizations were done for OS packages that used only during building on some Python packages by removing them after finishing packages installations. Also each subproject contains own requirements.txt to decrease container size.

    As result new image sizes became (with shared python layer of 71.95 MB size):

    crawler: 144.6 MB (own layers: 72.65 MB)
    kafka-monitor: 91.95 MB (own layers: 20 MB)
    redis-monitor: 88.67 MB (own layers: 16.72 MB)
    

    In contrast to previous (with shared python layer of 675.1 MB size):

    crawler-dev: 780 MB (own layers: 104.9 MB)
    redis-monitor-dev: 746.8 MB (own layers: 71.7 MB)
    kafka-monitor-dev: 746.8 MB (own layers: 71.7 MB)
    
    opened by tarhan 12
  • Add python 3 support.

    Add python 3 support.

    Use decode_responses option on Redis client and value_deserializer,value_serializer option on Kafka client to handle unicode problem. Also fix several syntax error and update several test cases for python 3. And as scrapy-cluster 1.2 use ujson instead of pickle, I think no migration is needed.

    ready to merge 
    opened by gas1121 11
  • _get_bin takes hours with queue size 1M.

    _get_bin takes hours with queue size 1M.

    I'm scraping etsy.com and queue size become more than 1M. When I query for info/statistics it stuck on _get_bin function in scrapy-cluster/redis-monitor/plugins/info_monitor.py file. Also i takes 500MB memory for redis-monitor in that moment.

    1. What is the best way to keep queue size small?
    2. Perhaps _get_bin should be rewritten in more efficient way to calc statistics in the database.
    opened by yrik 11
  • Python 3 Support

    Python 3 Support

    With Scrapy soon supporting Python 3, we should consider supporting it as well. At a first glance, most of the functionality changes do not affect the code within, but I am sure there needs to be more work done.

    roadmap 
    opened by madisonb 11
  • ImportError: No module named online

    ImportError: No module named online

    test_feed (main.TestKafkaMonitor) ... ERROR test_run (main.TestKafkaMonitor) ... ERROR

    ====================================================================== ERROR: test_feed (main.TestKafkaMonitor)

    Traceback (most recent call last): File "tests/online.py", line 56, in setUp self.kafka_monitor._load_plugins() File "/root/scrapy-cluster/kafka-monitor/kafka_monitor.py", line 75, in _load_plugins the_class = self._import_class(key) File "/root/scrapy-cluster/kafka-monitor/kafka_monitor.py", line 59, in _import_class m = import(cl[0:d], globals(), locals(), [classname]) ImportError: No module named online

    ====================================================================== ERROR: test_run (main.TestKafkaMonitor)

    Traceback (most recent call last): File "tests/online.py", line 56, in setUp self.kafka_monitor._load_plugins() File "/root/scrapy-cluster/kafka-monitor/kafka_monitor.py", line 75, in _load_plugins the_class = self._import_class(key) File "/root/scrapy-cluster/kafka-monitor/kafka_monitor.py", line 59, in _import_class m = import(cl[0:d], globals(), locals(), [classname]) ImportError: No module named online


    Ran 2 tests in 0.600s

    opened by mohit0749 9
  • ui exception  No connection adapters were found

    ui exception No connection adapters were found

    use ui mode exception stack info,but browser normal File "ui_service.py", line 121, in _kafka_stats r = requests.post(self.settings['REST_HOST'] + "/feed", json=data) File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 110, in post return request('post', url, data=data, json=json, **kwargs) File "/usr/local/lib/python2.7/site-packages/requests/api.py", line 56, in request return session.request(method=method, url=url, **kwargs) File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 488, in request resp = self.send(prep, **send_kwargs) File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 603, in send adapter = self.get_adapter(url=request.url) File "/usr/local/lib/python2.7/site-packages/requests/sessions.py", line 685, in get_adapter raise InvalidSchema("No connection adapters were found for '%s'" % url) InvalidSchema: No connection adapters were found fo 5343

    opened by new-wxw 0
  • Upgrading the ELK stack

    Upgrading the ELK stack

    Great project, thanks for sharing - and supporting for so long!

    I ran into a few problems running the ELK stack - the Elasticsearch container kept restarting with java.lang.IllegalStateException docker-elk-logs.txt

    I couldn't find the root cause for this, but in the end switched to using a later version of the ELK stack - v7.10 - which gave good results, and used Filebeat rather than Logstash as there seemed to be more documentation around this use-case. Not sure if this is a change you wanted to make to the project, but have my files on a branch here - happy to submit a pull request if you think that it might be useful: https://github.com/4OH4/scrapy-cluster/tree/elk-update

    Haven't managed to properly import the Kibana dashboard configuration from export.json though - I guess a few things have changed between the different versions of Kibana.

    Cheers

    opened by 4OH4 2
  • TypeError: can't pickle thread.lock objects

    TypeError: can't pickle thread.lock objects

    Hi.

    I don't know how much it happens or how much it already happenned but one of my crawl fell on the error below. I ran a thousand requests and it happened to only one of them. But still my crawl was put down. Here is the stack trace:

    2021-02-08 15:03:30 [scrapy.core.scraper] ERROR: Error downloading <GET https://pccomponentes-prod.mirakl.net/login> Traceback (most recent call last): File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) File "/usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py", line 66, in process_exception spider=spider) File "/usr/src/app/crawler/crawling/log_retry_middleware.py", line 89, in process_exception self._log_retry(request, exception, spider) File "/usr/src/app/crawler/crawling/log_retry_middleware.py", line 102, in _log_retry self.logger.error('Scraper Retry', extra=extras) File "/usr/src/app/crawler/scutils/log_factory.py", line 244, in error extras = self.add_extras(extra, "ERROR") File "/usr/src/app/crawler/scutils/log_factory.py", line 319, in add_extras my_copy = copy.deepcopy(dict) File "/usr/local/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/local/lib/python2.7/copy.py", line 257, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/usr/local/lib/python2.7/copy.py", line 190, in deepcopy y = _reconstruct(x, rv, 1, memo) File "/usr/local/lib/python2.7/copy.py", line 334, in _reconstruct state = deepcopy(state, memo) File "/usr/local/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/local/lib/python2.7/copy.py", line 257, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/usr/local/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/local/lib/python2.7/copy.py", line 264, in _deepcopy_method return type(x)(x.im_func, deepcopy(x.im_self, memo), x.im_class) File "/usr/local/lib/python2.7/copy.py", line 190, in deepcopy y = _reconstruct(x, rv, 1, memo) File "/usr/local/lib/python2.7/copy.py", line 334, in _reconstruct state = deepcopy(state, memo) File "/usr/local/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/local/lib/python2.7/copy.py", line 257, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/usr/local/lib/python2.7/copy.py", line 190, in deepcopy y = _reconstruct(x, rv, 1, memo) File "/usr/local/lib/python2.7/copy.py", line 334, in _reconstruct state = deepcopy(state, memo) File "/usr/local/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/local/lib/python2.7/copy.py", line 257, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/usr/local/lib/python2.7/copy.py", line 190, in deepcopy y = _reconstruct(x, rv, 1, memo) File "/usr/local/lib/python2.7/copy.py", line 334, in _reconstruct state = deepcopy(state, memo) File "/usr/local/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/local/lib/python2.7/copy.py", line 257, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/usr/local/lib/python2.7/copy.py", line 190, in deepcopy y = _reconstruct(x, rv, 1, memo) File "/usr/local/lib/python2.7/copy.py", line 334, in _reconstruct state = deepcopy(state, memo) File "/usr/local/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/local/lib/python2.7/copy.py", line 257, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/usr/local/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/local/lib/python2.7/copy.py", line 230, in _deepcopy_list y.append(deepcopy(a, memo)) File "/usr/local/lib/python2.7/copy.py", line 190, in deepcopy y = _reconstruct(x, rv, 1, memo) File "/usr/local/lib/python2.7/copy.py", line 334, in _reconstruct state = deepcopy(state, memo) File "/usr/local/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/local/lib/python2.7/copy.py", line 257, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/usr/local/lib/python2.7/copy.py", line 190, in deepcopy y = _reconstruct(x, rv, 1, memo) File "/usr/local/lib/python2.7/copy.py", line 334, in _reconstruct state = deepcopy(state, memo) File "/usr/local/lib/python2.7/copy.py", line 163, in deepcopy y = copier(x, memo) File "/usr/local/lib/python2.7/copy.py", line 257, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/usr/local/lib/python2.7/copy.py", line 182, in deepcopy rv = reductor(2) TypeError: can't pickle thread.lock objects

    Some help would be highly appreciated

    opened by benjaminelkrieff 2
  • Future of the project

    Future of the project

    Hi. I've just come across this project and it is exactly what we need. However, I've noticed there haven't been any updates for a while now. Could you guys please share your vision for this project? Is it still being maintained? Thank you very much.

    opened by demisx 9
Releases(v1.2.1)
CreamySoup - a helper script for automated SourceMod plugin updates management.

CreamySoup/"Creamy SourceMod Updater" (or just soup for short), a helper script for automated SourceMod plugin updates management.

3 Jan 03, 2022
Web Scraping Practica With Python

Web-Scraping-Practica Integrants: Guillem Vidal Pallarols. Lídia Bandrés Solé Fitxers: Aquest document és el primer que trobem. A continuació trobem u

2 Nov 08, 2021
Fundamentus scrapy

Fundamentus_scrapy Baixa informacões que os outros scrapys do fundamentus não realizam. Para iniciar (python main.py), sera criado um arquivo chamado

Guilherme Silva Uchoa 1 Oct 24, 2021
Displays market info for the LUNI token on the Terra Blockchain

LuniBot for Discord Displays market info for the LUNI/LUNA token on the Terra Blockchain (Webscrape method currently scraping CoinMarketCap). Will evo

0 Jan 22, 2022
A dead simple crawler to get books information from Douban.

Introduction A dead simple crawler to get books information from Douban. Pre-requesites Python 3 Install dependencies from requirements.txt (Optional)

Yun Wang 1 Jan 10, 2022
Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage import lassie lassie.fetch('http://www.youtube.com/watch?v

Mike Helmick 570 Dec 19, 2022
A web scraper that exports your entire WhatsApp chat history.

WhatSoup 🍲 A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

Eddy Harrington 87 Jan 06, 2023
Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

TwitterScraper Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine . Screenshot Data Users Only

Remax Alghamdi 19 Nov 17, 2022
Command line program to download documents from web portals.

command line document download made easy Highlights list available documents in json format or download them filter documents using string matching re

16 Dec 26, 2022
robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

Joshua Carp 3.7k Dec 27, 2022
Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Video Games Web Scraper Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages. This

Albert Marrero 1 Jan 12, 2022
哔哩哔哩爬取器:以个人为中心

Open Bilibili Crawer 哔哩哔哩是一个信息非常丰富的社交平台,我们基于此构造社交网络。在该网络中,节点包括用户(up主),以及视频、专栏等创作产物;关系包括:用户之间,包括关注关系(following/follower),回复关系(评论区),转发关系(对视频or动态转发);用户对创

Boshen Shi 3 Oct 21, 2021
Crawler in Python 3.7, 3.8. 3.9. Pypy3

Description Python Crawler written Python 3. (Supports major Python releases Python3.6, Python3.7 and Python 3.8) Installation and Use Setup VirtualEn

Vinit Kumar 2 Mar 12, 2022
A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

7 Dec 03, 2022
Searching info from Google using Python Scrapy

Python-Search-Engine-Scrapy || Python-爬虫-索引/利用爬虫获取谷歌信息**/ Searching info from Google using Python Scrapy /* 利用 PYTHON 爬虫获取天气信息,以及城市信息和资料**/ translatio

HONGVVENG 1 Jan 06, 2022
A simple app to scrap data from Twitter.

Twitter-Scraping-App A simple app to scrap data from Twitter. Available Features Search query. Select number of data you want to fetch from twitter. C

Davis David 2 Oct 31, 2022
HappyScrapper - Google news web scrapper with python

HappyScrapper ~ Google news web scrapper INSTALLATION ♦ Clone the repository ♦ O

Jhon Aguiar 0 Nov 07, 2022
An application that on a given url, crowls a web page and gets all words, sorts and counts them.

Web-Scrapping-1 An application that on a given url, crowls a web page and gets all words, sorts and counts them. Installation Using the package manage

adriano atambo 1 Jan 16, 2022
Subscrape - A Python scraper for substrate chains

subscrape A Python scraper for substrate chains that uses Subscan. Usage copy co

ChaosDAO 14 Dec 15, 2022
京东抢茅台,秒杀成功很多次讨论,天猫抢购,赚钱交流等。

Jd_Seckill 特别声明: 请添加个人微信:19972009719 进群交流讨论 目前群里很多人抢到【扫描微信添加群就好,满200关闭群,有喜欢薅信用卡羊毛的也可以找我交流】 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性

50 Jan 05, 2023