Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Last update: Jan 03, 2023

Overview

Gerapy

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js.

Documentation

Documentation is available online at https://docs.gerapy.com/ and https://github.com/Gerapy/Docs.

Support

Gerapy is developed based on Python 3.x. Python 2.x may be supported later.

Usage

Install Gerapy by pip:

pip3 install gerapy

After the installation, you need to do these things below to run Gerapy server:

If you have installed Gerapy successfully, you can use command gerapy. If not, check the installation.

First use this command to initialize the workspace:

gerapy init

Now you will get a folder named gerapy. Also you can specify the name of your workspace by this command:

gerapy init <workspace>

Then cd to this folder, and run this command to initialize the Database:

cd gerapy
gerapy migrate

Next you need to create a superuser by this command:

gerapy createsuperuser

Then you can runserver by this command:

gerapy runserver

Then you can visit http://localhost:8000 to enjoy it. Also you can vist http://localhost:8000/admin to get the admin management backend.

If you want to run Gerapy in public, just run like this:

gerapy runserver 0.0.0.0:8000

Then it will run with public host and port 8000.

In Gerapy, You can create a configurable project and then configure and generate code of Scrapy automatically. But this module is unstable, we're trying to refine it.

Also you can drag your Scrapy Project to projects folder. Then refresh web, it will appear in the Project Index Page and comes to un-configurable, but you can edit this project through the web page.

As for deployment, you can move to Deploy Page. Firstly you need to build your project and add client in the Client Index Page, then you can deploy the project just by clicking button.

After the deployment, you can manage the job in Monitor Page.

Docker

Just run this command:

docker run -d -v ~/gerapy:/app/gerapy -p 8000:8000 germey/gerapy

Then it will run at port 8000. You can use the temp admin account (username: admin, password: admin) to login. And please change the password later for safety.

Command Usage:

docker run -d -v <workspace>:/app/gerapy -p <public_port>:<container_port> germey/gerapy

Please specify your workspace to mount Gerapy workspace by -v <workspace>:/app/gerapy and specify server port by -p <public_port>:<container_port>.

If you run Gerapy by Docker, you can visit Gerapy website such as http://localhost:8000 and enjoy it, no need to do other initialzation things.

TodoList

Add Visual Configuration of Spider with Previewing Website
Add Scrapyd Auth Management
Add Gerapy Auth Management
Add Timed Task Scheduler
Add Visual Configuration of Scrapy
Add Intelligent Analysis of Web Page

Communication

If you have any questions or ideas, you can send Issues or Pull Requests, your suggestions are really import for us, thanks for your contirbution.

Comments

一点愚见，崔大
首先，作为崔大的学员。希望这个项目能更好。下面是我的一点愚见

看了TodoList。感觉最紧要需要的功能，不应该是可视化创建爬虫什么的。这是一个分布式部署项目，不是八爪鱼之类的可视化爬虫生成项目。我个人觉得目前最应该更新的功能如下:

scrapyd 认证，或者是 gerapy 认证。加scrapyd认证，那么gerapy就放在本地，加gerapy认证，则gerapy放服务器上，都能实现基本的安全认证。不然这个东西根本没办法在生成环境中使用。就只是一个玩玩的东西

定时任务，生产中最基本的功能，没有这个就不会有人使用gerapy。基本事实就是这样。

希望崔大，先完成这两个基本功能。别的功能都只是锦上添花，基本的是根基啊。最后，希望崔大好好干，不断壮大gerapy，写一个广受使用的分布式爬虫部署管理平台。
opened by Amos-x 13
gerapy init failed complaining cannot import name 'version'

I've recently installed gerapy on a Python 3.6.3 virtual environment. After successful installation I executed: gerapy init

and got this: Traceback (most recent call last): File "/home/muhammad/development/virtualenvs/py3-pocs/bin/gerapy", line 7, in from gerapy.cmd import cmd File "/home/muhammad/development/virtualenvs/py3-pocs/lib/python3.6/site-packages/gerapy/cmd/init.py", line 14, in from gerapy import version ImportError: cannot import name 'version'

opened by matifayaz 7
ModuleNotFoundError: No module named 'scrapy.settings.deprecated'
Describe the bug 当使用命令 gerapy 时出现： ModuleNotFoundError: No module named 'scrapy.settings.deprecated'

To Reproduce Steps to reproduce the behavior:

gerapy migrate便出现上面错误

Traceback gerapy migrate Traceback (most recent call last): File "/Users/sven/opt/anaconda2/envs/spiders-manager/bin/gerapy", line 5, in from gerapy.cmd import cmd File "/Users/sven/opt/anaconda2/envs/spiders-manager/lib/python3.7/site-packages/gerapy/cmd/init.py", line 5, in from gerapy.cmd.parse import parse File "/Users/sven/opt/anaconda2/envs/spiders-manager/lib/python3.7/site-packages/gerapy/cmd/parse.py", line 2, in from gerapy.server.core.parser import get_start_requests, get_follow_requests_and_items File "/Users/sven/opt/anaconda2/envs/spiders-manager/lib/python3.7/site-packages/gerapy/server/core/parser.py", line 6, in from scrapy.settings.deprecated import check_deprecated_settings ModuleNotFoundError: No module named 'scrapy.settings.deprecated'

Environment (please complete the following information):

OS: macOS Mojave

Python Version 3.7

Gerapy Version '0.9.2'

Scrapy version (2, 1, 0)

Additional context Add any other context about the problem here.
bug
opened by 474416133 6
在任务管理中创建的任务，删除失败且查看状态显示加载失败。

Describe the bug 我在任务管理中创建了一个爬虫的定时任务，当我去查看状态时，显示加载失败，且该任务无法被删除。 gerapy后台日志报错： django.core.exceptions.FieldError: Cannot resolve keyword 'name' into field. Choices are: djangojobexecution, id, job_state, next_run_time
bug

opened by a2011318824 5
在主机-调度-爬虫的日志界面显示 Processing Failed 是怎么回事啊

爬虫是正常运行的，但是应该是scrapy输出的信息位置显示的是这样。。 <html><head><title>Processing Failed</title></head><body><b>Processing Failed</b></body></html>

opened by whisperbb 5
fix bug in build.py

There is a bug in build.py: The path passed to create_default_setup_py() is the folder path, not the absolute path of setup.py, which causes this function to never create setup.py. The related issue #165 #164

opened by MeepoAII 5
English Language Support Feature

Hi @Germey ,

Hope you are doing great. I am deeply happy to see you continuously working so hard to improve the performance & adding new feature of Gerapy.

I know that this is probably not an ideal question to ask you hereon github issue section but I was wondering if you won't mind to let me know when you are expecting to have English support for such an excellent Framework Gerapy.

"In our earlier conversation", you said that "I'm Chinese from Beijing, China. 😁 If you feel any inconvenience I'm glad to convert it in the next version.".

I am patiently & enthusiastically looking forward to see support for English.

Thank you so much for your dedication, time, effort in building such amazing Framework.

Thank you.

opened by mtaziz 5

创建主机401

环境：Windows10，python3.7 在web端创建主机的时候无响应，日志显示401状态码，请问这是怎么回事，如何解决

Performing system checks...

INFO - 2020-04-30 14:53:16,985 - process: 6236 - scheduler.py - gerapy.server.core.scheduler - 102 - scheduler - successfully synced task with jobs
System check identified no issues (0 silenced).
April 30, 2020 - 14:53:17
Django version 1.11.26, using settings 'gerapy.server.server.settings'
Starting development server at http://127.0.0.1:8000/
Quit the server with CTRL-BREAK.
[30/Apr/2020 14:53:21] "GET / HTTP/1.1" 200 2466
[30/Apr/2020 14:53:21] "GET /static/js/app.65f12732.js HTTP/1.1" 304 0
[30/Apr/2020 14:53:21] "GET /static/js/chunk-vendors.b39c50b5.js HTTP/1.1" 304 0
[30/Apr/2020 14:53:21] "GET /static/img/loading.864753ef.svg HTTP/1.1" 304 0
[30/Apr/2020 14:53:21] "GET /static/fonts/fontawesome-webfont.af7ae505.woff2 HTTP/1.1" 304 0
[30/Apr/2020 14:53:21] "GET /static/img/logo.0aa9679a.png HTTP/1.1" 304 0
[30/Apr/2020 14:53:21] "GET /static/fonts/element-icons.535877f5.woff HTTP/1.1" 304 0
[30/Apr/2020 14:53:22] "GET /api/client/ HTTP/1.1" 401 27
[30/Apr/2020 14:53:33] "POST /api/client/create HTTP/1.1" 401 27
[30/Apr/2020 14:53:34] "POST /api/client/create HTTP/1.1" 401 27
[30/Apr/2020 14:53:35] "POST /api/client/create HTTP/1.1" 401 27
[30/Apr/2020 14:53:35] "POST /api/client/create HTTP/1.1" 401 27

bug

opened by kanadeblisst 4

定时任务调度已经删除，但任务仍在运行。

使用的版本为 gerapy (0.8.6b1)。在卧槽哥@thsheep 的指导下，终于成功让定时任务跑起来了（感激），但紧接着就发现了一个问题。我添加的定时任务是一个测试脚本，就是简单用来测试定时任务是否可用的。测试完成后，我将添加的定时任务删除，但却发现，在“主机管理”--“选择主机”--“调度”中看到任务运行记录，删掉了定时调度，定时任务却依然按照之前的定时调度执行。。。另外，卧槽哥能否给个crontab配置的示例，跟linux下面不太一样，不太确定gerapy里面crontab的约定格式是怎样的。多谢！

opened by AaronJny 4
scrapy 爬虫部署到主机时发生setting文件读取错误

一个写好的scrapy项目,通过图形界面部署到gerapy主机时,部署失败,后台的错误日志如下, 不知道是什么原因.

{ message: 'scrapyd_api.exceptions.ScrapydResponseError: Traceback (most recent call last):\n File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main\n "main", fname, loader, pkg_name)\n File "/usr/lib/python2.7/runpy.py", line 72, in _run_code\n exec code in run_globals\n File "/usr/local/lib/python2.7/dist-packages/scrapyd/runner.py", line 40, in \n main()\n File "/usr/local/lib/python2.7/dist-packages/scrapyd/runner.py", line 37, in main\n execute()\n File "/usr/local/lib/python2.7/dist-packages/scrapy/cmdline.py", line 110, in execute\n settings = get_project_settings()\n File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/project.py", line 68, in get_project_settings\n settings.setmodule(settings_module_path, priority='project')\n File "/usr/local/lib/python2.7/dist-packages/scrapy/settings/init.py", line 292, in setmodule\n module = import_module(module)\n File "/usr/lib/python2.7/importlib/init.py", line 37, in import_module\n import(name)\nImportError: No module named CNKISpider.settings\n' } } }

opened by FRANDAVID 4
AttributeError: cffi library '_openssl' has no function, constant or global variable named 'Cryptography_HAS_EVP_PKEY_get_set_tls_encodedpoint'

$ gerapy init $ cd gerapy/ $ ls projects $ gerapy migrate Traceback (most recent call last): File "/home/datacrawl/.local/bin/gerapy", line 11, in sys.exit(cmd()) File "/home/datacrawl/.local/lib/python3.5/site-packages/gerapy/cmd/init.py", line 27, in cmd server() File "/home/datacrawl/.local/lib/python3.5/site-packages/gerapy/cmd/server.py", line 6, in server manage() File "/home/datacrawl/.local/lib/python3.5/site-packages/gerapy/server/manage.py", line 23, in manage execute_from_command_line(sys.argv) File "/home/datacrawl/.local/lib/python3.5/site-packages/django/core/management/init.py", line 371, in execute_from_command_line utility.execute() File "/home/datacrawl/.local/lib/python3.5/site-packages/django/core/management/init.py", line 365, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/home/datacrawl/.local/lib/python3.5/site-packages/django/core/management/base.py", line 288, in run_from_argv self.execute(*args, **cmd_options) File "/home/datacrawl/.local/lib/python3.5/site-packages/django/core/management/base.py", line 332, in execute self.check() File "/home/datacrawl/.local/lib/python3.5/site-packages/django/core/management/base.py", line 364, in check include_deployment_checks=include_deployment_checks, File "/home/datacrawl/.local/lib/python3.5/site-packages/django/core/management/commands/migrate.py", line 58, in _run_checks issues.extend(super()._run_checks(**kwargs)) File "/home/datacrawl/.local/lib/python3.5/site-packages/django/core/management/base.py", line 351, in _run_checks return checks.run_checks(**kwargs) File "/home/datacrawl/.local/lib/python3.5/site-packages/django/core/checks/registry.py", line 73, in run_checks new_errors = check(app_configs=app_configs) File "/home/datacrawl/.local/lib/python3.5/site-packages/django/core/checks/urls.py", line 13, in check_url_config return check_resolver(resolver) File "/home/datacrawl/.local/lib/python3.5/site-packages/django/core/checks/urls.py", line 23, in check_resolver return check_method() File "/home/datacrawl/.local/lib/python3.5/site-packages/django/urls/resolvers.py", line 397, in check for pattern in self.url_patterns: File "/home/datacrawl/.local/lib/python3.5/site-packages/django/utils/functional.py", line 36, in get res = instance.dict[self.name] = self.func(instance) File "/home/datacrawl/.local/lib/python3.5/site-packages/django/urls/resolvers.py", line 536, in url_patterns patterns = getattr(self.urlconf_module, "urlpatterns", self.urlconf_module) File "/home/datacrawl/.local/lib/python3.5/site-packages/django/utils/functional.py", line 36, in get res = instance.dict[self.name] = self.func(instance) File "/home/datacrawl/.local/lib/python3.5/site-packages/django/urls/resolvers.py", line 529, in urlconf_module return import_module(self.urlconf_name) File "/usr/lib/python3.5/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 986, in _gcd_import File "", line 969, in _find_and_load File "", line 958, in _find_and_load_unlocked File "", line 673, in _load_unlocked File "", line 665, in exec_module File "", line 222, in _call_with_frames_removed File "/home/datacrawl/.local/lib/python3.5/site-packages/gerapy/server/server/urls.py", line 21, in url(r'^', include('gerapy.server.core.urls')), File "/home/datacrawl/.local/lib/python3.5/site-packages/django/urls/conf.py", line 34, in include urlconf_module = import_module(urlconf_module) File "/usr/lib/python3.5/importlib/init.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "", line 986, in _gcd_import File "", line 969, in _find_and_load File "", line 958, in _find_and_load_unlocked File "", line 673, in _load_unlocked File "", line 665, in exec_module File "", line 222, in _call_with_frames_removed File "/home/datacrawl/.local/lib/python3.5/site-packages/gerapy/server/core/urls.py", line 2, in from . import views File "/home/datacrawl/.local/lib/python3.5/site-packages/gerapy/server/core/views.py", line 1, in import json, os, requests, time, pytz, pymongo, string File "/home/datacrawl/.local/lib/python3.5/site-packages/requests/init.py", line 84, in from urllib3.contrib import pyopenssl File "/home/datacrawl/.local/lib/python3.5/site-packages/urllib3/contrib/pyopenssl.py", line 46, in import OpenSSL.SSL File "/home/datacrawl/.local/lib/python3.5/site-packages/OpenSSL/init.py", line 8, in from OpenSSL import crypto, SSL File "/home/datacrawl/.local/lib/python3.5/site-packages/OpenSSL/crypto.py", line 16, in from OpenSSL._util import ( File "/home/datacrawl/.local/lib/python3.5/site-packages/OpenSSL/_util.py", line 6, in from cryptography.hazmat.bindings.openssl.binding import Binding File "/home/datacrawl/.local/lib/python3.5/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 156, in Binding.init_static_locks() File "/home/datacrawl/.local/lib/python3.5/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 137, in init_static_locks cls._ensure_ffi_initialized() File "/home/datacrawl/.local/lib/python3.5/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 124, in _ensure_ffi_initialized cls.lib = build_conditional_library(lib, CONDITIONAL_NAMES) File "/home/datacrawl/.local/lib/python3.5/site-packages/cryptography/hazmat/bindings/openssl/binding.py", line 84, in build_conditional_library if not getattr(lib, condition): AttributeError: cffi library '_openssl' has no function, constant or global variable named 'Cryptography_HAS_EVP_PKEY_get_set_tls_encodedpoint'

opened by Germey 4

Spider Works in Terminal Not in Gerapy

Before I start I just want to say that you all have done a great job developing this project. I love gerapy. I will probably start contributing to the project. I will try to document this as well as I can so it can be helpful to others.

Describe the bug I have a scrapy project which runs perfectly fine in terminal using the following command:

scrapy crawl examplespider

However, when I schedule it in a task and run it on my local scrapyd client it runs but immediately closes. I don't know why it opens and closes without doing anything. Throws no errors. I think it's a config file issue. When I view the results of the job it shows the following:

`y.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider opened
2022-12-15 07:03:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-15 07:03:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-15 07:03:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.002359,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 314439),
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'log_count/WARNING': 1,
 'memusage/max': 63709184,
 'memusage/startup': 63709184,
 'start_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 312080)}
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider closed (finished)`

In the logs it shows the following:

/home/ubuntu/env/scrape/bin/logs/examplescraper/examplespider

2022-12-15 07:03:21 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: examplescraper)
2022-12-15 07:03:21 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.8.10 (default, Nov 14 2022, 12:59:47) - [GCC 9.4.0], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Linux-5.15.0-1026-aws-x86_64-with-glibc2.29
2022-12-15 07:03:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'examplescraper', 
 'DOWNLOAD_DELAY': 0.1, 
 'LOG_FILE': 'logs/examplescraper/examplespider/8d623d447c4611edad0641137877ddff.log', 
 'NEWSPIDER_MODULE': 'examplespider.spiders', 
 'SPIDER_MODULES': ['examplespider.spiders'], 
 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '               
 		'(KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'
}

2022-12-15 07:03:21 [py.warnings] WARNING: /home/ubuntu/env/scrape/lib/python3.8/site-packages/scrapy/utils/request.py:231: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.

It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.

See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this 
  deprecation.  
    return cls(crawler)
    
2022-12-15 07:03:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-12-15 07:03:21 [scrapy.extensions.telnet] INFO: Telnet Password: b11a24faee23f82c
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats', 
 'scrapy.extensions.telnet.TelnetConsole 
 'scrapy.extensions.memusage.MemoryUsage', 
 'scrapy.extensions.logstats.LogStats']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 
 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 
 'scrapy.spidermiddlewares.referer.RefererMiddleware', 
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider opened
2022-12-15 07:03:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-15 07:03:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-15 07:03:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

{'elapsed_time_seconds': 0.002359, 
 'finish_reason': 'finished', 
 'finish_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 314439), 
 'log_count/DEBUG': 1, 
 'log_count/INFO': 10, 
 'log_count/WARNING': 1, 
 'memusage/max': 63709184, 
 'memusage/startup': 63709184, 
 'start_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 312080)
}
 2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider closed (finished)

/home/ubuntu/gerapy/logs

[email protected]:~/gerapy/logs$ cat 20221215065310.log 
 INFO - 2022-12-15 14:53:18,043 - process: 480 - scheduler.py - gerapy.server.core.scheduler - 105 - scheduler - successfully synced task with jobs with force
 INFO - 2022-12-15 14:54:15,011 - process: 480 - scheduler.py - gerapy.server.core.scheduler - 34 - scheduler - execute job of client LOCAL, project examplescraper, spider examplespider
 [email protected]:~/gerapy/logs$

To Reproduce Steps to reproduce the behavior:

AWS Ubuntu 20.04 Instance
Use python3 virtual environment and follow the installation instructions
Create a systemd service for scrapyd and gerapy by doing the following:

    cd /lib/systemd/system
    sudo nano scrapyd.service

paste the following:

     [Unit]
     Description=Scrapyd service
     After=network.target

     [Service]
     User=ubuntu
     Group=ubuntu
     WorkingDirectory=/home/ubuntu/env/scrape/bin
     ExecStart=/home/ubuntu/env/scrape/bin/scrapyd

     [Install]
     WantedBy=multi-user.target

Issue the following commands:

      sudo systemctl enable scrapyd.service
      sudo systemctl start scrapyd.service
      sudo systemctl status scrapyd.service

It should say: active (running) Create a script to run gerapy as a systemd service

     cd ~/virtualenv/exampleproject/bin/
     nano runserv-gerapy.sh

Paste the following:

     #!/bin/bashcd 
     /home/ubuntu/virtualenv
     source exampleproject/bin/activate
     cd /home/ubuntu/gerapy
     gerapy runserver 0.0.0.0:8000

Give this file execute permissions sudo chmod +x runserve-gerapy.sh

Navigate back to systemd and create a service to run the runserve-gerapy.sh

     cd /lib/systemd/system
     sudo nano gerapy-web.service

Paste the following:

     [Unit]
     Description=Gerapy Webserver Service
     After=network.target

     [Service]
     User=ubuntu
     Group=ubuntu
     WorkingDirectory=/home/ubuntu/virtualenv/exampleproject/bin
     ExecStart=/bin/bash /home/ubuntu/virtualenv/exampleproject/bin/runserver-gerapy.sh

     [Install]
     WantedBy=multi-user.target

Again issue the following:

     sudo systemctl enable gerapy-web.service
     sudo systemctl start gerapy-web.service
     sudo systemctl status gerapy-web.service

Look for active (running) and navigate to http://your.pub.ip.add:8000 or http://localhost:8000 or http://127.0.0.1:8000 to verify that it is running. Reboot the instance to verify that the services are running on system startup. 5. Log in and create a client for the local scrapyd service. Use IP 127.0.0.1 and Port 6800. No Auth. Save it as "Local" or "Scrapyd" 6. Create a project. Select Clone. For testing I used the following github scrapy project: https://github.com/eneiromatos/NebulaEmailScraper (actually a pretty nice starter project). Save the project. Build the project. Deploy the project. (If you get an error when deploying make sure to be running in the virtual env, you might need to reboot). 7. Create a task. Make sure the project name and spider name matches what is in the scrapy.cfg and examplespider.py files and save the task. Schedule the task. Run the task

Traceback See logs above ^^^

Expected behavior It should run for at least 5 minutes and output to a file called emails.json in the project root folder (the folder with scrapy.cfg file)

Screenshots I can upload screenshots if requested.

Environment (please complete the following information):

OS: AWS Ubuntu 20.04
Browser Firefox
Python Version 3.8
Gerapy Version 0.9.11 (latest)

Additional context Add any other context about the problem here.

bug

opened by wmullaney 0

Gerapy任务管理选择多台主机，调度2台，剩下2台未调度

问题描述 任务管理中配置4台主机，gerapy调度时有输出客户端的调度日志，但scrapyd的web页面有两台未启动，gerapy元数据表(django_apscheduler_djangojobexecution)的状态是两条数据为Executed，两条数据为Started execution。【重启gerapy后可正常调度】

日志 INFO - 2022-12-06 10:23:00,022 - process: 26463 - scheduler.py - gerapy.server.core.scheduler - 34 - scheduler - execute job of client crawler02_6805, project project_name, spider spider_name INFO - 2022-12-06 10:23:00,055 - process: 26463 - scheduler.py - gerapy.server.core.scheduler - 34 - scheduler - execute job of client crawler03_6806, project project_name, spider spider_name INFO - 2022-12-06 10:23:00,067 - process: 26463 - scheduler.py - gerapy.server.core.scheduler - 34 - scheduler - execute job of client crawler03_6807, project project_name, spider spider_name INFO - 2022-12-06 10:23:00,088 - process: 26463 - scheduler.py - gerapy.server.core.scheduler - 34 - scheduler - execute job of client crawler03_6808, project project_name, spider spider_name

django_apscheduler_djangojobexecution表数据 16 Started execution 2022-12-06 02:23:00.000000 crawler02_6805-project_name-spider_name 17 Started execution 2022-12-06 02:23:00.000000 crawler03_6806-project_name-spider_name 18 Executed 2022-12-06 02:23:00.000000 2.00 1670293382.00 crawler03_6807-project_name-spider_name 19 Executed 2022-12-06 02:23:00.000000 1.98 1670293381.98 crawler03_6808-project_name-spider_name

环境 OS: ubuntu 20.04 Browser: Chrome 104(64位) Python: 3.9.15 Gerapy: 0.9.11 Scrapyd: 1.3.0
bug

opened by Qingenmm 0
点击任务状态提示加载失败

ERROR - 2022-11-03 22:09:06,050 - process: 14087 - utils.py - gerapy.server.core.utils - 564 - utils - invalid literal for int() with base 10: 'dev-supply_chain_fumei_data_gather-cement' Traceback (most recent call last): File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/gerapy/server/core/utils.py", line 562, in wrapper result = func(*args, **kwargs) File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view return view_func(*args, **kwargs) File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/views/generic/base.py", line 71, in view return self.dispatch(request, *args, **kwargs) File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 505, in dispatch response = self.handle_exception(exc) File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 465, in handle_exception self.raise_uncaught_exception(exc) File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 476, in raise_uncaught_exception raise exc File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 502, in dispatch response = handler(request, *args, **kwargs) File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/rest_framework/decorators.py", line 50, in handler return func(*args, **kwargs) File "/opt/rh/rh-python38/root/usr/local/lib/python3.8/site-packages/gerapy/server/core/views.py", line 943, in task_status jobs = DjangoJob.objects.filter(id=job_id) File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/db/models/manager.py", line 82, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/db/models/query.py", line 892, in filter return self._filter_or_exclude(False, *args, **kwargs) File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/db/models/query.py", line 910, in _filter_or_exclude clone.query.add_q(Q(*args, **kwargs)) File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/db/models/sql/query.py", line 1290, in add_q clause, _ = self._add_q(q_object, self.used_aliases) File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/db/models/sql/query.py", line 1315, in _add_q child_clause, needed_inner = self.build_filter( File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/db/models/sql/query.py", line 1251, in build_filter condition = self.build_lookup(lookups, col, value) File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/db/models/sql/query.py", line 1116, in build_lookup lookup = lookup_class(lhs, rhs) File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/db/models/lookups.py", line 20, in init self.rhs = self.get_prep_lookup() File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/db/models/lookups.py", line 70, in get_prep_lookup return self.lhs.output_field.get_prep_value(self.rhs) File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/db/models/fields/init.py", line 972, in get_prep_value return int(value) ValueError: invalid literal for int() with base 10: 'dev-supply_chain_fumei_data_gather-cement' Internal Server Error: /api/task/1/status Traceback (most recent call last): File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner response = get_response(request) File "/opt/rh/rh-python38/root/usr/local/lib64/python3.8/site-packages/django/core/handlers/base.py", line 124, in _get_response raise ValueError( ValueError: The view gerapy.server.core.utils.wrapper didn't return an HttpResponse object. It returned None instead. [03/Nov/2022 22:09:06] "GET /api/task/1/status HTTP/1.1" 500 14901

opened by chenziyao 1
python3.7 安装换败
Describe the bug A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

Go to '...'

Click on '....'

Scroll down to '....'

See error

Traceback Copy traceback displayed in console to here.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

OS: [e.g. macos 12]

Browser [e.g. Chrome 67]

Python Version [e.g. 3.7.9]

Gerapy Version [e.g. 0.8.6]

Additional context Add any other context about the problem here.

gerapy 安装失败. 报错信息： pip subprocess to install build dependencies did not run successfully.
bug
opened by songsh 0
崔大，启用异步的TWISTED_REACTOR时候，部署就会报错
Describe the bug 启用异步的TWISTED_REACTOR时候，部署就会报错

Traceback Traceback (most recent call last): File "D:\anaconda\envs\scrapy\lib\site-packages\twisted\web\http.py", line 2369, in allContentReceived req.requestReceived(command, path, version) File "D:\anaconda\envs\scrapy\lib\site-packages\twisted\web\http.py", line 1003, in requestReceived self.process() File "D:\anaconda\envs\scrapy\lib\site-packages\twisted\web\server.py", line 229, in process self.render(resrc) File "D:\anaconda\envs\scrapy\lib\site-packages\twisted\web\server.py", line 294, in render body = resrc.render(self) --- --- File "D:\anaconda\envs\scrapy\lib\site-packages\scrapyd\webservice.py", line 21, in render return JsonResource.render(self, txrequest).encode('utf-8') File "D:\anaconda\envs\scrapy\lib\site-packages\scrapyd\utils.py", line 21, in render r = resource.Resource.render(self, txrequest) File "D:\anaconda\envs\scrapy\lib\site-packages\twisted\web\resource.py", line 263, in render return m(request) File "D:\anaconda\envs\scrapy\lib\site-packages\scrapyd\webservice.py", line 88, in render_POST spiders = get_spider_list(project, version=version) File "D:\anaconda\envs\scrapy\lib\site-packages\scrapyd\utils.py", line 134, in get_spider_list raise RuntimeError(msg.encode('unicode_escape') if six.PY2 else msg) builtins.RuntimeError: D:\anaconda\envs\scrapy\lib\site-packages\scrapy\utils\project.py:81: ScrapyDeprecationWarning: Use of environment variables prefixed with S CRAPY_ to override settings is deprecated. The following environment variables are currently defined: EGG_VERSION warnings.warn( Traceback (most recent call last): File "D:\anaconda\envs\scrapy\lib\runpy.py", line 192, in _run_module_as_main return _run_code(code, main_globals, None, File "D:\anaconda\envs\scrapy\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\anaconda\envs\scrapy\lib\site-packages\scrapyd\runner.py", line 46, in main() File "D:\anaconda\envs\scrapy\lib\site-packages\scrapyd\runner.py", line 43, in main execute() File "D:\anaconda\envs\scrapy\lib\site-packages\scrapy\cmdline.py", line 144, in execute cmd.crawler_process = CrawlerProcess(settings) File "D:\anaconda\envs\scrapy\lib\site-packages\scrapy\crawler.py", line 280, in init super().init(settings) File "D:\anaconda\envs\scrapy\lib\site-packages\scrapy\crawler.py", line 156, in init self._handle_twisted_reactor() File "D:\anaconda\envs\scrapy\lib\site-packages\scrapy\crawler.py", line 343, in _handle_twisted_reactor install_reactor(self.settings["TWISTED_REACTOR"], self.settings["ASYNCIO_EVENT_LOOP"]) File "D:\anaconda\envs\scrapy\lib\site-packages\scrapy\utils\reactor.py", line 66, in install_reactor asyncioreactor.install(eventloop=event_loop) File "D:\anaconda\envs\scrapy\lib\site-packages\twisted\internet\asyncioreactor.py", line 308, in install reactor = AsyncioSelectorReactor(eventloop) File "D:\anaconda\envs\scrapy\lib\site-packages\twisted\internet\asyncioreactor.py", line 63, in init raise TypeError( TypeError: ProactorEventLoop is not supported, got:

Environment (please complete the following information):

OS: Windows 10

Python Version 3.8.2

Gerapy Version 0.9.10

bug
opened by frshman 1

Releases(v0.9.12)

v0.9.12(Dec 27, 2022)

Version 0.9.12 Released
Source code(tar.gz)
Source code(zip)
v0.9.10(Dec 30, 2021)
Fix bug of not redirecting to login page for 401

Fix bug of generating extra logs folder when gerapy init

Source code(tar.gz)
Source code(zip)
v0.9.9(Dec 26, 2021)
Fix bug: https://github.com/Gerapy/Gerapy/issues/211

Fix bug: https://github.com/Gerapy/Gerapy/issues/210

Source code(tar.gz)
Source code(zip)
v0.9.8(Dec 26, 2021)
Fix bug https://github.com/Gerapy/Gerapy/issues/221

Fix bug https://github.com/Gerapy/Gerapy/issues/220

Fix bug https://github.com/Gerapy/Gerapy/issues/219

Fix bug https://github.com/Gerapy/Gerapy/issues/207

Source code(tar.gz)
Source code(zip)
v0.9.7(Jul 31, 2021)
Support Scrapy 2.x

Fix some known bugs

Source code(tar.gz)
Source code(zip)
v0.9.6(Aug 27, 2020)

Source code(tar.gz)
Source code(zip)
v0.9.3(Jul 6, 2020)
Change Log:

Fixed some bugs:

https://github.com/Gerapy/Gerapy/issues/154

https://github.com/Gerapy/Gerapy/issues/149

https://github.com/Gerapy/Gerapy/issues/147

Add safe control for evil injection.

Source code(tar.gz)
Source code(zip)
v0.9.2(Nov 25, 2019)
Fix bug of Docker Build

Refine Docs

Source code(tar.gz)
Source code(zip)
v0.9.0(Nov 21, 2019)
Fix bugs of time scheduler

Beautify command line message

Add initadmin command

New Docker image released

Source code(tar.gz)
Source code(zip)
0.8.6-beta(Jul 4, 2018)

Source code(tar.gz)
Source code(zip)
0.8.5(Jan 24, 2018)
Add Scrapyd Auth

Fix Remove Client Bug

Source code(tar.gz)
Source code(zip)
v0.8.4-rc2(Jan 23, 2018)
Add Scrapyd Auth

Source code(tar.gz)
Source code(zip)
v0.8.3(Jan 23, 2018)
Add Docker

Fix Encoding Bugs by adding Middleware

Fix Delete Project bug by removing Deployment Relations first

Fix language transition bug

Fix variable bug in frontend

Fix log encoding bug

Add judgement of finding egg

Source code(tar.gz)
Source code(zip)
v0.8.2(Jan 20, 2018)
Update logo of Gerapy

Set ALLOWED_HOSTS to all(*)

Fix runserver bug on Windows

Update Django requirements to >= 1.11

Source code(tar.gz)
Source code(zip)
v0.7.8(Jan 19, 2018)
Add compatibility of Django 2.X

Fix bug of build

Fix bug of configuring host and port

Remove monitor temporarily

Source code(tar.gz)
Source code(zip)

Owner

Gerapy

Distributed Crawler Management Framework Based on Scrapy, Scrapyd.

GitHub Repository https://docs.gerapy.com/

This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Introduction This was supposed to be a web scraping project, but somehow I've turned it into a spamming project.

1 Jan 23, 2022

An introduction to free, automated web scraping with GitHub’s powerful new Actions framework.

An introduction to free, automated web scraping with GitHub’s powerful new Actions framework Published at palewi.re/docs/first-github-scraper/ Contrib

15 Nov 24, 2022

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

0 Jan 22, 2022

Pro Football Reference Game Data Webscraper

Pro Football Reference Game Data Webscraper Code Copyright Yeetzsche This is a simple Pro Football Reference Webscraper that can either collect all ga

6 Dec 21, 2022

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation This repository provides two web crawlers to label domain nam

1 Nov 05, 2021

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

COVID19-WEB-SCRAPER Open Source Tech Lab - Project [SEMESTER IV] OSTL Assignments OSTL Assignments - 1 OSTL Assignments - 2 Project COVID19 India Data

8 Apr 28, 2022

Scrapes all articles and their headlines from theonion.com

The Onion Article Scraper Scrapes all articles and their headlines from the satirical news website https://www.theonion.com Also see Clickhole Article

0 Nov 17, 2021

A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

Annex Bubt Scraping Script I think this is the first public repository that provides free annex-BUBT, BUBT-Soft, and BUBT website scraping API script

4 Dec 03, 2022

Scraping Thailand COVID-19 data from the DDC's tableau dashboard

Scraping COVID-19 data from DDC Dashboard Scraping Thailand COVID-19 data from the DDC's tableau dashboard. Data is updated at 07:30 and 08:00 daily.

5 Jan 04, 2022

An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

TikTok Scraper An utility library to scrape data from TikTok hassle-free Go to the website » View Demo · Report Bug · Request Feature About The Projec

6 Jan 08, 2023

Scraping weather data using Python to receive umbrella reminders

A Python package which scrapes weather data from google and sends umbrella reminders to specified email at specified time daily.

1 Aug 23, 2022

淘宝茅台抢购最新优化版本，淘宝茅台秒杀，优化了茅台抢购线程队列

118 Dec 16, 2022

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

Toxicity comments crawler Crawler job that scrapes comments from social media posts and saves them in a S3 bucket. Twitter Tweets and replies are scra

2 Jan 24, 2022

A web scraper for nomadlist.com, made to avoid website restrictions.

Gypsylist gypsylist.py is a web scraper for nomadlist.com, made to avoid website restrictions. nomadlist.com is a website with a lot of information fo

5 Nov 24, 2022

API which uses discord to scrape NameMC searches/droptime/dropping status of minecraft names

NameMC Scrape API This is an api to scrape NameMC using message previews generated by discord. NameMC makes it a pain to scrape their website, but som

2 Dec 22, 2021

✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Spider-Cut is a Network Mapper Framework (NMAP Framework) Installation | Usage | Creators | Donate Installation # Kali Linux | WSL

3 Mar 07, 2022

京东茅台抢购

截止 2021/2/1 日，该项目已无法使用！京东：约满即止，仅限京东实名认证用户APP端抢购，2月1日10:00开始预约，2月1日12:00开始抢购（京东APP需升级至8.5.6版本及以上）写在前面本项目来自 huanghyw - jd_seckill，作者的项目地址我找不到了，找到了再贴上

73 Dec 03, 2022

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

273 Dec 31, 2022

This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

LeasePlan - Scraper This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease. It has

4 Nov 18, 2022

Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

45.5k Jan 07, 2023

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Related tags

Overview

Gerapy

Documentation

Support

Usage

Docker

TodoList

Communication

Comments

Releases(v0.9.12)

v0.9.12(Dec 27, 2022)

v0.9.10(Dec 30, 2021)

v0.9.9(Dec 26, 2021)

v0.9.8(Dec 26, 2021)

v0.9.7(Jul 31, 2021)

v0.9.6(Aug 27, 2020)

v0.9.3(Jul 6, 2020)

Change Log:

v0.9.2(Nov 25, 2019)

v0.9.0(Nov 21, 2019)

0.8.6-beta(Jul 4, 2018)

0.8.5(Jan 24, 2018)

v0.8.4-rc2(Jan 23, 2018)

v0.8.3(Jan 23, 2018)

v0.8.2(Jan 20, 2018)

v0.7.8(Jan 19, 2018)

Owner

Gerapy

This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

An introduction to free, automated web scraping with GitHub’s powerful new Actions framework.

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Pro Football Reference Game Data Webscraper

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

Scrapes all articles and their headlines from theonion.com

A powerful annex BUBT, BUBT Soft, and BUBT website scraping script.

Scraping Thailand COVID-19 data from the DDC's tableau dashboard

An helper library to scrape data from TikTok in one line, using the Influencer Hunters APIs.

Scraping weather data using Python to receive umbrella reminders

淘宝茅台抢购最新优化版本，淘宝茅台秒杀，优化了茅台抢购线程队列

Crawler job that scrapes comments from social media posts and saves them in a S3 bucket.

A web scraper for nomadlist.com, made to avoid website restrictions.

API which uses discord to scrape NameMC searches/droptime/dropping status of minecraft names

✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

京东茅台抢购

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

Scrapy, a fast high-level web crawling & scraping framework for Python.