LSpider 一个为被动扫描器定制的前端爬虫

Last update: Dec 12, 2022

Related tags

Overview

LSpider

LSpider - 一个为被动扫描器定制的前端爬虫

什么是LSpider?

一款为被动扫描器而生的前端爬虫~

由Chrome Headless、LSpider主控、Mysql数据库、RabbitMQ、被动扫描器5部分组合而成。

(1) 建立在Chrome Headless基础上，将模拟点击和触发事件作为核心原理，通过设置代理将流量导出到被动扫描器。

(2) 通过内置任务+子域名api来进行发散式的爬取，目的经可能的触发对应目标域的流量。

(3) 通过RabbitMQ来进行任务管理，支持大量线程同时任务。

(4) 智能填充表单，提交表单等。

(5) 通过一些方式智能判断登录框，并反馈给使用者，使用者可以通过添加cookie的方式来完成登录。

(6) 定制了相应的Webhook接口，以供Webhook统计发送到微信。

(7) 内置了Hackerone、bugcrowd爬虫，提供账号的情况下可以一键获取某个目标的所有范围。

为什么选择LSpider?

LSpider是专门为被动扫描器定制的爬虫，许多功能都是为被动扫描器而服务的。

建立在RabbitMQ的任务管理系统相当稳定，可以长期在无人监管的情况下进行发散式的爬取。

LSpider的最佳实践是什么？

服务器1（2c4g以上）: Nginx + Mysql + Mysql管理界面（phpmyadmin）

将被动扫描器的输出位置设置为web路径下，这样可以通过Web同时管理结果以及任务。

LSpider部署5线程以上，设置代理连接被动扫描器（被动扫描器可以设置专门的漏扫代理）

服务器2（非必要，但如果部署在服务器1，那么就需要更好的配置）：RabbitMQ

还有什么问题？

LSpider从设计之初是为了配合像xray这种被动扫描器而诞生的，但可惜的是，在工具发展的过程中，深刻认识到爬虫是无法和被动扫描器拆分开来的。

强行将应该在被动扫描器实现的功能在爬虫端实现简直是舍本逐末，所以我们发起了另一个被动扫描器项目，如果有机会，后续还会开源出来给大家。

设计思路？

为被动扫描器量身打造一款爬虫-LSpider

Usage

安装&使用

你可以通过下面的命令来测试是否安装成功

python3 manage.py SpiderCoreBackendStart --test

值得注意的是，以下脚本可能会涉及到项目路径影响，使用前请修改相应的配置

启动LSpider webhook（默认端口2062）

./lspider_webhook.sh

启动LSpider

./lspider_start.sh

完全关闭LSpider

./lspider_stop.sh

启动被动扫描器

./xray.sh

一些关键的配置

配置说明

如何配置扫描任务以及其他的配置相关

其中包含了如何配置扫描任务、鉴权信息、webhook。

值得注意的是，文中提到的Cookie配置，格式为浏览器请求包复制即可。

如何配置扫描任务以及其他的配置相关

使用内置的hackerone、bugcrowd爬虫获取目标

使用hackerone爬虫，你需要首先配置好hackerone账号

 python3 .\manage.py HackeroneSpider {appname}

同理，bugcrowd使用

 python3 .\manage.py BugcrowdSpider {appname}

404StarLink

LSpider 是 404Team 星链计划中的一环，如果对LSpider有任何疑问又或是想要找小伙伴交流，可以参考星链计划的加群方式。

https://github.com/knownsec/404StarLink-Project#community

Comments

使用遇到了问题

[WARNING] [Thread-5] [00:33:08] [LReq.py:115] [LReq] something error, Traceback (most recent call last): File "/home/ubuntuvm/LSpider/utils/LReq.py", line 75, in get return method(url, args) File "/home/ubuntuvm/LSpider/utils/LReq.py", line 179, in getRespByChrome return self.cs.get_resp(url, cookies) File "/home/ubuntuvm/LSpider/core/chromeheadless.py", line 134, in get_resp self.add_cookie(cookies) File "/home/ubuntuvm/LSpider/core/chromeheadless.py", line 192, in add_cookie value = cookie.split('=')[1].strip() IndexError: list index out of range

[WARNING] [Thread-5] [00:33:08] [htmlparser.py:86] [AST] something error, Traceback (most recent call last): File "/home/ubuntuvm/LSpider/core/htmlparser.py", line 42, in html_parser soup = BeautifulSoup(content, "html.parser") File "/usr/local/lib/python3.8/dist-packages/bs4/init.py", line 310, in init elif len(markup) <= 256 and ( TypeError: object of type 'bool' has no len()

报这个错误不知道怎么解决

opened by 294517102 3
pika.exceptions.AMQPConnectionError 错误

运行lspider_start.sh 提示pika.exceptions.AMQPConnectionError

ubuntu20，python3.8，RabbitMQ 3.9.10，Erlang 24.1.7 http://ip:2062可访问，http://ip:15672可访问，且新建Virtual Hosts为lyspider。 lspider与rabbitmq位于一机，且rabbitmq使用docker，命令如下： docker run -d --hostname rabbit --name some-rabbit -p 15672:15672 rabbitmq:3-management

设置如下

报错截图如下：

哪怕账号密码乱打然后使用docker logs rabbit-log都看不到任何相关报错，怀疑是IP/端口问题，但怎么看都不像是有问题的样子。

没接触过RABBITMQ和相关模块，折磨一天百度谷歌无果，特此发问，感谢回复！

opened by KagamigawaMeguri 2
AttributeError: 'ChromeDriver' object has no attribute 'driver'

第一次运行时正常，但是后面每次运行都报 [email protected]:/home/tomato/LSpider-1.0.0.1# python3 manage.py SpiderCoreBackendStart --test [INFO] [MainThread] [08:48:14] [SpiderCoreBackendStart.py:35] [Spider] start test spider. [INFO] [MainThread] [08:48:14] [rabbitmqhandler.py:39] [Monitor][INIT][Rabbitmq] New Rabbitmq link to 127.0.0.1 [INFO] [MainThread] [08:48:14] [rabbitmqhandler.py:36] [Monitor][INIT] Rabbitmq init success... [INFO] [MainThread] [08:48:14] [chromeheadless.py:100] [Chrome Headless] Proxy 127.0.0.1:7777 init [ERROR] [MainThread] [08:48:15] [chromeheadless.py:45] [Chrome Headless] ChromeDriver load error. [ERROR] [MainThread] [08:48:15] [SpiderCoreBackendStart.py:47] [Spider] something error, Traceback (most recent call last): File "/home/tomato/LSpider-1.0.0.1/core/chromeheadless.py", line 38, in init self.init_object() File "/home/tomato/LSpider-1.0.0.1/core/chromeheadless.py", line 119, in init_object desired_capabilities=desired_capabilities) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/chrome/webdriver.py", line 81, in init desired_capabilities=desired_capabilities) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 157, in init self.start_session(capabilities, browser_profile) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 252, in start_session response = self.execute(Command.NEW_SESSION, parameters) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/webdriver.py", line 321, in execute self.error_handler.check_response(response) File "/usr/local/lib/python3.6/dist-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: unknown error: Chrome failed to start: exited abnormally. (unknown error: DevToolsActivePort file doesn't exist) (The process started from chrome location /usr/bin/google-chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/tomato/LSpider-1.0.0.1/web/spider/management/commands/SpiderCoreBackendStart.py", line 40, in handle spidercore = SpiderCore(test_target_list) File "/home/tomato/LSpider-1.0.0.1/web/spider/controller/spider.py", line 239, in init self.req = LReq(is_chrome=True) File "/home/tomato/LSpider-1.0.0.1/utils/LReq.py", line 37, in init self.cs = ChromeDriver() File "/home/tomato/LSpider-1.0.0.1/core/chromeheadless.py", line 46, in init exit(0) File "/usr/lib/python3.6/_sitebuiltins.py", line 26, in call raise SystemExit(code) SystemExit: 0

Exception ignored in: <bound method ChromeDriver.del of <core.chromeheadless.ChromeDriver object at 0x7f1bb6c546d8>> Traceback (most recent call last): File "/home/tomato/LSpider-1.0.0.1/core/chromeheadless.py", line 591, in del self.close_driver() File "/home/tomato/LSpider-1.0.0.1/core/chromeheadless.py", line 586, in close_driver self.driver.quit() AttributeError: 'ChromeDriver' object has no attribute 'driver'

opened by LuckyT0mat0 2
Docker rabbitmq传入环境变量的特性已弃用

rabbitmq不停报错重启，docker-compose报错信息：

rabbitmq | error: RABBITMQ_DEFAULT_PASS is set but deprecated rabbitmq | error: RABBITMQ_DEFAULT_USER is set but deprecated rabbitmq | error: RABBITMQ_DEFAULT_VHOST is set but deprecated rabbitmq | error: deprecated environment variables detected

官方镜像仓库描述，3.9开始确实停用了这个特性。

我在docker-compose.yml修改，指定版本3.8。看起来能解决问题。或者作者按新版推荐的写配置文件方式改一下，嘻嘻 rabbitmq: image: rabbitmq:3.8 container_name: rabbitmq hostname: rabbitmq restart: always

opened by go1f 0

docker搭建后，在lspider的docker环境中执行，如下报错，请大佬告知一下，什么原因

/opt/LSpider # python3 manage.py SpiderCoreBackendStart --test
[INFO] [MainThread] [03:55:17] [SpiderCoreBackendStart.py:35] [Spider] start test spider.
[INFO] [MainThread] [03:55:17] [rabbitmqhandler.py:39] [Monitor][INIT][Rabbitmq] New Rabbitmq link to rabbitmq
[INFO] [MainThread] [03:55:17] [rabbitmqhandler.py:36] [Monitor][INIT] Rabbitmq init success...
[INFO] [MainThread] [03:55:17] [chromeheadless.py:100] [Chrome Headless] Proxy 127.0.0.1:7777 init
[ERROR] [MainThread] [03:55:17] [chromeheadless.py:45] [Chrome Headless] ChromeDriver load error.
[ERROR] [MainThread] [03:55:17] [SpiderCoreBackendStart.py:47] [Spider] something error, Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 76, in start
    stdin=PIPE)
  File "/usr/local/lib/python3.7/subprocess.py", line 800, in __init__
    restore_signals, start_new_session)
  File "/usr/local/lib/python3.7/subprocess.py", line 1551, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/opt/LSpider/bin/chromedriver': '/opt/LSpider/bin/chromedriver'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/LSpider/core/chromeheadless.py", line 38, in __init__
    self.init_object()
  File "/opt/LSpider/core/chromeheadless.py", line 119, in init_object
    desired_capabilities=desired_capabilities)
  File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in __init__
    self.service.start()
  File "/usr/local/lib/python3.7/site-packages/selenium/webdriver/common/service.py", line 83, in start
    os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/LSpider/web/spider/management/commands/SpiderCoreBackendStart.py", line 40, in handle
    spidercore = SpiderCore(test_target_list)
  File "/opt/LSpider/web/spider/controller/spider.py", line 239, in __init__
    self.req = LReq(is_chrome=True)
  File "/opt/LSpider/utils/LReq.py", line 37, in __init__
    self.cs = ChromeDriver()
  File "/opt/LSpider/core/chromeheadless.py", line 46, in __init__
    exit(0)
  File "/usr/local/lib/python3.7/_sitebuiltins.py", line 26, in __call__
    raise SystemExit(code)
SystemExit: 0

Exception ignored in: <function ChromeDriver.__del__ at 0x7f91f2b63680>
Traceback (most recent call last):
  File "/opt/LSpider/core/chromeheadless.py", line 591, in __del__
    self.close_driver()
  File "/opt/LSpider/core/chromeheadless.py", line 586, in close_driver
    self.driver.quit()
AttributeError: 'ChromeDriver' object has no attribute 'driver'

opened by uunnsec 3

Releases(1.0.2)

1.0.2(Feb 22, 2021)

LSpider v1.0.2: - 添加Web模式用来适配被动扫描器输出 - 添加docker环境以便于快速搭建环境。感谢@QGW
Source code(tar.gz)
Source code(zip)
1.0.0.1(Jan 26, 2021)

更新大量相关文档
Source code(tar.gz)
Source code(zip)
1.0.0(Jan 20, 2021)

LSpider v1.0.0发布
Source code(tar.gz)
Source code(zip)

Owner

Knownsec, Inc.

GitHub Repository

A Python web scraper to scrape latest posts from official Coinbase's Blog.

Coinbase Blog Scraper A Python web scraper to scrape latest posts from official Coinbase's Blog. IDEA It scrapes up latest blog posts from https://blo

3 Feb 18, 2022

SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

SearchifyX SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features. SearchifyX lets you

28 Dec 20, 2022

Deep Web Miner Python | Spyder Crawler

Webcrawler written in Python. This crawler does dig in till the 3 level of inside addressed and mine the respective data accordingly

17 Jan 24, 2022

Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

GetTss python Package extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file. Install $ pip install GetTss Us

6 Nov 21, 2022

哔哩哔哩爬取器：以个人为中心

Open Bilibili Crawer 哔哩哔哩是一个信息非常丰富的社交平台，我们基于此构造社交网络。在该网络中，节点包括用户（up主），以及视频、专栏等创作产物；关系包括：用户之间，包括关注关系（following/follower），回复关系（评论区），转发关系（对视频or动态转发）；用户对创

3 Oct 21, 2021

An helper library to scrape data from Instagram effortlessly, using the Influencer Hunters APIs.

Instagram Scraper An utility library to scrape data from Instagram hassle-free Go to the website » View Demo · Report Bug · Request Feature About The

2 Jul 06, 2022

中国大学生在线四史自动答题刷分(现仅支持英雄篇)

中国大学生在线 “四史”学习教育竞答自动答题刷分 (现仅支持英雄篇，已更新可用) 若对您有所帮助，记得点个Star 🌟 ！！！中国大学生在线 “四史”学习教育竞答自动答题刷分 (现仅支持英雄篇，已更新可用) 🥰 🥰 🥰 依赖本项目依赖的第三方库: requests 在终端执行以下

229 Dec 12, 2022

基于Github Action的定时HITsz疫情上报脚本，开箱即用

HITsz Daily Report 基于 GitHub Actions 的「HITsz 疫情系统」访问入口定时自动上报脚本，开箱即用。感谢 @JellyBeanXiewh 提供原始脚本和 idea。感谢 @bugstop 对脚本进行重构并新增 Easy Connect 校内代理访问。

56 Nov 27, 2022

河南工业大学完美校园自动校外打卡

HAUT-checkin 河南工业大学自动校外打卡由于github actions存在明显延迟，建议直接使用腾讯云函数特点多人打卡使用简单，仅需账号密码以及用于微信推送的uid 自动获取上一次打卡信息用于打卡向所有成员微信单独推送打卡状态完美校园服务器繁忙时造成打卡失败会自动重新打卡

36 Oct 27, 2022

A Python package that scrapes Google News article data while remaining undetected by Google.

A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https

6 Aug 10, 2022

A social networking service scraper in Python

snscrape snscrape is a scraper for social networking services (SNS). It scrapes things like user profiles, hashtags, or searches and returns the disco

2.4k Jan 01, 2023

Scrapy-soccer-games - Scraping information about soccer games from a few websites

scrapy-soccer-games Esse projeto tem por finalidade pegar informação de tabela d

2 Jul 20, 2022

A database scraper created with mechanical soup and sqlite

WebscrapingDatabases a database scraper created with mechanical soup and sqlite author: Mariya Sha Watch on YouTube: This repository was created to su

30 Aug 08, 2022

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

1 Nov 30, 2021

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

COVID19-WEB-SCRAPER Open Source Tech Lab - Project [SEMESTER IV] OSTL Assignments OSTL Assignments - 1 OSTL Assignments - 2 Project COVID19 India Data

8 Apr 28, 2022

A scalable frontier for web crawlers

Frontera Overview Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large sc

1.2k Jan 02, 2023

An Web Scraping API for MDL(My Drama List) for Python.

PyMDL An API for MyDramaList(MDL) based on webscraping for python. Description An API for MDL to make your life easier in retriving and working on dat

6 Dec 10, 2022

This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

Movies-Scraper You are probably tired of navigating through a movie website to get the right movie you'd want to watch during the weekend. There may e

1 Jan 31, 2022

让中国用户使用git从github下载的速度提高1000倍!

序言 github上有很多好项目,但是国内用户连github却非常的慢.每次都要用插件或者其他工具来解决. 这次自己做一个小工具,输入github原地址后,就可以自动替换为代理地址,方便大家更快速的下载. 安装 pip install cit 主要功能与用法主要功能 change 将目标地址转换为

35 Aug 29, 2022

Python script for crawling ResearchGate.net papers✨⭐️📎

ResearchGate Crawler Python script for crawling ResearchGate.net papers About the script This code start crawling process by urls in start.txt and giv

4 Aug 30, 2022

LSpider 一个为被动扫描器定制的前端爬虫

Related tags

Overview

LSpider

什么是LSpider?

为什么选择LSpider?

LSpider的最佳实践是什么？

还有什么问题？

设计思路？

Usage

一些关键的配置

如何配置扫描任务 以及 其他的配置相关

使用内置的hackerone、bugcrowd爬虫获取目标

404StarLink

Comments

使用遇到了问题

pika.exceptions.AMQPConnectionError 错误

AttributeError: 'ChromeDriver' object has no attribute 'driver'

Docker rabbitmq传入环境变量的特性已弃用

docker搭建后，在lspider的docker环境中执行，如下报错，请大佬告知一下，什么原因

Releases(1.0.2)

1.0.2(Feb 22, 2021)

1.0.0.1(Jan 26, 2021)

1.0.0(Jan 20, 2021)

Owner

Knownsec, Inc.

A Python web scraper to scrape latest posts from official Coinbase's Blog.

SearchifyX, predecessor to Searchify, is a fast Quizlet, Quizizz, and Brainly webscraper with various stealth features.

Deep Web Miner Python | Spyder Crawler

Extract gene TSS site form gencode/ensembl/gencode database GTF file and export bed format file.

哔哩哔哩爬取器：以个人为中心

An helper library to scrape data from Instagram effortlessly, using the Influencer Hunters APIs.

中国大学生在线 四史自动答题刷分(现仅支持英雄篇)

基于Github Action的定时HITsz疫情上报脚本，开箱即用

河南工业大学 完美校园 自动校外打卡

A Python package that scrapes Google News article data while remaining undetected by Google.

A social networking service scraper in Python

Scrapy-soccer-games - Scraping information about soccer games from a few websites

A database scraper created with mechanical soup and sqlite

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

A scalable frontier for web crawlers

An Web Scraping API for MDL(My Drama List) for Python.

This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

让中国用户使用git从github下载的速度提高1000倍!

Python script for crawling ResearchGate.net papers✨⭐️📎

如何配置扫描任务以及其他的配置相关

中国大学生在线四史自动答题刷分(现仅支持英雄篇)

河南工业大学完美校园自动校外打卡