feapder 是一款简单、快速、轻量级的爬虫框架。以开发快速、抓取快速、使用简单、功能强大为宗旨。支持分布式爬虫、批次爬虫、多模板爬虫，以及完善的爬虫报警机制。

Last update: Dec 29, 2022

Related tags

Overview

FEAPDER

简介

feapder 是一款简单、快速、轻量级的爬虫框架。起名源于 fast、easy、air、pro、spider的缩写，以开发快速、抓取快速、使用简单、功能强大为宗旨，历时4年倾心打造。支持轻量爬虫、分布式爬虫、批次爬虫、爬虫集成，以及完善的爬虫报警机制。

之前一直在公司内部使用，已使用本框架采集100+数据源，日采千万数据。现在开源，供大家学习交流！

读音: [ˈfiːpdə]

官方文档：http://boris.org.cn/feapder/

环境要求：

Python 3.6.0+
Works on Linux, Windows, macOS

安装

From PyPi:

pip3 install feapder

From Git:

pip3 install git+https://github.com/Boris-code/feapder.git

若安装出错，请参考安装问题

小试一下

创建爬虫

feapder create -s first_spider

创建后的爬虫代码如下：

import feapder


class FirstSpider(feapder.AirSpider):
    def start_requests(self):
        yield feapder.Request("https://www.baidu.com")

    def parse(self, request, response):
        print(response)


if __name__ == "__main__":
    FirstSpider().start()

直接运行，打印如下：

Thread-2|2021-02-09 14:55:11,373|request.py|get_response|line:283|DEBUG|
                -------------- FirstSpider.parser request for ----------------
                url  = https://www.baidu.com
                method = GET
                body = {'timeout': 22, 'stream': True, 'verify': False, 'headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.93 Safari/537.36'}}


Thread-2|2021-02-09 14:55:11,610|parser_control.py|run|line:415|INFO| parser 等待任务 ...
FirstSpider|2021-02-09 14:55:14,620|air_spider.py|run|line:80|DEBUG| 无任务，爬虫结束

功能概览

1. 支持周期性采集

周期性抓取是爬虫中常见的需求，如每日抓取一次商品的销量等，我们把每个周期称为一个批次。

本框架支持批次采集，引入了批次表的概念，详细记录了每一批次的抓取状态

2. 支持分布式采集

面对海量的数据，分布式采集必不可少的，本框架支持分布式，且可随时重启爬虫，任务不丢失

3. 支持爬虫集成

本功能可以将多个爬虫以插件的形式集成为一个爬虫，常用于采集周期一致，需求一致的，但需要采集多个数据源的项目

4. 支持海量数据去重

框架内置3种去重机制，通过简单的配置可对任务及数据自动去重，也可拿出来单独作为模块使用，支持批量去重。

临时去重：处理一万条数据约0.26秒。去重1亿条数据占用内存约1.43G，可指定去重的失效周期
内存去重：处理一万条数据约0.5秒。去重一亿条数据占用内存约285MB
永久去重：处理一万条数据约3.5秒。去重一亿条数据占用内存约285MB

5. 数据自动入库

只需要根据数据库表自动生成item，然后给item属性赋值，直接yield 返回即可批量入库

6. 支持Debug模式

爬虫支持debug模式，debug模式下默认数据不入库、不修改任务状态。可针对某个任务进行调试，方便开发

7. 完善的报警机制

为了保证数据的全量性、准确性、时效性，本框架内置报警机制，有了这些报警，我们可以实时掌握爬虫状态

实时计算爬虫抓取速度，估算剩余时间，在指定的抓取周期内预判是否会超时
爬虫卡死报警
爬虫任务失败数过多报警，可能是由于网站模板改动或封堵导致

8. 下载监控

框架对请求总数、成功数、失败数、解析异常数进行监控，将数据点打入到infuxdb，结合Grafana面板，可方便掌握抓取情况

学习交流

官方文档：http://boris.org.cn/feapder/

知识星球：

星球会不定时分享爬虫技术干货，涉及的领域包括但不限于js逆向技巧、爬虫框架刨析、爬虫技术分享等

Comments

给item属性赋值时，不支持datetime object。提示 Object of type datetime is not JSON serializable

在使用 datetime 库的时候，创建了一个对象。但是通过 Item 入MongoDB库的时候提示

2021-11-30 21:58:04.183 | DEBUG | feapder.core.parser_control:run:464 - parser 等待任务... 2021-11-30 21:58:04.197 | ERROR | feapder.utils.tools:dumps_json:844 - Object of type datetime is not JSON serializable 2021-11-30 21:58:04.201 | DEBUG | feapder.buffer.item_buffer:__add_item_to_db:323 -

其中一个json内容是：

[{'_id': 2508352, 'elapsed_mins': 1.0691154666666667, 'first_post_time': '11-30 21:57', 'first_post_time_obj': datetime.datetime(2021, 11, 30, 21, 57)...

但是在MongoDB上却能正常看到这个入库：

搜了一下网上的。发现是 json.dumps 不支持这个类型，解决方法简单粗暴转换为 str，其中一个方法是添加 default=str 在对应部分。

json.dumps(my_dictionary, indent=4, sort_keys=True, default=str)

不知道大佬觉得应该如何解决？实在不想转换到 str。。。。多谢了！

opened by kk-deng 14
为何我配置了代理API连接会一直出问题呢???

********** feapder begin ********** Thread-5|2021-03-31 15:51:30,281|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加5, 失效0, 当前代理数5, Thread-5|2021-03-31 15:51:30,809|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:33,600|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:33,600|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:33,600|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 10 Thread-5|2021-03-31 15:51:34,604|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:34,604|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:34,604|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 20 Thread-5|2021-03-31 15:51:35,608|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加0, 失效5, 当前代理数0, Thread-5|2021-03-31 15:51:35,608|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:35,608|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:35,608|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:35,608|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 30 Thread-5|2021-03-31 15:51:36,613|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:36,613|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 40 Thread-5|2021-03-31 15:51:37,617|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:37,617|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:37,617|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 50 Thread-5|2021-03-31 15:51:38,621|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:38,622|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 60 Thread-5|2021-03-31 15:51:39,626|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:39,626|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 70 Thread-5|2021-03-31 15:51:40,629|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加0, 失效5, 当前代理数0, Thread-5|2021-03-31 15:51:40,629|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:40,629|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:40,629|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:40,629|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 80 Thread-5|2021-03-31 15:51:41,634|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:41,634|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:41,634|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 90 Thread-5|2021-03-31 15:51:42,638|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:42,638|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 100 Thread-5|2021-03-31 15:51:43,643|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:43,643|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:43,643|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 110 Thread-5|2021-03-31 15:51:44,646|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:44,646|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 120 Thread-5|2021-03-31 15:51:45,649|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加0, 失效5, 当前代理数0, Thread-5|2021-03-31 15:51:45,649|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:45,650|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:45,650|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:45,650|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 130 Thread-5|2021-03-31 15:51:46,654|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:46,654|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:46,654|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 140 Thread-5|2021-03-31 15:51:47,659|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:47,659|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 150 Thread-5|2021-03-31 15:51:48,664|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:48,664|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 160 Thread-5|2021-03-31 15:51:49,666|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:49,666|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:49,666|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 170 Thread-5|2021-03-31 15:51:50,669|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加0, 失效5, 当前代理数0, Thread-5|2021-03-31 15:51:50,669|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:50,669|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:50,669|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:50,670|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 180 Thread-5|2021-03-31 15:51:51,674|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:51,675|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:51,675|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 190 Thread-5|2021-03-31 15:51:52,679|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:52,680|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 200 Thread-5|2021-03-31 15:51:53,682|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:53,682|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:53,682|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 210 Thread-5|2021-03-31 15:51:54,687|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:54,687|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 220 Thread-5|2021-03-31 15:51:55,693|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加0, 失效5, 当前代理数0, Thread-5|2021-03-31 15:51:55,693|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:55,693|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:55,693|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:55,693|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 230 Thread-5|2021-03-31 15:51:56,698|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:56,698|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:56,698|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 240 Thread-5|2021-03-31 15:51:57,701|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:57,701|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 250 Thread-5|2021-03-31 15:51:58,706|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:58,706|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:58,706|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 260 Thread-5|2021-03-31 15:51:59,711|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:51:59,711|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 270 Thread-5|2021-03-31 15:52:00,715|proxy_pool.py|reset_proxy_pool|line:664|DEBUG| 重置代理池成功: 获取5, 成功添加0, 失效5, 当前代理数0, Thread-5|2021-03-31 15:52:00,715|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:52:00,715|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:52:00,715|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 280 Thread-5|2021-03-31 15:52:01,720|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:52:01,720|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:52:01,720|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 290 Thread-5|2021-03-31 15:52:02,721|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:52:02,722|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:52:02,722|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 300 Thread-5|2021-03-31 15:52:03,723|request.py|get_response|line:272|DEBUG| 暂无可用代理 ... Thread-5|2021-03-31 15:52:03,723|proxy_pool.py|reset_proxy_pool|line:646|DEBUG| 代理池重置的太快了:) 310

opened by Mrqjj 9
递归调用 feapder.Request 时 url 存在遗漏
目前想完成一个需求：爬取目标网站的所有 URL。代码如下：

class UrlSpider(feapder.AirSpider): def start_requests(self): yield feapder.Request("https://example.cn") def parse(self, request, response): print(response.url) url_list = response.re("<a.*?href='(.*?)'") for url in url_list: # print(url) if response.url in url: yield feapder.Request(url, callback=self.parse)

这样调用的时候，url_list 中的 url 只有部分进入 feapder.Request 中。同时设置 REQUEST_FILTER_ENABLE = True 后，似乎也没有生效。是否我的使用方式存在问题？

PS：像 Scrapy 的 Link extractor 一样的效果。
opened by KerwinChen 7
1.7.9无法指定类型创建爬虫

feapder create -s first_spider 3
usage: feapder [-h] [-p] [-s] [-i] [-t] [-init] [-j] [-sj] [-c] [--params] [--setting] [--host] [--port] [--username] [--password] [--db] feapder: error: unrecognized arguments: 3

opened by fgetwewr 6
能否实现打开本地html文件并解析的功能？

大佬的parse解析函数写的很棒，很适合小白分析网页数据并定位节点。但每次都需要向网站发送request，然后根据网站回传的response进行解析，有些网站又做了反爬处理，时不时就弹验证，需要手动处理。实际上，小白写代码时需要不停的获取response（也就是网站的html文件）来编写代码，那就考虑把网站的html保存到本地，然后调用feapder处理，查看了说明文档和网上的实例，都没提及这点，所以想请教大佬该怎么操作？谢谢！

opened by yxnwh 6
使用requests可以正常请求，使用feapder提示浏览器版本过低，有点搞不懂
import feapder

class CantonfairAirSpider(feapder.AirSpider): def start_requests(self): yield feapder.Request("https://www.cantonfair.org.cn",verify=True)

def parse(self, request, response): html = response.text print(html)

if name == "main": CantonfairAirSpider().start()
opened by lscool66 6

参照文档设置的代理不起作用

解决方案

刚刚通过此文已将问题解决：膜拜此文作者的刨根问题精神：https://zhuanlan.zhihu.com/p/350015032

遇到的问题

环境

操作系统：WIN10

Python版本：3.10

feapder代码

import feapder
from feapder import Item 

class ddxpTest(feapder.AirSpider):
    def start_requests(self):
        for i in range(200):
            yield feapder.Request(f"https://www.google.com/#{i}")

    def download_midware(self, request):
        request.proxies = {'http':"http://127.0.0.1:10801",'https':"https://127.0.0.1:10801"}
        return request

    def parse(self, request, response):
        title = response.xpath("//title/text()").extract_first()
        item = Item() 
        item.table_name = "spider_data" # 表名
        item.title = title
        yield item

if __name__ == "__main__":
    ddxpTest(thread_count=30).start()

截取的报错信息：

2021-11-18 10:58:27.859 | ERROR    | feapder.core.parser_control:deal_requests:607 - 
                                -------------- ddxpTest.parse error -------------
                                error          HTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', TimeoutError('_ssl.c:980: The handshake operation timed out')))
                                response       None
                                deal request   {
                            "url": "https://www.google.com/#178",
                            "parser_name": "ddxpTest",
                            "proxies": {
                                                        "http": "http://127.0.0.1:10801",
                                                        "https": "https://127.0.0.1:10801",
                                                        "ftp": "ftp://127.0.0.1:10801"
                            }
}

而且挂了代理后，百度也不能访问了：

2021-11-18 10:55:44.536 | ERROR    | feapder.core.parser_control:deal_requests:607 - 
                                -------------- ddxpTest.parse error -------------
                                error          HTTPSConnectionPool(host='www.baidu.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', TimeoutError('_ssl.c:980: The handshake operation timed out')))
                                response       None
                                deal request   {
                            "url": "https://www.baidu.com/#188",
                            "parser_name": "ddxpTest",
                            "proxies": {
                                                        "http": "http://127.0.0.1:10801",
                                                        "https": "https://127.0.0.1:10801",
                                                        "ftp": "ftp://127.0.0.1:10801"
                            }
}

使用requests库测试挂代理访问google，一切正常。

import requests
proxies = {    'socks5': 'socks5://127.0.0.1:10800',    'socks5': 'socks5://127.0.0.1:10800'}
# proxies = {'http':"http://127.0.0.1:10801",'https':"https://127.0.0.1:10801"}
r = requests.get('http://www.google.com',proxies=proxies)
print(r.text[:30])

输出：<!doctype html><html itemscope

opened by sirliu 4

How to use playwright in batchspider mode?

I have copied codes from test_playwright.py in a bacthspider script, and it worked wrong with warning: "It looks like you are using Playwright Sync API inside the asyncio loop. Please use the Async API instead." I known test_playwright plays well in Airspider demo, does there exist a bacthspider demo using palywright?

opened by lycanthropes 3
chrome 的渲染遇到一个 js 卡住了整个标签页

chrome 的渲染有个问题， selenium + chrome 访问 https://baijiahao.baidu.com/s?id=1739368224007714423 时，浏览器会一直加载一个 js 文件，造成该标签页卡住，然后浏览器就无法响应其他行为（刷新，获取页面源码，访问其他url……），feapder 是否可以加一个参数来禁止加载 js。

如果以上链接失效，请访问以下任意链接： https://baijiahao.baidu.com/s?id=1739304797053642547&wfr=spider&for=pc https://baijiahao.baidu.com/s?id=1739377098661725506&wfr=spider&for=pc https://baijiahao.baidu.com/s?id=1739377692137820326&wfr=spider&for=pc https://baijiahao.baidu.com/s?id=1739377788266450267&wfr=spider&for=pc

opened by pgshow 3

Item保存数据去重异常

你好，我使用框架抓取自动入库后，发现MySQL中每批数据仅存储了第一条；运行过程中并没有报错；

我开启了item去重和request去重；

# 去重
ITEM_FILTER_ENABLE = True  # item 去重
REQUEST_FILTER_ENABLE = True  # request 去重

存储时代码大概是这样的；

for keyword_info in response.json["extendList"]['list']:
    keyword = keyword_info['word']
    word_id = keyword_info['word_id']
    keyword_relate = keyword_info['relate']

    # 准备存储；
    item.keyword = request.keyword
    item.keyword_id = request.keyword_id
    item.recommend_keyword = keyword
    item.recommend_keyword_id = word_id
    item.keyword_relate = keyword_relate

    yield item  # 存入MySQL；

opened by ShellMonster 3

item存储的问题
你好，我在使用框架的时候遇到这样一个现象，可能会是一个潜在的隐患我是直接在解析体里实例化Item，然后给Item的各个键赋值。整个解析体会有一些条件分支，如下示例代码所示一开始我没有注意到，因为大部分情况和我料想的一样 else语句里没有赋值的字段都是自动填充None到数据库的但是后来发现，有的数据本来key_a, key_b, key_c需要有值的，实际上却只有一个status字段的值是1，其余均为空定位到上面那张图的代码位置原因我猜想是这样的，多条数据在一起入库，生成sql语句的时候，选择了列表中第一条记录的key，而第一条记录的key如果是else条件下赋值的，就会只有一个字段被使用，那么这一批一起入库的数据就只有一个字段入库了不知作者能不能明白我的意思哈哈~

item = Item() item.create_time = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(time.time())) if 'hhh' in response.text: item.status = 1 item.key_a = response.xpath('...') item.key_b = response.xpath('...') item.key_c = response.xpath('...') else: item.status = 0
opened by QinJun-1998 3
mysqldb close_connection问题
需知

升级feapder，保证feapder是最新版，若BUG仍然存在，则详细描述问题

pip install --upgrade feapder

问题当数据库连接失败的时候,conn,和cursor未返回,调用close_connection会报错截图

代码 feapder.db.mysqldb.py文件

建议在try之前定义conn,cursor,再close_connection加入判断如下:

def close_connection(self, conn, cursor): if conn: cursor.close() if cursor: conn.close()
opened by danerlt 0
使用代理时诡异的连接失败的问题

问题在 python3.8.x 下， feapder 所有版本都有这个问题，当使用某些IP白名单 ssl 代理时，会遇到各种连接不上代理服务器的错误。

错误会有很多种，我遇到的：443错误，1133 错误，握手超时等

解决方法 安装完所有包后，卸载 urllib3，并安装 1.26 以下的低版本，这样就可以正常使用 ssl 代理了。 pip uninstall urllib3 pip install urllib3==1.25.11

opened by pgshow 0
修复response编码错误
1.修复了在content含有非法字符时，自定义encoding不生效，而是使用UnicodeDammit对文本进行解析，导致response.text输出仍为乱码的问题 2.修复了使用UnicodeDammit猜测编码并转换文本后，encoding、apparent_encoding和文本实际编码显示不一致的问题 3.修复了ContentType为text/html时无法从请求头中获取编码的问题

以下为测试内容

from feapder import Request print(False or 0) res2 = Request('''http://part.csmu.edu.cn:82/zcglc/index.php?_m=mod_article&_a=fullist&caa_id=19''').get_response() res2.encoding='utf-8' print(res2.text) print(res2.encoding) print(res2.apparent_encoding) print(res2.encoding_errors)
opened by dream2333 0
废弃本次代理无效，重试request用的还是失效的代理ip

import feapder from feapder.network.proxy_pool import ProxyPool from feapder import Request from datetime import date from feapder.db.mysqldb import MysqlDB import time, re from utils.tengxunCos import tengxunCos from utils.langconv import *

proxy_pool = ProxyPool(reset_interval_max=1200, reset_interval=300, check_valid=False) feapder.Request.proxies_pool = proxy_pool

def parse(self, request, response): movie_id = request.movie_id if "可播放" in response.text: 。。。。。。。。。 elif "登录跳转" in response.text: print("检测到有异常请求从您的IP发出，请登录再试") request.proxies_pool.tag_proxy(request.requests_kwargs.get("proxies"), -1)

request.proxies_pool.tag_proxy(request.requests_kwargs.get("proxies"), -1) 这行代码无效

opened by ddzyx 0

Releases(v1.8.4)

v1.8.4(Dec 3, 2022)
更新

Task spider 支持检查依赖爬虫的状态，依赖的爬虫做完才可结束自己

命令行工具支持retry，可重试失败的请求或者item

支持重新导入失败的item

批次爬虫支持设置不自动启动下一批次

item 支持update方法

Bug Fixes

修复任务爬虫依赖其他爬虫的bug

修复 GoldUserPool bug

修复有依赖爬虫时，依赖爬虫不结束，新批次开启不了的bug

Source code(tar.gz)
Source code(zip)
v1.8.3(Nov 4, 2022)
Bug Fixes

修复下载中间件中自定义返回response时，response.browser属性不存在导致异常的bug

修复默认ua的bug，以及浏览器渲染模式下，ua及代理优先级的问题

修复selenium浏览器渲染bug

适配parsel==1.7.0

Source code(tar.gz)
Source code(zip)
v1.8.0(Oct 31, 2022)
更新

支持playwright

exception_request及failed_request透传异常参数e

AirSpider 支持去重

批次超时报警后，若后续批次完成，则发个批次完成的报警，提醒已恢复正常

爬虫并发数默认1

Bug Fixes

修复feapder命令在pycharm中上下方向键不起作用的问题

Source code(tar.gz)
Source code(zip)

v1.7.9(Aug 9, 2022)

更新

浏览器渲染支持指定selenuim的更多参数

WEBDRIVER = dict(
    pool_size=1,  # 浏览器的数量
    load_images=True,  # 是否加载图片
    user_agent=None,  # 字符串 或 无参函数，返回值为user_agent
    proxy=None,  # xxx.xxx.xxx.xxx:xxxx 或 无参函数，返回值为代理地址
    headless=False,  # 是否为无头浏览器
    driver_type="CHROME",  # CHROME、PHANTOMJS、FIREFOX
    timeout=30,  # 请求超时时间
    window_size=(1024, 800),  # 窗口大小
    executable_path=None,  # 浏览器路径，默认为默认路径
    render_time=0,  # 渲染时长，即打开网页等待指定时间后再获取源码
    custom_argument=[
        "--ignore-certificate-errors",
        "--disable-blink-features=AutomationControlled",
    ],  # 自定义浏览器渲染参数
    xhr_url_regexes=None,  # 拦截xhr接口，支持正则，数组类型
    auto_install_driver=True,  # 自动下载浏览器驱动 支持chrome 和 firefox
    use_stealth_js=True,  # 使用stealth.min.js隐藏浏览器特征
    xxxx=xxx,
    xxx2=xxx2
)

Bug Fixes

修复浏览器渲染模式下的代理bug
修复delete_keys的bug

Source code(tar.gz)
Source code(zip)

v1.7.8(Aug 4, 2022)
更新

响应的html支持指定是否拼接绝对连接

优化命令行，支持创建TaskSpider

下载方法单独抽离出来，方便扩展

优化tools.del_html_tag 函数

Source code(tar.gz)
Source code(zip)
v1.7.7(Jul 26, 2022)
更新

AirSpider 支持设置内存任务队列最大缓存的任务数
# 内存任务队列最大缓存的任务数，默认不限制；仅对AirSpider有效。 TASK_MAX_CACHED_SIZE = 0

新增TaskSpider爬虫，内部封装了取种子任务的逻辑，内置支持从redis或者mysql获取任务，也可通过自定义实现从其他来源获取任务

Bug Fixes

修复 request.copy()的bug

Source code(tar.gz)
Source code(zip)
v1.7.6(Jun 9, 2022)
Bug Fixes

修复去重库bug

Source code(tar.gz)
Source code(zip)
v1.7.5(Jun 7, 2022)
更新

去掉锁的，允许同时生产cookie

优化collector

修改默认webdriver的配置，避免selenium被检测到

支持飞书报警

response 支持from_text

默认开启自动适配浏览器版本

修改爬虫并发数的默认值为32

优化框架核心调度，加快调度速度且减少CPU占用

Bug Fixes

修复浏览器渲染模式下，没拼接params的bug

修复redis锁的bug

修复download_midware指定多个时，序列化报错问题

Source code(tar.gz)
Source code(zip)
v1.7.3(Mar 3, 2022)
更新

支持自动安装 selenium驱动

redisdb 支持统计redis使用情况

feapder 支持zip压缩命令，会过滤掉.git .pyc等无用的文件及文件夹（压缩项目上传到feaplat很方便）

命令行工具改为从剪切板读取内容，解决内容过长控制台不能输入问题

浏览器渲染添加xhr_data函数

Bug Fixes

修复去重库 redis连接问题

Source code(tar.gz)
Source code(zip)
v1.7.2(Feb 9, 2022)
更新

浏览器渲染模式-chrome 支持指定下载保存路径

优化邮件报警：当收件人为多人时收件人处显示为多人

爬虫集成支持传参

浏览器渲染模式支持拦截XHR数据

Bug Fixes

修复打点监控已知问题

Source code(tar.gz)
Source code(zip)
v1.7.1(Dec 22, 2021)
更新

cookie池改为用户池，更易使用，详见：https://boris.org.cn/feapder/#/source_code/UserPool

兼容maria数据库

以扩展的方式提供pgsql入库管道，详见：https://github.com/Boris-code/feapder_pipelines

Bug Fixes

修复mongo更新bug

修复创建item bug

Source code(tar.gz)
Source code(zip)
v1.7.0(Nov 18, 2021)
更新

适配python3.10

加强时间格式化工具函数

Source code(tar.gz)
Source code(zip)
v1.6.9(Oct 19, 2021)
更新

优化LoginCookiePool cookie池

mongo 支持 url连接方式

mongodb 更新数据时完善获取更新条件的鲁棒性

Bug Fixes

mysql 修复to_json和limit=1同时用时逻辑bug

Source code(tar.gz)
Source code(zip)
v1.6.8(Sep 22, 2021)
更新：

pipelines支持close方法

Bug Fixes

修复mongo pipeline 更新数据Bug

Source code(tar.gz)
Source code(zip)
v1.6.7(Sep 13, 2021)
更新

优化redis锁

支持cookie池

response支持给text重新赋值，应对浏览器渲染重新加载页面源码的场景

log支持方法提示

框架主线程异常捕获，防止某个线程崩溃导致爬虫卡死

支持更细粒度的去重配置

修改main函数启动模板

request支持获取代理及ua

命令行支持 feapder create --params

生成的item 显式的指定table_name，防止自动提取表名时提取错误

下载中间件支持指定多个

Bug Fixes

修复redis集群transaction参数不再支持的问题

Source code(tar.gz)
Source code(zip)
v1.6.6(Aug 23, 2021)
更新

封装的RedisDB支持调用所有原生的redis方法

完善字典类PerfectDict封装

爬虫常驻参数由 auto_stop_when_spider_done参数改为keep_alive，但兼容auto_stop_when_spider_done参数

数据入库失败自动重试，重试超过最大次数则将数据记录到redis，保证数据不丢（AirSpider不支持）

无redis时使用内存做报警的频率限制

Source code(tar.gz)
Source code(zip)
v1.6.3(Aug 8, 2021)
更新

集成打点监控

隐藏浏览器特征

添加移动端的请求头

Bug Fixes

修复驼峰转下划线问题

修复mongodb单次批量插入不能超过16MB的问题

修复拼接sql因表名奇葩导致sql语法错误问题

Source code(tar.gz)
Source code(zip)

v1.6.1(Jul 20, 2021)

更新

钉钉报警支持提醒所有人

# 钉钉报警
DINGDING_WARNING_URL = ""  # 钉钉机器人api
DINGDING_WARNING_PHONE = ""  # 报警人 支持列表，可指定多个
DINGDING_WARNING_ALL = False # 是否提示所有人， 默认为False

Bug Fixes

修复拼接update sql时，由于数据中有单引号，导致sql错误问题

Source code(tar.gz)
Source code(zip)

v1.6.0(Jul 15, 2021)
更新

增强response的解码

Source code(tar.gz)
Source code(zip)
v1.5.9(Jul 15, 2021)
更新

优化批次爬虫下发新批次任务时需等待1分钟，防止爬虫内部缓存的批次时间没来得及更新的问题,，改为根据运行的爬虫数智能判断是否需要等待

优化因下发任务时意外退出，锁不释放，导致加锁失败的问题

优化爬虫意外退出，因任务防丢策略，需等待10分钟才能取到任务问题，改为根据运行的爬虫数智能判断是否需要等待

强化user_agent_pool，支持指定UA类型，包括 'chrome'、'opera'、 'firefox'、 'internetexplorer'、'safari'

Source code(tar.gz)
Source code(zip)
v1.5.8(Jul 8, 2021)
更新

浏览器渲染模式支持关闭浏览器

def parse(self, request, response): response.close_browser(request)

关闭会自动重开一个新的浏览器实例

Bug Fixes

修复自定义配置对代理不生效问题

Source code(tar.gz)
Source code(zip)
v1.5.7(Jul 6, 2021)
Bug Fixes

修复redis zrangebyscore函数bug

redis 支持bool类型的值

Source code(tar.gz)
Source code(zip)
v1.5.6(Jul 2, 2021)
更新

临时去重定期删除过期的值

eamil错别字改为email ，配置文件关于邮件报警的key也做了相应的修改，升级时请注意修改下自己项目里的配置文件

# 邮件报警 EMAIL_SENDER = "" # 发件人 EMAIL_PASSWORD = "" # 授权码 EMAIL_RECEIVER = "" # 收件人支持列表，可指定多个 EMAIL_SMTPSERVER = "smtp.163.com" # 邮件服务器默认为163邮箱

Bug Fixes

修复自定义配置对报警及代理不生效的问题

Source code(tar.gz)
Source code(zip)
v1.5.5(Jun 25, 2021)
Bug Fixes

修复mongodb游标bug导致查询数据不全

修复format_time时间提取不准确的问题

Source code(tar.gz)
Source code(zip)
v1.5.4(Jun 18, 2021)
更新

完善 tools.format_time函数

Source code(tar.gz)
Source code(zip)
v1.5.3(Jun 18, 2021)
更新

日志支持带颜色输出，且支持更多的配置

强化 tools.format_time 函数

Source code(tar.gz)
Source code(zip)
v1.5.2(May 26, 2021)
更新

下载时间间隔支持随机

生成的项目包含爬虫文档及数据校验文档

Source code(tar.gz)
Source code(zip)

v1.5.1(May 23, 2021)

更新

爬虫支持重复运行

举例说明

import feapder


class AirSpiderDemo(feapder.AirSpider):
    def start_requests(self):
        yield feapder.Request("https://www.baidu.com")

    def parse(self, request, response):
        print(response)


if __name__ == "__main__":
    # 循环运行，本次结束后马上开始下一次
    spider = AirSpiderDemo()
    while True:
        spider.start()
        spider.join() # 等待结束

    # 直接开启10个爬虫
    # for i in range(10):
    #     spider = AirSpiderDemo()
    #     spider.start()

Source code(tar.gz)
Source code(zip)

v1.5.0(May 13, 2021)
Bug Fixes

修复浏览器渲染不自动拼装绝对连接问题

Source code(tar.gz)
Source code(zip)
v1.4.9(May 10, 2021)
更新

邮件报警支持自定义邮件服务器

配置文件整理

Source code(tar.gz)
Source code(zip)

Owner

boris

爬虫工程师、数据工程师

GitHub Repository https://boris.org.cn/feapder

✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Spider-Cut is a Network Mapper Framework (NMAP Framework) Installation | Usage | Creators | Donate Installation # Kali Linux | WSL

3 Mar 07, 2022

Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

1 Oct 29, 2021

此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

5 Nov 19, 2021

WebScraping - Scrapes Job website for python developer jobs and exports the data to a csv file

WebScraping Web scraping Pyton program that scrapes Job website for python devel

2 Jul 22, 2022

A Python module to bypass Cloudflare's anti-bot page.

cloudflare-scrape A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Reque

3k Jan 04, 2023

Scraping followers of an instagram account

ScrapInsta A script to scraping data from Instagram Install First of all you can run: pip install scrapinsta After that you need to install these requ

1 Sep 05, 2021

Web Scraping images using Selenium and Python

Web Scraping images using Selenium and Python A propos de ce document This is a markdown document about Web scraping images and videos using Selenium

3 Jul 01, 2022

Consulta de CPF e CNPJ na Receita Federal com Web-Scraping

Repositório contendo scripts Python que realizam a consulta de CPF e CNPJ diretamente no site da Receita Federal.

5 Nov 29, 2021

This script is intended to crawl license information of repositories through the GitHub API.

GithubLicenseCrawler This script is intended to crawl license information of repositories through the GitHub API. Taking a csv file with requirements.

4 Oct 25, 2022

A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

Amazon-Web-Scarper Created a web scraper using simple functions to check price of a product on amazon (can be duplicated to check price at other marke

1 Jan 17, 2022

A Telegram crawler to search groups and channels automatically and collect any type of data from them.

Introduction This is a crawler I wrote in Python using the APIs of Telethon months ago. This tool was not intended to be publicly available for a numb

39 Dec 28, 2022

A simple app to scrap data from Twitter.

Twitter-Scraping-App A simple app to scrap data from Twitter. Available Features Search query. Select number of data you want to fetch from twitter. C

2 Oct 31, 2022

Open Crawl Vietnamese Text

Open Crawl Vietnamese Text This repo contains crawled Vietnamese text from multiple sources. This list of a topic-centric public data sources in high

4 Jan 05, 2022

联通手机营业厅自动做任务、签到、领流量、领积分等。

联通手机营业厅自动完成每日任务，领流量、签到获取积分等，月底流量不发愁。功能沃之树领流量、浇水(12M日流量) 每日签到(1积分+翻倍4积分+第七天1G流量日包) 天天抽奖，每天三次免费机会(随机奖励) 游戏中心每日打卡(连续打卡，积分递增至最高

2k May 06, 2021

Demonstration on how to use async python to control multiple playwright browsers for web-scraping

Playwright Browser Pool This example illustrates how it's possible to use a pool of browsers to retrieve page urls in a single asynchronous process. i

8 Oct 27, 2022

Danbooru scraper with python

Danbooru Version: 0.0.1 License under: MIT License Dependencies Python: = 3.9.7 beautifulsoup4 cloudscraper Example of use Danbooru from danbooru imp

2 Oct 27, 2022

Web-Scrapper using Python and Flask

Web-Scrapper "[초급]Python으로 웹 스크래퍼 만들기" 코스 -NomadCoders 기초적인 Python 문법강의부터 시작하여 웹사이트의 html파일에서 원하는 내용을 Scrapping해서 출력, csv 파일로 저장, flask를 이용한 간단한 웹페이지

1 Nov 10, 2021

Simply scrape / download all the media from an fansly account.

Simply scrape / download all the media from an fansly account. Providing updates as long as its continuously gaining popularity, so hit the ⭐ button!

334 Jan 01, 2023

PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

PaperRobot PaperRobot 是一个论文抓取工具，可以快速批量下载大量论文，方便后期进行持续的论文管理与学习。 PaperRobot通过多个接口抓取论文，目前抓取成功率维持在90%以上。通过配置Config文件，可以抓取任意计算机领域相关会议的论文。 Installation Down

47 Nov 23, 2022

A web crawler for recording posts in "sina weibo"

Web Crawler for "sina weibo" A web crawler for recording posts in "sina weibo" Introduction This script helps collect attributes of posts in "sina wei

4 Aug 20, 2022

feapder 是一款简单、快速、轻量级的爬虫框架。以开发快速、抓取快速、使用简单、功能强大为宗旨。支持分布式爬虫、批次爬虫、多模板爬虫，以及完善的爬虫报警机制。

Related tags

Overview

FEAPDER

简介

环境要求：

安装

小试一下

功能概览

1. 支持周期性采集

2. 支持分布式采集

3. 支持爬虫集成

4. 支持海量数据去重

5. 数据自动入库

6. 支持Debug模式

7. 完善的报警机制

8. 下载监控

学习交流

Comments

解决方案

遇到的问题

环境

操作系统：WIN10

Python版本：3.10

feapder代码

截取的报错信息：

而且挂了代理后，百度也不能访问了：

使用requests库测试挂代理访问google，一切正常。

Releases(v1.8.4)

v1.8.4(Dec 3, 2022)

更新

Bug Fixes

v1.8.3(Nov 4, 2022)

Bug Fixes

v1.8.0(Oct 31, 2022)

更新

Bug Fixes

v1.7.9(Aug 9, 2022)

更新

Bug Fixes

v1.7.8(Aug 4, 2022)

更新

v1.7.7(Jul 26, 2022)

更新

Bug Fixes

v1.7.6(Jun 9, 2022)

Bug Fixes

v1.7.5(Jun 7, 2022)

更新

Bug Fixes

v1.7.3(Mar 3, 2022)

更新

Bug Fixes

v1.7.2(Feb 9, 2022)

更新

Bug Fixes

v1.7.1(Dec 22, 2021)

更新

Bug Fixes

v1.7.0(Nov 18, 2021)

更新

v1.6.9(Oct 19, 2021)

更新

Bug Fixes

v1.6.8(Sep 22, 2021)

更新：

Bug Fixes

v1.6.7(Sep 13, 2021)

更新

Bug Fixes

v1.6.6(Aug 23, 2021)

更新

v1.6.3(Aug 8, 2021)

更新

Bug Fixes

v1.6.1(Jul 20, 2021)

更新

Bug Fixes

v1.6.0(Jul 15, 2021)

更新