SmartScraper: 简单、自动、快捷的Python网络爬虫

Last update: Apr 16, 2022

Related tags

Web Crawling smartscraper

Overview

SmartScraper: 简单、自动、快捷的Python网络爬虫

Note: The origin developer of SmartScraper is Alireza Mika， I only change a little code of AutoScraper.

SmartScraper使页面数据抓取变得容易，不再需要学习诸如pyquery、beautifulsoup等定位包，我们只需要提供的url和数据给ta学习网页定位规律即可。

一、安装

pip install smartscraper

二、快速上手

2.1 获取相似结果

例如我们想从 豆瓣读书-小说 页面获得20本书的书名和出版信息

我们使用P1链接训练书名、出版信息这两个字段

from smartscraper import SmartScraper

# 待训练的网页链接
url = 'https://book.douban.com/tag/小说?start=0&type=T'

#定义 想要的字段
wanted_dict = {"title":["活着"],
               "pub": ["余华 / 作家出版社 / 2012-8-1 / 20.00元"]
              }

# 训练/在url对应的页面中寻找wanted_dict规律
scraper = SmartScraper()
results = scraper.build(url, wanted_dict=wanted_dict)
print(results)

运行代码，采集到的results如下

{'title': ['活着', 
           '房思琪的初恋乐园', 
           '白夜行', 
           '索拉里斯星', 
           '鄙视',
           ...], 
 'pub': ['余华 / 作家出版社 / 2012-8-1 / 20.00元', 
         '林奕含 / 北京联合出版公司 / 2018-2 / 45.00元', 
         '[日] 东野圭吾 / 刘姿君 / 南海出版公司 / 2013-1-1 / CNY 39.50', 
         '[波] 斯坦尼斯瓦夫·莱姆 / 靖振忠 / 译林出版社 / 2021-8 / 49.00元', 
         '[意] 阿尔贝托·莫拉维亚 / 沈萼梅、刘锡荣 / 江苏凤凰文艺出版社 / 2021-7 / 62.00',
          ...]
}

使用刚刚训练的scraper尝试从 P2链接 获取书名和出版信息

scraper.get_result_similar('https://book.douban.com/tag/小说?start=20&type=T')

2.2 保存模型

训练的smartscraper模型可以保存，后续直接调用

scraper.save('douban_Book.pkl')

模型导入代码

scraper.load('douban_Book.pkl')

三、其他

3.1 项目补充说明

SmartScraper仅为了简化使用，对AutoScraper进行了小修改（几行代码）
原创项目地址 https://github.com/alirezamika/autoscraper

3.2 相关课程

如果您是经管人文社科专业背景，编程小白，面临海量文本数据采集和处理分析艰巨任务，个人建议学习《python网络爬虫与文本数据分析》视频课。作为文科生，一样也是从两眼一抹黑开始，这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o(￣︶￣)o，

python入门
网络爬虫
数据读取
文本分析入门
机器学习与文本分析
文本分析在经管研究中的应用

感兴趣的童鞋不妨戳一下《python网络爬虫与文本数据分析》进来看看~

3.3 自媒体

B站:大邓和他的python
公众号：大邓和他的python

SmartScraper: 简单、自动、快捷的Python网络爬虫

Related tags

Overview

SmartScraper: 简单、自动、快捷的Python网络爬虫

一、安装

二、快速上手

2.1 获取相似结果

2.2 保存模型

三、其他

3.1 项目补充说明

3.2 相关课程

3.3 自媒体

Owner

DaDeng

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

A tool to easily scrape youtube data using the Google API

Telegram group scraper tool

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Scrapping Connections' info on Linkedin

The core packages of security analyzer web crawler

A dead simple crawler to get books information from Douban.

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

A simple Discord scraper for discord bots

Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Get paper names from dblp.org

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Examine.com supplement research scraper!

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

原神爬虫抓取原神界面圣遗物信息

A Very simple free proxy list scraper.

This is python to scrape overview and reviews of companies from Glassdoor.

A web crawler for recording posts in "sina weibo"

Explore scraping with BeautifulSoup!

SmartScraper: 简单、自动、快捷的Python网络爬虫

Related tags

Overview

SmartScraper: 简单、自动、快捷的Python网络爬虫

一、安装

二、快速上手

2.1 获取相似结果

2.2 保存模型

三、其他

3.1 项目补充说明

3.2 相关课程

3.3 自媒体

Owner

DaDeng

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

A tool to easily scrape youtube data using the Google API

Telegram group scraper tool

robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

Scrapping Connections' info on Linkedin

The core packages of security analyzer web crawler

A dead simple crawler to get books information from Douban.

A scrapy pipeline that provides an easy way to store files and images using various folder structures.

A simple Discord scraper for discord bots

Use Flask API to wrap Facebook data. Grab the wapper of Facebook public pages without an API key.

WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Get paper names from dblp.org

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Examine.com supplement research scraper!

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

原神爬虫 抓取原神界面圣遗物信息

A Very simple free proxy list scraper.

This is python to scrape overview and reviews of companies from Glassdoor.

A web crawler for recording posts in "sina weibo"

Explore scraping with BeautifulSoup!

原神爬虫抓取原神界面圣遗物信息