SmartScraper: 简单、自动、快捷的Python网络爬虫

Last update: Apr 16, 2022

Related tags

Web Crawling smartscraper

Overview

SmartScraper: 简单、自动、快捷的Python网络爬虫

Note: The origin developer of SmartScraper is Alireza Mika， I only change a little code of AutoScraper.

SmartScraper使页面数据抓取变得容易，不再需要学习诸如pyquery、beautifulsoup等定位包，我们只需要提供的url和数据给ta学习网页定位规律即可。

一、安装

pip install smartscraper

二、快速上手

2.1 获取相似结果

例如我们想从 豆瓣读书-小说 页面获得20本书的书名和出版信息

我们使用P1链接训练书名、出版信息这两个字段

from smartscraper import SmartScraper

# 待训练的网页链接
url = 'https://book.douban.com/tag/小说?start=0&type=T'

#定义 想要的字段
wanted_dict = {"title":["活着"],
               "pub": ["余华 / 作家出版社 / 2012-8-1 / 20.00元"]
              }

# 训练/在url对应的页面中寻找wanted_dict规律
scraper = SmartScraper()
results = scraper.build(url, wanted_dict=wanted_dict)
print(results)

运行代码，采集到的results如下

{'title': ['活着', 
           '房思琪的初恋乐园', 
           '白夜行', 
           '索拉里斯星', 
           '鄙视',
           ...], 
 'pub': ['余华 / 作家出版社 / 2012-8-1 / 20.00元', 
         '林奕含 / 北京联合出版公司 / 2018-2 / 45.00元', 
         '[日] 东野圭吾 / 刘姿君 / 南海出版公司 / 2013-1-1 / CNY 39.50', 
         '[波] 斯坦尼斯瓦夫·莱姆 / 靖振忠 / 译林出版社 / 2021-8 / 49.00元', 
         '[意] 阿尔贝托·莫拉维亚 / 沈萼梅、刘锡荣 / 江苏凤凰文艺出版社 / 2021-7 / 62.00',
          ...]
}

使用刚刚训练的scraper尝试从 P2链接 获取书名和出版信息

scraper.get_result_similar('https://book.douban.com/tag/小说?start=20&type=T')

2.2 保存模型

训练的smartscraper模型可以保存，后续直接调用

scraper.save('douban_Book.pkl')

模型导入代码

scraper.load('douban_Book.pkl')

三、其他

3.1 项目补充说明

SmartScraper仅为了简化使用，对AutoScraper进行了小修改（几行代码）
原创项目地址 https://github.com/alirezamika/autoscraper

3.2 相关课程

如果您是经管人文社科专业背景，编程小白，面临海量文本数据采集和处理分析艰巨任务，个人建议学习《python网络爬虫与文本数据分析》视频课。作为文科生，一样也是从两眼一抹黑开始，这门课程是用五年时间凝缩出来的。自认为讲的很通俗易懂o(￣︶￣)o，

python入门
网络爬虫
数据读取
文本分析入门
机器学习与文本分析
文本分析在经管研究中的应用

感兴趣的童鞋不妨戳一下《python网络爬虫与文本数据分析》进来看看~

3.3 自媒体

B站:大邓和他的python
公众号：大邓和他的python

SmartScraper: 简单、自动、快捷的Python网络爬虫

Related tags

Overview

SmartScraper: 简单、自动、快捷的Python网络爬虫

一、安装

二、快速上手

2.1 获取相似结果

2.2 保存模型

三、其他

3.1 项目补充说明

3.2 相关课程

3.3 自媒体

Owner

DaDeng

An arxiv spider

A web service for scanning media hosted by a Matrix media repository

Crawler in Python 3.7, 3.8. 3.9. Pypy3

A module for CME that spiders hashes across the domain with a given hash.

a high-performance, lightweight and human friendly serving engine for scrapy

LSpider 一个为被动扫描器定制的前端爬虫

HappyScrapper - Google news web scrapper with python

A repository with scraping code and soccer dataset from understat.com.

Web Scraping Practica With Python

This program will help you to properly scrape all data from a specific website

Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

An utility library to scrape data from TikTok, Instagram, Twitch, Youtube, Twitter or Reddit in one line!

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

An application that on a given url, crowls a web page and gets all words, sorts and counts them.

This scrapper scrapes the mail ids of faculty members from a given linl/page and stores it in a csv file

A Python module to bypass Cloudflare's anti-bot page.

Console application for downloading images from Reddit in Python

An introduction to free, automated web scraping with GitHub’s powerful new Actions framework.

tweet random sand cat pictures

crypto currency scraping