A crawler of doubamovie

Overview

豆瓣电影

A crawler of doubamovie

一个小小的入门级scrapy框架的应用,选取豆瓣电影对排行榜前1000的电影数据进行爬取。

spider.py

start_requests方法为scrapy的方法,我们对它进行重写。

def start_requests(self):
    # 将start_url中的链接通过for循环进行遍历。
    for url in self.start_urls:
        # 通过yield发送Request请求。
        # 这里的Reques注意是scrapy下的Request类。注意不到导错类了。
        # 这里的有3个参数:
        #        1、url为遍历后的链接
        #        2、callback为发送完请求后通过什么方法进行处理,这里通过parse方法进行处理。
        #        3、如果网站设置了防爬措施,需要加上headers伪装浏览器发送请求。

        yield scrapy.Request(url=url, callback=self.parse, headers=self.headler)

使用scrapy的css选择器,定位选取范围div.item

重写parse对start_request()请求到的数据进行处理,通过yield对爬取到的网页数据进行封装

def parse(self, response):
    # 这里使用scrapy的css选择器,既然数据在class为item的div下,那么把选取范围定位div.item
    for quote in response.css('div.item'):
        # 通过yield对网页数据进行循环抓取
        yield {
            # 抓取排名、电影名、导演、主演、上映日期、制片国家 / 地区、类型,评分、评论数量、一句话评价以及电影链接
            "电影链接": quote.css('div.info div.hd a::attr(href)').extract_first(),
            "排名": quote.css('div.pic em::text').extract(),
            "电影名": quote.css('div.info div.hd a span.title::text')[0].extract(),
            "上映年份": quote.css('div.info div.bd p::text')[1].extract().split('/')[0].strip(),
            "制片国家": quote.css('div.info div.bd p::text')[1].extract().split('/')[1].strip(),
            "类型": quote.css('div.info div.bd p::text')[1].extract().split('/')[2].strip(),
            "评分": quote.css('div.info div.bd div.star span.rating_num::text').extract(),
            "评论数量": quote.css('div.info div.bd div.star span::text')[1].re(r'\d+'),
            "引言": quote.css('div.info div.bd p.quote span.inq::text').extract(),
        }
    next_url = response.css('div.paginator span.next a::attr(href)').extract()
    if next_url:
        next_url = "https://movie.douban.com/top250" + next_url[0]
        print(next_url)
        yield scrapy.Request(next_url, headers=self.headler)

pipelines.py

将其存入mongodb数据库中,不用提前创建表。

def __init__(self):
    self.client = pymongo.MongoClient('localhost', 27017)
    scrapy_db = self.client['doubanmovie']  # 创建数据库
    self.coll = scrapy_db['movie']  # 创建数据库中的表格

def process_item(self, item, spider):
    self.coll.insert_one(dict(item))
    return item

def close_spider(self, spider):
    self.client.close()

item.py

数据封装

url = scrapy.Field()
name = scrapy.Field()
order = scrapy.Field()
content = scrapy.Field()
contentnum = scrapy.Field()
year = scrapy.Field()
country = scrapy.Field()
score = scrapy.Field()
vary = scrapy.Field()
Owner
Cats without dried fish
Cats without dried fish
A web scraper that exports your entire WhatsApp chat history.

WhatSoup 🍲 A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

Eddy Harrington 87 Jan 06, 2023
Generate a repository with mirror links for DriveDroid app

DriveDroid Repository Generator Generate a repository for the app that allow boot a PC using ISO files stored on your Android phone Check also an offi

Evgeny 11 Nov 19, 2022
Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

Pyrics Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes. ./test/run.py provides the full function in terminal cmd

MisterDK 1 Feb 12, 2022
Scrape and display grades onto the console

WebScrapeGrades About The Project This Project is a personal project where I learned how to webscrape using python requests. Being able to get request

Cyrus Baybay 1 Oct 23, 2021
Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

Jason Nguyen 1 Oct 29, 2021
爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

lxSpider 爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说网站、招标采购网》 简介: 时光荏苒,记不清写了多少案例了。

lx 793 Jan 05, 2023
This is a python api to scrape search results from a url.

googlescrape Installation Installation is simple! # Stable version pip install googlescrape Examples from googlescrape import client scrapeClient=cli

1 Dec 15, 2022
Web Scraping images using Selenium and Python

Web Scraping images using Selenium and Python A propos de ce document This is a markdown document about Web scraping images and videos using Selenium

Nafaa BOUGRAINE 3 Jul 01, 2022
Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

Footballmapies - Football mapies for learning webscraping and use of gmplot module in python

1 Jan 28, 2022
Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

Porames Vatanaprasan 31 Apr 17, 2022
SkyScrapers: A collection of variety of Scraping Apps

SkyScrapers Collection of variety of Web Scraping Apps The web-scrapers involved

Biplov Pokhrel 3 Feb 17, 2022
Facebook Group Scraping Using Beautiful Soup & Selenium

Extract Facebook group posts that are related to a specific topic and write them to a .json file.

Fatima Ghadieh 14 Aug 12, 2022
A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

Hesam N 1 Dec 19, 2021
Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

trafilatura: Web scraping tool for text discovery and retrieval Description Trafilatura is a Python package and command-line tool which seamlessly dow

Adrien Barbaresi 704 Jan 06, 2023
Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Aliexpress to telegram post Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a b

Fernando 6 Dec 06, 2022
Scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info

SpaceX Sofware I developed software to scrape data on SpaceX: Capsules, Rockets, Cores, Roadsters, SpaceX Info to use the software you need Python a

Maxence Rémy 16 Aug 02, 2022
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

Aditya Gupta 15 May 17, 2022
Command line program to download documents from web portals.

command line document download made easy Highlights list available documents in json format or download them filter documents using string matching re

16 Dec 26, 2022
Consulta de CPF e CNPJ na Receita Federal com Web-Scraping

Repositório contendo scripts Python que realizam a consulta de CPF e CNPJ diretamente no site da Receita Federal.

Josué Campos 5 Nov 29, 2021
Python Web Scrapper Project

Web Scrapper Projeto desenvolvido em python, sobre tudo com Selenium, BeautifulSoup e Pandas é um web scrapper que puxa uma tabela com as principais e

Jordan Ítalo Amaral 2 Jan 04, 2022