🐞 Douban Movie / Douban Book Scarpy

Last update: Dec 03, 2022

Related tags

Overview

ScrapyDouban

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

The purpose of maintaining this project is to share some of my practice in the process of using Scrapy, the project covers about 80% of my knowledge of Scrapy, I hope to help friends who are learning Scrapy, please note that the current version of the project is Scrapy 2.5.0.

Docker

Project contains douban_scrapyd douban_db douban_adminer three containers.

The douban_scrapyd container is based on python:3.9-slim-buster, the default installed Python3 libraries are scrapy scrapyd pymysql pillow arrow, default mapping port 6800:6800 to facilitate user access to scrapyd management interface via host IP:6800, login required parameters, username:scrapyd password:public.

The douban_db container is based on mysql:8, root password is public, and the default initialization is to import the docker/mysql/douban.sql file to the douban database.

douban_adminer container is based on adminer:4, default mapping port 8080:8080 to facilitate users to access the database management interface through the host IP:8080, login required parameters, server:mysql username:root password:public.

Project SQL

The path to the SQL file used by the project is docker/mysql/douban.sql.

Collection Process

First collect Subject ID --> then crawl the detail page by Subject ID to collect data --> finally collect comments by Subject ID

method

$ git clone https://github.com/xjia77/ScrapyDouban.git
# Build and run containers
$ cd ./ScrapyDouban/docker
$ sudo docker-compose up --build -d
# enter douban_scrapyd container
$ sudo docker exec -it douban_scrapyd bash
# enter scrapy content
$ cd /srv/ScrapyDouban/scrapy
$ scrapy list
# Grabbing movie data
$ scrapy crawl movie_subject # collect movie Subject ID
$ scrapy crawl movie_meta # collect movie data
$ scrapy crawl movie_comment # collect movie comment
# Grabbing book data
$ scrapy crawl book_subject # collect book Subject ID
$ scrapy crawl book_meta # collect book data
$ scrapy crawl book_comment # collect book comment

If you want to make changes to your code more easily while testing, you can mount your project in the scrapy directory to the douban_scrapyd container. If you are used to working with scrapyd, you can deploy your project directly to the douban_scrapyd container via scrapyd-client.

Proxy IP

Due to douban's anti-crawler mechanism, the only way to bypass it now is through a proxy IP. ProxyMiddleware middleware is not enabled in the default settings.py. If you really need to use Douban's data to do some research, you can go rent a paid proxy pool.

image download

douban.pipelines.CoverPipeline processes the cover download logic by filtering spider.name, and the save path of the downloaded image files is the /srv/ScrapyDouban/storage directory of the douban_scrapy container.

🐞 Douban Movie / Douban Book Scarpy

Related tags

Overview

ScrapyDouban

Docker

Project SQL

Collection Process

method

Proxy IP

image download

Owner

Xingbo Jia

Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

中国大学生在线四史自动答题刷分(现仅支持英雄篇)

Library to scrape and clean web pages to create massive datasets.

a way to scrape a database of all of the isef projects

Meme-videos - Scrapes memes and turn them into a video compilations

A high-level distributed crawling framework.

A social networking service scraper in Python

Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

A simple code to fetch comments below an Instagram post and save them to a csv file

Automated Linkedin bot that will improve your visibility and increase your network.

Danbooru scraper with python

Scrape all the media from an OnlyFans account - Updated regularly

Grab the changelog from releases on Github

A simplistic scraper made to download tons of random screenshots made by people.

Linkedin webscraping - Linkedin web scraping with python

Fundamentus scrapy

Example of scraping a paginated API endpoint and dumping the data into a DB

script to scrape direct download links (ddls) from google drive index.

This is a sport analytics project that combines the knowledge of OOP and Webscraping

🐞 Douban Movie / Douban Book Scarpy

Related tags

Overview

ScrapyDouban

Docker

Project SQL

Collection Process

method

Proxy IP

image download

Owner

Xingbo Jia

Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

中国大学生在线 四史自动答题刷分(现仅支持英雄篇)

Library to scrape and clean web pages to create massive datasets.

a way to scrape a database of all of the isef projects

Meme-videos - Scrapes memes and turn them into a video compilations

A high-level distributed crawling framework.

A social networking service scraper in Python

Using Python and Pushshift.io to Track stocks on the WallStreetBets subreddit

Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

A simple code to fetch comments below an Instagram post and save them to a csv file

Automated Linkedin bot that will improve your visibility and increase your network.

Danbooru scraper with python

Scrape all the media from an OnlyFans account - Updated regularly

Grab the changelog from releases on Github

A simplistic scraper made to download tons of random screenshots made by people.

Linkedin webscraping - Linkedin web scraping with python

Fundamentus scrapy

Example of scraping a paginated API endpoint and dumping the data into a DB

script to scrape direct download links (ddls) from google drive index.

This is a sport analytics project that combines the knowledge of OOP and Webscraping

中国大学生在线四史自动答题刷分(现仅支持英雄篇)