Skip to content

mogita/douban-crawler

Repository files navigation

Project Moved

Project moved to https://gitlab.com/mogita/douban-crawler permanently

Introduction

A dead simple crawler for data scraping from Douban.

Heads-up: this project is under heavy development and things shall change or even break from time to time till further stated. You can track the progress on this Trello board.

Usage

This crawler depends on the proxy_pool for making concurrent anonymous requests. The crawler has built-in support that makes it pretty easy to customize and run your own instance of the proxy pool service.

Prerequisites

Steps

Without the Proxy Pool

  1. Create a .env file under the project root and add the following line:
WITHOUT_PROXY=yes
  1. Build and start
# You can add `--no-cache` to always build a clean image
docker-compose build

# You can add `--force-recreate` if you want to drop the container even when 
# the configuration or the image hasn't changed.
docker-compose up -d

Not using proxies might lead to 403 error responses from the source site.

With the Proxy Pool

  1. Free IPs just don't work most of the time. It's highly recommended that you choose a payed proxy provider and tweak the code under proxy_pool directory to override the functionality and suit your needs. Take Zhima (芝麻) HTTP Proxy for example, create a .env file and put the API endpoint into it:
ZHIMA_PROXY_URL="https://..."
  1. Build and start
# Start the proxy_pool containers first, and you might want to wait for a while
# to make sure there's IP available in the pool by looking at the logs of 
# container "douban-crawler-proxy-pool". With available IPs you're good to go 
# to the next command.
docker-compose -f docker-compose.proxy.yml up -d

# You can add `--no-cache` to always build a clean image
docker-compose build

# You can add `--force-recreate` if you want to drop the container even when 
# the configuration or the image hasn't changed.
docker-compose up -d

Development

Prerequisites

It might be a bit more convenient to use Virtualenv or Anaconda to handle the environment. But this differs from case to case so please know what you're dealing with before going ahead.

Steps

  1. This project depends on proxy_pool for making anonymous requests. You should either setup your own instance or if yor're super lazy just try your luck with the free server setup by the proxy_pool project. You should be warned about the super low usability of the free IPs. To customize the fetching method for different providers, tweak the code under proxy_pool/fetcher directory. Please refer to the docs of proxy_pool to learn more.

Edit .env file to set the proper environment variables:

# Adding the following line will make the scripts show verbose logs
DEBUG=yes

# As I'm using Zhima HTTP Proxy I'll put the API here so proxy_pool/fetcher knows 
# where to get new IPs to refresh the pool. 
ZHIMA_PROXY_URL="https://..."

# Put the host name and port (if needed) here for the "proxy_pool" instance so this
# crawler knows where the pool is.
PROXY_POOL_HOST="https://localhost:5010"

# Anyway if you don't need the proxy pool at all, e.g. you want the script to 
# make request directly from your network, you can add the following line and
# go to step 2
WITHOUT_PROXY=yes
  1. Install dependencies.
pip install -r requirements.txt
  1. Migrate database schemas

First you should install golang-migrate/migrate tool to enable the migrate command. Follow the installation guides here: migrate CLI.

Then make the migration to your database (change the user, pass and/or hostname and port accordingly):

migrate -database "postgres://user:pass@localhost:5432/crawler?sslmode=disable" -path migrations up
  1. Run the scripts in the following sequence:
# First, get as more as possible tags
python app.py get_tags

# Second, iterate through tags and fetch the links to the books
python app.py get_book_links

# Lastly start to crawl books from the links
python app.py crawl_books

License

MIT © mogita

About

A dead simple crawler for data scraping from Douban.

Resources

License

Stars

Watchers

Forks

Releases

No releases published