Project Moved

Project moved to https://gitlab.com/mogita/douban-crawler permanently

Introduction

A dead simple crawler for data scraping from Douban.

Heads-up: this project is under heavy development and things shall change or even break from time to time till further stated. You can track the progress on this Trello board.

Usage

This crawler depends on the proxy_pool for making concurrent anonymous requests. The crawler has built-in support that makes it pretty easy to customize and run your own instance of the proxy pool service.

Prerequisites

Steps

Without the Proxy Pool

Create a .env file under the project root and add the following line:

WITHOUT_PROXY=yes

Build and start

# You can add `--no-cache` to always build a clean image
docker-compose build

# You can add `--force-recreate` if you want to drop the container even when 
# the configuration or the image hasn't changed.
docker-compose up -d

Not using proxies might lead to 403 error responses from the source site.

With the Proxy Pool

Free IPs just don't work most of the time. It's highly recommended that you choose a payed proxy provider and tweak the code under proxy_pool directory to override the functionality and suit your needs. Take Zhima (芝麻) HTTP Proxy for example, create a .env file and put the API endpoint into it:

ZHIMA_PROXY_URL="https://..."

Build and start

# Start the proxy_pool containers first, and you might want to wait for a while
# to make sure there's IP available in the pool by looking at the logs of 
# container "douban-crawler-proxy-pool". With available IPs you're good to go 
# to the next command.
docker-compose -f docker-compose.proxy.yml up -d

# You can add `--no-cache` to always build a clean image
docker-compose build

# You can add `--force-recreate` if you want to drop the container even when 
# the configuration or the image hasn't changed.
docker-compose up -d

Development

Prerequisites

Python 3 with pip
PostgreSQL
Redis
proxy_pool

It might be a bit more convenient to use Virtualenv or Anaconda to handle the environment. But this differs from case to case so please know what you're dealing with before going ahead.

Steps

This project depends on proxy_pool for making anonymous requests. You should either setup your own instance or if yor're super lazy just try your luck with the free server setup by the proxy_pool project. You should be warned about the super low usability of the free IPs. To customize the fetching method for different providers, tweak the code under proxy_pool/fetcher directory. Please refer to the docs of proxy_pool to learn more.

Edit .env file to set the proper environment variables:

# Adding the following line will make the scripts show verbose logs
DEBUG=yes

# As I'm using Zhima HTTP Proxy I'll put the API here so proxy_pool/fetcher knows 
# where to get new IPs to refresh the pool. 
ZHIMA_PROXY_URL="https://..."

# Put the host name and port (if needed) here for the "proxy_pool" instance so this
# crawler knows where the pool is.
PROXY_POOL_HOST="https://localhost:5010"

# Anyway if you don't need the proxy pool at all, e.g. you want the script to 
# make request directly from your network, you can add the following line and
# go to step 2
WITHOUT_PROXY=yes

Install dependencies.

pip install -r requirements.txt

Migrate database schemas

First you should install golang-migrate/migrate tool to enable the migrate command. Follow the installation guides here: migrate CLI.

Then make the migration to your database (change the user, pass and/or hostname and port accordingly):

migrate -database "postgres://user:pass@localhost:5432/crawler?sslmode=disable" -path migrations up

Run the scripts in the following sequence:

# First, get as more as possible tags
python app.py get_tags

# Second, iterate through tags and fetch the links to the books
python app.py get_book_links

# Lastly start to crawl books from the links
python app.py crawl_books

License

MIT © mogita

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
migrations		migrations
proxy_pool		proxy_pool
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.migrate		Dockerfile.migrate
LICENSE		LICENSE
Readme.md		Readme.md
app.py		app.py
docker-compose.proxy.yml		docker-compose.proxy.yml
docker-compose.yml		docker-compose.yml
migrate.sh		migrate.sh
requirements.txt		requirements.txt
start.sh		start.sh

License

mogita/douban-crawler

Folders and files

Latest commit

History

Repository files navigation

Project Moved

Introduction

Usage

Prerequisites

Steps

Without the Proxy Pool

With the Proxy Pool

Development

Prerequisites

Steps

License

About

Resources

License

Stars

Watchers

Forks

Languages