Skip to content

cloudy-sfu/Web-crawler-weibo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web crawler weibo

The toolbox to collect posts from https://weibo.com

Debug

Linux system

You should get cookies with Google Chrome in Windows. After finishing login_windows.py, you can copy this program and the generated database to a Linux machine and retrieve posts before the cookies expire.

Contribution is open to Linux and other operation systems and browsers' login scripts.

Reasons for empty results

  1. Searching non-Chinese strings will usually return nothing, because "Weibo" is a Chinese social media.
  2. There are several hours' delay before posts appear in the search engine. It's better to search posts 2 days ago.

Usage

If it's the first time to use this program, please create a Python virtual environment and run

pip install -r requirements.txt

In the minimum example, I assume the computer has installed Google Chrome in the default path. If Google Chrome is installed, but in the customized path, please run the following command and set chrome_user_data manually.

Run python login_windows.py and follow the instructions in the command line. This step requires a graphic operation system, because the user have to open a web browser and login "weibo". Other steps can be implemented in a non-graphic operation system.


Minimum example

Search posts containing "GitHub" at 11:00-12:00 (UTC+8) on August 15, 2023. Retrieve at most 2 pages (10 posts per page).

python search.py --query="GitHub" --start_time=2023-08-15-11 --end_time=2023-08-15-12 --max_page=2

For more search options, see

python search.py --help

If the database path is by default, the results are saved in posts.db database.

View the search table and get the table name. The data structure is shown as follows.

Name Type Description
query text The searching words.
start_time text Posts from this hour will be collected. The format code is %Y-%m-%d-%H
end_time text Posts till this hour will be collected. The format is the same as start_time
table text The table name that this query's results are stored.

According to this table name, view the results in the mentioned table.

The data structure of the search results:

Name Type Description
avatar text Link to the avatar of the post author.
nickname text Username of the post author.
user_id text User ID of the post author.
posted_time text The time when the post was published. Its format can be either seconds/hours/days ago (in Chinese) or an exact datetime with or without years.
source text How the post author visits "weibo". It can be either the device name or the topic (tag) name.
weibo_id text
content text The main body of the post. This column of fast reposts will be empty.
reposts text Number of reposts. Chinese character "万" may appear in this field, as well as comments and likes, which means "muptiply 10,000".
comments text Number of comments.
likes text Number of likes.

SQL statements:

NameTableDescription
User profile URL search results 'https://weibo.com/u/' || user_id
Post URL search results 'https://weibo.com/' || user_id || '/' || weibo_id