Web crawler weibo

The toolbox to collect posts from https://weibo.com

Debug

Linux system

You should get cookies with Google Chrome in Windows. After finishing login_windows.py, you can copy this program and the generated database to a Linux machine and retrieve posts before the cookies expire.

Contribution is open to Linux and other operation systems and browsers' login scripts.

Reasons for empty results

Searching non-Chinese strings will usually return nothing, because "Weibo" is a Chinese social media.
There are several hours' delay before posts appear in the search engine. It's better to search posts 2 days ago.

Usage

If it's the first time to use this program, please create a Python virtual environment and run

pip install -r requirements.txt

In the minimum example, I assume the computer has installed Google Chrome in the default path. If Google Chrome is installed, but in the customized path, please run the following command and set chrome_user_data manually.

Run python login_windows.py and follow the instructions in the command line. This step requires a graphic operation system, because the user have to open a web browser and login "weibo". Other steps can be implemented in a non-graphic operation system.

Minimum example

Search posts containing "GitHub" at 11:00-12:00 (UTC+8) on August 15, 2023. Retrieve at most 2 pages (10 posts per page).

python search.py --query="GitHub" --start_time=2023-08-15-11 --end_time=2023-08-15-12 --max_page=2

For more search options, see

python search.py --help

If the database path is by default, the results are saved in posts.db database.

View the search table and get the table name. The data structure is shown as follows.

Name	Type	Description
query	text	The searching words.
start_time	text	Posts from this hour will be collected. The format code is %Y-%m-%d-%H
end_time	text	Posts till this hour will be collected. The format is the same as `start_time`
table	text	The table name that this query's results are stored.

According to this table name, view the results in the mentioned table.

The data structure of the search results:

Name	Type	Description
avatar	text	Link to the avatar of the post author.
nickname	text	Username of the post author.
user_id	text	User ID of the post author.
posted_time	text	The time when the post was published. Its format can be either seconds/hours/days ago (in Chinese) or an exact datetime with or without years.
source	text	How the post author visits "weibo". It can be either the device name or the topic (tag) name.
weibo_id	text
content	text	The main body of the post. This column of fast reposts will be empty.
reposts	text	Number of reposts. Chinese character "万" may appear in this field, as well as `comments` and `likes`, which means "muptiply 10,000".
comments	text	Number of comments.
likes	text	Number of likes.

SQL statements:

Name	Table	Description
User profile URL	search results	`'https://weibo.com/u/' \|\| user_id`
Post URL	search results	`'https://weibo.com/' \|\| user_id \|\| '/' \|\| weibo_id`

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
login_windows.py		login_windows.py
requirements.txt		requirements.txt
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

login_windows.py

login_windows.py

requirements.txt

requirements.txt

search.py

search.py

Repository files navigation

Web crawler weibo

Debug

Usage

About

Languages

License

cloudy-sfu/Web-crawler-weibo

Folders and files

Latest commit

History

Repository files navigation

Web crawler weibo

Debug

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Languages