A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Overview

Universal Online Judge Spider

Introduction

This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/).

It also works for all other Online Judges using the UOJ system.

This spider is written in python3, using python selenium webdriver library and ChromeDriver.

It is only tested on Ubuntu 20.04, so the commands in the following section are only available for this system as well.

Features

  • Automatic login, no need to obtain cookies manually.
  • Convert pages into PDFs with reproducible text rather than simple screenshots.
  • Automatically detects the loading of MathJax to ensure that the mathematical formula within the results are displayed correctly.
  • Automatically skips pages that already exist (if the corresponding PDF file already exists locally).
  • Support for proxy.
  • Support for all websites using the UOJ system.

Installation

1. Install python3 and ChromeDriver:

apt install python3 python-pip3 chromium-browser chromium-chromedriver

2. Install selenium library for python3

pip3 install selenium

3. Download this program

Usage

Firstly you have to set these variables:

# [Basic settings]
url = ""
username = ""
password = ""
start_number = 1
end_number = 100
save_dir = "downloads"

# [Advanced settings]
proxy = ""
page_404_title = "404 - "
max_login_time = 60
max_mathjax_start_time = 60
max_mathjax_load_time = 60

Basic settings

  • url: the index URL of your target, e.g. https://uoj.ac/. Please note that the value must end in a slash /.
  • username: your username.
  • password: your password.
  • start_number: the number of the first problem crawled (minimum).
  • end_number: the number of the last problem crawled (maximum).
  • save_dir: the name of the folder where the result will be stored.

Advanced settings

If you don't know what the advanced settings are for, you're probably better not to change them.

  • proxy: the address of your proxy server, e.g. HTTP://127.0.0.1:1080, or SOCKS5://127.0.0.1:1081. Leave it blank (empty string) if you do not need to use a proxy.
  • page_404_title: the title of OJ's 404 page. You may use a substring of the title, like 404 - . If the program gets a page title that contains this string, the download of that page will be skipped.
  • max_login_time: the maximum waiting time for a login attempt, in seconds.
  • max_mathjax_start_time: the maximum wait time for a MathJax loading message to appear, in seconds.
  • max_mathjax_load_time: the maximum wait time for a MathJax loading message to disappear (i.e. MathJax rendering is finished), in seconds.

After completing the setup, run:

python3 main.py

Sample result

page1

page2

License

MIT License.

Owner
TriNitroTofu
QAQ...
TriNitroTofu
Dex-scrapper - Hobby project for scrapping dex data on VeChain

Folders /zumo_abis # abi extracted from zumo repo /zumo_pools # runtime e

3 Jan 20, 2022
Scraping followers of an instagram account

ScrapInsta A script to scraping data from Instagram Install First of all you can run: pip install scrapinsta After that you need to install these requ

Matheus Kolln 1 Sep 05, 2021
A simple code to fetch comments below an Instagram post and save them to a csv file

fetch_comments A simple code to fetch comments below an Instagram post and save them to a csv file usage First you have to enter your username and pas

2 Jul 14, 2022
Binance Smart Chain Contract Scraper + Contract Evaluator

Pulls Binance Smart Chain feed of newly-verified contracts every 30 seconds, then checks their contract code for links to socials.Returns only those with socials information included, and then submit

14 Dec 09, 2022
🐞 Douban Movie / Douban Book Scarpy

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

Xingbo Jia 1 Dec 03, 2022
Scraping weather data using Python to receive umbrella reminders

A Python package which scrapes weather data from google and sends umbrella reminders to specified email at specified time daily.

Edula Vinay Kumar Reddy 1 Aug 23, 2022
Using Selenium with Python to Web Scrap Popular Youtube Tech Channels.

Web Scrapping Popular Youtube Tech Channels with Selenium Data Mining, Data Wrangling, and Exploratory Data Analysis About the Data Web scrapi

David Rusho 0 Aug 18, 2021
Python scrapper scrapping torrent website and download new movies Automatically.

torrent-scrapper Python scrapper scrapping torrent website and download new movies Automatically. If you like it Put a ⭐ on this repo 😇 Run this git

Fazil vk 1 Jan 08, 2022
A python tool to scrape NFT's off of OpenSea

Right Click Bot A script to download NFT PNG's from OpenSea. All the NFT's you could ever want, no blockchain, for free. Usage Must Use Python 3! Auto

15 Jul 16, 2022
:arrow_double_down: Dumb downloader that scrapes the web

You-Get NOTICE: Read this if you are looking for the conventional "Issues" tab. You-Get is a tiny command-line utility to download media contents (vid

Mort Yao 46.4k Jan 03, 2023
一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

QQ音乐歌词爬虫 一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件,默认去除了所有演唱会(Live)版本的歌曲。 使用方法 直接运行python run.py即可,然后输入你想获取的歌手名字,然后静静等待片刻。 output目录下保存生成的歌词和歌名文件。以周杰伦为例,会生成两

Yang Wei 11 Jul 27, 2022
API which uses discord to scrape NameMC searches/droptime/dropping status of minecraft names

NameMC Scrape API This is an api to scrape NameMC using message previews generated by discord. NameMC makes it a pain to scrape their website, but som

Twilak 2 Dec 22, 2021
The core packages of security analyzer web crawler

Security Analyzer 🐍 A large scale web crawler (considered also as vulnerability scanner tool) to take an overview about security of Moroccan sites Cu

Security Analyzer 10 Jul 03, 2022
Web crawling framework based on asyncio.

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Requirements Python3.5+ Installation pip install gain pip install uvloo

Jiuli Gao 2k Jan 05, 2023
NASA APOD Discord Bot - Fetches information from NASA APOD site.

NASA APOD Discord Bot - Fetches information from NASA APOD site.

Astronomy Club IITK 4 Apr 23, 2022
A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

cybernews A package that provides you Latest Cyber/Hacker News from website using Web-Scraping. Latest Cyber/Hacker News Using Webscraping Developed b

Hitesh Rana 4 Jun 02, 2022
自动完成每日体温上报(Github Actions)

体温上报助手 简介 每天 10:30 GMT+8 自动完成体温上报,如想修改定时运行的时间,可修改 .github/workflows/SduHealthReport.yml 中 schedule 属性。 如果当日有异常,请手动在小程序端/PC 端填写!

Teng Zhang 23 Sep 15, 2022
Web Scraping Practica With Python

Web-Scraping-Practica Integrants: Guillem Vidal Pallarols. Lídia Bandrés Solé Fitxers: Aquest document és el primer que trobem. A continuació trobem u

2 Nov 08, 2021
Explore scraping with BeautifulSoup!

beautifulsoup-scrape Explore scraping with BeautifulSoup! Part One: Start from Shakespeare As my professor is a poet (yes, and he teaches me data and

Chuqin 2 Oct 05, 2022
Lovely Scrapper

Lovely Scrapper

Tushar Gadhe 2 Jan 01, 2022