北大选课网2021年春季验证码识别

Overview

北大选课网验证码识别 2021 年春季学期

Powered by Elector Quartet (@Rabbit, @xmcp, @SpiritedAwayCN, @gzz)

数据集描述

最初的数据集为 5130 张人工标记的验证码,之后利用早期训练好的模型在选课网上进行自动验证 (自举),又收集到 102741 张验证码,并对其中 2363 张识别错误的验证码进行人工标注。以上三部分合并为 GIF 验证码数据集

对 GIF 验证码数据集进行预处理,得到实际用于模型训练的数据集,它由裁剪好的单字母灰度图组成,图片尺寸 52x52,8 位色深,文件格式 PNG

训练集/测试集按照 8:2 进行划分,对训练集的数据进行数据增强,方式为旋转特定的几个角度,测试集不做特殊处理

在 const.py 中,DATA_RAW_DIR 对应最初的 5130 张人工标记的验证码,DATA_BOOTSTRAP_DIR 对应所有通过自举得到的验证码 (包含验证正确/错误的验证码),DATA_MANUALLY_LABELED_DIR 对应所有人工标注的验证码

数据集下载

百度网盘下载链接:https://pan.baidu.com/s/12GcfXYaLbHFC1WkzWhre2w (提取码:xsl3)

  1. 最初的 5130 张人工标注的验证码 xmcp_5130.zip,由 xmcp 提供,详见 xmcp/elective-dataset-2021spring
  2. 自举得到的验证码以 10000 张为单位,分为 11 个 bootstrap_all.x.zip
  3. 人工标注的验证码为 manually_labeled.zip,其中 manually_labeled_20210309/ 对应第一批加入到 CNN 模型训练的人工标注验证码,manually_labeled_20210310/manually_labeled_20210311/ 对应第二批加入到 CNN 模型训练的人工标注验证码
  4. 直接用于模型 210311_1 训练/测试的单字母灰度图 crop.20210311.1.tar.gz,共包含 354008 张单字母灰度图,对应于毫秒时间戳不超过 1615449512408 的 88502 张 GIF 验证码

图像预处理

基本采用 SpiritedAwayCN/ElectiveCaptCha 的方法进行预处理,相关算法的介绍参看 SpiritedAwayCN/ElectiveCaptCha 的说明

本项目对 SpiritedAwayCN/ElectiveCaptCha 的算法做了以下两点改进:

  1. 验证码的字母可能会变色,此时隔 4 帧作差,会有上一个字母的残留,因此加入 M_mask 的相关代码,每提取出一个字母,就将其加入到 M_mask 中,下一个字母提取的时候,会通过借助 M_mask 将之前已经提取出的字母对应的像素从当前字母的提取结果中抹去,以消除字母变色对后 3 个字母的提取所产生的影响
  2. 验证码的第一个字母可能在前 4 帧里颜色较浅,取第 4 帧转为灰度图后,第一个字母对应像素的灰度值大于全局平均灰度值,利用 OSTU 二值化时会被抹去,填充成白色,最后得到一张只剩下少量噪音残留的空白图像。改进的方法是,首先将 4 个关键帧的灰度图叠加,保留颜色最深的灰度值 (最小值),利用第 5-16 帧中第一个字母的颜色逐渐加深的特点,获得灰度值较深的包含第一个字母轮廓的 M_merge,此时再利用 OSTU 二值化,就基本上能留住第一个字母的信息,然后将 M_merge 的右三个字母抹去,仅保留包含第一个字母的轮廓,再根据此时的 M_merge,将第 1 个关键帧的灰度图中第一个字母对应像素的灰度值降低,此时再调用之前的处理函数,就可以更加准确地提取出第一个字母

模型训练与测试

以测试集的测试结果作为模型训练结果的参考,以模型在选课网上在线验证的结果作为模型的实际准确率

模型 测试集准确率 (单字母) 在线测试准确率 (四字母) / % 备注
210308_3 0.9968 (15818 / 15869) 0.9675 (17253 / 17832) 扩大数据集,进行数据增强
210309_2 0.9952 (33811 / 33973) 0.9799 (11659 / 11898) 扩大数据集,加入第一批人工标注的验证码
210309_3 0.9951 (33807 / 33973) 0.9788 (11847 / 12104) 同 210309_2,但数据增强的旋转角度范围缩小至 [-15, 15]
210309_4 0.9928 (33730 / 33973) 0.9727 (9527 / 9794) 同 210309_2,但不进行数据增强
210311_1 0.9977 (70467 / 70630) 0.9916 (10219 / 10306) 扩大数据集,加入第二批人工标注的验证码,优化第一个字母的图像处理算法
210311_2 0.9961 (70643 / 70918) 0.9838 (11230 / 11415) 同 210311_1,但使用未经优化的图像处理算法

目前准确率最高的模型为 210311_1,该模型的训练结果、训练时的终端输出、训练/测试的相关代码已在 dist/ 中提供

环境依赖

开发环境 Python 版本为 3.6.8,相关依赖包版本参考 requirements.txt,其中 PyTorch 版本应大于 1.4.x,否则无法读取 dist/ 中提供的模型

模型训练在阿里云服务器上进行,CUDA 版本 11.0,使用的 GPU 为 Tesla V100-SXM2-32GB,使用的 CPU 为 Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz,训练集一次性读入内存,以 210311_1 的训练/测试为例,数据量为 88502 张 GIF 验证码,数据增强的旋转角度范围为 [-30, 30],数据集扩增至 9 倍规模,占用的内存量为 32.9 GB,每个 Epoch 的耗时约为 309 秒

文件描述

preprocess.py

图像预处理的相关算法,实际有用的函数是 extract_c0_v1, extract_c0_v2, extract_c123, crop,处理流程参考 main 函数

cnn.py

与 CNN 模型相关的代码,包括数据集定义、数据增强、CNN 模型定义、训练集/测试集划分、模型训练、模型测试等

bootstrap.py

与模型自举相关的代码,在选课网上在线测试现有模型的准确率,并且将预测正确/错误的模型分类收集,用于扩增数据集

label_server.py

用于人工标注验证码的网站后端,前端代码在 web/ 中。提供图像预处理、输入检查等功能,为人工标注验证码提供方便,所使用的数据源为模型自举过程中收集到的识别错误的验证码

config.ini

与选课网账户相关的信息,包括学号、密码、是否为双学位账号,主要用于 bootstrap.py

致谢

Owner
Rabbit
CCME, Peking University
Rabbit
Automatically re-open threads when they get archived, no matter your boost level!

ThreadPersist Automatically re-open threads when they get archived, no matter your boost level! Installation You will need to install poetry to run th

7 Sep 18, 2022
Unfinished Python library based on ndspy, for Zelda: Phantom Hourglass and Spirit Tracks.

zed An unfinished library and toolset by me, for viewing and editing files from The Legend of Zelda: Phantom Hourglass and The Legend of Zelda: Spirit

4 Oct 13, 2022
Uma moeda simples e segura!

SecCoin - Documentação A SecCoin foi criada com intuito de ser uma moeda segura, de fácil investimento e mineração. A Criptomoeda está na sua primeira

Sec-Coin Team 5 Dec 09, 2022
Cvdl-hw2 - Find Contour, Camera Calibration, Augmented Reality and Stereo Disparity Map

opevcvdl-hw2 This project uses openCV and Qt to achieve the requirements. Version Python 3.7 opencv-contrib-python 3.4.2.17 Matplotlib 3.1.1 pyqt5 5.1

Kenny Cheng 3 Aug 17, 2022
because rico hates uuid's

terrible-uuid-lambda because rico hates uuid's sub 200ms response times! Try it out here: https://api.mathisvaneetvelde.com/uuid https://api.mathisvan

Mathis Van Eetvelde 2 Feb 15, 2022
Perform oocyst segmentation in mercurochrome stained mosquito midgut

Midgut_oocyst_segmentation Perform oocyst segmentation in mercurochrome stained mosquito midguts This oocyst segmentation model also powers the webtoo

Duo Peng 3 Oct 27, 2021
Custom Weapons 3 attribute support for Custom Weapons X

CW3toX Allows use of Custom Weapons 3 attributes in Custom Weapons X. Requiremen

2 Mar 01, 2022
Force you (or your user) annotate Python function type hints.

Must-typing Force you (or your user) annotate function type hints. Notice: It's more like a joke, use it carefully. If you call must_typing in your mo

Konge 13 Feb 19, 2022
List of Linux Tools I put on almost every linux / Debian host

Linux-Tools List of Linux Tools I put on almost every Linux / Debian host Installed: geany -- GUI editor/ notepad++ like chkservice -- TUI Linux ser

Stew Alexander 20 Jan 02, 2023
Backend/API for the Mumble.dev, an open source social media application.

Welcome to the Mumble Api Repository Getting Started If you are trying to use this project for the first time, you can get up and running by following

Dennis Ivy 189 Dec 27, 2022
Python implementation of the ASFLIP advection method

This is a python implementation of the ASFLIP advection method . We would like to hear from you if you appreciate this work.

Raymond Yun Fei 133 Nov 13, 2022
A Way to Use Python, Easier.

PyTools A Way to Use Python, Easier. How to Install Just copy this code, then make a new file in your project directory called PyTools.py, then paste

Kamran 2 Aug 15, 2022
A python package that computes an optimal motion plan for approaching a red light

redlight_approach redlight_approach is a Python package that computes an optimal motion plan during traffic light approach. RLA_demo.mov Given the par

Jonathan Roy 4 Oct 27, 2022
Dyson Sphere Program Blueprint Toolkit

dspbptk This is dspbptk, the Dyson Sphere Program Blueprint toolkit. Dyson Sphere Program is an amazing factory-building game by the incredibly talent

Johannes Bauer 22 Nov 15, 2022
Gmvault: Backup and restore your gmail account

Gmvault: Backup and restore your gmail account Gmvault is a tool for backing up your gmail account and never lose email correspondence. Gmvault is ope

Guillaume Aubert 3.5k Jan 01, 2023
Pacman - A suite of tools for manipulating debian packages

Overview Repository is a suite of tools for manipulating debian packages. At a h

Pardis Pashakhanloo 1 Feb 24, 2022
Decipher using Markov Chain Monte Carlo

Decipher using Markov Chain Monte Carlo

Science étonnante 43 Dec 24, 2022
Supply Chain will be a SAAS platfom to provide e-logistic facilites with most optimal

Shipp It Welcome To Supply Chain App [ Shipp It ] In "Shipp It" we are creating a full solution[web+app] for a entire supply chain from receiving orde

SAIKAT_CLAW 25 Dec 26, 2022
This is sample project needed for security course to connect web service to database

secufaku This is sample project needed for security course to "connect web service to database". Why it suits alignment purpose It connects to postgre

Mark Nicholson 6 May 15, 2022
Code for ML, domain generation, graph generation of ABC dataset

This is the repository for codes for ML, domain generation, graph generation of Asymmetric Buckling Columns (ABC) dataset in the paper "Learning Mechanically Driven Emergent Behavior with Message Pas

Peerasait Prachaseree (Jeffrey) 0 Jan 28, 2022