一些爬虫相关的签名、验证码破解

Overview

cracking4crawling

一些爬虫相关的签名、验证码破解,目前已有脚本:

说明:

脚本按目标网站、App命名归档,每个脚本一般都是可以单独导入使用(除非调用了额外的用于加解密的js文件),使用方法可阅读文档或参考其中的test函数。

使用方法:

小红书

小红书App接口签名(shield)

shield是小红书App接口主要的签名,由path、params、xy_common_params、xy_platform_info、data拼接并加密生成。原始加密在libshield.so中,已用python复现。

from urllib import parse

from xiaohongshu.shield import get_sign

# 对接口路径、url参数、header中的xy-common-params、xy-platform-info、请求的data进行签名
path = '/api/sns/v4/note/user/posted'

params = parse.urlencode({'user_id': '5eeb209d000000000101d84a'})

xy_common_params = parse.urlencode({})
    
xy_platform_info = parse.urlencode({})

data = parse.urlencode({})

# 生成签名
sign = get_sign(path=path, 
                params=params, 
                xy_common_params=xy_common_params, 
                xy_platform_info=xy_platform_info,
                data=data)
print(sign)

小红书滑块(数美)验证破解

小红书使用数美滑块验证码,验证过程(获取验证码配置>获取验证码>提交验证)在数美的服务器(数美使用organization来识别被验证的网站、App)上进行,完成后将通过的rid提交到小红书的接口。

具体实现细节:

  • 协议更新:数美会定期自动更新js和接口参数字段(接口里所有两个字母组成的字段名都会在更新修改),通过"/ca/v1/conf"接口返回的js路径可以判断协议版本(如"/pr/auto-build/v1.0.1-33/captcha-sdk.min.js",表示协议版本号为33),脚本会加载js,并通过匹配确认字段名,用于后续的接口请求。
  • 验证参数:验证主要需要三个参数:位移比率、时间、轨迹,使用opencv中的matchTemplate函数计算距离,并随机生成相应的轨迹。
  • 调用加密:提交验证的主要参数都需要加密,使用DES加密。
  • 加密过程:"/ca/v1/register"接口会返回一个参数k,使用"sshummei"作为key对它解密,结果为加密参数所需的key,再对参数进行加密。

注:当前的验证参数全部按照小红书App调整,用于其他验证(如小红书Web或其他网站、App),可能需要调整其中参数。

from xiaohongshu.shumei_slide_captcha import get_verify

# 表示小红书
organization = 'eR46sBuqF0fdw7KWFLYa'

# rid是验证过程中响应的标示,r是最后提交验证返回的响应
rid, r = get_verify(organization)

print(rid, r)

# riskLevel为PASS说明验证通过
if r['riskLevel'] == 'PASS':
    # 这里需要向小红书提交rid
    # 具体可抓包查看,接口:/api/sns/v1/system_service/slide_captcha_check
    pass

海南航空

海南航空App接口签名(hnairSign)

签名对象主要是请求的data,取common、data下的全部参数,按字典序排序进行拼接(list、dict类型不参与拼接),结尾加上slat,进行HMAC_SHA1加密生成。

注:"/user/"下的接口加签时,会在拼接的内容前加上token,同时HMAC_SHA1加密会使用服务器返回的secret

from hnair.hna_signature

# 对请求的data进行签名
data = {
    'common': {
        # common的内容
    },
    'data': {
        'adultCount': 1,
        'cabins': ['*'],
        'childCount': 0,
        'depDate': '2020-12-09',
        'dstCode': 'PEK',
        'infantCount': 0,
        'orgCode': 'YYZ',
        'tripType': 1,
        'type': 3
    }
}

# /user/ 路径下的接口需要登录,同时加签要传入token、secret(都由服务器返回)
# token = ''
# secret = ''

# 生成签名
sign = get_sign(data=data)
print(sign)
Owner
XNFA
XNFA
A web scraper for nomadlist.com, made to avoid website restrictions.

Gypsylist gypsylist.py is a web scraper for nomadlist.com, made to avoid website restrictions. nomadlist.com is a website with a lot of information fo

Alessio Greggi 5 Nov 24, 2022
京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

MaoTai 129 Dec 14, 2022
Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

TwitterScraper Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine . Screenshot Data Users Only

Remax Alghamdi 19 Nov 17, 2022
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 07, 2023
Find thumbnails and original images from URL or HTML file.

Haul Find thumbnails and original images from URL or HTML file. Demo Hauler on Heroku Installation on Ubuntu $ sudo apt-get install build-essential py

Vinta Chen 150 Oct 15, 2022
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

Scrapy project 859 Dec 29, 2022
A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Parallel web scraping The project is a training task for web scraping using python multithreading and a real-time-updated list of available proxy serv

Kushal Shingote 1 Feb 10, 2022
An experiment to deploy a serverless infrastructure for a scrapy project.

Serverless Scrapy project This project aims to evaluate the feasibility of an architecture based on serverless technology for a web crawler using scra

José Ferraz Neto 5 Jul 08, 2022
Web3 Pancakeswap Sniper bot written in python3

Pancakeswap_BSC_Sniper_Bot Web3 Pancakeswap Sniper bot written in python3, Please note the license conditions! The first Binance Smart Chain sniper bo

Treading-Tigers 295 Dec 31, 2022
Scraping weather data using Python to receive umbrella reminders

A Python package which scrapes weather data from google and sends umbrella reminders to specified email at specified time daily.

Edula Vinay Kumar Reddy 1 Aug 23, 2022
A crawler of doubamovie

豆瓣电影 A crawler of doubamovie 一个小小的入门级scrapy框架的应用,选取豆瓣电影对排行榜前1000的电影数据进行爬取。 spider.py start_requests方法为scrapy的方法,我们对它进行重写。 def start_requests(self):

Cats without dried fish 1 Oct 05, 2021
An introduction to free, automated web scraping with GitHub’s powerful new Actions framework.

An introduction to free, automated web scraping with GitHub’s powerful new Actions framework Published at palewi.re/docs/first-github-scraper/ Contrib

Ben Welsh 15 Nov 24, 2022
Extract embedded metadata from HTML markup

extruct extruct is a library for extracting embedded metadata from HTML markup. Currently, extruct supports: W3C's HTML Microdata embedded JSON-LD Mic

Scrapinghub 725 Jan 03, 2023
Jobinja.ir jobs scraper.

Jobinja.ir Dataset Introduction This project is a simple web scraper that scraps pages of jobinja.ir concurrently and writes and update (if file gets

Iman Kermani 3 Apr 15, 2022
Danbooru scraper with python

Danbooru Version: 0.0.1 License under: MIT License Dependencies Python: = 3.9.7 beautifulsoup4 cloudscraper Example of use Danbooru from danbooru imp

Sugarbell 2 Oct 27, 2022
Scrapes proxies and saves them to a text file

Proxy Scraper Scrapes proxies from https://proxyscrape.com and saves them to a file. Also has a customizable theme system Made by nell and Lamp

nell 2 Dec 22, 2021
Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

Anton Ivarsson 1 Nov 07, 2021
原神爬虫 抓取原神界面圣遗物信息

原神圣遗物半自动爬虫 说明 直接抓取原神界面中的圣遗物数据 目前只适配了背包页面的抓取 准确率:97.5%(普通通用接口,对 40 件随机圣遗物识别,统计完全正确的数量为 39) 准确率:100%(4k 屏幕,普通通用接口,对 110 件圣遗物识别,统计完全正确的数量为 110) 不排除还有小错误的

hwa 28 Oct 10, 2022
Dailyiptvlist.com Scraper With Python

Dailyiptvlist.com scraper Info Made in python Linux only script Script requires to have wget installed Running script Clone repository with: git clone

1 Oct 16, 2021
Simple library for exploring/scraping the web or testing a website you’re developing

Robox is a simple library with a clean interface for exploring/scraping the web or testing a website you’re developing. Robox can fetch a page, click on links and buttons, and fill out and submit for

Dan Claudiu Pop 79 Nov 27, 2022