Haphazard scripts for scraping bitcoin/bitcoin data from GitHub

Overview

This is a quick-and-dirty tool used to scrape bitcoin/bitcoin pull request and commentary data.

Each output/<pr number> folder contains

  • comments.json: an aggregated list of both issue and review comments, in Github's original format
  • commits.json: a list of commit objects corresponding to the PR, in Github's original format
  • pr.json: the pull request object, in Github's original format
  • comments_abbrev.csv: abbreviated representation of each comment in CSV format
  • pr_abbrev.csv: abbreviated representation of the PR in CSV format
  • done: the datetime we retrieved the PR data

Limitations

Right now this doesn't really handle open PRs (or PRs that are expected to be updated) properly since it will not refresh data once the done sentinel is created. This could be fixed by comparing various timestamps to the done sentinel and overwriting.

See also

Owner
James O'Beirne
James O'Beirne
A simple django-rest-framework api using web scraping

Apicell You can use this api to search in google, bing, pypi and subscene and get results Method : POST Parameter : query Example import request url =

Hesam N 1 Dec 19, 2021
UsernameScraperTool - Username Scraper Tool With Python

UsernameScraperTool Username Scraper for 40+ Social sites. How To use git clone

E4crypt3d 1 Dec 20, 2022
A repository with scraping code and soccer dataset from understat.com.

UNDERSTAT - SHOTS DATASET As many people interested in soccer analytics know, Understat is an amazing source of information. They provide Expected Goa

douglasbc 48 Jan 03, 2023
腾讯课堂,模拟登陆,获取课程信息,视频下载,视频解密。

腾讯课堂脚本 要学一些东西,但腾讯课堂不支持自定义变速,播放时有水印,且有些老师的课一遍不够看,于是这个脚本诞生了。 时间比较紧张,只会不定时修复重大bug。多线程下载之类的功能更新短期内不会有,如果你想一起完善这个脚本,欢迎pr 2020.5.22测试可用 使用方法 很简单,三部完成 下载代码,

163 Dec 30, 2022
Collection of code files to scrap different kinds of websites.

STW-Collection Scrap The Web Collection; blog posts. This repo contains Scrapy sample code to scrap the following kind of websites: Do you want to lea

Tapasweni Pathak 15 Jun 08, 2022
LSpider 一个为被动扫描器定制的前端爬虫

LSpider LSpider - 一个为被动扫描器定制的前端爬虫 什么是LSpider? 一款为被动扫描器而生的前端爬虫~ 由Chrome Headless、LSpider主控、Mysql数据库、RabbitMQ、被动扫描器5部分组合而成。

Knownsec, Inc. 321 Dec 12, 2022
Command line program to download documents from web portals.

command line document download made easy Highlights list available documents in json format or download them filter documents using string matching re

16 Dec 26, 2022
Crawl BookCorpus

These are scripts to reproduce BookCorpus by yourself.

Sosuke Kobayashi 590 Jan 03, 2023
京东云无线宝积分推送,支持查看多设备积分使用情况

JDRouterPush 项目简介 本项目调用京东云无线宝API,可每天定时推送积分收益情况,帮助你更好的观察主要信息 更新日志 2021-03-02: 查询绑定的京东账户 通知排版优化 脚本检测更新 支持Server酱Turbo版 2021-02-25: 实现多设备查询 查询今

雷疯 199 Dec 12, 2022
a high-performance, lightweight and human friendly serving engine for scrapy

a high-performance, lightweight and human friendly serving engine for scrapy

Speakol Ads 30 Mar 01, 2022
robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

Joshua Carp 3.7k Dec 27, 2022
A pure-python HTML screen-scraping library

Scrapely Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely con

Scrapy project 1.8k Dec 31, 2022
This tool crawls a list of websites and download all PDF and office documents

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.

AccessibilityLU 7 Sep 30, 2022
High available distributed ip proxy pool, powerd by Scrapy and Redis

高可用IP代理池 README | 中文文档 本项目所采集的IP资源都来自互联网,愿景是为大型爬虫项目提供一个高可用低延迟的高匿IP代理池。 项目亮点 代理来源丰富 代理抓取提取精准 代理校验严格合理 监控完备,鲁棒性强 架构灵活,便于扩展 各个组件分布式部署 快速开始 注意,代码请在release

SpiderClub 5.2k Jan 03, 2023
A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Parallel web scraping The project is a training task for web scraping using python multithreading and a real-time-updated list of available proxy serv

Kushal Shingote 1 Feb 10, 2022
薅薅乐 - JD 测试脚本

薅薅乐 安裝 使用docker docker一键安装: docker run -d --name jd classmatelin/hhl:latest. 使用 进入容器: docker exec -it jd bash 获取JD_COOKIES: python get_jd_cookies.py,

ClassmateLin 575 Dec 28, 2022
Python web scrapper

Website scrapper Web scrapping project in Python. Created for learning purposes. Start Install python Update configuration with websites Launch script

Nogueira Vitor 1 Dec 19, 2021
Dex-scrapper - Hobby project for scrapping dex data on VeChain

Folders /zumo_abis # abi extracted from zumo repo /zumo_pools # runtime e

3 Jan 20, 2022
Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

Jason Nguyen 1 Oct 29, 2021
Find thumbnails and original images from URL or HTML file.

Haul Find thumbnails and original images from URL or HTML file. Demo Hauler on Heroku Installation on Ubuntu $ sudo apt-get install build-essential py

Vinta Chen 150 Oct 15, 2022