A universal package of scraper scripts for humans

Related tags

Web CrawlingScrapera
Overview

Logo

MIT License version-shield release-shield python-shield

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Sponsors
  6. License
  7. Contact
  8. Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

  • Images
  • Text
  • Audio
  • Videos
  • Miscellaneous

  • The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

    DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

    Prerequisites

    Prerequisites can be installed separately through the requirements.txt file as below

    pip install -r requirements.txt

    Installation

    Scrapera is built with Python 3 and can be pip installed directly

    pip install scrapera

    Alternatively, if you wish to install the latest version directly through GitHub then run

    pip install git+https://github.com/DarshanDeshpande/Scrapera.git

    Usage

    To use any sub-module, you just need to import, instantiate and execute

    from scrapera.video.vimeo import VimeoScraper
    scraper = VimeoScraper()
    scraper.scrape('https://vimeo.com/191955190', '540p')

    For more examples, please refer to the individual test folders in respective modules

    Contributing

    Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
    For more guidelines, refer to CONTRIBUTING

    License

    Distributed under the MIT License. See LICENSE for more information.

    Sponsors

    Logo

    Contact

    Feel free to reach out for any issues or requests related to Scrapera

    Darshan Deshpande (Owner) - Email | LinkedIn

    Acknowledgements

    Owner
    Helping Machines Learn Better 💻😃
    Consulta de CPF e CNPJ na Receita Federal com Web-Scraping

    Repositório contendo scripts Python que realizam a consulta de CPF e CNPJ diretamente no site da Receita Federal.

    Josué Campos 5 Nov 29, 2021
    🐞 Douban Movie / Douban Book Scarpy

    Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

    Xingbo Jia 1 Dec 03, 2022
    让中国用户使用git从github下载的速度提高1000倍!

    序言 github上有很多好项目,但是国内用户连github却非常的慢.每次都要用插件或者其他工具来解决. 这次自己做一个小工具,输入github原地址后,就可以自动替换为代理地址,方便大家更快速的下载. 安装 pip install cit 主要功能与用法 主要功能 change 将目标地址转换为

    35 Aug 29, 2022
    A simple proxy scraper that utilizes the requests module in python.

    Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

    3 Sep 08, 2021
    A Python module to bypass Cloudflare's anti-bot page.

    cloudscraper A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests.

    VeNoMouS 2.6k Dec 31, 2022
    CreamySoup - a helper script for automated SourceMod plugin updates management.

    CreamySoup/"Creamy SourceMod Updater" (or just soup for short), a helper script for automated SourceMod plugin updates management.

    3 Jan 03, 2022
    用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。

    crawler_for_university 用python爬取江苏几大高校的就业网站,并提供3种方式通知给用户,分别是通过微信发送、命令行直接输出、windows气泡通知。 环境依赖 wxpy,requests,bs4等库 功能描述 该项目基于python,通过爬虫爬各高校的就业信息网,爬取招聘信

    8 Aug 16, 2021
    A package designed to scrape data from Yahoo Finance.

    yahoostock A package designed to scrape data from Yahoo Finance. Installation The most simple installation method is through PIP. pip install yahoosto

    Rohan Singh 2 May 28, 2022
    This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease.

    LeasePlan - Scraper This app will let you continuously scrape certain parts of LeasePlan and extract data of cars becoming available for lease. It has

    Rodney 4 Nov 18, 2022
    Find papers by keywords and venues. Then download it automatically

    paper finder Find papers by keywords and venues. Then download it automatically. How to use this? Search CLI python search.py -k "knowledge tracing,kn

    Jiahao Chen (TabChen) 2 Dec 15, 2022
    Tool to scan for secret files on HTTP servers

    snallygaster Finds file leaks and other security problems on HTTP servers. what? snallygaster is a tool that looks for files accessible on web servers

    Hanno Böck 2k Dec 28, 2022
    This script is intended to crawl license information of repositories through the GitHub API.

    GithubLicenseCrawler This script is intended to crawl license information of repositories through the GitHub API. Taking a csv file with requirements.

    schutera 4 Oct 25, 2022
    This is a script that scrapes the longitude and latitude on food.grab.com

    grab This is a script that scrapes the longitude and latitude for any restaurant in Manila on food.grab.com, location can be adjusted. Search Result p

    0 Nov 22, 2021
    High available distributed ip proxy pool, powerd by Scrapy and Redis

    高可用IP代理池 README | 中文文档 本项目所采集的IP资源都来自互联网,愿景是为大型爬虫项目提供一个高可用低延迟的高匿IP代理池。 项目亮点 代理来源丰富 代理抓取提取精准 代理校验严格合理 监控完备,鲁棒性强 架构灵活,便于扩展 各个组件分布式部署 快速开始 注意,代码请在release

    SpiderClub 5.2k Jan 03, 2023
    A webdriver-based script for reserving Tsinghua badminton courts.

    AutoReserve A webdriver-based script for reserving badminton courts. 使用说明 下载 chromedriver 选择当前Chrome对应版本 安装 selenium pip install selenium 更改场次、金额信息dat

    Payne Zhang 4 Nov 09, 2021
    Library to scrape and clean web pages to create massive datasets.

    lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

    Chip Huyen 2.1k Jan 06, 2023
    Introduction to WebScraping Workshop - Semcomp 24 Beta

    Extrair informações da internet de forma automatizada. Existem diversas maneiras de fazer isso, nesse tutorial vamos ver algumas delas, por meio de bibliotecas de python.

    Luísa Moura 19 Sep 11, 2022
    A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

    Amazon-Web-Scarper Created a web scraper using simple functions to check price of a product on amazon (can be duplicated to check price at other marke

    Swaroop Todankar 1 Jan 17, 2022
    Scrape and display grades onto the console

    WebScrapeGrades About The Project This Project is a personal project where I learned how to webscrape using python requests. Being able to get request

    Cyrus Baybay 1 Oct 23, 2021
    🤖 Threaded Scraper to get discord servers from disboard.org written in python3

    Disboard-Scraper Threaded Scraper to get discord servers from disboard.org written in python3. Setup. One thread / tag If you whant to look for multip

    Ѵιcнч 11 Nov 01, 2022