A universal package of scraper scripts for humans

Related tags

Web CrawlingScrapera
Overview

Logo

MIT License version-shield release-shield python-shield

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Sponsors
  6. License
  7. Contact
  8. Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

  • Images
  • Text
  • Audio
  • Videos
  • Miscellaneous

  • The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

    DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

    Prerequisites

    Prerequisites can be installed separately through the requirements.txt file as below

    pip install -r requirements.txt

    Installation

    Scrapera is built with Python 3 and can be pip installed directly

    pip install scrapera

    Alternatively, if you wish to install the latest version directly through GitHub then run

    pip install git+https://github.com/DarshanDeshpande/Scrapera.git

    Usage

    To use any sub-module, you just need to import, instantiate and execute

    from scrapera.video.vimeo import VimeoScraper
    scraper = VimeoScraper()
    scraper.scrape('https://vimeo.com/191955190', '540p')

    For more examples, please refer to the individual test folders in respective modules

    Contributing

    Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
    For more guidelines, refer to CONTRIBUTING

    License

    Distributed under the MIT License. See LICENSE for more information.

    Sponsors

    Logo

    Contact

    Feel free to reach out for any issues or requests related to Scrapera

    Darshan Deshpande (Owner) - Email | LinkedIn

    Acknowledgements

    Owner
    Helping Machines Learn Better 💻😃
    Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

    proxy scraper 🔎 Installation: git clone https://github.com/ebankoff/proxy_scraper Required pip libraries (pip install library name): lxml beautifulso

    Eban'ko 19 Dec 07, 2022
    一些爬虫相关的签名、验证码破解

    cracking4crawling 一些爬虫相关的签名、验证码破解,目前已有脚本: 小红书App接口签名(shield)(2020.12.02) 小红书滑块(数美)验证破解(2020.12.02) 海南航空App接口签名(hnairSign)(2020.12.05) 说明: 脚本按目标网站、App命

    XNFA 90 Feb 09, 2021
    A scrapy pipeline that provides an easy way to store files and images using various folder structures.

    scrapy-folder-tree This is a scrapy pipeline that provides an easy way to store files and images using various folder structures. Supported folder str

    Panagiotis Simakis 7 Oct 23, 2022
    Web Scraping COVID 19 Meta Portal with Python

    Web-Scraping-COVID-19-Meta-Portal-with-Python - Requests API and Beautiful Soup to scrape real-time COVID statistics from worldometer website and perform data cleaning and visual analysis in Jupyter

    Aarif Munwar Jahan 1 Jan 04, 2022
    Download images from forum threads

    Forum Image Scraper Downloads images from forum threads Only works with forums which doesn't require a login to view and have an incremental paginatio

    9 Nov 16, 2022
    Libextract: extract data from websites

    Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python

    499 Dec 09, 2022
    Dailyiptvlist.com Scraper With Python

    Dailyiptvlist.com scraper Info Made in python Linux only script Script requires to have wget installed Running script Clone repository with: git clone

    1 Oct 16, 2021
    Crawl BookCorpus

    These are scripts to reproduce BookCorpus by yourself.

    Sosuke Kobayashi 590 Jan 03, 2023
    FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

    FilmMikirAPI - A simple rest-api which is used for scrapping on the Kincir website using the Python and Flask package

    UserGhost411 1 Nov 17, 2022
    robobrowser - A simple, Pythonic library for browsing the web without a standalone web browser.

    RoboBrowser: Your friendly neighborhood web scraper Homepage: http://robobrowser.readthedocs.org/ RoboBrowser is a simple, Pythonic library for browsi

    Joshua Carp 3.7k Dec 27, 2022
    A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.

    A leetcode scraper to compile all questions in leetcode free tier to text file, pdf also available. if new questions get added, run again to get new questions.

    3 Dec 07, 2021
    Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN

    Lexile-Atos-Scraper Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN You will need to install the chrome webdriver if you have n

    1 Feb 11, 2022
    A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

    Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

    Aditya Gupta 15 May 17, 2022
    此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

    此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

    N0el4kLs 5 Nov 19, 2021
    哔哩哔哩爬取器:以个人为中心

    Open Bilibili Crawer 哔哩哔哩是一个信息非常丰富的社交平台,我们基于此构造社交网络。在该网络中,节点包括用户(up主),以及视频、专栏等创作产物;关系包括:用户之间,包括关注关系(following/follower),回复关系(评论区),转发关系(对视频or动态转发);用户对创

    Boshen Shi 3 Oct 21, 2021
    Generate a repository with mirror links for DriveDroid app

    DriveDroid Repository Generator Generate a repository for the app that allow boot a PC using ISO files stored on your Android phone Check also an offi

    Evgeny 11 Nov 19, 2022
    ChromiumJniGenerator - Jni Generator module extracted from Chromium project

    ChromiumJniGenerator - Jni Generator module extracted from Chromium project

    allenxuan 4 Jun 12, 2022
    Meme-videos - Scrapes memes and turn them into a video compilations

    Meme Videos Scrapes memes from reddit using praw and request and then converts t

    Partho 12 Oct 28, 2022
    京东茅台抢购 2021年4月最新版

    Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性,完整性和有效性,请根据情况自行判断。 本项目内所有资源文件,禁止任何公众号、自媒体进行任何形式的转载、发布。 huanghyw 对任何脚本问题概不

    45 Dec 14, 2022
    Newsscraper - A simple Python 3 module to get crypto or news articles and their content from various RSS feeds.

    NewsScraper A simple Python 3 module to get crypto or news articles and their content from various RSS feeds. 🔧 Installation Clone the repo locally.

    Rokas 3 Jan 02, 2022