Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Overview

Video Games Web Scraper

Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

This project uses an open-source and collaborative framework named Scrapy.

Sources

Installation

I strongly recommend that you install this project in a dedicated virtual environment to avoid conflicting with your system packages.

See Virtual Environments and Packages on how to create and use your virtual environment.

Use the package manager pip to install the requirements of this project.

pip install -r requirements.txt

Usage

You can start crawling a source using a spider.

scrapy crawl <spider>

VideoGameGeek

Spiders

  • vgg-games
  • vgg-hotitems

Developer Resources

Initialize your Development Environment

pip install -r requirements.txt

Create and Run Tests

See the Spiders Contracts for more instructions on how to create tests for spiders and then run:

scrapy check

Scrapy Documentation

See the Scrapy Documentation for more instructions on how to create and modify spiders.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

Support

If you enjoy this repository, please star this repository. By starring a repository, it shows appreciation to the repository maintainer for their work. Many of GitHub's repository rankings depend on the number of stars a repository has.

License

MIT

You might also like...
Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

A Smart, Automatic, Fast and Lightweight Web Scraper for Python
A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file
A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

Web and PDF Scraper Refactoring

Web and PDF Scraper Refactoring This repository contains the example code of the Web and PDF scraper code roast. Here are the links to the videos: Par

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)
Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

A web scraper that exports your entire WhatsApp chat history.
A web scraper that exports your entire WhatsApp chat history.

WhatSoup 🍲 A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

A simple python web scraper.

Dissec A simple python web scraper. It gets a website and its contents and parses them with the help of bs4. Installation To install the requirements,

Web scraper for Zillow

Zillow-Scraper Instructions All terminal commands are highlighted. Make sure you first have python 3 installed. You can check this by running "python

Comments
  • Bump scrapy from 2.6.1 to 2.6.2

    Bump scrapy from 2.6.1 to 2.6.2

    Bumps scrapy from 2.6.1 to 2.6.2.

    Release notes

    Sourced from scrapy's releases.

    2.6.2

    Fixes a security issue around HTTP proxy usage, and addresses a few regressions introduced in Scrapy 2.6.0.

    See the changelog.

    Changelog

    Sourced from scrapy's changelog.

    Scrapy 2.6.2 (2022-07-25)

    Security bug fix:

    • When :class:~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware processes a request with :reqmeta:proxy metadata, and that :reqmeta:proxy metadata includes proxy credentials, :class:~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware sets the Proxy-Authentication header, but only if that header is not already set.

      There are third-party proxy-rotation downloader middlewares that set different :reqmeta:proxy metadata every time they process a request.

      Because of request retries and redirects, the same request can be processed by downloader middlewares more than once, including both :class:~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware and any third-party proxy-rotation downloader middleware.

      These third-party proxy-rotation downloader middlewares could change the :reqmeta:proxy metadata of a request to a new value, but fail to remove the Proxy-Authentication header from the previous value of the :reqmeta:proxy metadata, causing the credentials of one proxy to be sent to a different proxy.

      To prevent the unintended leaking of proxy credentials, the behavior of :class:~scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware is now as follows when processing a request:

      • If the request being processed defines :reqmeta:proxy metadata that includes credentials, the Proxy-Authorization header is always updated to feature those credentials.

      • If the request being processed defines :reqmeta:proxy metadata without credentials, the Proxy-Authorization header is removed unless it was originally defined for the same proxy URL.

        To remove proxy credentials while keeping the same proxy URL, remove the Proxy-Authorization header.

      • If the request has no :reqmeta:proxy metadata, or that metadata is a falsy value (e.g. None), the Proxy-Authorization header is removed.

        It is no longer possible to set a proxy URL through the :reqmeta:proxy metadata but set the credentials through the Proxy-Authorization header. Set proxy credentials through the :reqmeta:proxy metadata instead.

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump pillow from 9.0.0 to 9.0.1

    Bump pillow from 9.0.0 to 9.0.1

    Bumps pillow from 9.0.0 to 9.0.1.

    Release notes

    Sourced from pillow's releases.

    9.0.1

    https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html

    Changes

    • In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@​radarhere, @​hugovk]
    • Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]
    Changelog

    Sourced from pillow's changelog.

    9.0.1 (2022-02-03)

    • In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [radarhere, hugovk]

    • Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

    Commits
    • 6deac9e 9.0.1 version bump
    • c04d812 Update CHANGES.rst [ci skip]
    • 4fabec3 Added release notes for 9.0.1
    • 02affaa Added delay after opening image with xdg-open
    • ca0b585 Updated formatting
    • 427221e In show_file, use os.remove to remove temporary images
    • c930be0 Restrict builtins within lambdas for ImageMath.eval
    • 75b69dd Dont need to pin for GHA
    • cd938a7 Autolink CWE numbers with sphinx-issues
    • 2e9c461 Add CVE IDs
    • See full diff in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump pillow from 9.0.0 to 9.0.1

    Bump pillow from 9.0.0 to 9.0.1

    Bumps pillow from 9.0.0 to 9.0.1.

    Release notes

    Sourced from pillow's releases.

    9.0.1

    https://pillow.readthedocs.io/en/stable/releasenotes/9.0.1.html

    Changes

    • In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [@​radarhere, @​hugovk]
    • Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]
    Changelog

    Sourced from pillow's changelog.

    9.0.1 (2022-02-03)

    • In show_file, use os.remove to remove temporary images. CVE-2022-24303 #6010 [radarhere, hugovk]

    • Restrict builtins within lambdas for ImageMath.eval. CVE-2022-22817 #6009 [radarhere]

    Commits
    • 6deac9e 9.0.1 version bump
    • c04d812 Update CHANGES.rst [ci skip]
    • 4fabec3 Added release notes for 9.0.1
    • 02affaa Added delay after opening image with xdg-open
    • ca0b585 Updated formatting
    • 427221e In show_file, use os.remove to remove temporary images
    • c930be0 Restrict builtins within lambdas for ImageMath.eval
    • 75b69dd Dont need to pin for GHA
    • cd938a7 Autolink CWE numbers with sphinx-issues
    • 2e9c461 Add CVE IDs
    • See full diff in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Bump scrapy from 2.5.1 to 2.6.0

    Bump scrapy from 2.5.1 to 2.6.0

    Bumps scrapy from 2.5.1 to 2.6.0.

    Release notes

    Sourced from scrapy's releases.

    2.6.0

    • Security fixes for cookie handling (see details below)
    • Python 3.10 support
    • asyncio support is no longer considered experimental, and works out-of-the-box on Windows regardless of your Python version
    • Feed exports now support pathlib.Path output paths and per-feed item filtering and post-processing

    See the full changelog

    Security bug fixes

    • When a Request object with cookies defined gets a redirect response causing a new Request object to be scheduled, the cookies defined in the original Request object are no longer copied into the new Request object.

      If you manually set the Cookie header on a Request object and the domain name of the redirect URL is not an exact match for the domain of the URL of the original Request object, your Cookie header is now dropped from the new Request object.

      The old behavior could be exploited by an attacker to gain access to your cookies. Please, see the cjvr-mfj7-j4j8 security advisory for more information.

      Note: It is still possible to enable the sharing of cookies between different domains with a shared domain suffix (e.g. example.com and any subdomain) by defining the shared domain suffix (e.g. example.com) as the cookie domain when defining your cookies. See the documentation of the Request class for more information.

    • When the domain of a cookie, either received in the Set-Cookie header of a response or defined in a Request object, is set to a public suffix <https://publicsuffix.org/>_, the cookie is now ignored unless the cookie domain is the same as the request domain.

      The old behavior could be exploited by an attacker to inject cookies from a controlled domain into your cookiejar that could be sent to other domains not controlled by the attacker. Please, see the mfjm-vh54-3f96 security advisory for more information.

    Changelog

    Sourced from scrapy's changelog.

    Scrapy 2.6.0 (2022-03-01)

    Highlights:

    • :ref:Security fixes for cookie handling <2.6-security-fixes>

    • Python 3.10 support

    • :ref:asyncio support <using-asyncio> is no longer considered experimental, and works out-of-the-box on Windows regardless of your Python version

    • Feed exports now support :class:pathlib.Path output paths and per-feed :ref:item filtering <item-filter> and :ref:post-processing <post-processing>

    .. _2.6-security-fixes:

    Security bug fixes

    
    -   When a :class:`~scrapy.http.Request` object with cookies defined gets a
        redirect response causing a new :class:`~scrapy.http.Request` object to be
        scheduled, the cookies defined in the original
        :class:`~scrapy.http.Request` object are no longer copied into the new
        :class:`~scrapy.http.Request` object.
    
    If you manually set the ``Cookie`` header on a
    :class:`~scrapy.http.Request` object and the domain name of the redirect
    URL is not an exact match for the domain of the URL of the original
    :class:`~scrapy.http.Request` object, your ``Cookie`` header is now dropped
    from the new :class:`~scrapy.http.Request` object.
    

    The old behavior could be exploited by an attacker to gain access to your cookies. Please, see the cjvr-mfj7-j4j8 security advisory_ for more information.

    .. _cjvr-mfj7-j4j8 security advisory: https://github.com/scrapy/scrapy/security/advisories/GHSA-cjvr-mfj7-j4j8

    .. note:: It is still possible to enable the sharing of cookies between different domains with a shared domain suffix (e.g. example.com and any subdomain) by defining the shared domain suffix (e.g. example.com) as the cookie domain when defining your cookies. See the documentation of the :class:~scrapy.http.Request class for more information.

    • When the domain of a cookie, either received in the Set-Cookie header of a response or defined in a :class:~scrapy.http.Request object, is set to a public suffix &lt;https://publicsuffix.org/&gt;_, the cookie is now </tr></table>

... (truncated)

Commits

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot merge will merge this PR after your CI passes on it
  • @dependabot squash and merge will squash and merge this PR after your CI passes on it
  • @dependabot cancel merge will cancel a previously requested merge and block automerging
  • @dependabot reopen will reopen this PR if it is closed
  • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
  • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
  • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
  • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
  • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies 
opened by dependabot[bot] 0
Releases(v0.0.3)
Owner
Albert Marrero
Albert Marrero
This was supposed to be a web scraping project, but somehow I've turned it into a spamming project

Introduction This was supposed to be a web scraping project, but somehow I've turned it into a spamming project.

Boss Perry (Pez) 1 Jan 23, 2022
Generate a repository with mirror links for DriveDroid app

DriveDroid Repository Generator Generate a repository for the app that allow boot a PC using ISO files stored on your Android phone Check also an offi

Evgeny 11 Nov 19, 2022
Comment Webpage Screenshot is a GitHub Action that captures screenshots of web pages and HTML files located in the repository

Comment Webpage Screenshot is a GitHub Action that helps maintainers visually review HTML file changes introduced on a Pull Request by adding comments with the screenshots of the latest HTML file cha

Maksudul Haque 21 Sep 29, 2022
Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

Jason Nguyen 1 Oct 29, 2021
python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤(从2月份稳定运行至今)

python+selenium实现的web端自动打卡 说明 本打卡脚本适用于郑州大学健康打卡,其他web端打卡也可借鉴学习。(自己用的,从2月分稳定运行至今) 仅供学习交流使用,请勿依赖。开发者对使用本脚本造成的问题不负任何责任,不对脚本执行效果做出任何担保,原则上不提供任何形式的技术支持。 为防止

Sunday 1 Aug 27, 2022
A command-line program to download media, like and unlike posts, and more from creators on OnlyFans.

onlyfans-scraper A command-line program to download media, like and unlike posts, and more from creators on OnlyFans. Installation You can install thi

185 Jul 23, 2022
A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

combined-shop-scraper A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items. Features Define an

2 Dec 13, 2021
Dude is a very simple framework for writing web scrapers using Python decorators

Dude is a very simple framework for writing web scrapers using Python decorators. The design, inspired by Flask, was to easily build a web scraper in just a few lines of code. Dude has an easy-to-lea

Ronie Martinez 326 Dec 15, 2022
Unja is a fast & light tool for fetching known URLs from Wayback Machine

Unja Fetch Known Urls What's Unja? Unja is a fast & light tool for fetching known URLs from Wayback Machine, Common Crawl, Virus Total & AlienVault's

Sheryar 10 Aug 07, 2022
A distributed crawler for weibo, building with celery and requests.

A distributed crawler for weibo, building with celery and requests.

SpiderClub 4.8k Jan 03, 2023
feapder 是一款简单、快速、轻量级的爬虫框架。以开发快速、抓取快速、使用简单、功能强大为宗旨。支持分布式爬虫、批次爬虫、多模板爬虫,以及完善的爬虫报警机制。

feapder 是一款简单、快速、轻量级的爬虫框架。起名源于 fast、easy、air、pro、spider的缩写,以开发快速、抓取快速、使用简单、功能强大为宗旨,历时4年倾心打造。支持轻量爬虫、分布式爬虫、批次爬虫、爬虫集成,以及完善的爬虫报警机制。 之

boris 1.4k Dec 29, 2022
Pelican plugin that adds site search capability

Search: A Plugin for Pelican This plugin generates an index for searching content on a Pelican-powered site. Why would you want this? Static sites are

22 Nov 21, 2022
Async Python 3.6+ web scraping micro-framework based on asyncio

Ruia 🕸️ Async Python 3.6+ web scraping micro-framework based on asyncio. ⚡ Write less, run faster. Overview Ruia is an async web scraping micro-frame

howie.hu 1.6k Jan 01, 2023
Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

COVID19-WEB-SCRAPER Open Source Tech Lab - Project [SEMESTER IV] OSTL Assignments OSTL Assignments - 1 OSTL Assignments - 2 Project COVID19 India Data

AMEY THAKUR 8 Apr 28, 2022
Python web scrapper

Website scrapper Web scrapping project in Python. Created for learning purposes. Start Install python Update configuration with websites Launch script

Nogueira Vitor 1 Dec 19, 2021
API which uses discord to scrape NameMC searches/droptime/dropping status of minecraft names

NameMC Scrape API This is an api to scrape NameMC using message previews generated by discord. NameMC makes it a pain to scrape their website, but som

Twilak 2 Dec 22, 2021
A python script to extract answers to any question on Quora (Quora+ included)

quora-plus-bypass A python script to extract answers to any question on Quora (Quora+ included) Requirements Python 3.x

Nitin Narayanan 10 Aug 18, 2022
Scraping web pages to get data

Scraping Data Get public data and save in database This is project use Python How to run a project 1 - Clone the repository 2 - Install beautifulsoup4

Soccer Project 2 Nov 01, 2021
CreamySoup - a helper script for automated SourceMod plugin updates management.

CreamySoup/"Creamy SourceMod Updater" (or just soup for short), a helper script for automated SourceMod plugin updates management.

3 Jan 03, 2022
A webdriver-based script for reserving Tsinghua badminton courts.

AutoReserve A webdriver-based script for reserving badminton courts. 使用说明 下载 chromedriver 选择当前Chrome对应版本 安装 selenium pip install selenium 更改场次、金额信息dat

Payne Zhang 4 Nov 09, 2021