Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Overview
 __  __  _ __   ____  
/\ \/\ \/\`'__\/',__\ 
\ \ \_\ \ \ \//\__, `\
 \ \____/\ \_\\/\____/
  \/___/  \/_/ \/___/... Universal Reddit Scraper 

GitHub top language PRAW Version Build Status Codecov GitHub release (latest by date) Total lines License

usage: $ Urs.py
     
    [-h]
    [-e]
    [-v]

    [-t [
   
    ]]
    [--check]

    [-r 
    
      <(h|n|c|t|r|s)> 
     
       [
      
       ]] 
        [-y]
        [--csv]
        [--rules]
    [-u 
        
        
         ] [-c 
          
          
           ] [--raw] [-b] [--csv] [-lr 
           
            ] [-lu 
            
             ] [--nosave] [--stream-submissions] [-f 
             
              ] [--csv] [-wc 
              
                [
               
                ] [--nosave] 
               
              
             
            
           
          
         
        
       
      
     
    
   

Table of Contents

Contact

Whether you are using URS for enterprise or personal use, I am very interested in hearing about your use case and how it has helped you achieve a goal.

Additionally, please send me an email if you would like to contribute, have questions, or want to share something you have built on top of it.

You can send me an email or leave a note by clicking on either of these badges. I look forward to hearing from you!

Email Say Thanks!

Introduction

This is a comprehensive Reddit scraping tool that integrates multiple features:

  • Scrape Reddit via PRAW (the official Python Reddit API Wrapper)
    • Scrape Subreddits
    • Scrape Redditors
    • Scrape submission comments
  • Livestream Reddit via PRAW
    • Livestream comments submitted within Subreddits or by Redditors
    • Livestream submissions submitted within Subreddits or by Redditors
  • Analytical tools for scraped data
    • Generate frequencies for words that are found in submission titles, bodies, and/or comments
    • Generate a wordcloud from scrape results

See the Getting Started section to get your API credentials.

Installation

NOTE: Requires Python 3.7+

git clone --depth=1 https://github.com/JosephLai241/URS.git
cd URS
pip3 install . -r requirements.txt

Troubleshooting

ModuleNotFoundError

You may run into an error that looks like this:

from urs.utils.Logger import LogMain ModuleNotFoundError: No module named 'urs' ">
Traceback (most recent call last):
  File "/home/joseph/URS/urs/./Urs.py", line 30, in 
   
    
    from urs.utils.Logger import LogMain
ModuleNotFoundError: No module named 'urs'

   

This means you will need to add the URS directory to your PYTHONPATH. Here is a link that explains how to do so for each operating system.

Exporting

Export File Format

All files except for those generated by the wordcloud tool are exported to JSON by default. Wordcloud files are exported to PNG by default.

URS supports exporting to CSV as well, but JSON is the more versatile option.

Exporting to CSV

You will have to include the --csv flag to export to CSV.

You can only export to CSV when using:

  • The Subreddit scrapers
  • The word frequencies generator

These tools are also suitable for CSV format and are optimized to do so if you want to use that format instead.

The --csv flag is ignored if it is present while using any of the other scrapers.

Export Directory Structure

All exported files are saved within the scrapes directory and stored in a sub-directory labeled with the date. Many more sub-directories may be created in the date directory. Sub-directories are only created when its respective tool is run. For example, if you only use the Subreddit scraper, only the subreddits directory is created.

PRAW Scrapers

The subreddits, redditors, or comments directories may be created.

PRAW Livestream Scrapers

The livestream directory is created when you run any of the livestream scrapers. Within it, the subreddits or redditors directories may be created.

Analytical Tools

The analytics directory is created when you run any of the analytical tools. Within it, the frequencies or wordclouds directories may be created. See the Analytical Tools section for more information.

Example Directory Structure

This is the samples directory structure generated by the tree command.

scrapes/
└── 06-02-2021
    ├── analytics
    │   ├── frequencies
    │   │   ├── comments
    │   │   │   └── What’s something from the 90s you miss_-all.json
    │   │   ├── livestream
    │   │   │   └── subreddits
    │   │   │       └── askreddit-comments-20_44_11-00_01_10.json
    │   │   └── subreddits
    │   │       └── cscareerquestions-search-'job'-past-year-rules.json
    │   └── wordcloud
    │       ├── comments
    │       │   └── What’s something from the 90s you miss_-all.png
    │       ├── livestream
    │       │   └── subreddits
    │       │       └── askreddit-comments-20_44_11-00_01_10.png
    │       └── subreddits
    │           └── cscareerquestions-search-'job'-past-year-rules.png
    ├── comments
    │   └── What’s something from the 90s you miss_-all.json
    ├── livestream
    │   └── subreddits
    │       ├── askreddit-comments-20_44_11-00_01_10.json
    │       └── askreddit-submissions-20_46_12-00_01_52.json
    ├── redditors
    │   └── spez-5-results.json
    ├── subreddits
    │   ├── askreddit-hot-10-results.json
    │   └── cscareerquestions-search-'job'-past-year-rules.json
    └── urs.log

URS Overview

Scrape Speeds

Your internet connection speed is the primary bottleneck that will establish the scrape duration; however, there are additional bottlenecks such as:

  • The number of results returned for Subreddit or Redditor scraping.
  • The submission's popularity (total number of comments) for submission comments scraping.

Scraping Reddit via PRAW

Getting Started

It is very quick and easy to get Reddit API credentials. Refer to my guide to get your credentials, then update the environment variables located in .env.

Rate Limits

Yes, PRAW has rate limits. These limits are proportional to how much karma you have accumulated - the higher the karma, the higher the rate limit. This has been implemented to mitigate spammers and bots that utilize PRAW.

Rate limit information for your account is displayed in a small table underneath the successful login message each time you run any of the PRAW scrapers. I have also added a --check flag if you want to quickly view this information.

URS will display an error message as well as the rate limit reset date if you have used all your available requests.

There are a couple ways to circumvent rate limits:

  • Scrape intermittently
  • Use an account with high karma to get your PRAW credentials
  • Scrape less results per run

Available requests are refilled if you use the PRAW scrapers intermittently, which might be the best solution. This can be especially helpful if you have automated URS and are not looking at the output on each run.

A Table of All Subreddit, Redditor, and Submission Comments Attributes

These attributes are included in each scrape.

Subreddits (submissions) Redditors Submission Comments
author comment_karma author
created_utc created_utc body
distinguished fullname body_html
edited has_verified_email created_utc
id icon_img distinguished
is_original_content id edited
is_self is_employee id
link_flair_text is_friend is_submitter
locked is_mod link_id
name is_gold parent_id
num_comments link_karma score
nsfw name stickied
permalink subreddit
score *trophies
selftext *comments
spoiler *controversial
stickied *downvoted (may be forbidden)
title *gilded
upvote_ratio *gildings (may be forbidden)
url *hidden (may be forbidden)
*hot
*moderated
*multireddits
*new
*saved (may be forbidden)
*submissions
*top
*upvoted (may be forbidden)

*Includes additional attributes; see Redditors section for more information.

Available Flags

[-r 
   
     <(h|n|c|t|r|s)> 
    
      [
     
      ]] 
    [-y]
    [--csv]
    [--rules]
[-u 
       
       
        ] [-c 
         
         
          ] [--raw] [-b] [--csv] 
         
        
       
      
     
    
   

Subreddits

Subreddit Demo GIF

*This GIF is uncut.

Usage: $ ./Urs.py -r SUBREDDIT (H|N|C|T|R|S) N_RESULTS_OR_KEYWORDS

Supported export formats: JSON and CSV. To export to CSV, include the --csv flag.

You can specify Subreddits, the submission category, and how many results are returned from each scrape. I have also added a search option where you can search for keywords within a Subreddit.

These are the submission categories:

  • Hot
  • New
  • Controversial
  • Top
  • Rising
  • Search

The file names for all categories except for Search will follow this format:

"[SUBREDDIT]-[POST_CATEGORY]-[N_RESULTS]-result(s).[FILE_FORMAT]"

If you searched for keywords, file names will follow this format:

"[SUBREDDIT]-Search-'[KEYWORDS]'.[FILE_FORMAT]"

Scrape data is exported to the subreddits directory.

NOTE: Up to 100 results are returned if you search for keywords within a Subreddit. You will not be able to specify how many results to keep.

Time Filters

Time filters may be applied to some categories. Here is a table of the categories on which you can apply a time filter as well as the valid time filters.

Categories Time Filters
Controversial All (default)
Top Day
Search Hour
Month
Week
Year

Specify the time filter after the number of results returned or keywords you want to search for.

Usage: $ ./Urs.py -r SUBREDDIT (C|T|S) N_RESULTS_OR_KEYWORDS OPTIONAL_TIME_FILTER

If no time filter is specified, the default time filter all is applied. The Subreddit settings table will display None for categories that do not offer the additional time filter option.

If you specified a time filter, -past-[TIME_FILTER] will be appended to the file name before the file format like so:

"[SUBREDDIT]-[POST_CATEGORY]-[N_RESULTS]-result(s)-past-[TIME_FILTER].[FILE_FORMAT]"

Or if you searched for keywords:

"[SUBREDDIT]-Search-'[KEYWORDS]'-past-[TIME_FILTER].[FILE_FORMAT]"

Subreddit Rules and Post Requirements

You can also include the Subreddit's rules and post requirements in your scrape data by including the --rules flag. This only works when exporting to JSON. This data will be included in the subreddit_rules field.

If rules are included in your file, -rules will be appended to the end of the file name.

Bypassing the Final Settings Check

After submitting the arguments and Reddit validation, URS will display a table of Subreddit scraping settings as a final check before executing. You can include the -y flag to bypass this and immediately scrape.


Redditors

Redditor Demo GIF

*This GIF has been cut for demonstration purposes.

Usage: $ ./Urs.py -u REDDITOR N_RESULTS

Supported export formats: JSON.

You can also scrape Redditor profiles and specify how many results are returned.

Redditor information will be included in the information field and includes the following attributes:

Redditor Information
comment_karma
created_utc
fullname
has_verified_email
icon_img
id
is_employee
is_friend
is_mod
is_gold
link_karma
name
subreddit
trophies

Redditor interactions will be included in the interactions field. Here is a table of all Redditor interaction attributes that are also included, how they are sorted, and what type of Reddit objects are included in each.

Attribute Name Sorted By/Time Filter Reddit Objects
Comments Sorted By: New Comments
Controversial Time Filter: All Comments and submissions
Downvoted Sorted By: New Comments and submissions
Gilded Sorted By: New Comments and submissions
Gildings Sorted By: New Comments and submissions
Hidden Sorted By: New Comments and submissions
Hot Determined by other Redditors' interactions Comments and submissions
Moderated N/A Subreddits
Multireddits N/A Multireddits
New Sorted By: New Comments and submissions
Saved Sorted By: New Comments and submissions
Submissions Sorted By: New Submissions
Top Time Filter: All Comments and submissions
Upvoted Sorted By: New Comments and submissions

These attributes contain comments or submissions. Subreddit attributes are also included within both.

This is a table of all attributes that are included for each Reddit object:

Subreddits Comments Submissions Multireddits Trophies
can_assign_link_flair body author can_edit award_id
can_assign_user_flair body_html created_utc copied_from description
created_utc created_utc distinguished created_utc icon_40
description distinguished edited description_html icon_70
description_html edited id description_md name
display_name id is_original_content display_name url
id is_submitter is_self name
name link_id link_flair_text nsfw
nsfw parent_id locked subreddits
public_description score name visibility
spoilers_enabled stickied num_comments
subscribers *submission nsfw
user_is_banned subreddit_id permalink
user_is_moderator score
user_is_subscriber selftext
spoiler
stickied
*subreddit
title
upvote_ratio
url

* Contains additional metadata.

The file names will follow this format:

"[USERNAME]-[N_RESULTS]-result(s).json"

Scrape data is exported to the redditors directory.

NOTE: If you are not allowed to access a Redditor's lists, PRAW will raise a 403 HTTP Forbidden exception and the program will just append "FORBIDDEN" underneath that section in the exported file.

NOTE: The number of results returned are applied to all attributes. I have not implemented code to allow users to specify different number of results returned for individual attributes.


Submission Comments

Submission Comments Demo GIF

*This GIF has been cut for demonstration purposes.

Usage: $ ./Urs.py -c SUBMISSION_URL N_RESULTS

Supported export formats: JSON.

You can also scrape comments from submissions and specify the number of results returned.

Submission metadata will be included in the submission_metadata field and includes the following attributes:

Submission Attributes
author
created_utc
distinguished
edited
is_original_content
is_self
link_flair_text
locked
nsfw
num_comments
permalink
score
selftext
spoiler
stickied
subreddit
title
upvote_ratio

If the submission contains a gallery, the attributes gallery_data and media_metadata will be included.

Comments are written to the comments field. They are sorted by "Best", which is the default sorting option when you visit a submission.

PRAW returns submission comments in level order, which means scrape speeds are proportional to the submission's popularity.

The file names will generally follow this format:

"[POST_TITLE]-[N_RESULTS]-result(s).json"

Scrape data is exported to the comments directory.

Number of Comments Returned

You can scrape all comments from a submission by passing in 0 for N_RESULTS. Subsequently, [N_RESULTS]-result(s) in the file name will be replaced with all.

Otherwise, specify the number of results you want returned. If you passed in a specific number of results, the structured export will return up to N_RESULTS top level comments and include all of its replies.

Structured Comments

This is the default export style. Structured scrapes resemble comment threads on Reddit. This style takes just a little longer to export compared to the raw format because URS uses depth-first search to create the comment Forest after retrieving all comments from a submission.

If you want to learn more about how it works, refer to this additional document where I describe how I implemented the Forest.

Raw Comments

Raw scrapes do not resemble comment threads, but returns all comments on a submission in level order: all top-level comments are listed first, followed by all second-level comments, then third, etc.

You can export to raw format by including the --raw flag. -raw will also be appended to the end of the file name.

Livestreaming Reddit via PRAW

These tools may be used to livestream comments or submissions submitted within Subreddits or by Redditors.

Comments are streamed by default. To stream submissions instead, include the --stream-submissions flag.

New comments or submissions will continue to display within your terminal until you abort the stream using Ctrl + C.

The filenames will follow this format:

[SUBREDDIT_OR_REDDITOR]-[comments_OR_submissions]-[START_TIME_IN_HOURS_MINUTES_SECONDS]-[DURATION_IN_HOURS_MINUTES_SECONDS].json

This file is saved in the main livestream directory into the subreddits or redditors directory depending on which stream was run.

Reddit objects will be written to this JSON file in real time. After aborting the stream, the filename will be updated with the start time and duration.

Displayed vs. Saved Attributes

Displayed comment and submission attributes have been stripped down to essential fields to declutter the output. Here is a table of what is shown during the stream:

Comment Attributes Submission Attributes
author author
body created_utc
created_utc is_self
is_submitter link_flair_text
submission_author nsfw
submission_created_utc selftext
submission_link_flair_text spoiler
submission_nsfw stickied
submission_num_comments title
submission_score url
submission_title
submission_upvote_ratio
submission_url

Comment and submission attributes that are written to file will include the full list of attributes found in the Table of All Subreddit, Redditor, and Submission Comments Attributes.

Available Flags

[-lr 
   
    ]
[-lu 
    
     ]

    [--nosave]
    [--stream-submissions]

    
   

Livestreaming Subreddits

Livestream Subreddit Demo GIF

*This GIF has been cut for demonstration purposes.

Usage: $ ./Urs.py -lr SUBREDDIT

Supported export formats: JSON.

Default stream objects: Comments. To stream submissions instead, include the --stream-submissions flag.

You can livestream comments or submissions that are created within a Subreddit.

Reddit object information will be displayed in a PrettyTable as they are submitted.

NOTE: PRAW may not be able to catch all new submissions or comments within a high-volume Subreddit, as mentioned in these disclaimers located in the "Note" boxes.


Livestreaming Redditors

Livestream demo was not recorded for Redditors because its functionality is identical to the Subreddit livestream.

Usage: $ ./Urs.py -lu REDDITOR

Supported export formats: JSON.

Default stream objects: Comments. To stream submissions instead, include the --stream-submissions flag.

You can livestream comments or submissions that are created by a Redditor.

Reddit object information will be displayed in a PrettyTable as they are submitted.


Do Not Save Livestream to File

Include the --nosave flag if you do not want to save the livestream to file.

Analytical Tools

This suite of tools can be used after scraping data from Reddit. Both of these tools analyze the frequencies of words found in submission titles and bodies, or comments within JSON scrape data.

There are a few ways you can quickly get the correct filepath to the scrape file:

  • Drag and drop the file into the terminal.
  • Partially type the path and rely on tab completion support to finish the full path for you.

Running either tool will create the analytics directory within the date directory. This directory is located in the same directory in which the scrape data resides. For example, if you run the frequencies generator on February 16th for scrape data that was captured on February 14th, analytics will be created in the February 14th directory. Command history will still be written in the February 16th urs.log.

The sub-directories frequencies or wordclouds are created in analytics depending on which tool is run. These directories mirror the directories in which the original scrape files reside. For example, if you run the frequencies generator on a Subreddit scrape, the directory structure will look like this:

analytics/
└── frequencies
    └── subreddits
        └── SUBREDDIT_SCRAPE.json

A shortened export path is displayed once URS has completed exporting the data, informing you where the file is saved within the scrapes directory. You can open urs.log to view the full path.

Target Fields

The data varies depending on the scraper, so these tools target different fields for each type of scrape data:

Scrape Data Targets
Subreddit selftext, title
Redditor selftext, title, body
Submission Comments body
Livestream selftext and title, or body

For Subreddit scrapes, data is pulled from the selftext and title fields for each submission (submission title and body).

For Redditor scrapes, data is pulled from all three fields because both submission and comment data is returned. The title and body fields are targeted for submissions, and the selftext field is targeted for comments.

For submission comments scrapes, data is only pulled from the body field of each comment.

For livestream scrapes, comments or submissions may be included depending on user settings. The selftext and title fields are targeted for submissions, and the body field is targeted for comments.

File Names

File names are identical to the original scrape data so that it is easier to distinguish which analytical file corresponds to which scrape.

Available Flags

[-f 
   
    ]
    [--csv]
[-wc 
    
      [
     
      ]]
    [--nosave]

     
    
   

Generating Word Frequencies

Frequencies Demo GIF

*This GIF is uncut.

Usage: $ ./Urs.py -f FILE_PATH

Supported export formats: JSON and CSV. To export to CSV, include the --csv flag.

You can generate a dictionary of word frequencies created from the words within the target fields. These frequencies are sorted from highest to lowest.

Frequencies export to JSON by default, but this tool also works well in CSV format.

Exported files will be saved to the analytics/frequencies directory.


Generating Wordclouds

Wordcloud Demo GIF

*This GIF is uncut.

Usage: $ ./Urs.py -wc FILE_PATH

Supported export formats: eps, jpeg, jpg, pdf, png (default), ps, rgba, tif, tiff.

Taking word frequencies to the next level, you can generate wordclouds based on word frequencies. This tool is independent of the frequencies generator - you do not need to run the frequencies generator before creating a wordcloud.

PNG is the default format, but you can also export to any of the options listed above by including the format as the second flag argument.

Usage: $ ./Urs.py -wc FILE_PATH OPTIONAL_EXPORT_FORMAT

Exported files will be saved to the analytics/wordclouds directory.

Display Wordcloud Instead of Saving

Wordclouds are saved to file by default. If you do not want to keep a file, include the --nosave flag to only display the wordcloud.

Utilities

This section briefly outlines the utilities included with URS.

Available Flags

[-t [
   
    ]]
[--check]

   

Display Directory Tree

Display Directory Tree Demo GIF

Usage: $ ./Urs.py -t

If no date is provided, you can quickly view the directory structure for the current date. This is a quick alternative to the tree command.

You can also display a different day's scrapes by providing a date after the -t flag.

Usage: $ ./Urs.py -t OPTIONAL_DATE

The following date formats are supported:

  • MM-DD-YYYY
  • MM/DD/YYYY

An error is displayed if URS was not run on the entered date (if the date directory is not found within the scrapes directory).

Check PRAW Rate Limits

Check PRAW Rate Limits Demo GIF

Usage: $ ./Urs.py --check

You can quickly check the rate limits for your account by using this flag.

Sponsors

This is a shout-out section for my patrons - thank you so much for sponsoring this project!

Contributing

See the Contact section for ways to reach me.

Before Making Pull or Feature Requests

Consider the scope of this project before submitting a pull or feature request. URS stands for Universal Reddit Scraper. Two important aspects are listed in its name - universal and scraper.

I will not approve feature or pull requests that deviate from its sole purpose. This may include scraping a specific aspect of Reddit or adding functionality that allows you to post a comment with URS. Adding either of these requests will no longer allow URS to be universal or merely a scraper. However, I am more than happy to approve requests that enhance the current scraping capabilities of URS.

Building on Top of URS

Although I will not approve requests that deviate from the project scope, feel free to reach out if you have built something on top of URS or have made modifications to scrape something specific on Reddit. I will add your project to the Derivative Projects section!

Making Pull or Feature Requests

You can suggest new features or changes by going to the Issues tab and fill out the Feature Request template. If there is a good reason for a new feature, I will consider adding it.

You are also more than welcome to create a pull request - adding additional features, improving runtime, or refactoring existing code. If it is approved, I will merge the pull request into the master branch and credit you for contributing to this project.

Contributors

Date User Contribution
March 11, 2020 ThereGoesMySanity Created a pull request adding 2FA information to README
October 6, 2020 LukeDSchenk Created a pull request fixing "[Errno 36] File name too long" issue, making it impossible to save comment scrapes with long titles
October 10, 2020 IceBerge421 Created a pull request fixing a cloning error occuring on Windows machines due to illegal file name characters, ", found in two scrape samples

Derivative Projects

This is a showcase for projects that are built on top of URS!

skiwheelr/URS

Contains a bash script built on URS which counts ticker mentions in Subreddits, subsequently cURLs all the relevant links in parallel, and counts the mentions of those.

skiwheelr screenshot

Comments
  • Bump certifi from 2021.5.30 to 2022.12.7

    Bump certifi from 2021.5.30 to 2022.12.7

    Bumps certifi from 2021.5.30 to 2022.12.7.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Suggest to loosen the dependency on halo

    Suggest to loosen the dependency on halo

    Dear developers,

    Your project URS requires "halo==0.0.31" in its dependency. After analyzing the source code, we found that the following versions of halo can also be suitable without affecting your project, i.e., halo 0.0.30. Therefore, we suggest to loosen the dependency on halo from "halo==0.0.31" to "halo>=0.0.30,<=0.0.31" to avoid any possible conflict for importing more packages or for downstream projects that may use ddos_script.

    May I pull a request to further loosen the dependency on halo?

    By the way, could you please tell us whether such dependency analysis may be potentially helpful for maintaining dependencies easier during your development?



    Details:

    Your project (commit id: 9f8cf3a3adb9aa5079dfc7bfd7832b53358ee40f) directly uses 6 APIs from package halo.

    halo.halo.Halo.start, halo.halo.Halo.succeed, halo.halo.Halo.info, halo.halo.Halo.warn, halo.halo.Halo.fail, halo.halo.Halo.__init__
    

    Beginning fromwhich, 28 functions are then indirectly called, including 17 halo's internal APIs and 11 outsider APIs as follows:

    [/JosephLai241/URS]
    +--halo.halo.Halo.start
    |      +--halo.halo.Halo._check_stream
    |      +--halo.halo.Halo._hide_cursor
    |      |      +--halo.halo.Halo._check_stream
    |      |      +--halo.cursor.hide
    |      |      |      +--halo.cursor._CursorInfo.__init__
    |      |      |      +--ctypes.windll.kernel32.GetStdHandle
    |      |      |      +--ctypes.windll.kernel32.GetConsoleCursorInfo
    |      |      |      +--ctypes.windll.kernel32.SetConsoleCursorInfo
    |      |      |      +--ctypes.byref
    |      +--threading.Event
    |      +--threading.Thread
    |      +--halo.halo.Halo._render_frame
    |      |      +--halo.halo.Halo.clear
    |      |      |      +--halo.halo.Halo._write
    |      |      |      |      +--halo.halo.Halo._check_stream
    |      |      +--halo.halo.Halo.frame
    |      |      |      +--halo._utils.colored_frame
    |      |      |      |      +--termcolor.colored
    |      |      |      +--halo.halo.Halo.text_frame
    |      |      |      |      +--halo._utils.colored_frame
    |      |      +--halo.halo.Halo._write
    |      |      +--halo._utils.encode_utf_8_text
    |      |      |      +--codecs.encode
    +--halo.halo.Halo.succeed
    |      +--halo.halo.Halo.stop_and_persist
    |      |      +--halo._utils.decode_utf_8_text
    |      |      |      +--codecs.decode
    |      |      +--halo._utils.colored_frame
    |      |      +--halo.halo.Halo.stop
    |      |      |      +--halo.halo.Halo.clear
    |      |      |      +--halo.halo.Halo._show_cursor
    |      |      |      |      +--halo.halo.Halo._check_stream
    |      |      |      |      +--halo.cursor.show
    |      |      |      |      |      +--halo.cursor._CursorInfo.__init__
    |      |      |      |      |      +--ctypes.windll.kernel32.GetStdHandle
    |      |      |      |      |      +--ctypes.windll.kernel32.GetConsoleCursorInfo
    |      |      |      |      |      +--ctypes.windll.kernel32.SetConsoleCursorInfo
    |      |      |      |      |      +--ctypes.byref
    |      |      +--halo.halo.Halo._write
    |      |      +--halo._utils.encode_utf_8_text
    +--halo.halo.Halo.info
    |      +--halo.halo.Halo.stop_and_persist
    +--halo.halo.Halo.warn
    |      +--halo.halo.Halo.stop_and_persist
    +--halo.halo.Halo.fail
    |      +--halo.halo.Halo.stop_and_persist
    +--halo.halo.Halo.__init__
    |      +--halo._utils.get_environment
    |      |      +--IPython.get_ipython
    |      +--halo.halo.Halo.stop
    |      +--IPython.get_ipython
    |      +--atexit.register
    

    Since all these functions have not been changed between any version for package "halo" from [0.0.30] and 0.0.31. Therefore, we believe it is safe to loosen the corresponding dependency.

    opened by Agnes-U 0
  • Bump pillow from 8.2.0 to 9.3.0

    Bump pillow from 8.2.0 to 9.3.0

    Bumps pillow from 8.2.0 to 9.3.0.

    Release notes

    Sourced from pillow's releases.

    9.3.0

    https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

    Changes

    ... (truncated)

    Changelog

    Sourced from pillow's changelog.

    9.3.0 (2022-10-29)

    • Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

    • Initialize libtiff buffer when saving #6699 [radarhere]

    • Inline fname2char to fix memory leak #6329 [nulano]

    • Fix memory leaks related to text features #6330 [nulano]

    • Use double quotes for version check on old CPython on Windows #6695 [hugovk]

    • Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

    • Fixed set_variation_by_name offset #6445 [radarhere]

    • Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

    • Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

    • Added ExifTags enums #6630 [radarhere]

    • Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

    • Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

    • Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

    • Added GPS TIFF tag info #6661 [radarhere]

    • Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

    • Do not attempt normalization if mode is already normal #6644 [radarhere]

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Update timestamp to use ISO8601 standard

    Update timestamp to use ISO8601 standard

    Overview

    Summary

    As much as I love this scraper, finding my results in /scrapes/09-17-2022/ makes my toes curl and causes me physical pain. Refer to https://xkcd.com/1179/ and https://www.reddit.com/r/ISO8601/ for explanations of why ISO8601 is superior in every aspect.

    Motivation/Context

    ISO8601 was created as an international standard for timestamps. This article provides some context for its inception.

    New Dependencies

    None
    

    Issue Fix or Enhancement Request

    N/A

    Type of Change

    • [x] Code Refactor
    • [x] This change requires a documentation update

    Breaking Change

    N/A

    List All Changes That Have Been Made

    Changed

    • Source code
      • In Global.py:
        • Updated timestamp format for the date variable.
    • README
      • Summary of change
        • Describing the change
    • Tests
      • Summary of change
        • Describing the change

    How Has This Been Tested?

    Put "N/A" in this block if this is not applicable.

    Please describe the tests that you ran to verify your changes. Provide instructions so I can reproduce. Please also list any relevant details for your test configuration. Section your tests by relevance if it is lengthy. An example outline is shown below:

    • Summary of a test here
      • Details here with relevant test commands underneath.
        • Ran test command here.
          • If applicable, more details about the command underneath.
        • Then ran another test command here.

    Test Configuration

    Put "N/A" in this block if this is not applicable.

    • Python version: 3.x.x

    If applicable, describe more configuration settings. An example outline is shown below:

    • Summary goes here.
      • Configuration 1.
      • Configuration 2.
        • If applicable, provide extra details underneath a configuration.
      • Configuration 3.

    Dependencies

    N/A
    

    Checklist

    Tip: You can check off items by writing an "x" in the brackets, e.g. [x].

    • [ ] My code follows the style guidelines of this project.
    • [ ] I have performed a self-review of my own code, including testing to ensure my fix is effective or that my feature works.
    • [ ] My changes generate no new warnings.
    • [ ] I have commented my code, providing a summary of the functionality of each method, particularly in areas that may be hard to understand.
    • [ ] I have made corresponding changes to the documentation.
    • [ ] I have performed a self-review of this Pull Request template, ensuring the Markdown file renders correctly.
    refactor 
    opened by fridde 1
  • Bump numpy from 1.21.0 to 1.22.0

    Bump numpy from 1.21.0 to 1.22.0

    Bumps numpy from 1.21.0 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
Releases(v3.3.2)
  • v3.3.2(Jul 22, 2021)

    Summary

    This release fixes an open issue.

    PRAW v7.3.0 changed the Redditor object's subreddit attribute. This change breaks the Redditor scraper. It would be nice if all the tools worked as advertised.

    Full Changelog

    Added

    • Source code
      • In Redditor.py:
        • Added a new method GetInteractions._get_user_subreddit() - extracts subreddit data from the UserSubreddit object into a dictionary.
    • Tests
      • In test_Redditor.py:
        • Added TestGetUserSubredditMethod().test_get_user_subreddit() to test the new method.

    Changed

    • Source code
      • In Redditor.py:
        • GetInteractions._get_user_info() calls the new GetInteractions._get_user_subreddit() method to set the Redditor's subreddit data within the main Redditor information dictionary.
      • In Version.py:
        • Incremented version number.
    • README
      • Incremented PRAW badge version number.
    Source code(tar.gz)
    Source code(zip)
  • v3.3.1(Jul 3, 2021)

    Summary

    • Introduced a new utility, -t, which will display a visual tree of the current day's scrape directory by default. Optionally, include a different date to display that day's scrape directory.
    • Move CI providers from Travis-CI to GitHub Actions.
      • Travis-CI is no longer free - there is now a free build cap.
    • Minor code refactoring and issue resolution.

    Full Changelog

    Added

    • User interface
      • Added a new utility:
        • -t/--tree - display the directory structure of the current date directory. Or optionally include a date to display that day's scrape directory.
    • Source code
      • Added a new file Utilities.py to the urs/utils module.
        • Added a class DateTree which contains methods to find and build a visual tree for the target date's directory.
          • Added logging when this utility is run.
      • Added an additional Halo to the wordcloud generator.
    • README
      • Added new "Utilities" section.
        • This section describes how to use the -t/--tree and --check utility flags.
      • Added new "Sponsors" section.
    • Tests
      • Added test_Utilities.py under the test_utils module.

    Changed

    • Source code
      • Refactored the following methods within the analytics module:
        • GetPath.get_scrape_type()
        • GetPath.name_file()
        • FinalizeWordcloud().save_wordcloud()
          • Implemented pathlib's Path() method to get the path.
      • Upgraded all string formatting from old-school Python formatting (using the % operator) to the superior f-string.
      • Updated GitHub Actions workflow pytest.yml.
        • This workflow was previously disabled. The workflow has been upgraded to test URS on all platforms (ubuntu-latest, macOS-latest, and windows-latest) and to send test coverage to Codecov after testing completes on ubuntu-latest.
    • README
      • Changed the Travis-CI badge to a GitHub Actions badge.
        • Updated badge link to route to the workflows page within the repository.
    • Tests
      • Upgraded all string formatting from old-school Python formatting (using the % operator) to the superior f-string in the following modules:
        • test_utils/test_Export.py
        • test_praw_scrapers/test_live_scrapers/test_Livestream.py
      • Refactored two tests within test_Export.py:
        • TestExportWriteCSVAndWriteJSON().test_write_csv()
        • TestExportExportMethod().test_export_write_csv()
    • Community documents
      • Updated PULL_REQUEST_TEMPLATE.md.
        • Removed Travis-CI configuration block.

    Deprecated

    • Source code
      • Removed .travis.yml - URS no longer uses Travis-CI as its CI provider.
    Source code(tar.gz)
    Source code(zip)
  • v3.3.0(Jun 15, 2021)

    Summary

    • Introduced livestreaming tools:
      • Livestream comments or submissions submitted within Subreddits.
      • Livestream comments or submissions submitted by a Redditor.

    Full Changelog

    Added

    • User interface
      • Added livestream scraper flags:
        • -lr - livestream a Subreddit
        • -lu - livestream a Redditor
        • Added livestream scrape control flags to limit stream exclusively to submissions (default is streaming comments):
          • --stream-submissions
      • Added a flag -v/--version to display the version number.
    • Source code
      • Added a new sub-module live_scrapers within praw_scrapers for livestream functionality:
        • Livestream.py
        • utils/DisplayStream.py
        • utils/StreamGenerator.py
      • Added a new file Version.py to single-source the package version.
      • Added a gallery_data and media_metadata check in Comments.py, which includes the above fields if the submission contains a gallery.
    • README
      • Added a new "Installation" section with updated installation procedures.
      • Added a new section "Livestreaming Subreddits and Redditors" with sub-sections containing details for each flag.
      • Updated the Table of Contents accordingly.
    • Tests
      • Added additional unit tests for the live_scrapers module. These tests are located in tests/test_praw_scrapers/test_live_scrapers:
        • tests/test_praw_scrapers/test_live_scrapers/test_Livestream.py
        • tests/test_praw_scrapers/test_live_scrapers/test_utils/test_DisplayStream.py
        • tests/test_praw_scrapers/test_live_scrapers/test_utils/test_StreamGenerator.py
    • Repository documents
      • Added a Table of Contents for The Forest.md

    Changed

    • User interface
      • Updated the usage menu to clarify which tools may use which optional flags.
    • Source code
      • Reindexed the praw_scrapers module:
        • Moved the following files into the new static_scrapers sub-module:
          • Basic.py
          • Comments.py
          • Redditor.py
          • Subreddit.py
        • Updated absolute imports throughout the source code.
      • Moved confirm_options(), previously located in Subreddit.py to Global.py.
      • Moved PrepRedditor.prep_redditor() algorithm to its own class method PrepMutts.prep_mutts().
        • Added additional error handling to the algorithm to fix the KeyError exception mentioned in the Issue Fix or Enhancement Request section.
      • Removed Colorama's init() method from many modules - it only needs to be called once and is now located in Urs.py.
      • Updated requirements.txt.
    • README
      • The "Exporting" section is now one large section and is now located on top of the "URS Overview" section.
    • Tests
      • Updated absolute imports for existing PRAW scrapers.
      • Removed a few tests for DirInit.py since the make_directory() and make_type_directory() methods have been deprecated.

    Deprecated

    • Source code
      • Removed many methods defined in the InitializeDirectory class in DirInit.py:
        • LogMissingDir.log()
        • create()
        • make_directory()
        • make_type_directory()
        • make_analytics_directory()
          • Replaced these methods with a more versatile create_dirs() method.
    Source code(tar.gz)
    Source code(zip)
  • v3.2.1(Mar 28, 2021)

    Release date: March 28, 2021

    Summary

    • Structured comments export has been upgraded to include comments of all levels.
      • Structured comments are now the default export format. Exporting to raw format requires including the --raw flag.
    • Tons of metadata has been added to all scrapers. See the Full Changelog section for a full list of attributes that have been added.
    • Credentials.py has been deprecated in favor of .env to avoid hard-coding API credentials.
    • Added more terminal eye candy - Halo has been implemented to spice up the output.

    Full Changelog

    Added

    • User interface
      • Added Halo to spice up the output while maintaining minimalism.
    • Source code
      • Created a comment Forest and accompanying CommentNode.
        • The Forest contains methods for inserting CommentNodes, including a depth-first search algorithm to do so.
      • Subreddit.py has been refactored and submission metadata has been added to scrape files:
        • "author"
        • "created_utc"
        • "distinguished"
        • "edited"
        • "id"
        • "is_original_content"
        • "is_self"
        • "link_flair_text"
        • "locked"
        • "name"
        • "num_comments"
        • "nsfw"
        • "permalink"
        • "score"
        • "selftext"
        • "spoiler"
        • "stickied"
        • "title"
        • "upvote_ratio"
        • "url"
      • Comments.py has been refactored and submission comments now include the following metadata:
        • "author"
        • "body"
        • "body_html"
        • "created_utc"
        • "distinguished"
        • "edited"
        • "id"
        • "is_submitter"
        • "link_id"
        • "parent_id"
        • "score"
        • "stickied"
      • Major refactor for Redditor.py on top of adding additional metadata.
        • Additional Redditor information has been added to scrape files:
          • "has_verified_email"
          • "icon_img"
          • "subreddit"
          • "trophies"
        • Additional Redditor comment, submission, and multireddit metadata has been added to scrape files:
          • subreddit objects are nested within comment and submission objects and contain the following metadata:
            • "can_assign_link_flair"
            • "can_assign_user_flair"
            • "created_utc"
            • "description"
            • "description_html"
            • "display_name"
            • "id"
            • "name"
            • "nsfw"
            • "public_description"
            • "spoilers_enabled"
            • "subscribers"
            • "user_is_banned"
            • "user_is_moderator"
            • "user_is_subscriber"
          • comment objects will contain the following metadata:
            • "type"
            • "body"
            • "body_html"
            • "created_utc"
            • "distinguished"
            • "edited"
            • "id"
            • "is_submitter"
            • "link_id"
            • "parent_id"
            • "score"
            • "stickied"
            • "submission" - contains additional metadata
            • "subreddit_id"
          • submission objects will contain the following metadata:
            • "type"
            • "author"
            • "created_utc"
            • "distinguished"
            • "edited"
            • "id"
            • "is_original_content"
            • "is_self"
            • "link_flair_text"
            • "locked"
            • "name"
            • "num_comments"
            • "nsfw"
            • "permalink"
            • "score"
            • "selftext"
            • "spoiler"
            • "stickied"
            • "subreddit" - contains additional metadata
            • "title"
            • "upvote_ratio"
            • "url"
          • multireddit objects will contain the following metadata:
            • "can_edit"
            • "copied_from"
            • "created_utc"
            • "description_html"
            • "description_md"
            • "display_name"
            • "name"
            • "nsfw"
            • "subreddits"
            • "visibility"
        • interactions are now sorted in alphabetical order.
      • CLI
        • Flags
          • --raw - Export comments in raw format instead (structure format is the default)
      • Created a new .env file to store API credentials.
    • README
      • Added new bullet point for The Forest Markdown file.
    • Tests
      • Added a new test for the Status class in Global.py.
    • Repository documents
      • Added "The Forest".
        • This Markdown file is just a place where I describe how I implemented the Forest.

    Changed

    • User interface
      • Submission comments scraping parameters have changed due to the improvements made in this pull request.
        • Structured comments is now the default format.
          • Users will have to include the new --raw flag to export to raw format.
        • Both structured and raw formats can now scrape all comments from a submission.
    • Source code
      • The submission comments JSON file's structure has been modified to fit the new submission_metadata dictionary. "data" is now a dictionary that contains the submission metadata dictionary and scraped comments list. Comments are now stored in the "comments" field within "data".
      • Exporting Redditor or submission comments to CSV is now forbidden.
        • URS will ignore the --csv flag if it is present while trying to use either scraper.
      • The created_utc field for each Subreddit rule is now converted to readable time.
      • requirements.txt has been updated.
        • As of v1.20.0, numpy has dropped support for Python 3.6, which means Python 3.7+ is required for URS.
          • .travis.yml has been modified to exclude Python 3.6. Added Python 3.9 to test configuration.
          • Note: Older versions of Python can still be used by downgrading to numpy<=1.19.5.
      • Reddit object validation block has been refactored.
        • A new reusable module has been defined at the bottom of Validation.py.
      • Urs.py no longer pulls API credentials from Credentials.py as it is now deprecated.
        • Credentials are now read from the .env file.
      • Minor refactoring within Validation.py to ensure an extra Halo line is not rendered on failed credential validation.
    • README
      • Updated the Comments section to reflect new changes to comments scraper UI.
    • Repository documents
      • Updated How to Get PRAW Credentials.md to reflect new changes.
    • Tests
      • Updated CLI usage and examples tests.
      • Updated c_fname() test because submission comments scrapes now follow a different naming convention.

    Deprecated

    • User interface
      • Specifying 0 comments does not only export all comments to raw format anymore. Defaults to structured format.
    • Source code
      • Deprecated many global variables defined in Global.py:
        • eo
        • options
        • s_t
        • analytical_tools
      • Credentials.py has been replaced with the .env file.
      • The LogError.log_login decorator has been deprecated due to the refactor within Validation.py.
    Source code(tar.gz)
    Source code(zip)
  • v3.2.0(Feb 26, 2021)

    Release date: February 25, 2021

    Summary

    • Added analytical tools
      • Word frequencies generator
      • Wordcloud generator
    • Significantly improved JSON structure
    • JSON is now the default export option; the --json flag is deprecated
    • Added numerous extra flags
    • Improved logging
    • Bug fixes
    • Code refactor

    Full Changelog

    Added

    • User Interface
      • Analytical tools
        • Word frequencies generator.
        • Wordcloud generator.
    • Source code
      • CLI
        • Flags
          • -e - Display additional example usage.
          • --check - Runs a quick check for PRAW credentials and displays the rate limit table after validation.
          • --rules - Include the Subreddit's rules in the scrape data (for JSON only). This data is included in the subreddit_rules field.
          • -f - Word frequencies generator.
          • -wc - Wordcloud generator.
          • --nosave - Only display the wordcloud; do not save to file.
        • Added metavar for args help message.
        • Added additional verbose feedback if invalid arguments are given.
      • Log decorators
        • Added new decorator to log individual argument errors.
        • Added new decorator to log when no Reddit objects are left to scrape after failing validation check.
        • Added new decorator to log when an invalid file is passed into the analytical tools.
        • Added new decorator to log when the scrapes directory is missing, which would cause the new make_analytics_directory() method in DirInit.py to fail.
          • This decorator is also defined in the same file to avoid a circular import error.
      • ASCII art
        • Added new art for the word frequencies and wordcloud generators.
        • Added new error art displayed when a problem arises while exporting data.
        • Added new error art displayed when Reddit object validation is completed and there are no objects left to scrape.
        • Added new error art displayed when an invalid file is passed into the analytical tools.
    • README
      • Added new Contact section and moved contact badges into it.
        • Apparently it was not obvious enough in previous versions since users did not send emails to the address specifically created for URS-related inquiries.
      • Added new sections for the analytical tools.
      • Updated demo GIFs
        • Moved all GIFs to a separate branch to avoid unnecessary clones.
        • Hosting static images on Imgur.
    • Tests
      • Added additional tests for analytical tools.

    Changed

    • User interface
      • JSON is now the default export option. --csv flag is required to export to CSV instead.
      • Improved JSON structure.
        • PRAW scraping export structure:
          • Scrape details are now included at the top of each exported file in the scrape_details field.
            • Subreddit scrapes - Includes subreddit, category, n_results_or_keywords, and time_filter.
            • Redditor scrapes - Includes redditor and n_results.
            • Submission comments scrapes - Includes submission_title, n_results, and submission_url.
          • Scrape data is now stored in the data field.
            • Subreddit scrapes - data is a list containing submission objects.
            • Redditor scrapes - data is an object containing additional nested dictionaries:
              • information - a dictionary denoting Redditor metadata,
              • interactions - a dictionary denoting Redditor interactions (submissions and/or comments). Each interaction follows the Subreddit scrapes structure.
            • Submission comments scrapes - data is an list containing additional nested dictionaries.
              • Raw comments contains dictionaries of comment_id: SUBMISSION_METADATA.
              • Structured comments follows the structure seen in raw comments, but includes an extra replies field in the submission metadata, holding a list of additional nested dictionaries of comment_id: SUBMISSION_METADATA. This pattern repeats down to third level replies.
        • Word frequencies export structure:
          • The original scrape data filepath is included in the raw_file field.
          • data is a dictionary containing word: frequency.
      • Log:
        • scrapes.log is now named urs.log.
        • Validation of Reddit objects is now included - invalid Reddit objects will be logged as a warning.
        • Rate limit information is now included in the log.
    • Source code
      • Moved PRAW scrapers into its own package.
      • Subreddit scraper's "edited" field is now either a boolean (if the post was not edited) or a string (if it was).
        • Previous iterations did not distinguish the different types and would solely return a string.
      • Scrape settings for the basic Subreddit scraper is now cleaned within Basic.py, further streamlining conditionals in Subreddit.py and Export.py.
      • Returning final scrape settings dictionary from all scrapers after execution for logging purposes, further streamlining the LogPRAWScraper class in Logger.py.
      • Passing the submission URL instead of the exception into the not_found list for submission comments scraping.
        • This is a part of a bug fix that is listed in the Fixed section.
      • ASCII art:
        • Modified the args error art to display specific feedback when invalid arguments are passed.
      • Upgraded from relative to absolute imports.
      • Replaced old header comments with docstring comment block.
      • Upgraded method comments to Numpy/Scipy docstring format.
    • README
      • Moved Releases section into its own document.
      • Deleted all media from master branch.
    • Tests
      • Updated absolute imports to match new directory structure.
      • Updated a few tests to match new changes made in the source code.
    • Community documents
      • Updated PULL_REQUEST_TEMPLATE:
        • Updated section for listing changes that have been made to match new Releases syntax.
        • Wrapped New Dependencies in a code block.
      • Updated STYLE_GUIDE:
        • Created new rules for method comments.
      • Added Releases:
        • Moved Releases section from main README to a separate document.

    Fixed

    • Source code
      • PRAW scraper settings
        • Bug: Invalid Reddit objects (Subreddits, Redditors, or submissions) and their respective scrape settings would be added to the scrape settings dictionary even after failing validation.
        • Behavior: URS would try to scrape invalid Reddit objects, then throw an error mid-scrape because it is unable to pull data via PRAW.
        • Fix: Returning the invalid objects list from each scraper into GetPRAWScrapeSettings.get_settings() to circumvent this issue.
      • Basic Subreddit scraper
        • Bug: The time filter all would be applied to categories that do not support time filter use, resulting in errors while scraping.
        • Behavior: URS would throw an error when trying to export the file, resulting in a failed run.
        • Fix: Added a conditional to check if the category allows for a time filter, and applies either the all time filter or None accordingly.

    Deprecated

    • User interface
      • Removed the --json flag since it is now the default export option.
    Source code(tar.gz)
    Source code(zip)
  • v3.1.2(Feb 6, 2021)

    Release date: February 05, 2021

    Scrapes will now be exported to scrape-defined directories within the date directory.

    New in 3.1.2

    • URS will create sub-directories within the date directory based on the scraper.
      • Exported files will now be stored in the subreddits, redditors, or comments directories.
        • These directories are only created if the scraper is ran. For example, the redditors directory will not be created if you never run the Redditor scraper.
      • Removed the first character used in exported filenames to distinguish scrape type in previous iterations of URS.
        • This is no longer necessary due to the new sub-directory creation.
    • The forbidden access message that may appear when running the Redditor scraper was originally red. Changed the color from red to yellow to avoid confusion.
    • Fixed a filenaming bug that would omit the scrape type if the filename length is greater than 50 characters.
    • Updated README
      • Updated demo GIFs
      • Added new directory structure visual generated by the tree command.
      • Created new section headers to improve navigation.
    • Minor code reformatting/refactoring.
      • Updated STYLE_GUIDE to reflect new changes and made a minor change to the PRAW API walkthrough.
    Source code(tar.gz)
    Source code(zip)
  • v3.1.1(Jun 28, 2020)

    Release date: June 27, 2020

    Fulfilled user enhancement request by adding Subreddit time filter option.

    New in 3.1.1:

    • Users will now be able to specify a time filter for Subreddit categories Controversial, Search, and Top.
    • The valid time filters are:
      • all
      • day
      • hour
      • month
      • week
      • year
    • Updated CLI unit tests to match new changes to how Subreddit args are parsed.
    • Updated community documents located in the .github/ directory: STYLE_GUIDE, and PULL_REQUEST_TEMPLATE.
    • Updated README to reflect new changes.
    Source code(tar.gz)
    Source code(zip)
  • v3.1.0(Jun 22, 2020)

    Release date: June 22, 2020

    Major code refactor. Applied OOP concepts to existing code and rewrote methods in attempt to improve readability, maintenance, and scalability.

    New in 3.1.0:

    • Scrapes will now be exported to the scrapes/ directory within a subdirectory corresponding to the date of the scrape. These directories are automatically created for you when you run URS.
    • Added log decorators that record what is happening during each scrape, which scrapes were ran, and any errors that might arise during runtime in the log file scrapes.log. The log is stored in the same subdirectory corresponding to the date of the scrape.
    • Replaced bulky titles with minimalist titles for a cleaner look.
    • Added color to terminal output.
    • Improved naming convention for scripts.
    • Integrating Travis CI and Codecov.
    • Updated community documents located in the .github/ directory: BUG_REPORT, CONTRIBUTING, FEATURE_REQUEST, PULL_REQUEST_TEMPLATE, and STYLE_GUIDE
    • Numerous changes to README. The most significant change was splitting and storing walkthroughs in docs/.
    Source code(tar.gz)
    Source code(zip)
  • v3.0(Jan 22, 2020)

  • v2.0(Jan 22, 2020)

  • v1.0(Jan 22, 2020)

Owner
Joseph Lai
Self-taught software developer || Actively looking for work || Please contact urs
Rottentomatoes, Goodreads and IMDB sites crawler. Semantic Web final project.

Crawler Rottentomatoes, Goodreads and IMDB sites crawler. Crawler written by beautifulsoup, selenium and lxml to gather books and films information an

Faeze Ghorbanpour 1 Dec 30, 2021
CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform

CRI Scrape CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform Disclaimer This code is only for educational purpose. So

Vincenzo Cardone 0 Jul 23, 2022
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 212 Nov 05, 2022
Scraping followers of an instagram account

ScrapInsta A script to scraping data from Instagram Install First of all you can run: pip install scrapinsta After that you need to install these requ

Matheus Kolln 1 Sep 05, 2021
UdemyBot - A Simple Udemy Free Courses Scrapper

UdemyBot - A Simple Udemy Free Courses Scrapper

Gautam Kumar 112 Nov 12, 2022
Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Pythonic Crawling / Scraping Framework Built on Eventlet Features High Speed WebCrawler built on Eventlet. Supports relational databases engines like

Juan Manuel Garcia 173 Dec 05, 2022
淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党

taobao_seckill 淘宝、天猫半价抢购,抢电视、抢茅台,干死黄牛党 依赖 安装chrome浏览器,根据浏览器的版本找到对应的chromedriver下载安装 web版使用说明 1、抢购前需要校准本地时间,然后把需要抢购的商品加入购物车 2、如果要打包成可执行文件,可使用pyinstalle

2k Jan 05, 2023
A Python module to bypass Cloudflare's anti-bot page.

cloudflare-scrape A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Reque

3k Jan 04, 2023
This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Faisal Ahmed 1 Jan 10, 2022
Scrapy, a fast high-level web crawling & scraping framework for Python.

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used to crawl websites and extract structured data from their pag

Scrapy project 45.5k Jan 07, 2023
Web scrapper para cotizar articulos

WebScrapper Este web scrapper esta desarrollado en python 3.10.0 para buscar en la pagina de cyber puerta articulos dentro del catalogo. El programa t

Jordan Gaona 1 Oct 27, 2021
A Python Oriented tool to Scrap WhatsApp Group Link using Google Dork it Scraps Whatsapp Group Links From Google Results And Gives Working Links.

WaGpScraper A Python Oriented tool to Scrap WhatsApp Group Link using Google Dork it Scraps Whatsapp Group Links From Google Results And Gives Working

Muhammed Rizad 27 Dec 18, 2022
Simply scrape / download all the media from an fansly account.

Simply scrape / download all the media from an fansly account. Providing updates as long as its continuously gaining popularity, so hit the ⭐ button!

Mika C. 334 Jan 01, 2023
An introduction to free, automated web scraping with GitHub’s powerful new Actions framework.

An introduction to free, automated web scraping with GitHub’s powerful new Actions framework Published at palewi.re/docs/first-github-scraper/ Contrib

Ben Welsh 15 Nov 24, 2022
simple http & https proxy scraper and checker

simple http & https proxy scraper and checker

Neospace 11 Nov 15, 2021
download NCERT books using scrapy

download_ncert_books download NCERT books using scrapy Downloading Books: You can either use the spider by cloning this repo and following the instruc

1 Dec 02, 2022
Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

Instagram_scrapper This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or exce

Lakhdar Belkharroubi 5 Oct 17, 2022
Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

Jason Nguyen 1 Oct 29, 2021
Proxy scraper. Format: IP | PORT | COUNTRY | TYPE

proxy scraper 🔎 Installation: git clone https://github.com/ebankoff/proxy_scraper Required pip libraries (pip install library name): lxml beautifulso

Eban'ko 19 Dec 07, 2022
Examine.com supplement research scraper!

ExamineScraper Examine.com supplement research scraper! Why I want to be able to search pages for a specific term. For example, I want to be able to s

Tyler 15 Dec 06, 2022