Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Last update: Jan 04, 2022

Related tags

Web Crawling Web-scraping

Overview

Extract Data from the IRS website A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a range of years.

How to run the script? This script runs on Python 3.8. Install the libraries on requirements.txt into a new environment, then run 'Script.py'.

What should I expect? The script will ask you for the form number(s) then scrap the IRS website. --> Please enter the complete tax form number separated by a comma followed by a space (not case sensitive): (ie. Form W-2, Form 1095-C, Form W-3, etc) --> Form W-2, Form 1095-C

Then the bot will ask if the user would like to download the forms. --> Would you like to download all related pdfs? (Y/N)

If selected, the bot will follow up by asking a year range. --> Please provide the year range by using a dash in between the years (starting year must be smaller than ending year): (ie. 2018-2020)

Once executed, the bot will automatically create a folder and download the relevant pdfs into the folder.

Finally, the results will be returned as a json string. If there are no results, the user will get a 'No results' instead.

Sample output: [ {'form_number': 'Form W-2', 'form_title': 'Wage and Tax Statement (Info Copy Only)', 'min_year': '1954', 'max_year': '2022'}, {'form_number': 'Form 1095-C', 'form_title': 'Employer-Provided Health Insurance Offer and Coverage', 'min_year': '2014', 'max_year': '2022'}, {'form_number': 'Form W-3', 'form_title': 'Transmittal of Wage and Tax Statements (Info Copy Only)', 'min_year': '1990', 'max_year': '2022'} ]

Note: To keep users engaged, the bot will display which task it is performing and what URL it is currently searching.

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Related tags

Overview

Owner

Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

This is a script that scrapes the longitude and latitude on food.grab.com

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

Command line program to download documents from web portals.

CreamySoup - a helper script for automated SourceMod plugin updates management.

京东茅台抢购

12306抢票脚本

A simple code to fetch comments below an Instagram post and save them to a csv file

A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.

UsernameScraperTool - Username Scraper Tool With Python

Transistor, a Python web scraping framework for intelligent use cases.

CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform

A simple python script to fetch the latest covid info

Create crawler get some new products with maximum discount in banimode website

An application that on a given url, crowls a web page and gets all words, sorts and counts them.

Grab the changelog from releases on Github

a small library for extracting rich content from urls

京东抢茅台，秒杀成功很多次讨论，天猫抢购，赚钱交流等。

simple http & https proxy scraper and checker

script to scrape direct download links (ddls) from google drive index.