Web-Scraper-for-a-news-website

This is a webscraper for a specific website (Economic Times). It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

Installation

Install the following:

Selenium: Please follow the link https://selenium-python.readthedocs.io/installation.html and install the selenium.
Chromedriver: Check your Chrome browser's version (Menu -> Help -> About Google Chrome) and download the relevant Chromedriver from https://sites.google.com/chromium.org/driver/home
TQDM: https://pypi.org/project/tqdm/
BeautifulSoup4: https://pypi.org/project/beautifulsoup4/

Using the webscraper

It is important to take care of the sequence of executing these files. Please follow the sequence below:

ET_Archive_Links.py: Use this website as it is the source of everything that we'll do later. This scripy gives us the initial links in the Archive page of the website.
ET_All_Links_Inside_Archive.py: This is the script that takes the output (csv file) of the previous script. It produces a new file which contain URLs of all the archived news on the website since 2002.
ET_Content.py: Finally, this is the script that scrapes the headlines along with the dates. ( If you want to scrap any other part of the website then this is the script that you have to edit )

Dataset

I used the scraper on another news website named "Businessline". It's dataset is available on Kaggle(https://www.kaggle.com/rsiyanwal/20182019-businessline-headlines).

This is a webscraper for a specific website

Related tags

Overview

Web-Scraper-for-a-news-website

Installation

Using the webscraper

Dataset

Owner

Rahul Siyanwal

Scraping and visualising India's real-time COVID-19 data from the MOHFW dataset.

Instagram profile scrapper with python

A simple, configurable and expandable combined shop scraper to minimize the costs of ordering several items

Using Selenium with Python to Web Scrap Popular Youtube Tech Channels.

Unja is a fast & light tool for fetching known URLs from Wayback Machine

基于Github Action的定时HITsz疫情上报脚本，开箱即用

Extract embedded metadata from HTML markup

A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Amazon web scraping using Scrapy Framework

TikTok Username Swapper/Claimer/etc

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN

京东茅台抢购最新优化版本，京东秒杀，添加误差时间调整，优化了茅台抢购进程队列

Scrap the 42 Intranet's elearning videos in a single click

Open Crawl Vietnamese Text

script to scrape direct download links (ddls) from google drive index.

An automated, headless YouTube Watcher and Scraper

Binance harvester - A Python 3 script to harvest data from the Binance socket stream and calculate popular TA indicators and produce lists of top trending coins

A Python module to bypass Cloudflare's anti-bot page.