This is a webscraper for a specific website

Last update: Dec 13, 2021

Overview

Web-Scraper-for-a-news-website

This is a webscraper for a specific website (Economic Times). It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

Installation

Install the following:

Selenium: Please follow the link https://selenium-python.readthedocs.io/installation.html and install the selenium.
Chromedriver: Check your Chrome browser's version (Menu -> Help -> About Google Chrome) and download the relevant Chromedriver from https://sites.google.com/chromium.org/driver/home
TQDM: https://pypi.org/project/tqdm/
BeautifulSoup4: https://pypi.org/project/beautifulsoup4/

Using the webscraper

It is important to take care of the sequence of executing these files. Please follow the sequence below:

ET_Archive_Links.py: Use this website as it is the source of everything that we'll do later. This scripy gives us the initial links in the Archive page of the website.
ET_All_Links_Inside_Archive.py: This is the script that takes the output (csv file) of the previous script. It produces a new file which contain URLs of all the archived news on the website since 2002.
ET_Content.py: Finally, this is the script that scrapes the headlines along with the dates. ( If you want to scrap any other part of the website then this is the script that you have to edit )

Dataset

I used the scraper on another news website named "Businessline". It's dataset is available on Kaggle(https://www.kaggle.com/rsiyanwal/20182019-businessline-headlines).

This is a webscraper for a specific website

Related tags

Overview

Web-Scraper-for-a-news-website

Installation

Using the webscraper

Dataset

Owner

Rahul Siyanwal

Dictionary - Application focused on word search through web scraping

Amazon web scraping using Scrapy Framework

PyQuery-based scraping micro-framework.

🐞 Douban Movie / Douban Book Scarpy

Automatically download and crop key information from the arxiv daily paper.

Deep Web Miner Python | Spyder Crawler

Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Web Scraping images using Selenium and Python

Create crawler get some new products with maximum discount in banimode website

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

for those who dont want to pay $10/month for high school game footage with ads

京东茅台抢购 2021年4月最新版

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Automatically scrapes all menu items from the Taco Bell website

API to parse tibia.com content into python objects.

Scraping news from Ucsal portal with Scrapy.

Scrape and display grades onto the console

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

河南工业大学完美校园自动校外打卡

This is a webscraper for a specific website

Related tags

Overview

Web-Scraper-for-a-news-website

Installation

Using the webscraper

Dataset

Owner

Rahul Siyanwal

Dictionary - Application focused on word search through web scraping

Amazon web scraping using Scrapy Framework

PyQuery-based scraping micro-framework.

🐞 Douban Movie / Douban Book Scarpy

Automatically download and crop key information from the arxiv daily paper.

Deep Web Miner Python | Spyder Crawler

Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Web Scraping images using Selenium and Python

Create crawler get some new products with maximum discount in banimode website

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

for those who dont want to pay $10/month for high school game footage with ads

京东茅台抢购 2021年4月最新版

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Automatically scrapes all menu items from the Taco Bell website

API to parse tibia.com content into python objects.

Scraping news from Ucsal portal with Scrapy.

Scrape and display grades onto the console

a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

Bigdata - This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

河南工业大学 完美校园 自动校外打卡

河南工业大学完美校园自动校外打卡