If Google News had a Python library

Overview

pygooglenews

If Google News had a Python library

Created by Artem from newscatcherapi.com but you do not need anything from us or from anyone else to get the software going, it just works out of the box.

My blog post about how I did it

Demo

You might also like to check our Google News API or Financial Google News API

Table of Contents

About

A python wrapper of the Google News RSS feed.

Top stories, topic related news feeds, geolocation news feed, and an extensive full text search feed.

This work is more of a collection of all things I could find out about how Google News functions.

How is it different from other Pythonic Google News libraries?

  1. URL-escaping user input helper for the search function
  2. Extensive support for the search function that makes it simple to use:
    • exact match
    • in title match, in url match, etc
    • search by date range (from_ & to_), latest published (when)
  3. Parsing of the sub articles. Almost always, all feeds except the search one contain a subset of similar news for each article in a feed. This package takes care of extracting those sub articles. This feature might be highly useful to ML task when you need to collect a data of similar article headlines

Examples of Use Cases

  1. Integrating a news feed to your platform/application/website
  2. Collecting data by topic to train your own ML model
  3. Search for latest mentions for your new product
  4. Media monitoring of people/organizations — PR

Working with Google News in Production

Before we start, if you want to integrate Google News data to your production then I would advise you to use one of the 3 methods described below. Why? Because you do not want your servers IP address to be locked by Google. Every time you call any function there is an HTTPS request to Google's servers. Don't get me wrong, this Python package still works out of the box.

  1. NewsCatcher's Google News API — all code is written for you, clean & structured JSON output. Low price. You can test it yourself with no credit card. Plus, financial version of API is also available.
  2. ScrapingBee API which handles proxy rotation for you. Each function in this package has scraping_bee parameter where you paste your API key. You can also try it for free, no credit card required. See example
  3. Your own proxy — already have a pool of proxies? Each function in this package has proxies parameter (python dictionary) where you just paste your own proxies.

Motivation

I love working with the news data. I love it so much that I created my own company that crawls for hundreds of thousands of news articles, and allow you to search it via a news API. But this time, I want to share with the community a Python package that makes it simple to get the news data from the best search engine ever created - Google.

Most likely, you know already that Google has its own news service. It is different from the usual Google search that we use on a daily basis (sorry DuckDuckGo, maybe next time).

This package uses the RSS feed of the Google News. The top stories page, for example.

RSS is an XML page that is already well structured. I heavily rely on Feedparser package to parse the RSS feed.

Google News used to have an API but it was deprecated many years ago. (Unofficial) information about RSS syntax is decentralized over the web. There is no official documentation. So, I tried my best to collect all this informaion in one place.

Installation

$ pip install pygooglenews --upgrade

Quickstart

from pygooglenews import GoogleNews

gn = GoogleNews()

Top Stories

top = gn.top_news()

Stories by Topic

business = gn.topic_headlines('business')

Geolocation Specific Stories

headquaters = gn.geo_headlines('San Fran')

Stories by a Query Search

# search for the best matching articles that mention MSFT and 
# do not mention AAPL (over the past 6 month
search = gn.search('MSFT -APPL', when = '6m')

Documentation - Functions & Classes

GoogleNews Class

from pygooglenews import GoogleNews
# default GoogleNews instance
gn = GoogleNews(lang = 'en', country = 'US')

To get the access to all the functions, you first have to initiate the GoogleNews class.

It has 2 required variables: lang and country

You can try any combination of those 2, however, it does not exist for all. Only the combinations that are supported by GoogleNews will work. Check the official Google News page to check what is covered:

On the bottom left side of the Google News page you may find a Language & region section where you can find all of the supported combinations.

For example, for country=UA (Ukraine), there are 2 languages supported:

  • lang=uk Ukrainian
  • lang=ru Russian

Top Stories

top = gn.top_news(proxies=None, scraping_bee = None)

top_news() returns the top stories for the selected country and language that are defined in GoogleNews class. The returned object contains feed (FeedParserDict) and entries list of articles found with all data parsed.


Stories by Topic

business = gn.topic_headlines('BUSINESS', proxies=None, scraping_bee = None)

The returned object contains feed (FeedParserDict) and entries list of articles found with all data parsed.

Accepted topics are:

  • WORLD
  • NATION
  • BUSINESS
  • TECHNOLOGY
  • ENTERTAINMENT
  • SCIENCE
  • SPORTS
  • HEALTH

However, you can find some other topics that are also supported by Google News.

For example, if you search for corona in the search tab of en + US you will find COVID-19 as a topic.

The URL looks like this: https://news.google.com/topics/CAAqIggKIhxDQkFTRHdvSkwyMHZNREZqY0hsNUVnSmxiaWdBUAE?hl=en-US&gl=US&ceid=US%3Aen

We have to copy the text after topics/ and before ?, then you can use it as an input for the top_news() function.

from pygooglenews import GoogleNews

gn = GoogleNews()
covid = gn.topic_headlines('CAAqIggKIhxDQkFTRHdvSkwyMHZNREZqY0hsNUVnSmxiaWdBUAE')

However, be aware that this topic will be unique for each language/country combination.


Stories by Geolocation

gn = GoogleNews('uk', 'UA')
kyiv = gn.geo_headlines('kyiv', proxies=None, scraping_bee = None)
# or 
kyiv = gn.geo_headlines('kiev', proxies=None, scraping_bee = None)
# or
kyiv = gn.geo_headlines('киев', proxies=None, scraping_bee = None)
# or
kyiv = gn.geo_headlines('Київ', proxies=None, scraping_bee = None)

The returned object contains feed (FeedParserDict) and entries list of articles found with all data parsed.

All of the above variations will return the same feed of the latest news about Kyiv, Ukraine:

geo['feed'].title

# 'Київ - Останні - Google Новини'

It is language agnostic, however, it does not guarantee that the feed for any specific place will exist. For example, if you want to find the feed on LA or Los Angeles you can do it with GoogleNews('en', 'US').

The main (enUS) Google News client will most likely find the feed about the most places.


Stories by a Query

gn.search(query: str, helper = True, when = None, from_ = None, to_ = None, proxies=None, scraping_bee=None)

The returned object contains feed (FeedParserDict) and entries list of articles found with all data parsed.

Google News search itself is a complex function that has inherited some features from the standard Google Search.

The official reference on what could be inserted

The biggest obstacle that you might have is to write the URL-escaping input. To ease this process, helper = True is turned on by default.

helper uses urllib.parse.quote_plus to automatically convert the input.

For example:

  • 'New York metro opening' --> 'New+York+metro+opening'
  • 'AAPL -MSFT' --> 'AAPL+-MSFT'
  • '"Tokyo Olimpics date changes"' --> '%22Tokyo+Olimpics+date+changes%22'

You can turn it off and write your own query in case you need it by helper = False

when parameter (str) sets the time range for the published datetime. I could not find any documentation regarding this option, but here is what I deducted:

  • h for hours.(For me, worked for up to 101h). when=12h will search for only the articles matching the search criteri and published for the last 12 hours
  • d for days
  • m for month (For me, worked for up to 48m)

I did not set any hard limit here. You may try put here anything. Probably, it will work. However, I would like to warn you that wrong inputs will not lead to an error. Instead, the when parameter will be ignored by the Google.

from_ and to_ accept the following format of date: %Y-%m-%d For example, 2020-07-01


Google's Special Query Terms Cheat Sheet

Many Google's Special Query Terms have been tested one by one. Most of the core ones have been inherited by Google News service. At first, I wanted to integrate all of those as the search() function parameters. But, I realised that it might be a bit confusing and difficult to make them all work correctly.

Instead, I decided to write some kind of a cheat sheet that should give you a decent understanding of what you could do.

  • Boolean OR Search [ OR ]
from pygooglenews import GoogleNews

gn = GoogleNews()

s = gn.search('boeing OR airbus')

print(s['feed'].title)
# "boeing OR airbus" - Google News
  • Exclude Query Term [-]

"The exclude (-) query term restricts results for a particular search request to documents that do not contain a particular word or phrase. To use the exclude query term, you would preface the word or phrase to be excluded from the matching documents with "-" (a minus sign)."

  • Include Query Term [+]

"The include (+) query term specifies that a word or phrase must occur in all documents included in the search results. To use the include query term, you would preface the word or phrase that must be included in all search results with "+" (a plus sign).

The URL-escaped version of + (a plus sign) is %2B."

  • Phrase Search

"The phrase search (") query term allows you to search for complete phrases by enclosing the phrases in quotation marks or by connecting them with hyphens.

The URL-escaped version of " (a quotation mark) is %22.

Phrase searches are particularly useful if you are searching for famous quotes or proper names."

  • allintext

"The allintext: query term requires each document in the search results to contain all of the words in the search query in the body of the document. The query should be formatted as allintext: followed by the words in your search query.

If your search query includes the allintext: query term, Google will only check the body text of documents for the words in your search query, ignoring links in those documents, document titles and document URLs."

  • intitle

"The intitle: query term restricts search results to documents that contain a particular word in the document title. The search query should be formatted as intitle:WORD with no space between the intitle: query term and the following word."

  • allintitle

"The allintitle: query term restricts search results to documents that contain all of the query words in the document title. To use the allintitle: query term, include "allintitle:" at the start of your search query.

Note: Putting allintitle: at the beginning of a search query is equivalent to putting intitle: in front of each word in the search query."

  • inurl

"The inurl: query term restricts search results to documents that contain a particular word in the document URL. The search query should be formatted as inurl:WORD with no space between the inurl: query term and the following word"

  • allinurl

The allinurl: query term restricts search results to documents that contain all of the query words in the document URL. To use the allinurl: query term, include allinurl: at the start of your search query.

List of operators that do not work (for me, at least):

  1. Most (probably all) of the as_* terms do not work for Google News
  2. allinlinks:
  3. related:

Tip. If you want to build a near real-time feed for a specific topic, use when='1h'. If Google captured fewer than 100 articles over the past hour, you should be able to retrieve all of them.

Check the Useful Links section if you want to dig into how Google Search works.

Especially, Special Query Terms section of Google XML reference.

Plus, I will provide some more examples under the Full-Text Search Examples section


Output Body

All 4 functions return the dictionary that has 2 sub-objects:

  • feed - contains the information on the feed metadata
  • entries - contains the parsed articles

Both are inherited from the Feedparser. The only change is that each dictionary under entries also contains sub_articles which are the similar articles found in the description. Usually, it is non-empty for top_news() and topic_headlines() feeds.

Tip To check what is the found feed's name just check the title under the feed dictionary


How to use pygooglenews with ScrapingBee

Every function has scrapingbee parameter. It accepts your ScrapingBee API key that will be used to get the response from Google's servers.

You can take a look at what exactly is happening in the source code: check for __scaping_bee_request() function under GoogleNews class

Pay attention to the concurrency of each plan at ScrapingBee

How to use example:

gn = GoogleNews()

# it's a fake API key, do not try to use it
gn.top_news(scraping_bee = 'I5SYNPRFZI41WHVQWWUT0GNXFMO104343E7CXFIISR01E2V8ETSMXMJFK1XNKM7FDEEPUPRM0FYAHFF5')

How to use pygooglenews with proxies

So, if you have your own HTTP/HTTPS proxy(s) that you want to use to make requests to Google, that's how you do it:

gn = GoogleNews()

gn.top_news(proxies = {'https':'34.91.135.38:80'})

Advanced Querying Search Examples

Example 1. Search for articles that mention boeing and do not mention airbus

from pygooglenews import GoogleNews

gn = GoogleNews()

s = gn.search('boeing -airbus')

print(s['feed'].title)
# "boeing -airbus" - Google News

Example 2. Search for articles that mention boeing in title

from pygooglenews import GoogleNews

gn = GoogleNews()

s = gn.search('intitle:boeing')

print(s['feed'].title)
# "intitle:boeing" - Google News

Example 3. Search for articles that mention boeing in title and got published over the past hour

from pygooglenews import GoogleNews

gn = GoogleNews()

s = gn.search('intitle:boeing', when = '1h')

print(s['feed'].title)
# "intitle:boeing when:1h" - Google News

Example 4. Search for articles that mention boeing or airbus

from pygooglenews import GoogleNews

gn = GoogleNews()

s = gn.search('boeing OR airbus', when = '1h')

print(s['feed'].title)
# "boeing AND airbus when:1h" - Google News

Useful Links

Stack Overflow thread from which it all began

Google XML reference for the search query

Google News Search parameters (The Missing Manual)


Built With

Feedparser

Beutifulsoup4


About me

My name is Artem. I ❤️ working with news data. I am a co-founder of NewsCatcherAPI - Ultra-fast API to find news articles by any topic, country, language, website, or keyword

If you are interested in hiring me, please, contact me by email - [email protected] or [email protected]

Follow me on 🖋  Twitter - I write about data engineering, python, entrepreneurship, and memes.

Want to read about how it all was done? Subscribe to CODARIUM

thx to Kizy


Change Log

v0.1.1 -- fixed language-country issues

Owner
Artem Bugara
Data Engineer. Building newscatcherapi.com
Artem Bugara
1 May 12, 2022
使用clash核心,对服务器进行Netflix解锁批量测试。

注意事项 测速及解锁测试仅供参考,不代表实际使用情况,由于网络情况变化、Netflix封锁及ip更换,测速具有时效性 本项目使用 Python 编写,使用前请完成环境安装 首次运行前请安装pip及相关依赖,也可使用 pip install -r requirements.txt 命令自行安装 Net

11 Dec 07, 2022
Convert Photoshop curves (acv) to xmp presets for Lightroom

acv2xmp Convert Photoshop curves (acv) to Lightroom preset (xmp) acv2xmp.py Basic command prompt that relies on standard library only and can be used

5 Feb 06, 2022
An account generator for guilded.gg that I made a while back and decided to bring back up

An account generator for guilded.gg that I made a while back and decided to bring back up

8 Nov 17, 2022
Osintgram by Datalux but i fixed some errors i found and made it look cleaner

OSINTgram-V2 OSINTgram-V2 is made from Osintgram which is made by Datalux originally but i took the script and fixed some errors i found and made the

2 Feb 02, 2022
Aevsploit İçin Destekde Bulun Papara: 1427113016

Aevsploit İçin Destekde Bulun Papara: 1427113016 Toolu Geliştirmek İçin Fikirlerinizi Bekliyorum Telegram

9 Jun 07, 2022
🍬️🦇️ Open source Trick or Treat! 🦇️🍬️

Open Source Halloween! What's an easy way to have fun, and celebrate an open source Halloween? Open source trick or treating, of course! The repositor

Research Software Engineers 3 Oct 18, 2021
Direct Multi-view Multi-person 3D Human Pose Estimation

Implementation of NeurIPS-2021 paper: Direct Multi-view Multi-person 3D Human Pose Estimation [paper] [video-YouTube, video-Bilibili] [slides] This is

Sea AI Lab 253 Jan 05, 2023
You can change your mac address with this program.

1 - Warning! You can use this program with Kali Linux. Therefore if you don't install the Kali Linux. Firstly you need to install Kali Linux. 2 - Star

Mustafa Bahadır Doğrusöz 1 Jun 10, 2022
Add-In for Blender to automatically save files when rendering

Autosave - Render: Automatically save .blend, .png and readme.txt files when rendering with Blender Purpose This Blender Add-On provides an easy way t

Volker 9 Aug 10, 2022
A totally unrealistic cell growth/reproduction simulation.

A totally unrealistic cell growth/reproduction simulation.

Andrien Wiandyano 1 Oct 24, 2021
Minitel 5 somewhat reverse-engineered

Minitel 5 The Minitel was a french dumb terminal with an embedded modem which had its Golden Age before the rise of Internet. Typically cubic, with an

cLx 10 Dec 28, 2022
A tool to help you to do the monthly reading requirements

Monthly Reading Requirement Auto ⚙️ A tool to help you do the monthly reading requirements Important ⚠️ Some words can't be translated Links: Synonym

Julian Jauk 2 Oct 31, 2021
Entitlement AND Hardened Runtime Check

Python3 script for macOS to recursively check /Applications and also check /usr/local/bin, /usr/bin, and /usr/sbin for binaries with problematic/interesting entitlements. Also checks for hardened run

Cedric Owens 79 Nov 16, 2022
Automated, progress quest-inspired procedural adventuring

Tales of an Endless Journey (TEJ) Automated, progress quest-inspired procedural adventuring What is this project? Journey is the result of many, many

8 Dec 14, 2021
Leveraging pythonic forces to defeat different coding challenges 🐍

Pyforces Leveraging pythonic forces to defeat different coding challenges! Table of Contents Pyforces Tests Pyforces Pyforces is a study repo with a c

Igor Grillo Peternella 8 Dec 14, 2022
Fully coded Apps by Codex.

OpenAI-Codex-Code-Generation Fully coded Apps by Codex. How I use Codex in VSCode to generate multiple completions with autosorting by highest "mean p

nanowell 47 Jan 01, 2023
The Python Achievements Framework!

Pychievements: The Python Achievements Framework! Pychievements is a framework for creating and tracking achievements within a Python application. It

Brian 114 Jul 21, 2022
Parser for air tickets' price

Air-ticket-price-parser Parser for air tickets' price How to Install Firefox If geckodriver.exe is not compatible with your Firefox version, download

Situ Xuannn 1 Dec 13, 2021
PDX Code Guild Full Stack Python Bootcamp starting 2022/02/28

Class Liger Rough Timeline Weeks 1, 2, 3, 4: Python Weeks 5, 6, 7, 8: HTML/CSS/Flask Weeks 9, 10, 11: Javascript Weeks 12, 13, 14, 15: Django Weeks 16

PDX Code Guild 5 Jul 05, 2022