Accurately separate the TLD from the registered domain and subdomains of a URL, using the Public Suffix List.

Overview

tldextract

Python Module PyPI version Build Status

tldextract accurately separates the gTLD or ccTLD (generic or country code top-level domain) from the registered domain and subdomains of a URL. For example, say you want just the 'google' part of 'http://www.google.com'.

Everybody gets this wrong. Splitting on the '.' and taking the last 2 elements goes a long way only if you're thinking of simple e.g. .com domains. Think parsing http://forums.bbc.co.uk for example: the naive splitting method above will give you 'co' as the domain and 'uk' as the TLD, instead of 'bbc' and 'co.uk' respectively.

tldextract on the other hand knows what all gTLDs and ccTLDs look like by looking up the currently living ones according to the Public Suffix List (PSL). So, given a URL, it knows its subdomain from its domain, and its domain from its country code.

>>> import tldextract

>>> tldextract.extract('http://forums.news.cnn.com/')
ExtractResult(subdomain='forums.news', domain='cnn', suffix='com')

>>> tldextract.extract('http://forums.bbc.co.uk/') # United Kingdom
ExtractResult(subdomain='forums', domain='bbc', suffix='co.uk')

>>> tldextract.extract('http://www.worldbank.org.kg/') # Kyrgyzstan
ExtractResult(subdomain='www', domain='worldbank', suffix='org.kg')

ExtractResult is a namedtuple, so it's simple to access the parts you want.

>>> ext = tldextract.extract('http://forums.bbc.co.uk')
>>> (ext.subdomain, ext.domain, ext.suffix)
('forums', 'bbc', 'co.uk')
>>> # rejoin subdomain and domain
>>> '.'.join(ext[:2])
'forums.bbc'
>>> # a common alias
>>> ext.registered_domain
'bbc.co.uk'

Note subdomain and suffix are optional. Not all URL-like inputs have a subdomain or a valid suffix.

>>> tldextract.extract('google.com')
ExtractResult(subdomain='', domain='google', suffix='com')

>>> tldextract.extract('google.notavalidsuffix')
ExtractResult(subdomain='google', domain='notavalidsuffix', suffix='')

>>> tldextract.extract('http://127.0.0.1:8080/deployed/')
ExtractResult(subdomain='', domain='127.0.0.1', suffix='')

If you want to rejoin the whole namedtuple, regardless of whether a subdomain or suffix were found:

>>> ext = tldextract.extract('http://127.0.0.1:8080/deployed/')
>>> # this has unwanted dots
>>> '.'.join(ext)
'.127.0.0.1.'
>>> # join each part only if it's truthy
>>> '.'.join(part for part in ext if part)
'127.0.0.1'

By default, this package supports the public ICANN TLDs and their exceptions. You can optionally support the Public Suffix List's private domains as well.

This module started by implementing the chosen answer from this StackOverflow question on getting the "domain name" from a URL. However, the proposed regex solution doesn't address many country codes like com.au, or the exceptions to country codes like the registered domain parliament.uk. The Public Suffix List does, and so does this module.

Installation

Latest release on PyPI:

pip install tldextract

Or the latest dev version:

pip install -e 'git://github.com/john-kurkowski/tldextract.git#egg=tldextract'

Command-line usage, splits the url components by space:

tldextract http://forums.bbc.co.uk
# forums bbc co.uk

Note About Caching

Beware when first running the module, it updates its TLD list with a live HTTP request. This updated TLD set is usually cached indefinitely in ``$HOME/.cache/python-tldextract`. To control the cache's location, set TLDEXTRACT_CACHE environment variable or set the cache_dir path in TLDExtract initialization.

(Arguably runtime bootstrapping like that shouldn't be the default behavior, like for production systems. But I want you to have the latest TLDs, especially when I haven't kept this code up to date.)

# extract callable that falls back to the included TLD snapshot, no live HTTP fetching
no_fetch_extract = tldextract.TLDExtract(suffix_list_urls=None)
no_fetch_extract('http://www.google.com')

# extract callable that reads/writes the updated TLD set to a different path
custom_cache_extract = tldextract.TLDExtract(cache_dir='/path/to/your/cache/')
custom_cache_extract('http://www.google.com')

# extract callable that doesn't use caching
no_cache_extract = tldextract.TLDExtract(cache_dir=False)
no_cache_extract('http://www.google.com')

If you want to stay fresh with the TLD definitions--though they don't change often--delete the cache file occasionally, or run

tldextract --update

or:

env TLDEXTRACT_CACHE="~/tldextract.cache" tldextract --update

It is also recommended to delete the file after upgrading this lib.

Advanced Usage

Public vs. Private Domains

The PSL maintains a concept of "private" domains.

PRIVATE domains are amendments submitted by the domain holder, as an expression of how they operate their domain security policy. … While some applications, such as browsers when considering cookie-setting, treat all entries the same, other applications may wish to treat ICANN domains and PRIVATE domains differently.

By default, tldextract treats public and private domains the same.

>>> extract = tldextract.TLDExtract()
>>> extract('waiterrant.blogspot.com')
ExtractResult(subdomain='waiterrant', domain='blogspot', suffix='com')

The following overrides this.

>>> extract = tldextract.TLDExtract()
>>> extract('waiterrant.blogspot.com', include_psl_private_domains=True)
ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com')

or to change the default for all extract calls,

>>> extract = tldextract.TLDExtract( include_psl_private_domains=True)
>>> extract('waiterrant.blogspot.com')
ExtractResult(subdomain='', domain='waiterrant', suffix='blogspot.com')

The thinking behind the default is, it's the more common case when people mentally parse a URL. It doesn't assume familiarity with the PSL nor that the PSL makes such a distinction. Note this may run counter to the default parsing behavior of other, PSL-based libraries.

Specifying your own URL or file for the Suffix List data

You can specify your own input data in place of the default Mozilla Public Suffix List:

extract = tldextract.TLDExtract(
    suffix_list_urls=["http://foo.bar.baz"],
    # Recommended: Specify your own cache file, to minimize ambiguities about where
    # tldextract is getting its data, or cached data, from.
    cache_dir='/path/to/your/cache/',
    fallback_to_snapshot=False)

The above snippet will fetch from the URL you specified, upon first need to download the suffix list (i.e. if the cached version doesn't exist).

If you want to use input data from your local filesystem, just use the file:// protocol:

extract = tldextract.TLDExtract(
    suffix_list_urls=["file://absolute/path/to/your/local/suffix/list/file"],
    cache_dir='/path/to/your/cache/',
    fallback_to_snapshot=False)

Use an absolute path when specifying the suffix_list_urls keyword argument. os.path is your friend.

FAQ

Can you add suffix ____? Can you make an exception for domain ____?

This project doesn't contain an actual list of public suffixes. That comes from the Public Suffix List (PSL). Submit amendments there.

(In the meantime, you can tell tldextract about your exception by either forking the PSL and using your fork in the suffix_list_urls param, or adding your suffix piecemeal with the extra_suffixes param.)

If I pass an invalid URL, I still get a result, no error. What gives?

To keep tldextract light in LoC & overhead, and because there are plenty of URL validators out there, this library is very lenient with input. If valid URLs are important to you, validate them before calling tldextract.

This lenient stance lowers the learning curve of using the library, at the cost of desensitizing users to the nuances of URLs. Who knows how much. But in the future, I would consider an overhaul. For example, users could opt into validation, either receiving exceptions or error metadata on results.

Contribute

Setting up

  1. git clone this repository.
  2. Change into the new directory.
  3. pip install tox

Running the Test Suite

Run all tests against all supported Python versions:

tox --parallel

Run all tests against a specific Python environment configuration:

tox -l
tox -e py37
Owner
John Kurkowski
UX Engineering Consultant
John Kurkowski
Spectral Analysis in Python

SPECTRUM : Spectral Analysis in Python contributions: Please join https://github.com/cokelaer/spectrum contributors: https://github.com/cokelaer/spect

Thomas Cokelaer 280 Dec 16, 2022
My solution to the book A Collection of Data Science Take-Home Challenges

DS-Take-Home Solution to the book "A Collection of Data Science Take-Home Challenges". Note: Please don't contact me for the dataset. This repository

Jifu Zhao 1.5k Jan 03, 2023
Produces a summary CSV report of an Amber Electric customer's energy consumption and cost data.

Amber Electric Usage Summary This is a command line tool that produces a summary CSV report of an Amber Electric customer's energy consumption and cos

Graham Lea 12 May 26, 2022
This python script allows you to manipulate the audience data from Sl.ido surveys

Slido-Automated-VoteBot This python script allows you to manipulate the audience data from Sl.ido surveys Since Slido blocks interference from automat

Pranav Menon 1 Jan 24, 2022
Additional tools for particle accelerator data analysis and machine information

PyLHC Tools This package is a collection of useful scripts and tools for the Optics Measurements and Corrections group (OMC) at CERN. Documentation Au

PyLHC 3 Apr 13, 2022
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 1.2k Dec 31, 2022
Semi-Automated Data Processing

Perform semi automated exploratory data analysis, feature engineering and feature selection on provided dataset by visualizing every possibilities on each step and assisting the user to make a meanin

Arun Singh Babal 1 Jan 17, 2022
Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. 😃 Motiv

Souvik Pratiher 31 Dec 16, 2022
ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

Павел Максимов 18 Jul 06, 2022
Data science/Analysis Health Care Portfolio

Health-Care-DS-Projects Data Science/Analysis Health Care Portfolio Consists Of 3 Projects: Mexico Covid-19 project, analyze the patient medical histo

Mohamed Abd El-Mohsen 1 Feb 13, 2022
Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

Art Tucker 1 Sep 05, 2021
Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

PizzaOrders_DataPipeline There is a Tony who is owning a New Pizza shop. He knew that pizza alone was not going to help him get seed funding to expand

Melwin Varghese P 4 Jun 05, 2022
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and lo

Coiled 102 Nov 10, 2022
Flood modeling by 2D shallow water equation

hydraulicmodel Flood modeling by 2D shallow water equation. Refer to Hunter et al (2005), Bates et al. (2010). Diffusive wave approximation Local iner

6 Nov 30, 2022
Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

Tirthajyoti Sarkar 249 Jan 08, 2023
Provide a market analysis (R)

market-study Provide a market analysis (R) - FRENCH Produisez une étude de marché Prérequis Pour effectuer ce projet, vous devrez maîtriser la manipul

1 Feb 13, 2022
Implementation in Python of the reliability measures such as Omega.

reliabiliPy Summary Simple implementation in Python of the [reliability](https://en.wikipedia.org/wiki/Reliability_(statistics) measures for surveys:

Rafael Valero Fernández 2 Apr 27, 2022
PyPSA: Python for Power System Analysis

1 Python for Power System Analysis Contents 1 Python for Power System Analysis 1.1 About 1.2 Documentation 1.3 Functionality 1.4 Example scripts as Ju

758 Dec 30, 2022
Project under the certification "Data Analysis with Python" on FreeCodeCamp

Sea Level Predictor Assignment You will anaylize a dataset of the global average sea level change since 1880. You will use the data to predict the sea

Bhavya Gopal 3 Jan 31, 2022
Convert tables stored as images to an usable .csv file

Convert an image of numbers to a .csv file This Python program aims to convert images of array numbers to corresponding .csv files. It uses OpenCV for

711 Dec 26, 2022