Fast and robust date extraction from web pages, with Python or on the command-line

Overview

htmldate: find the publication date of web pages

Python package Python versions Documentation Status Code Coverage Downloads

Code: https://github.com/adbar/htmldate
Documentation: https://htmldate.readthedocs.io
Issue tracker: https://github.com/adbar/htmldate/issues

Find original and updated publication dates of any web page. From the command-line or within Python, all the steps needed from web page download to HTML parsing, scraping, and text analysis are included.

In a nutshell


Demo as GIF image

With Python:

>>> from htmldate import find_date
>>> find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'2016-12-23'
>>> find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True)
'2016-06-23'

On the command-line:

$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
'2016-12-23'

Features

  • Compatible with all recent versions of Python (see above)
  • Multilingual, robust and efficient (used in production on millions of documents)
  • URLs, HTML files, or HTML trees are given as input (includes batch processing)
  • Output as string in any date format (defaults to ISO 8601 YMD)
  • Detection of both original and updated dates

htmldate finds original and updated publication dates of web pages using heuristics on HTML code and linguistic patterns. It provides following ways to date a HTML document:

  1. Markup in header: Common patterns are used to identify relevant elements (e.g. link and meta elements) including Open Graph protocol attributes and a large number of CMS idiosyncrasies
  2. HTML code: The whole document is then searched for structural markers: abbr and time elements as well as a series of attributes (e.g. postmetadata)
  3. Bare HTML content: A series of heuristics is run on text and markup:
  • in fast mode the HTML page is cleaned and precise patterns are targeted
  • in extensive mode all potential dates are collected and a disambiguation algorithm determines the best one

Performance

500 web pages containing identifiable dates (as of 2021-09-24)
Python Package Precision Recall Accuracy F-Score Time
articleDateExtractor 0.20 0.769 0.691 0.572 0.728 3.3x
date_guesser 2.1.4 0.738 0.544 0.456 0.626 20x
goose3 3.1.9 0.821 0.453 0.412 0.584 8.2x
htmldate[all] 0.9.1 (fast) 0.839 0.906 0.772 0.871 1x
htmldate[all] 0.9.1 (extensive) 0.825 0.990 0.818 0.900 1.7x
newspaper3k 0.2.8 0.729 0.630 0.510 0.675 8.4x
news-please 1.5.21 0.769 0.691 0.572 0.728 30x

For complete results and explanations see the evaluation page.

Installation

This Python package is tested on Linux, macOS and Windows systems, it is compatible with Python 3.6 upwards. It is available on the package repository PyPI and can notably be installed with pip (pip3 where applicable): pip install htmldate and optionally pip install htmldate[speed].

Documentation

For more details on installation, Python & CLI usage, please refer to the documentation: htmldate.readthedocs.io

License

htmldate is distributed under the GNU General Public License v3.0. If you wish to redistribute this library but feel bounded by the license conditions please try interacting at arms length, multi-licensing with compatible licenses, or contacting me.

See also GPL and free software licensing: What's in it for business?

Author

This effort is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). Extracting and pre-processing web texts to the exacting standards of scientific research presents a substantial challenge for those who conduct such research. There are web pages for which neither the URL nor the server response provide a reliable way to find out when a document was published or modified. For more information:

JOSS article Zenodo archive
@article{barbaresi-2020-htmldate,
  title = {{htmldate: A Python package to extract publication dates from web pages}},
  author = "Barbaresi, Adrien",
  journal = "Journal of Open Source Software",
  volume = 5,
  number = 51,
  pages = 2439,
  url = {https://doi.org/10.21105/joss.02439},
  publisher = {The Open Journal},
  year = 2020,
}

You can contact me via my contact page or GitHub.

Contributing

Contributions are welcome!

Feel free to file issues on the dedicated page. Thanks to the contributors who submitted features and bugfixes!

Kudos to the following software libraries:

Comments
  • Memory leak

    Memory leak

    See issue https://github.com/adbar/trafilatura/issues/216.

    Extracting the date from the same web page multiple times shows that the module is leaking memory, this doesn't appear to be related to extensive_search:

    import os
    import psutil
    from htmldate import find_date
    
    with open('test.html', 'rb') as inputf:
        html = inputf.read()
    
    for i in range(10):
        result = find_date(html, extensive_search=False)
        process = psutil.Process(os.getpid())
        print(i, ":", process.memory_info().rss / 1024 ** 2)
    

    tracemalloc doesn't give any clue.

    bug 
    opened by adbar 21
  • feature: supports delaying url date extraction

    feature: supports delaying url date extraction

    add a feature to improve precision of dates by delaying the extraction of the URL. see (https://github.com/adbar/htmldate/issues/55)

    adds the boolean parameter url_delayed to the find_date function

    This is slightly hackey, but is a quick fix. A better longer term solution will be allowing the extractors to be defined in order.

    opened by getorca 9
  • Good test cases

    Good test cases

    Hi Adrien

    here are a few test cases where the extraction gave a wrong answer:

    https://www.gardeners.com/how-to/vegetable-gardening/5069.html https://www.almanac.com/vegetable-gardening-for-beginners

    Somewhat related, this one 'hangs': https://www.homedepot.com/c/ah/how-to-start-a-vegetable-garden/9ba683603be9fa5395fab90d6de2854

    opened by vprelovac 9
  • Add new test cases including more global stories

    Add new test cases including more global stories

    This adds a new set of test cases based on a global random sample of 200 articles from the Media Cloud dataset (related to #8). We currently use our own date_guesser library and are evaluating switching the htmldate.

    This new corpus includes 200 articles discovered via a search of stories from 2020 in the Media Cloud corpus. The set of countries of origin, and languages, is representative of the ~60k sources we ingest from every day.

    The htmldate code still performs well against this new test corpus:

    Name                    Precision    Recall    Accuracy    F-score
    --------------------  -----------  --------  ----------  ---------
    htmldate extensive       0.755102  0.973684       0.74    0.850575
    htmldate fast            0.769663  0.861635       0.685   0.813056
    newspaper                0.671141  0.662252       0.5     0.666667
    newsplease               0.736527  0.788462       0.615   0.76161
    articledateextractor     0.72973   0.675          0.54    0.701299
    date_guesser             0.686567  0.582278       0.46    0.630137
    goose                    0.75      0.508772       0.435   0.606272
    

    A few notes:

    • We changed comparison.py to load test data from .json files so the test data is isolated from the code itself.
    • The new set of stories and dates are in test/eval_mediacloud_2020.json, with HTML cached in tests/eval.
    • Then evaluation results are now printed out via the tabulate module, and saved to the file system.
    • Perhaps the two evaluations sets should be merged into one larger one? Or the scores combined between them? We weren't sure how to approach this.
    • Interesting to note that overall all the precision scores were lower against this corpus - more false positives. Recall actually slightly better against this set - fewer false negatives.

    We hope this contribution helps document the performance of the various libraries against a more global dataset.

    opened by rahulbot 8
  • `find_date` doesn't extract `%D %b %Y` formatted dates in free text

    `find_date` doesn't extract `%D %b %Y` formatted dates in free text

    For the following MWE:

    from htmldate import find_date
    
    print(find_date("<html><body>Wed, 19 Oct 2022 14:24:05 +0000</body></html>"))
    

    htmldate outputs 2022-01-01 instead of the expected 2022-10-19.

    I've traced the execution of the above call and I believe it is the search_page function that has the bug. It doesn't seem to catch the above date pattern as a valid date and only grabs onto the 2022 part of the date string (which autocompletes the rest to 1st Jan).

    I haven't found time to understand why the bug happens in detail so I don't have a solution right now. I'll try and see if I can fix the bug and will make a PR if I can.

    enhancement 
    opened by k-sareen 7
  • return datetime instead of date

    return datetime instead of date

    Is there a way to force htmldate to look for datetime and not date, or prioritise specific extractors over others, eg opengraph over url-extraction. Let me give you an example:

    from htmldate import find_date
    url = "https://www.ot.gr/2022/03/23/apopseis/daimonopoiisi/"
    find_date(url, outputformat='%Y-%m-%d %H:%M:%S', verbose = True)
    
    INFO:htmldate.utils:URL detected, downloading: https://www.ot.gr/2022/03/23/apopseis/daimonopoiisi/
    DEBUG:urllib3.connectionpool:Resetting dropped connection: www.ot.gr
    DEBUG:urllib3.connectionpool:https://www.ot.gr:443 "GET /2022/03/23/apopseis/daimonopoiisi/ HTTP/1.1" 200 266737
    DEBUG:htmldate.extractors:found date in URL: /2022/03/23/
    '2022-03-23 00:00:00'
    

    returns:

    '2022-03-23 00:00:00'
    

    But if you look at the article you can find: <meta property="article:published_time" content="2022-03-23T06:15:58+00:00">

    question 
    opened by kvasilopoulos 7
  • "URL couldn't be processed: %s" during callinf of find_date()

    I got a problem with exctracting date from website. date = find_date('https://uk.investing.com/news/astrazeneca-earnings-revenue-beat-in-q4-2582731');

    I got such an error:

    ValueError Traceback (most recent call last) in () ----> 1 date = find_date('https://uk.investing.com/news/astrazeneca-earnings-revenue-beat-in-q4-2582731');

    1 frames /usr/local/lib/python3.7/dist-packages/htmldate/core.py in find_date(htmlobject, extensive_search, original_date, outputformat, url, verbose, min_date, max_date) 598 if verbose is True: 599 logging.basicConfig(level=logging.DEBUG) --> 600 tree = load_html(htmlobject) 601 find_date.extensive_search = extensive_search 602 min_date, max_date = get_min_date(min_date), get_max_date(max_date)

    /usr/local/lib/python3.7/dist-packages/htmldate/utils.py in load_html(htmlobject) 165 # log the error and quit 166 if htmltext is None: --> 167 raise ValueError("URL couldn't be processed: %s", htmlobject) 168 # start processing 169 tree = None

    ValueError: ("URL couldn't be processed: %s", 'https://uk.investing.com/news/astrazeneca-earnings-revenue-beat-in-q4-2582731')

    I will be gratefull for any support and help with this.

    question 
    opened by HubLubas 6
  • Port improvements from go-htmldate

    Port improvements from go-htmldate

    Overview

    While porting this library into Go language, I've tried to made some improvements to make the extraction more accurate. After more testing, it looks like those improvements are good and stable enough to use so I decided to implement those improvements back to Python here.

    Changes

    There are three main changes in this PR:

    1. Add French and Indonesian language to regular expressions that used to parse long date string.

      This is done to fix htmldate failed to extract date from paris-luttes.info.html which uses French language. Since I added a new language to the regular expressions, I decided to add Indonesian language as well.

    2. Improve custom_parse.

      Now it works by trying to parse the string using several formats with following priority:

      • YYYYMMDD pattern
      • YYYY-MM-DD (ISO-8601)
      • DD-MM-YYYY (most common used date format according to Wikipedia)
      • YYYY-MM pattern
      • Regex patterns
    3. Merge xpath selectors from array of strings into a single string.

      This is done to fix htmldate extracted the wrong date for wolfsrebellen-netz.forumieren.com.regeln.html. Consider HTML document like this:

      	<div>
      		<h1>Some Title</h1>
      		<p class="author">By Joestar at 2020/12/12, 10:11 AM</p>
      		<p>Lorem ipsum dolor sit amet.</p>
      		<p>Dolorum explicabo quos voluptas voluptates?</p>
      		<p class="current-time">Current date and time: 2021/07/14, 09:00 PM</p>
      	</div>
      

      In document above, there are two dates: one in element with class "author" and the other in element with class "current-time".

      In the original code, htmldate will pick the date from element in "current-time" even though it's occured later in the document. This is because currently DATE_EXPRESSIONS is created as array of Xpath selectors, and in that array element with classes that contains time is given more priority than element with classes that contains author.

      To fix this, I've converted DATE_EXPRESSIONS and other Xpath selectors from array of strings into a single string. This way every rules inside the expressions has same priority, so now the <p class="author"> will be selected first.

    Result

    Here is the result of comparison test for the original htmldate:

    | Package | Precision | Recall | Accuracy | F-Score | Speed (s) | |:------------------------------:|:---------:|:------:|:--------:|:-------:|:---------:| | htmldate fast | 0.899 | 0.917 | 0.831 | 0.908 | 1.038 | | htmldate extensive | 0.893 | 1.000 | 0.893 | 0.944 | 2.219 |

    And here is after this PR:

    | Package | Precision | Recall | Accuracy | F-Score | Speed (s) | |:------------------------------:|:---------:|:------:|:--------:|:-------:|:---------:| | htmldate fast | 0.920 | 0.938 | 0.867 | 0.929 | 1.579 | | htmldate extensive | 0.911 | 1.000 | 0.911 | 0.953 | 2.807 |

    So there is a slight increase in accuracy, however the extraction speed become slower (around 1.5x slower than the original).

    Additional Notes

    I've not added it to this PR, however since custom_parse has been improved, from what I test we can safely remove external_date_parser without any performance loss. Here is the result of comparison test after external_date_parser removed:

    | Package | Precision | Recall | Accuracy | F-Score | Speed (s) | |:------------------------------:|:---------:|:------:|:--------:|:-------:|:---------:| | htmldate fast | 0.920 | 0.938 | 0.867 | 0.929 | 1.678 | | htmldate extensive | 0.911 | 1.000 | 0.911 | 0.953 | 1.816 |

    So the accuracy is still the same, however the extraction speed for extensive mode become a lot faster (now only 1.08x slower than the fast mode) so we might be able to make the extensive mode as default. Might need more tests though.

    opened by RadhiFadlillah 6
  • Strange inferred date for target news article

    Strange inferred date for target news article

    Hello @adbar,

    I just stumbled upon an issue when extracting contents from this html file (an article from LeMonde): https://gist.github.com/Yomguithereal/de4457a421729c92a976b506268631d7

    It returns 2021-01-31 (which was a date in the future at the time the html was downloaded, i.e. more than one year ago) because it latches on something which is an expiry date for something in a JavaScript string litteral.

    I don't really know how trafilatura tries to extract a date from html pages, but I guess here it was found because of a regex scanning the whole text? In which case maybe a condition checking that the found dates are not in the future could help (this could also be tedious because one would need to pass the "present" date when extracting data collected in the past).

    bug 
    opened by Yomguithereal 6
  • error: redefinition of group name 'm' as group 5; was group 2 at position 116

    error: redefinition of group name 'm' as group 5; was group 2 at position 116

    Hello there,

    Thanks for this great project! I encountered a problem while crawling different websites and trying to extract dates with this package. Especially on this URL: https://osmh.dev

    Here is the error using iPython and Python 3.8.12:

    # works
    In [3]: from htmldate import find_date
    
    In [4]: find_date("https://osmh.dev")
    Out[4]: '2020-11-29'
    
    # doesn't work
    In [6]: find_date("https://osmh.dev", extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')
    

    The last example throws an error:

    ---------------------------------------------------------------------------
    error                                     Traceback (most recent call last)
    <ipython-input-6-9988648ad55b> in <module>
    ----> 1 find_date("https://osmh.dev", extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in find_date(htmlobject, extensive_search, original_date, outputformat, url, verbose, min_date, max_date)
        653
        654     # try time elements
    --> 655     time_result = examine_time_elements(
        656         search_tree, outputformat, extensive_search, original_date, min_date, max_date
        657     )
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in examine_time_elements(tree, outputformat, extensive_search, original_date, min_date, max_date)
        389                         return attempt
        390                 else:
    --> 391                     reference = compare_reference(reference, elem.get('datetime'), outputformat, extensive_search, original_date, min_date, max_date)
        392                     if reference > 0:
        393                         break
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in compare_reference(reference, expression, outputformat, extensive_search, original_date, min_date, max_date)
        300     attempt = try_expression(expression, outputformat, extensive_search, min_date, max_date)
        301     if attempt is not None:
    --> 302         return compare_values(reference, attempt, outputformat, original_date)
        303     return reference
        304
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/validators.py in compare_values(reference, attempt, outputformat, original_date)
        110 def compare_values(reference, attempt, outputformat, original_date):
        111     """Compare the date expression to a reference"""
    --> 112     timestamp = time.mktime(datetime.datetime.strptime(attempt, outputformat).timetuple())
        113     if original_date is True and (reference == 0 or timestamp < reference):
        114         reference = timestamp
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in _strptime_datetime(cls, data_string, format)
        566     """Return a class cls instance based on the input string and the
        567     format string."""
    --> 568     tt, fraction, gmtoff_fraction = _strptime(data_string, format)
        569     tzname, gmtoff = tt[-2:]
        570     args = tt[:6] + (fraction,)
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in _strptime(data_string, format)
        331         if not format_regex:
        332             try:
    --> 333                 format_regex = _TimeRE_cache.compile(format)
        334             # KeyError raised when a bad format is found; can be specified as
        335             # \\, in which case it was a stray % but with a space after it
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in compile(self, format)
        261     def compile(self, format):
        262         """Return a compiled re object for the format string."""
    --> 263         return re_compile(self.pattern(format), IGNORECASE)
        264
        265 _cache_lock = _thread_allocate_lock()
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/re.py in compile(pattern, flags)
        250 def compile(pattern, flags=0):
        251     "Compile a regular expression pattern, returning a Pattern object."
    --> 252     return _compile(pattern, flags)
        253
        254 def purge():
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/re.py in _compile(pattern, flags)
        302     if not sre_compile.isstring(pattern):
        303         raise TypeError("first argument must be string or compiled pattern")
    --> 304     p = sre_compile.compile(pattern, flags)
        305     if not (flags & DEBUG):
        306         if len(_cache) >= _MAXCACHE:
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_compile.py in compile(p, flags)
        762     if isstring(p):
        763         pattern = p
    --> 764         p = sre_parse.parse(p, flags)
        765     else:
        766         pattern = None
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in parse(str, flags, state)
        946
        947     try:
    --> 948         p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
        949     except Verbose:
        950         # the VERBOSE flag was switched on inside the pattern.  to be
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in _parse_sub(source, state, verbose, nested)
        441     start = source.tell()
        442     while True:
    --> 443         itemsappend(_parse(source, state, verbose, nested + 1,
        444                            not nested and not items))
        445         if not sourcematch("|"):
    
    /opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in _parse(source, state, verbose, nested, first)
        829                     group = state.opengroup(name)
        830                 except error as err:
    --> 831                     raise source.error(err.msg, len(name) + 1) from None
        832             sub_verbose = ((verbose or (add_flags & SRE_FLAG_VERBOSE)) and
        833                            not (del_flags & SRE_FLAG_VERBOSE))
    
    error: redefinition of group name 'm' as group 5; was group 2 at position 116
    
    bug 
    opened by kinoute 4
  • how are timezones handled when available?

    how are timezones handled when available?

    Some articles include the full publication time, with timezone, in HTML meta tags or Javascript config. Does this library parse and handle those timezones? Relatedly, how does it internally store dates with regards to timezone - are the all returned in machine-local time, held in GMT, or something else?

    For instance, this Guardian article includes the article:published_time meta tag with a timezone included. Does this library recognize that timezone and return the date as it would be in GMT? Same for this article on CNN, which includes the datePublished meta tag.

    question 
    opened by rahulbot 3
  • ignore undateable domains more intentionally

    ignore undateable domains more intentionally

    In our testing the current code produces unreliable results when tested on Wikipedia articles. Sometimes it returns a data, sometimes it doesn't. Wikipedia articles are constantly updated, so @coreydockser and I would like to propose to change it so it returns no date if the URL is a wikipedia.org one. In our broader experience with Media Cloud this produces more useful results (for our open web news analysis context).

    In terms of implementation, we could just copy filter_url_for_undateable function from date_guesser and use that as is to include the other checks it does for undateable domains. We'd call it early on in guess_date.

    question 
    opened by rahulbot 7
  • Test htmldate on further web pages and report bugs

    Test htmldate on further web pages and report bugs

    I have mostly tested htmldate on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction of a date doesn't work so far.

    Please install the dateparser library beforehand as it significantly extends linguistic coverage: pipor pip3 install -U dateparser or pip install -U htmldate[all].

    Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in core.py (see DATE_EXPRESSIONS and ADDITIONAL_EXPRESSIONS).

    Thanks!

    good first issue up for grabs 
    opened by adbar 7
  • Check the language, clarity and consistency of documentation

    Check the language, clarity and consistency of documentation

    A short version of the documentation is available straight from Github (README.rst) while a more exhaustive one is present in the docs folder and online on htmldate.readthedocs.io

    Several problems could arise:

    • Non-idiomatic use of English (not quite fluent or natural)
    • Unclear or incomplete descriptions
    • Code examples that don't work
    • Typos in explanations or code sections
    • Outdated sections
    good first issue up for grabs 
    opened by adbar 2
Releases(v1.4.0)
  • v1.4.0(Nov 28, 2022)

    • additional search of free text in whole document (#67)
    • optional parameter for subdaily precision with @getorca (#66)
    • fix for HTML doctype parsing (#44)
    • cleaner code for multilingual month expressions
    • extended expressions for extraction in HTML meta fields
    • update of dependencies and evaluation
    Source code(tar.gz)
    Source code(zip)
  • v1.3.2(Oct 14, 2022)

  • v1.3.1(Aug 26, 2022)

  • v1.3.0(Jul 20, 2022)

    • Entirely type-checked code base
    • New function clear_caches() (#57)
    • Slightly more efficient code (about 5% faster)

    Full Changelog: https://github.com/adbar/htmldate/compare/v1.2.3...v1.3.0

    Source code(tar.gz)
    Source code(zip)
  • v1.2.3(Jun 16, 2022)

  • v1.2.2(Jun 13, 2022)

    • slightly higher accuracy & faster extensive extraction
    • maintenance: code base simplified, more tests
    • bugs addressed: #51, #54
    • docs: fix by @MSK1582

    Full Changelog: https://github.com/adbar/htmldate/compare/v1.2.1...v1.2.2

    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Mar 23, 2022)

    • speed and accuracy gains
    • better extraction coverage, simpler code
    • bug fixed (typo in variable)

    Full Changelog: https://github.com/adbar/htmldate/compare/v1.2.0...v1.2.1

    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Mar 16, 2022)

    • better performance
    • remove unnecessary ciso8601 dependency
    • temporary fix for scrapinghub/dateparser#1045 bug

    Full Changelog: https://github.com/adbar/htmldate/compare/v1.1.1...v1.2.0

    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Mar 3, 2022)

    • bugfix: input encoding
    • improved extraction coverage (#47) by @liulinlin90

    Full Changelog: https://github.com/adbar/htmldate/compare/v1.1.0...v1.1.1

    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Feb 18, 2022)

    • better handling of file encodings
    • slight increase in accuracy, more efficient code

    Full Changelog: https://github.com/adbar/htmldate/compare/v1.0.1...v1.1.0

    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Feb 14, 2022)

    • maintenance release, code base cleaned
    • command-line interface: --version added
    • file parsing reviewed

    Full Changelog: https://github.com/adbar/htmldate/compare/v1.0.0...v1.0.1

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Nov 9, 2021)

  • v0.9.1(Sep 24, 2021)

    • improved generic date parsing (thanks @RadhiFadlillah)
    • specific support for French and Indonesian (thanks @RadhiFadlillah)
    • additional evaluation for English news sites (kudos to @coreydockser & @rahulbot)
    • bugs fixed
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Jul 28, 2021)

  • v0.8.1(Mar 9, 2021)

  • v0.8.0(Feb 11, 2021)

  • v0.7.3(Jan 4, 2021)

    • dependencies updated and reduced: switch from requests to bare urllib3, make chardet standard and cchardet optional
    • fixes: downloads, OverflowError in extraction
    Source code(tar.gz)
    Source code(zip)
  • v0.7.2(Oct 20, 2020)

  • v0.7.1(Sep 14, 2020)

  • v0.7.0(Jul 29, 2020)

    • code base and performance improved
    • minimum date available as option
    • support for Turkish patterns and CMS idiosyncrasies (thanks @evolutionoftheuniverse)
    Source code(tar.gz)
    Source code(zip)
  • v0.6.3(May 26, 2020)

  • v0.6.1(Jan 17, 2020)

    htmldate finds original and updated publication dates of any web page. All the steps needed from web page download to HTML parsing, scraping and text analysis are included.

    In a nutshell, with Python:

    from htmldate import find_date find_date('http://blog.python.org/2016/12/python-360-is-now-available.html') '2016-12-23' find_date('https://netzpolitik.org/2016/die-cider-connection-abmahnungen-gegen-nutzer-von-creative-commons-bildern/', original_date=True) '2016-06-23'

    On the command-line:

    $ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html '2016-12-23'

    Releases used in production and meant to be archived on Zenodo for reproducibility and citability.

    For more information see htmldate.readthedocs.io

    Source code(tar.gz)
    Source code(zip)
  • v0.5.6(Sep 24, 2019)

Owner
Adrien Barbaresi
Research scientist – natural language processing, web scraping and text analytics. Mostly with Python.
Adrien Barbaresi
Every web site provides APIs.

Toapi Overview Toapi give you the ability to make every web site provides APIs. Version v2.0.0, Completely rewrote. More elegant. More pythonic v1.0.0

Jiuli Gao 3.3k Jan 05, 2023
Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage import lassie lassie.fetch('http://www.youtube.com/watch?v

Mike Helmick 571 Dec 29, 2022
News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

Lucas Ou-Yang 12.3k Jan 01, 2023
Module for automatic summarization of text documents and HTML pages.

Automatic text summarizer Simple library and command line utility for extracting summary from HTML pages or plain texts. The package also contains sim

Mišo Belica 3k Jan 03, 2023
a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

Charles Leifer 588 Dec 27, 2022
Open clone of OpenAI's unreleased WebText dataset scraper.

Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.

Joshua C Peterson 471 Dec 30, 2022
Convert HTML to Markdown-formatted text.

html2text html2text is a Python script that converts a page of HTML into clean, easy-to-read plain ASCII text. Better yet, that ASCII also happens to

Alireza Savand 1.3k Dec 31, 2022
Web-Extractor - Simple Tool To Extract IP-Adress From Website

IP-Adress Extractor Simple Tool To Extract IP-Adress From Website Socials: Langu

ميخائيل 7 Jan 16, 2022
Brownant is a web data extracting framework.

Brownant Brownant is a lightweight web data extracting framework. Who uses it? At the moment, dongxi.douban.com (a.k.a. Douban Dongxi) uses Brownant i

Douban Inc. 157 Jan 06, 2022
fast python port of arc90's readability tool, updated to match latest readability.js!

python-readability Given a html document, it pulls out the main body text and cleans it up. This is a python port of a ruby port of arc90's readabilit

Yuri Baburov 2.2k Dec 28, 2022
Export your data from Xiami

Xiami Exporter 导出虾米音乐的个人数据,功能: 导出歌曲为 json 收藏歌曲 收藏专辑 播放列表 导出收藏艺人为 json 导出收藏专辑为 json 导出播放列表为 json (个人和收藏) 将导出的数据整理至 sqlite 数据库 收藏歌曲 收藏艺人 收藏专辑 播放列表 下载已导出

Xiao Meng 59 Nov 13, 2021
Pythonic HTML Parsing for Humans™

Requests-HTML: HTML Parsing for Humans™ This library intends to make parsing HTML (e.g. scraping the web) as simple and intuitive as possible. When us

Python Software Foundation 12.9k Jan 01, 2023
RSS feed generator website with user friendly interface

RSS feed generator website with user friendly interface

Alexandr Nesterenko 331 Jan 02, 2023
Zotero2Readwise - A Python Library to retrieve annotations and notes from Zotero and upload them to your Readwise

Zotero ➡️ Readwise zotero2readwise is a Python library that retrieves all Zotero

Essi Alizadeh 49 Dec 20, 2022
Fast and robust date extraction from web pages, with Python or on the command-line

Find original and updated publication dates of any web page. From the command-line or within Python, all the steps needed from web page download to HTML parsing, scraping, and text analysis are inclu

Adrien Barbaresi 60 Dec 14, 2022
Combine XPath, CSS Selectors and JSONPath for Web data extracting.

Data Extractor Combine XPath, CSS Selectors and JSONPath for Web data extracting. Quickstarts Installation Install the stable version from PYPI. pip i

林玮 (Jade Lin) 27 Oct 22, 2022
Github Actions采集RSS, 打造无广告内容优质的头版头条超赞宝藏页

Github Actions Rss (garss, 嘎RSS! 已收集69个RSS源, 生成时间: 2021-02-26 11:23:45) 信息茧房是指人们关注的信息领域会习惯性地被自己的兴趣所引导,从而将自己的生活桎梏于像蚕茧一般的“茧房”中的现象。

zhaoolee 721 Jan 02, 2023