Extract embedded metadata from HTML markup

Overview

extruct

Build Status Coverage report PyPI Version

extruct is a library for extracting embedded metadata from HTML markup.

Currently, extruct supports:

The microdata algorithm is a revisit of this Scrapinghub blog post showing how to use EXSLT extensions.

Installation

pip install extruct

Usage

All-in-one extraction

The simplest example how to use extruct is to call extruct.extract(htmlstring, base_url=base_url) with some HTML string and an optional base URL.

Let's try this on a webpage that uses all the syntaxes supported (RDFa with ogp).

First fetch the HTML using python-requests and then feed the response body to extruct:

>>> import extruct
>>> import requests
>>> import pprint
>>> from w3lib.html import get_base_url
>>>
>>> pp = pprint.PrettyPrinter(indent=2)
>>> r = requests.get('https://www.optimizesmart.com/how-to-use-open-graph-protocol/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url=base_url)
>>>
>>> pp.pprint(data)
{ 'dublincore': [ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/description',
                                      'content': 'What is Open Graph Protocol '
                                                 'and why you need it? Learn to '
                                                 'implement Open Graph Protocol '
                                                 'for Facebook on your website. '
                                                 'Open Graph Protocol Meta Tags.',
                                      'name': 'description'}],
                      'namespaces': {},
                      'terms': []}],

'json-ld': [ { '@context': 'https://schema.org',
                 '@id': '#organization',
                 '@type': 'Organization',
                 'logo': 'https://www.optimizesmart.com/wp-content/uploads/2016/03/optimize-smart-Twitter-logo.jpg',
                 'name': 'Optimize Smart',
                 'sameAs': [ 'https://www.facebook.com/optimizesmart/',
                             'https://uk.linkedin.com/in/analyticsnerd',
                             'https://www.youtube.com/user/optimizesmart',
                             'https://twitter.com/analyticsnerd'],
                 'url': 'https://www.optimizesmart.com/'}],
  'microdata': [ { 'properties': {'headline': ''},
                   'type': 'http://schema.org/WPHeader'}],
  'microformat': [ { 'children': [ { 'properties': { 'category': [ 'specialized-tracking'],
                                                     'name': [ 'Open Graph '
                                                               'Protocol for '
                                                               'Facebook '
                                                               'explained with '
                                                               'examples\n'
                                                               '\n'
                                                               'Specialized '
                                                               'Tracking\n'
                                                               '\n'
                                                               '\n'
                                                               (...)
                                                               'Follow '
                                                               '@analyticsnerd\n'
                                                               '!function(d,s,id){var '
                                                               "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
                                                               "'script', "
                                                               "'twitter-wjs');"]},
                                     'type': ['h-entry']}],
                     'properties': { 'name': [ 'Open Graph Protocol for '
                                               'Facebook explained with '
                                               'examples\n'
                                               (...)
                                               'Follow @analyticsnerd\n'
                                               '!function(d,s,id){var '
                                               "js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, "
                                               "'script', 'twitter-wjs');"]},
                     'type': ['h-feed']}],
  'opengraph': [ { 'namespace': {'og': 'http://ogp.me/ns#'},
                   'properties': [ ('og:locale', 'en_US'),
                                   ('og:type', 'article'),
                                   ( 'og:title',
                                     'Open Graph Protocol for Facebook '
                                     'explained with examples'),
                                   ( 'og:description',
                                     'What is Open Graph Protocol and why you '
                                     'need it? Learn to implement Open Graph '
                                     'Protocol for Facebook on your website. '
                                     'Open Graph Protocol Meta Tags.'),
                                   ( 'og:url',
                                     'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'),
                                   ('og:site_name', 'Optimize Smart'),
                                   ( 'og:updated_time',
                                     '2018-03-09T16:26:35+00:00'),
                                   ( 'og:image',
                                     'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'),
                                   ( 'og:image:secure_url',
                                     'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg')]}],
  'rdfa': [ { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/#header',
              'http://www.w3.org/1999/xhtml/vocab#role': [ { '@id': 'http://www.w3.org/1999/xhtml/vocab#banner'}]},
            { '@id': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/',
              'article:modified_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
              'article:published_time': [ { '@value': '2010-07-02T18:57:23+00:00'}],
              'article:publisher': [ { '@value': 'https://www.facebook.com/optimizesmart/'}],
              'article:section': [{'@value': 'Specialized Tracking'}],
              'http://ogp.me/ns#description': [ { '@value': 'What is Open '
                                                            'Graph Protocol '
                                                            'and why you need '
                                                            'it? Learn to '
                                                            'implement Open '
                                                            'Graph Protocol '
                                                            'for Facebook on '
                                                            'your website. '
                                                            'Open Graph '
                                                            'Protocol Meta '
                                                            'Tags.'}],
              'http://ogp.me/ns#image': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
              'http://ogp.me/ns#image:secure_url': [ { '@value': 'https://www.optimizesmart.com/wp-content/uploads/2010/07/open-graph-protocol.jpg'}],
              'http://ogp.me/ns#locale': [{'@value': 'en_US'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Optimize Smart'}],
              'http://ogp.me/ns#title': [ { '@value': 'Open Graph Protocol for '
                                                      'Facebook explained with '
                                                      'examples'}],
              'http://ogp.me/ns#type': [{'@value': 'article'}],
              'http://ogp.me/ns#updated_time': [ { '@value': '2018-03-09T16:26:35+00:00'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.optimizesmart.com/how-to-use-open-graph-protocol/'}],
              'https://api.w.org/': [ { '@id': 'https://www.optimizesmart.com/wp-json/'}]}]}

Select syntaxes

It is possible to select which syntaxes to extract by passing a list with the desired ones to extract. Valid values: 'microdata', 'json-ld', 'opengraph', 'microformat', 'rdfa' and 'dublincore'. If no list is passed all syntaxes will be extracted and returned:

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'])
>>>
>>> pp.pprint(data)
{ 'microdata': [],
  'opengraph': [ { 'namespace': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
                                  'fb': 'http://www.facebook.com/2008/fbml',
                                  'og': 'http://ogp.me/ns#'},
                   'properties': [ ('fb:app_id', '308540029359'),
                                   ('og:site_name', 'Songkick'),
                                   ('og:type', 'songkick-concerts:artist'),
                                   ('og:title', 'Elysian Fields'),
                                   ( 'og:description',
                                     'Find out when Elysian Fields is next '
                                     'playing live near you. List of all '
                                     'Elysian Fields tour dates and concerts.'),
                                   ( 'og:url',
                                     'https://www.songkick.com/artists/236156-elysian-fields'),
                                   ( 'og:image',
                                     'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg')]}],
  'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
              'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
              'al:ios:app_store_id': [{'@value': '438690886'}],
              'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
              'http://ogp.me/ns#description': [ { '@value': 'Find out when '
                                                            'Elysian Fields is '
                                                            'next playing live '
                                                            'near you. List of '
                                                            'all Elysian '
                                                            'Fields tour dates '
                                                            'and concerts.'}],
              'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
              'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
              'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
              'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

Uniform

Another option is to uniform the output of microformat, opengraph, microdata, dublincore and json-ld syntaxes to the following structure:

{'@context': 'http://example.com',
             '@type': 'example_type',
             /* All other the properties in keys here */
             }

To do so set uniform=True when calling extract, it's false by default for backward compatibility. Here the same example as before but with uniform set to True:

>>> r = requests.get('http://www.songkick.com/artists/236156-elysian-fields')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata', 'opengraph', 'rdfa'], uniform=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [],
  'opengraph': [ { '@context': { 'concerts': 'http://ogp.me/ns/fb/songkick-concerts#',
                               'fb': 'http://www.facebook.com/2008/fbml',
                               'og': 'http://ogp.me/ns#'},
                 '@type': 'songkick-concerts:artist',
                 'fb:app_id': '308540029359',
                 'og:description': 'Find out when Elysian Fields is next '
                                   'playing live near you. List of all '
                                   'Elysian Fields tour dates and concerts.',
                 'og:image': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg',
                 'og:site_name': 'Songkick',
                 'og:title': 'Elysian Fields',
                 'og:url': 'https://www.songkick.com/artists/236156-elysian-fields'}],
  'rdfa': [ { '@id': 'https://www.songkick.com/artists/236156-elysian-fields',
              'al:ios:app_name': [{'@value': 'Songkick Concerts'}],
              'al:ios:app_store_id': [{'@value': '438690886'}],
              'al:ios:url': [ { '@value': 'songkick://artists/236156-elysian-fields'}],
              'http://ogp.me/ns#description': [ { '@value': 'Find out when '
                                                            'Elysian Fields is '
                                                            'next playing live '
                                                            'near you. List of '
                                                            'all Elysian '
                                                            'Fields tour dates '
                                                            'and concerts.'}],
              'http://ogp.me/ns#image': [ { '@value': 'http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg'}],
              'http://ogp.me/ns#site_name': [{'@value': 'Songkick'}],
              'http://ogp.me/ns#title': [{'@value': 'Elysian Fields'}],
              'http://ogp.me/ns#type': [{'@value': 'songkick-concerts:artist'}],
              'http://ogp.me/ns#url': [ { '@value': 'https://www.songkick.com/artists/236156-elysian-fields'}],
              'http://www.facebook.com/2008/fbmlapp_id': [ { '@value': '308540029359'}]}]}

NB rdfa structure is not uniformed yet

Returning HTML node

It is also possible to get references to HTML node for every extracted metadata item. The feature is supported only by microdata syntax.

To use that, just set the return_html_node option of extract method to True. As the result, an additional key "nodeHtml" will be included in the result for every item. Each node is of lxml.etree.Element type:

>>> r = requests.get('http://www.rugpadcorner.com/shop/no-muv/')
>>> base_url = get_base_url(r.text, r.url)
>>> data = extruct.extract(r.text, base_url, syntaxes=['microdata'], return_html_node=True)
>>>
>>> pp.pprint(data)
{ 'microdata': [ { 'htmlNode': <Element div at 0x7f10f8e6d3b8>,
                   'properties': { 'description': 'KEEP RUGS FLAT ON CARPET!\n'
                                                  'Not your thin sticky pad, '
                                                  'No-Muv is truly the best!',
                                   'image': ['', ''],
                                   'name': ['No-Muv', 'No-Muv'],
                                   'offers': [ { 'htmlNode': <Element div at 0x7f10f8e6d138>,
                                                 'properties': { 'availability': 'http://schema.org/InStock',
                                                                 'price': 'Price:  '
                                                                          '$45'},
                                                 'type': 'http://schema.org/Offer'},
                                               { 'htmlNode': <Element div at 0x7f10f8e60f48>,
                                                 'properties': { 'availability': 'http://schema.org/InStock',
                                                                 'price': '(Select '
                                                                          'Size/Shape '
                                                                          'for '
                                                                          'Pricing)'},
                                                 'type': 'http://schema.org/Offer'}],
                                   'ratingValue': ['5.00', '5.00']},
                   'type': 'http://schema.org/Product'}]}

Single extractors

You can also use each extractor individually. See below.

Microdata extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.w3cmicrodata import MicrodataExtractor
>>>
>>> # example from http://www.w3.org/TR/microdata/#associating-names-with-items
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Photo gallery</title>
...  </head>
...  <body>
...   <h1>My photos</h1>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/house.jpeg" alt="A white house, boarded up, sits in a forest.">
...    <figcaption itemprop="title">The house I found.</figcaption>
...   </figure>
...   <figure itemscope itemtype="http://n.whatwg.org/work" itemref="licenses">
...    <img itemprop="work" src="images/mailbox.jpeg" alt="Outside the house is a mailbox. It has a leaflet inside.">
...    <figcaption itemprop="title">The mailbox.</figcaption>
...   </figure>
...   <footer>
...    <p id="licenses">All images licensed under the <a itemprop="license"
...    href="http://www.opensource.org/licenses/mit-license.php">MIT
...    license</a>.</p>
...   </footer>
...  </body>
... </html>"""
>>>
>>> mde = MicrodataExtractor()
>>> data = mde.extract(html)
>>> pp.pprint(data)
[{'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The house I found.',
                 'work': 'http://www.example.com/images/house.jpeg'},
  'type': 'http://n.whatwg.org/work'},
 {'properties': {'license': 'http://www.opensource.org/licenses/mit-license.php',
                 'title': 'The mailbox.',
                 'work': 'http://www.example.com/images/mailbox.jpeg'},
  'type': 'http://n.whatwg.org/work'}]

JSON-LD extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.jsonld import JsonLdExtractor
>>>
>>> html = """<!DOCTYPE HTML>
... <html>
...  <head>
...   <title>Some Person Page</title>
...  </head>
...  <body>
...   <h1>This guys</h1>
...     <script type="application/ld+json">
...     {
...       "@context": "http://schema.org",
...       "@type": "Person",
...       "name": "John Doe",
...       "jobTitle": "Graduate research assistant",
...       "affiliation": "University of Dreams",
...       "additionalName": "Johnny",
...       "url": "http://www.example.com",
...       "address": {
...         "@type": "PostalAddress",
...         "streetAddress": "1234 Peach Drive",
...         "addressLocality": "Wonderland",
...         "addressRegion": "Georgia"
...       }
...     }
...     </script>
...  </body>
... </html>"""
>>>
>>> jslde = JsonLdExtractor()
>>>
>>> data = jslde.extract(html)
>>> pp.pprint(data)
[{'@context': 'http://schema.org',
  '@type': 'Person',
  'additionalName': 'Johnny',
  'address': {'@type': 'PostalAddress',
              'addressLocality': 'Wonderland',
              'addressRegion': 'Georgia',
              'streetAddress': '1234 Peach Drive'},
  'affiliation': 'University of Dreams',
  'jobTitle': 'Graduate research assistant',
  'name': 'John Doe',
  'url': 'http://www.example.com'}]

RDFa extraction (experimental)

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.rdfa import RDFaExtractor  # you can ignore the warning about html5lib not being available
INFO:rdflib:RDFLib Version: 4.2.1
/home/paul/.virtualenvs/extruct.wheel.test/lib/python3.5/site-packages/rdflib/plugins/parsers/structureddata.py:30: UserWarning: html5lib not found! RDFa and Microdata parsers will not be available.
  'parsers will not be available.')
>>>
>>> html = """<html>
...  <head>
...    ...
...  </head>
...  <body prefix="dc: http://purl.org/dc/terms/ schema: http://schema.org/">
...    <div resource="/alice/posts/trouble_with_bob" typeof="schema:BlogPosting">
...       <h2 property="dc:title">The trouble with Bob</h2>
...       ...
...       <h3 property="dc:creator schema:creator" resource="#me">Alice</h3>
...       <div property="schema:articleBody">
...         <p>The trouble with Bob is that he takes much better photos than I do:</p>
...       </div>
...      ...
...    </div>
...  </body>
... </html>
... """
>>>
>>> rdfae = RDFaExtractor()
>>> pp.pprint(rdfae.extract(html, base_url='http://www.example.com/index.html'))
[{'@id': 'http://www.example.com/alice/posts/trouble_with_bob',
  '@type': ['http://schema.org/BlogPosting'],
  'http://purl.org/dc/terms/creator': [{'@id': 'http://www.example.com/index.html#me'}],
  'http://purl.org/dc/terms/title': [{'@value': 'The trouble with Bob'}],
  'http://schema.org/articleBody': [{'@value': '\n'
                                               '        The trouble with Bob '
                                               'is that he takes much better '
                                               'photos than I do:\n'
                                               '      '}],
  'http://schema.org/creator': [{'@id': 'http://www.example.com/index.html#me'}]}]

You'll get a list of expanded JSON-LD nodes.

Open Graph extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.opengraph import OpenGraphExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
...  <head>
...   <title>Himanshu's Open Graph Protocol</title>
...   <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
...   <meta http-equiv="Content-Language" content="en-us" />
...   <link rel="stylesheet" type="text/css" href="event-education.css" />
...   <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
...   <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
...   <meta property="og:type" content="article"/>
...   <meta property="og:url" content="https://www.eventeducation.com/test.php"/>
...   <meta property="og:image" content="https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"/>
...   <meta property="fb:admins" content="himanshu160"/>
...   <meta property="og:site_name" content="Event Education"/>
...   <meta property="og:description" content="Event Education provides free courses on event planning and management to event professionals worldwide."/>
...  </head>
...  <body>
...   <div id="fb-root"></div>
...   <script>(function(d, s, id) {
...               var js, fjs = d.getElementsByTagName(s)[0];
...               if (d.getElementById(id)) return;
...                  js = d.createElement(s); js.id = id;
...                  js.src = "//connect.facebook.net/en_US/all.js#xfbml=1&appId=501839739845103";
...                  fjs.parentNode.insertBefore(js, fjs);
...                  }(document, 'script', 'facebook-jssdk'));</script>
...  </body>
... </html>"""
>>>
>>> opengraphe = OpenGraphExtractor()
>>> pp.pprint(opengraphe.extract(html))
[{"namespace": {
      "og": "http://ogp.me/ns#"
  },
  "properties": [
      [
          "og:title",
          "Himanshu's Open Graph Protocol"
      ],
      [
          "og:type",
          "article"
      ],
      [
          "og:url",
          "https://www.eventeducation.com/test.php"
      ],
      [
          "og:image",
          "https://www.eventeducation.com/images/982336_wedding_dayandouan_th.jpg"
      ],
      [
          "og:site_name",
          "Event Education"
      ],
      [
          "og:description",
          "Event Education provides free courses on event planning and management to event professionals worldwide."
      ]
    ]
 }]

Microformat extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>>
>>> from extruct.microformat import MicroformatExtractor
>>>
>>> html = """<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
... <html xmlns="https://www.w3.org/1999/xhtml" xmlns:og="https://ogp.me/ns#" xmlns:fb="https://www.facebook.com/2008/fbml">
...  <head>
...   <title>Himanshu's Open Graph Protocol</title>
...   <meta http-equiv="Content-Type" content="text/html;charset=WINDOWS-1252" />
...   <meta http-equiv="Content-Language" content="en-us" />
...   <link rel="stylesheet" type="text/css" href="event-education.css" />
...   <meta name="verify-v1" content="so4y/3aLT7/7bUUB9f6iVXN0tv8upRwaccek7JKB1gs=" >
...   <meta property="og:title" content="Himanshu's Open Graph Protocol"/>
...   <article class="h-entry">
...    <h1 class="p-name">Microformats are amazing</h1>
...    <p>Published by <a class="p-author h-card" href="http://example.com">W. Developer</a>
...       on <time class="dt-published" datetime="2013-06-13 12:00:00">13<sup>th</sup> June 2013</time></p>
...    <p class="p-summary">In which I extoll the virtues of using microformats.</p>
...    <div class="e-content">
...     <p>Blah blah blah</p>
...    </div>
...   </article>
...  </head>
...  <body></body>
... </html>"""
>>>
>>> microformate = MicroformatExtractor()
>>> data = microformate.extract(html)
>>> pp.pprint(data)
[{"type": [
      "h-entry"
  ],
  "properties": {
      "name": [
          "Microformats are amazing"
      ],
      "author": [
          {
              "type": [
                  "h-card"
              ],
              "properties": {
                  "name": [
                      "W. Developer"
                  ],
                  "url": [
                      "http://example.com"
                  ]
              },
              "value": "W. Developer"
          }
      ],
      "published": [
          "2013-06-13 12:00:00"
      ],
      "summary": [
          "In which I extoll the virtues of using microformats."
      ],
      "content": [
          {
              "html": "\n<p>Blah blah blah</p>\n",
              "value": "\nBlah blah blah\n"
          }
      ]
    }
 }]

DublinCore extraction

>>> import pprint
>>> pp = pprint.PrettyPrinter(indent=2)
>>> from extruct.dublincore import DublinCoreExtractor
>>> html = '''<head profile="http://dublincore.org/documents/dcq-html/">
... <title>Expressing Dublin Core in HTML/XHTML meta and link elements</title>
... <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
... <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" />
...
...
... <meta name="DC.title" lang="en" content="Expressing Dublin Core
... in HTML/XHTML meta and link elements" />
... <meta name="DC.creator" content="Andy Powell, UKOLN, University of Bath" />
... <meta name="DCTERMS.issued" scheme="DCTERMS.W3CDTF" content="2003-11-01" />
... <meta name="DC.identifier" scheme="DCTERMS.URI"
... content="http://dublincore.org/documents/dcq-html/" />
... <link rel="DCTERMS.replaces" hreflang="en"
... href="http://dublincore.org/documents/2000/08/15/dcq-html/" />
... <meta name="DCTERMS.abstract" content="This document describes how
... qualified Dublin Core metadata can be encoded
... in HTML/XHTML &lt;meta&gt; elements" />
... <meta name="DC.format" scheme="DCTERMS.IMT" content="text/html" />
... <meta name="DC.type" scheme="DCTERMS.DCMIType" content="Text" />
... <meta name="DC.Date.modified" content="2001-07-18" />
... <meta name="DCTERMS.modified" content="2001-07-18" />'''
>>> dublinlde = DublinCoreExtractor()
>>> data = dublinlde.extract(html)
>>> pp.pprint(data)
[ { 'elements': [ { 'URI': 'http://purl.org/dc/elements/1.1/title',
                    'content': 'Expressing Dublin Core\n'
                               'in HTML/XHTML meta and link elements',
                    'lang': 'en',
                    'name': 'DC.title'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/creator',
                    'content': 'Andy Powell, UKOLN, University of Bath',
                    'name': 'DC.creator'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/identifier',
                    'content': 'http://dublincore.org/documents/dcq-html/',
                    'name': 'DC.identifier',
                    'scheme': 'DCTERMS.URI'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/format',
                    'content': 'text/html',
                    'name': 'DC.format',
                    'scheme': 'DCTERMS.IMT'},
                  { 'URI': 'http://purl.org/dc/elements/1.1/type',
                    'content': 'Text',
                    'name': 'DC.type',
                    'scheme': 'DCTERMS.DCMIType'}],
    'namespaces': { 'DC': 'http://purl.org/dc/elements/1.1/',
                    'DCTERMS': 'http://purl.org/dc/terms/'},
    'terms': [ { 'URI': 'http://purl.org/dc/terms/issued',
                 'content': '2003-11-01',
                 'name': 'DCTERMS.issued',
                 'scheme': 'DCTERMS.W3CDTF'},
               { 'URI': 'http://purl.org/dc/terms/abstract',
                 'content': 'This document describes how\n'
                            'qualified Dublin Core metadata can be encoded\n'
                            'in HTML/XHTML <meta> elements',
                 'name': 'DCTERMS.abstract'},
               { 'URI': 'http://purl.org/dc/terms/modified',
                 'content': '2001-07-18',
                 'name': 'DC.Date.modified'},
               { 'URI': 'http://purl.org/dc/terms/modified',
                 'content': '2001-07-18',
                 'name': 'DCTERMS.modified'},
               { 'URI': 'http://purl.org/dc/terms/replaces',
                 'href': 'http://dublincore.org/documents/2000/08/15/dcq-html/',
                 'hreflang': 'en',
                 'rel': 'DCTERMS.replaces'}]}]

Command Line Tool

extruct provides a command line tool that allows you to fetch a page and extract the metadata from it directly from the command line.

Dependencies

The command line tool depends on requests, which is not installed by default when you install extruct. In order to use the command line tool, you can install extruct with the cli extra requirements:

pip install extruct[cli]

Usage

extruct "http://example.com"

Downloads "http://example.com" and outputs the Microdata, JSON-LD and RDFa, Open Graph and Microformat metadata to stdout.

Supported Parameters

By default, the command line tool will try to extract all the supported metadata formats from the page (currently Microdata, JSON-LD, RDFa, Open Graph and Microformat). If you want to restrict the output to just one or a subset of those, you can pass their individual names collected in a list through 'syntaxes' argument.

For example, this command extracts only Microdata and JSON-LD metadata from "http://example.com":

extruct "http://example.com" --syntaxes microdata json-ld

NB syntaxes names passed must correspond to these: microdata, json-ld, rdfa, opengraph, microformat

Development version

mkvirtualenv extruct
pip install -r requirements-dev.txt

Tests

Run tests in current environment:

py.test tests

Use tox to run tests with different Python versions:

tox
Owner
Scrapinghub
Turn web content into useful data
Scrapinghub
Crawler in Python 3.7, 3.8. 3.9. Pypy3

Description Python Crawler written Python 3. (Supports major Python releases Python3.6, Python3.7 and Python 3.8) Installation and Use Setup VirtualEn

Vinit Kumar 2 Mar 12, 2022
Kusonime scraper using python3

Features Scrap from url Scrap from recommendation Search by query Todo [+] Search by genre Example # Get download url from kusonime import Scrap

MhankBarBar 2 Jan 28, 2022
Google Scholar Web Scraping

Google Scholar Web Scraping This is a python script that asks for a user to input the url for a google scholar profile, and then it writes publication

Suzan M 1 Dec 12, 2021
Google Developer Profile Badge Scraper

Google Developer Profile Badge Scraper GDev Profile Badge Scraper is a Google Developer Profile Web Scraper which scrapes for specific badges in a use

Siddhant Lad 7 Jan 10, 2022
A Python package that scrapes Google News article data while remaining undetected by Google.

A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https

Geminid Systems, Inc 6 Aug 10, 2022
A tool can scrape product in aliexpress: Title, Price, and URL Product.

Scrape-Product-Aliexpress A tool can scrape product in aliexpress: Title, Price, and URL Product. Usage: 1. Install Python 3.8 3.9 padahal halaman ins

Rahul Joshua Damanik 1 Dec 30, 2021
Find thumbnails and original images from URL or HTML file.

Haul Find thumbnails and original images from URL or HTML file. Demo Hauler on Heroku Installation on Ubuntu $ sudo apt-get install build-essential py

Vinta Chen 150 Oct 15, 2022
Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

TwitterScraper Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine . Screenshot Data Users Only

Remax Alghamdi 19 Nov 17, 2022
Searching info from Google using Python Scrapy

Python-Search-Engine-Scrapy || Python-爬虫-索引/利用爬虫获取谷歌信息**/ Searching info from Google using Python Scrapy /* 利用 PYTHON 爬虫获取天气信息,以及城市信息和资料**/ translatio

HONGVVENG 1 Jan 06, 2022
PaperRobot: a paper crawler that can quickly download numerous papers, facilitating paper studying and management

PaperRobot PaperRobot 是一个论文抓取工具,可以快速批量下载大量论文,方便后期进行持续的论文管理与学习。 PaperRobot通过多个接口抓取论文,目前抓取成功率维持在90%以上。通过配置Config文件,可以抓取任意计算机领域相关会议的论文。 Installation Down

moxiaoxi 47 Nov 23, 2022
Audio media crawler for lbry.

Audio media crawler for lbry. Requirements Python 3.8 Poetry 1.1.7 Elasticsearch 7.14.0 Lbry-sdk 0.99.0 Development This project uses poetry as a depe

Hound.fm 4 Dec 03, 2022
Python Web Scrapper Project

Web Scrapper Projeto desenvolvido em python, sobre tudo com Selenium, BeautifulSoup e Pandas é um web scrapper que puxa uma tabela com as principais e

Jordan Ítalo Amaral 2 Jan 04, 2022
An arxiv spider

An Arxiv Spider 做为一个cser,杰出男孩深知内核对连接到计算机上的硬件设备进行管理的高效方式是中断而不是轮询。每当小伙伴发来一篇刚挂在arxiv上的”热乎“好文章时,杰出男孩都会感叹道:”师兄这是每天都挂在arxiv上呀,跑的好快~“。于是杰出男孩找了找 github,借鉴了一下其

Jie Liu 11 Sep 09, 2022
Poolbooru gelscraper - a simple python script for scraping images off gelbooru pools.

poolbooru_gelscraper a simple python script for scraping images off gelbooru pools. modules required:requests_html, and os by default saves files with

savantshuia 1 Jan 02, 2022
A training task for web scraping using python multithreading and a real-time-updated list of available proxy servers.

Parallel web scraping The project is a training task for web scraping using python multithreading and a real-time-updated list of available proxy serv

Kushal Shingote 1 Feb 10, 2022
京东茅台抢购 2021年4月最新版

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性,完整性和有效性,请根据情况自行判断。 本项目内所有资源文件,禁止任何公众号、自媒体进行任何形式的转载、发布。 huanghyw 对任何脚本问题概不

45 Dec 14, 2022
Automatically scrapes all menu items from the Taco Bell website

Automatically scrapes all menu items from the Taco Bell website. Returns as PANDAS dataframe.

Sasha 2 Jan 15, 2022
Scrapes proxies and saves them to a text file

Proxy Scraper Scrapes proxies from https://proxyscrape.com and saves them to a file. Also has a customizable theme system Made by nell and Lamp

nell 2 Dec 22, 2021
Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

Instagram_scrapper This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or exce

Lakhdar Belkharroubi 5 Oct 17, 2022
Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage import lassie lassie.fetch('http://www.youtube.com/watch?v

Mike Helmick 570 Dec 19, 2022