Web Content Retrieval for Humans™

Last update: Dec 19, 2022

Overview

Lassie

https://img.shields.io/coveralls/michaelhelmick/lassie/master.svg?style=flat-square

https://img.shields.io/badge/Say%20Thanks!-:)-1EAEDB.svg?style=flat-square

Lassie is a Python library for retrieving basic content from websites.

Usage

>>> import lassie
>>> lassie.fetch('http://www.youtube.com/watch?v=dQw4w9WgXcQ')
{
    'description': u'Music video by Rick Astley performing Never Gonna Give You Up. YouTube view counts pre-VEVO: 2,573,462 (C) 1987 PWL',
    'videos': [{
        'src': u'http://www.youtube.com/v/dQw4w9WgXcQ?autohide=1&version=3',
        'height': 480,
        'type': u'application/x-shockwave-flash',
        'width': 640
    }, {
        'src': u'https://www.youtube.com/embed/dQw4w9WgXcQ',
        'height': 480,
        'width': 640
    }],
    'title': u'Rick Astley - Never Gonna Give You Up',
    'url': u'http://www.youtube.com/watch?v=dQw4w9WgXcQ',
    'keywords': [u'Rick', u'Astley', u'Sony', u'BMG', u'Music', u'UK', u'Pop'],
    'images': [{
        'src': u'http://i1.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg?feature=og',
        'type': u'og:image'
    }, {
        'src': u'http://i1.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg',
        'type': u'twitter:image'
    }, {
        'src': u'http://s.ytimg.com/yts/img/favicon-vfldLzJxy.ico',
        'type': u'favicon'
    }, {
        'src': u'http://s.ytimg.com/yts/img/favicon_32-vflWoMFGx.png',
        'type': u'favicon'
    }],
    'locale': u'en_US'
}

Install

Install Lassie via pip

$ pip install lassie

or, with easy_install

$ easy_install lassie

But, hey... that's up to you.

Documentation

Documentation can be found here: https://lassie.readthedocs.org/

Comments

Fix possible ValueError in convert_to_int caused by values like 1px

When trying to parse http://www.wired.com/wiredscience/2013/09/rim-fire-map-color-scale/ a ValueError was raised in convert_to_img, because the page has image width and height values ending in px.

I changed the function to be more liberal regarding dimension values, by extracting the digits before casting to int. I added a test for this.

Not sure though if the value should be converted to int at all or kept as a string.

opened by yaph 14

Import fails on Python3.5

It appears something is seriously broken when trying to install lassie with Python 3.5. Install goes fine but when importing I get here:

Python 3.5.0 (default, Sep 23 2015, 04:41:38)
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.72)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import lassie
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ben/dev/beavy/venv/src/lassie/lassie/__init__.py", line 19, in <module>
    from .api import fetch
  File "/Users/ben/dev/beavy/venv/src/lassie/lassie/api.py", line 11, in <module>
    from .core import Lassie
  File "/Users/ben/dev/beavy/venv/src/lassie/lassie/core.py", line 13, in <module>
    from bs4 import BeautifulSoup
  File "/Users/ben/dev/beavy/venv/lib/python3.5/site-packages/bs4/__init__.py", line 30, in <module>
    from .builder import builder_registry, ParserRejectedMarkup
  File "/Users/ben/dev/beavy/venv/lib/python3.5/site-packages/bs4/builder/__init__.py", line 308, in <module>
    from . import _htmlparser
  File "/Users/ben/dev/beavy/venv/lib/python3.5/site-packages/bs4/builder/_htmlparser.py", line 7, in <module>
    from html.parser import (
ImportError: cannot import name 'HTMLParseError'

opened by gnunicorn 6

Add optional structured properties for og:image and og:video

From http://ogp.me/#structured.

The og:video tag has the identical tags as og:image.

og:image:url - Identical to og:image. og:image:secure_url - An alternate url to use if the webpage requires HTTPS. og:image:type - A MIME type for this image. og:image:width - The number of pixels wide. og:image:height - The number of pixels high.

opened by jpadilla 6
Optional support for canonical URL meta tag.

This is very roughed in, but it adds support for returning the URL as provided by the canonical link element.

There isn't anything to determine precedence with og:url.

Has passing tests, and is disabled by default.

Needed this for a project, not sure if it would be useful upstream.
enhancement

opened by jmhobbs 5
Possible relative URL in og:image

I just came accros a page with a relative path value for the og:image. Adding a call to urljoin on the src attribute in line 186 of core.py would be a possibility, but maybe it's better to check for the src prop (possibly href prop too) in _filter_meta_data and do it there. What do you think about that?

opened by yaph 5
Can't get the full article.

Hi, I want to extract the article from the source url. I got only the title of the article and small parts of it under the "description" parameter.

opened by yaseenox 4
Update requests==2.8 in setup.py, too

The changelog for the last release states, that request is now pinned at version 2.8, yet when installing the latest version of lassie, it requires (and install) version 2.6 – the setup.py hasn't been updated to reflect that change and breaks the installation. This PR corrects that.

opened by gnunicorn 4
Please allow to configure the requests session

It would be useful to be able to configure the requests session used to retrieve the requested URL.

You could perhaps initialize a default session object in the Lassie constructor, which the user could then configure, and/or add a parameter to Lassie.fetch() to override the default session.

opened by tawmas 4
Bump requests from 2.18.4 to 2.20.0
⚠️ Dependabot is rebasing this PR ⚠️

If you make any changes to it yourself then they will take precedence over the rebase.

Bumps requests from 2.18.4 to 2.20.0.

Changelog

Sourced from requests's changelog.

2.20.0 (2018-10-18)

Bugfixes

Content-Type header parsing is now case-insensitive (e.g. charset=utf8 v Charset=utf8).

Fixed exception leak where certain redirect urls would raise uncaught urllib3 exceptions.

Requests removes Authorization header from requests redirected from https to http on the same hostname. (CVE-2018-18074)

should_bypass_proxies now handles URIs without hostnames (e.g. files).

Dependencies

Requests now supports urllib3 v1.24.

Deprecations

Requests has officially stopped support for Python 2.6.

2.19.1 (2018-06-14)

Bugfixes

Fixed issue where status_codes.py's init function failed trying to append to a __doc__ value of None.

2.19.0 (2018-06-12)

Improvements

Warn user about possible slowdown when using cryptography version < 1.3.4

Check for invalid host in proxy URL, before forwarding request to adapter.

Fragments are now properly maintained across redirects. (RFC7231 7.1.2)

Removed use of cgi module to expedite library load time.

Added support for SHA-256 and SHA-512 digest auth algorithms.

Minor performance improvement to Request.content.

Migrate to using collections.abc for 3.7 compatibility.

Bugfixes

Parsing empty Link headers with parse_header_links() no longer return one bogus entry.

... (truncated)

Commits

bd84045 v2.20.0

7fd9267 remove final remnants from 2.6

6ae8a21 Add myself to AUTHORS

89ab030 Use comprehensions whenever possible

2c6a842 Merge pull request #4827 from webmaven/patch-1

30be889 CVE URLs update: www sub-subdomain no longer valid

a6cd380 Merge pull request #4765 from requests/encapsulate_urllib3_exc

bbdbcc8 wrap url parsing exceptions from urllib3's PoolManager

ff0c325 Merge pull request #4805 from jdufresne/https

b0ad249 Prefer https:// for URLs throughout project

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot ignore this [patch|minor|major] version will close this PR and stop Dependabot creating any more for this minor/major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 3
Added support for open graph optional property `site_name`.

Hi, I added supported for the open graph site_name property.

This parse the following tag... <meta property="og:site_name" content="IMDb" /> into {"site_name": "IMDb"}

opened by cameronmaske 3
make image urls absolute and added mock to test_requirements

I made a change so that when lassie.fetch is called with all_images=True the images src attributes contain absolute URLs. Since lassie already comes with a function that makes relative URLs absolute, I think it's better done inside lassie than in the application which imports it.

When trying to run the tests after the changes the mock package was missing, so I added it to the test_requirements.txt file.

opened by yaph 2
docs: Fix a few typos
There are small typos in:

docs/usage/advanced_usage.rst

Fixes:

Should read attributes rather than attibutes.

Should read actual rather than acutal.

Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md
opened by timgates42 0
Any reason to pindown upper version in requirements.txt
Hi,

Since lassie is a library, limiting upper versions for dependencies as in

requests>=2.18.4,<3.0.0 beautifulsoup4>=4.9.0,<4.10.0

can lead to conflicts for software using it, e.g. on pip install:

The conflict is caused by: The user requested beautifulsoup4==4.10.0 lassie 0.11.11 depends on beautifulsoup4<4.10.0 and >=4.9.0

Is there any reason for the pindown?
opened by idlesign 1
Encoding issues with german umlauts

Hi,

when getting the description from a German website the "ü" "ä" etc. end up being "Ã¤", "Ã¼" etc. Example: https://finanzguru.de/ Result:

Finanzguru - Finanzen magisch einfach Finanzen magisch einfach. Verwalte deine VertrÃ¤ge, kÃ¼ndige per Fingertipp und spare Geld mit meinen Spartipps. Alles an einem Ort und komplett kostenfrei. Einfacher war es noch nie.

I am using lassie within Django.

opened by leugh 0
Add new filters for embeddable items

The idea is to return as much data as we can in the API so users can possibly embed media. (i.e. Spotify tracks)

We'll probably add a new embed.py and return a new embed key in the lassie API response.
enhancement

opened by michaelhelmick 0

Releases(0.11.11)

0.11.11(Aug 20, 2021)
Fix PyPI release

Source code(tar.gz)
Source code(zip)
0.11.10(Aug 20, 2021)
Add html to response dict when available.

Upgrade to GitHub Actions

Source code(tar.gz)
Source code(zip)
0.11.8(Dec 16, 2020)
Update requests dependency

Source code(tar.gz)
Source code(zip)
0.11.2(Nov 1, 2017)
Add support for OEmbed providers (YouTube)

Source code(tar.gz)
Source code(zip)
0.10.0(Feb 3, 2017)
Fix issue where a website may have malformed HTML and no tag causing soup.html to be None (#60)

Updated beautifulsoup4 to 4.5.3

Update html5lib to 1.0b10

Source code(tar.gz)
Source code(zip)
0.9.0(Jan 29, 2017)
Added a default fake user agent to use instead of using python-requests/version (some websites will mark certain user agents as bot attempts)

Updated requests to 2.13.0

Source code(tar.gz)
Source code(zip)
0.8.7(Dec 21, 2016)
Fix Python 3 support

Handle empty AMP image lists

Source code(tar.gz)
Source code(zip)
0.8.6(Nov 17, 2016)
Handle AMP image list of strings vs list of objects

Source code(tar.gz)
Source code(zip)
0.8.5(Nov 3, 2016)
Handle AMP data that is contained in a list

Retrieve videos and thumbnails (as images) from AMP VideoObjects

Source code(tar.gz)
Source code(zip)
0.8.4(Nov 1, 2016)
Fix issue where AMP images could be lists inside an object

Source code(tar.gz)
Source code(zip)
0.8.3(Nov 1, 2016)
Fix issue where some keys returned (i.e. description) would not be retrieved if the key existed with an empty value already

Source code(tar.gz)
Source code(zip)
0.8.2(Nov 1, 2016)
Fix issue where AMP images could be images and not objects

Source code(tar.gz)
Source code(zip)
0.8.1(Nov 1, 2016)
Add support for AMP "description" attribute

Fix issue where an error would be thrown if width/height of an image weren't strings

Fix duplicate AMP title request, should have been url

Source code(tar.gz)
Source code(zip)
0.8.0(Nov 1, 2016)
Add support for links that use AMP

Source code(tar.gz)
Source code(zip)
0.7.1(Jul 27, 2016)
Add support for open graph site_name

Source code(tar.gz)
Source code(zip)
0.7.0(Jul 27, 2016)
Add status_code to response dictionary

Source code(tar.gz)
Source code(zip)
0.6.2(Nov 11, 2015)
Pinned requests library to version 2.8.1

Pinned beautifulsoup4 library to version 4.4.1

Add Python 3.5 to Travis CI build matrix (officially support 3.5)

Source code(tar.gz)
Source code(zip)
0.5.4(Aug 19, 2015)
Support for secure url image and videos from Open Graph

Simplified merge_settings and data updating internally

Source code(tar.gz)
Source code(zip)
0.5.3(Jul 13, 2015)

Handle when a website doesn't set a value on the "keywords" meta tag
Source code(tar.gz)
Source code(zip)
0.3.0(Aug 17, 2013)
Added support for locale to be returned. If lang is specified in the html tag and it normalizes to an actual locale, it will be added to the returned data.

Fixed bug where height was not being returned for body images

Added test coverage, we're 100% covered! :D

Source code(tar.gz)
Source code(zip)
0.2.0(Aug 6, 2013)

Fix package error when importing
Source code(tar.gz)
Source code(zip)
0.1.0(Aug 6, 2013)

Initial Release
Source code(tar.gz)
Source code(zip)

Owner

Mike Helmick

GitHub Repository https://lassie.readthedocs.org

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

1 Nov 30, 2021

WebScrapping Project - G1 Latest News

Web Scrapping com Python Esse projeto consiste em um código para o usuário buscar as últimas nóticias sobre um termo qualquer, no site G1. Para esse p

2 Feb 13, 2022

Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

1 Nov 03, 2021

Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

Instagram_scrapper This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or exce

5 Oct 17, 2022

a small library for extracting rich content from urls

A small library for extracting rich content from urls. what does it do? micawber supplies a few methods for retrieving rich metadata about a variety o

588 Dec 27, 2022

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Universal Online Judge Spider Introduction This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/). It also works for all other Onl

1 Dec 07, 2021

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

3 Oct 04, 2022

✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Spider-Cut is a Network Mapper Framework (NMAP Framework) Installation | Usage | Creators | Donate Installation # Kali Linux | WSL

3 Mar 07, 2022

让中国用户使用git从github下载的速度提高1000倍!

序言 github上有很多好项目,但是国内用户连github却非常的慢.每次都要用插件或者其他工具来解决. 这次自己做一个小工具,输入github原地址后,就可以自动替换为代理地址,方便大家更快速的下载. 安装 pip install cit 主要功能与用法主要功能 change 将目标地址转换为

35 Aug 29, 2022

Audio media crawler for lbry.

Audio media crawler for lbry. Requirements Python 3.8 Poetry 1.1.7 Elasticsearch 7.14.0 Lbry-sdk 0.99.0 Development This project uses poetry as a depe

4 Dec 03, 2022

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

859 Dec 29, 2022

This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

crawler_to_visual_gmane Analyzing an EMAIL Archive from gmane and vizualizing the data using the D3 JavaScript library. This is a set of tools that al

1 Dec 20, 2021

Binance Smart Chain Contract Scraper + Contract Evaluator

Pulls Binance Smart Chain feed of newly-verified contracts every 30 seconds, then checks their contract code for links to socials.Returns only those with socials information included, and then submit

14 Dec 09, 2022

A Scrapper with python

Scrapper-en-python Scrapper des données signifie récuperer des données pour les traiter ou les analyser. En python, il y'a 2 grands moyens de scrapper

1 Dec 05, 2021

A web service for scanning media hosted by a Matrix media repository

Matrix Content Scanner A web service for scanning media hosted by a Matrix media repository Installation TODO Development In a virtual environment wit

5 Dec 01, 2022

Parse feeds in Python

Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

web-scraping Program that scrapes a website for a collection of quotes, picks on

1 Jan 07, 2022

Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN

Lexile-Atos-Scraper Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN You will need to install the chrome webdriver if you have n

1 Feb 11, 2022

This project was created using Python technology and flask tools to scrape a music site

python-scrapping This project was created using Python technology and flask tools to scrape a music site You need to install the following packages to

1 Dec 07, 2021

原神爬虫抓取原神界面圣遗物信息

原神圣遗物半自动爬虫说明直接抓取原神界面中的圣遗物数据目前只适配了背包页面的抓取准确率：97.5%(普通通用接口，对 40 件随机圣遗物识别，统计完全正确的数量为 39) 准确率：100%(4k 屏幕，普通通用接口，对 110 件圣遗物识别，统计完全正确的数量为 110) 不排除还有小错误的

28 Oct 10, 2022

Web Content Retrieval for Humans™

Related tags

Overview

Lassie

Usage

Install

Documentation

Comments

2.20.0 (2018-10-18)

2.19.1 (2018-06-14)

2.19.0 (2018-06-12)

Releases(0.11.11)

0.11.11(Aug 20, 2021)

0.11.10(Aug 20, 2021)

0.11.8(Dec 16, 2020)

0.11.2(Nov 1, 2017)

0.10.0(Feb 3, 2017)

0.9.0(Jan 29, 2017)

0.8.7(Dec 21, 2016)

0.8.6(Nov 17, 2016)

0.8.5(Nov 3, 2016)

0.8.4(Nov 1, 2016)

0.8.3(Nov 1, 2016)

0.8.2(Nov 1, 2016)

0.8.1(Nov 1, 2016)

0.8.0(Nov 1, 2016)

0.7.1(Jul 27, 2016)

0.7.0(Jul 27, 2016)

0.6.2(Nov 11, 2015)

0.5.4(Aug 19, 2015)

0.5.3(Jul 13, 2015)

0.3.0(Aug 17, 2013)

0.2.0(Aug 6, 2013)

0.1.0(Aug 6, 2013)