Web Content Retrieval for Humans™

Overview

Lassie

https://img.shields.io/pypi/v/lassie.svg?style=flat-square https://img.shields.io/travis/michaelhelmick/lassie.svg?style=flat-square https://img.shields.io/coveralls/michaelhelmick/lassie/master.svg?style=flat-square https://img.shields.io/badge/Say%20Thanks!-:)-1EAEDB.svg?style=flat-square

Lassie is a Python library for retrieving basic content from websites.

https://i.imgur.com/QrvNfAX.gif

Usage

>>> import lassie
>>> lassie.fetch('http://www.youtube.com/watch?v=dQw4w9WgXcQ')
{
    'description': u'Music video by Rick Astley performing Never Gonna Give You Up. YouTube view counts pre-VEVO: 2,573,462 (C) 1987 PWL',
    'videos': [{
        'src': u'http://www.youtube.com/v/dQw4w9WgXcQ?autohide=1&version=3',
        'height': 480,
        'type': u'application/x-shockwave-flash',
        'width': 640
    }, {
        'src': u'https://www.youtube.com/embed/dQw4w9WgXcQ',
        'height': 480,
        'width': 640
    }],
    'title': u'Rick Astley - Never Gonna Give You Up',
    'url': u'http://www.youtube.com/watch?v=dQw4w9WgXcQ',
    'keywords': [u'Rick', u'Astley', u'Sony', u'BMG', u'Music', u'UK', u'Pop'],
    'images': [{
        'src': u'http://i1.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg?feature=og',
        'type': u'og:image'
    }, {
        'src': u'http://i1.ytimg.com/vi/dQw4w9WgXcQ/hqdefault.jpg',
        'type': u'twitter:image'
    }, {
        'src': u'http://s.ytimg.com/yts/img/favicon-vfldLzJxy.ico',
        'type': u'favicon'
    }, {
        'src': u'http://s.ytimg.com/yts/img/favicon_32-vflWoMFGx.png',
        'type': u'favicon'
    }],
    'locale': u'en_US'
}

Install

Install Lassie via pip

$ pip install lassie

or, with easy_install

$ easy_install lassie

But, hey... that's up to you.

Documentation

Documentation can be found here: https://lassie.readthedocs.org/

Comments
  • Fix possible ValueError in convert_to_int caused by values like 1px

    Fix possible ValueError in convert_to_int caused by values like 1px

    When trying to parse http://www.wired.com/wiredscience/2013/09/rim-fire-map-color-scale/ a ValueError was raised in convert_to_img, because the page has image width and height values ending in px.

    I changed the function to be more liberal regarding dimension values, by extracting the digits before casting to int. I added a test for this.

    Not sure though if the value should be converted to int at all or kept as a string.

    opened by yaph 14
  • Import fails on Python3.5

    Import fails on Python3.5

    It appears something is seriously broken when trying to install lassie with Python 3.5. Install goes fine but when importing I get here:

    Python 3.5.0 (default, Sep 23 2015, 04:41:38)
    [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.72)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import lassie
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/ben/dev/beavy/venv/src/lassie/lassie/__init__.py", line 19, in <module>
        from .api import fetch
      File "/Users/ben/dev/beavy/venv/src/lassie/lassie/api.py", line 11, in <module>
        from .core import Lassie
      File "/Users/ben/dev/beavy/venv/src/lassie/lassie/core.py", line 13, in <module>
        from bs4 import BeautifulSoup
      File "/Users/ben/dev/beavy/venv/lib/python3.5/site-packages/bs4/__init__.py", line 30, in <module>
        from .builder import builder_registry, ParserRejectedMarkup
      File "/Users/ben/dev/beavy/venv/lib/python3.5/site-packages/bs4/builder/__init__.py", line 308, in <module>
        from . import _htmlparser
      File "/Users/ben/dev/beavy/venv/lib/python3.5/site-packages/bs4/builder/_htmlparser.py", line 7, in <module>
        from html.parser import (
    ImportError: cannot import name 'HTMLParseError'
    
    opened by gnunicorn 6
  • Add optional structured properties for og:image and og:video

    Add optional structured properties for og:image and og:video

    From http://ogp.me/#structured.

    The og:video tag has the identical tags as og:image.

    og:image:url - Identical to og:image. og:image:secure_url - An alternate url to use if the webpage requires HTTPS. og:image:type - A MIME type for this image. og:image:width - The number of pixels wide. og:image:height - The number of pixels high.

    opened by jpadilla 6
  • Optional support for canonical URL meta tag.

    Optional support for canonical URL meta tag.

    This is very roughed in, but it adds support for returning the URL as provided by the canonical link element.

    There isn't anything to determine precedence with og:url.

    Has passing tests, and is disabled by default.

    Needed this for a project, not sure if it would be useful upstream.

    enhancement 
    opened by jmhobbs 5
  • Possible relative URL in og:image

    Possible relative URL in og:image

    I just came accros a page with a relative path value for the og:image. Adding a call to urljoin on the src attribute in line 186 of core.py would be a possibility, but maybe it's better to check for the src prop (possibly href prop too) in _filter_meta_data and do it there. What do you think about that?

    opened by yaph 5
  • Can't get the full article.

    Can't get the full article.

    Hi, I want to extract the article from the source url. I got only the title of the article and small parts of it under the "description" parameter.

    opened by yaseenox 4
  • Update requests==2.8 in setup.py, too

    Update requests==2.8 in setup.py, too

    The changelog for the last release states, that request is now pinned at version 2.8, yet when installing the latest version of lassie, it requires (and install) version 2.6 – the setup.py hasn't been updated to reflect that change and breaks the installation. This PR corrects that.

    opened by gnunicorn 4
  • Please allow to configure the requests session

    Please allow to configure the requests session

    It would be useful to be able to configure the requests session used to retrieve the requested URL.

    You could perhaps initialize a default session object in the Lassie constructor, which the user could then configure, and/or add a parameter to Lassie.fetch() to override the default session.

    opened by tawmas 4
  • Bump requests from 2.18.4 to 2.20.0

    Bump requests from 2.18.4 to 2.20.0

    ⚠️ Dependabot is rebasing this PR ⚠️

    If you make any changes to it yourself then they will take precedence over the rebase.


    Bumps requests from 2.18.4 to 2.20.0.

    Changelog

    Sourced from requests's changelog.

    2.20.0 (2018-10-18)

    Bugfixes

    • Content-Type header parsing is now case-insensitive (e.g. charset=utf8 v Charset=utf8).
    • Fixed exception leak where certain redirect urls would raise uncaught urllib3 exceptions.
    • Requests removes Authorization header from requests redirected from https to http on the same hostname. (CVE-2018-18074)
    • should_bypass_proxies now handles URIs without hostnames (e.g. files).

    Dependencies

    • Requests now supports urllib3 v1.24.

    Deprecations

    • Requests has officially stopped support for Python 2.6.

    2.19.1 (2018-06-14)

    Bugfixes

    • Fixed issue where status_codes.py's init function failed trying to append to a __doc__ value of None.

    2.19.0 (2018-06-12)

    Improvements

    • Warn user about possible slowdown when using cryptography version < 1.3.4
    • Check for invalid host in proxy URL, before forwarding request to adapter.
    • Fragments are now properly maintained across redirects. (RFC7231 7.1.2)
    • Removed use of cgi module to expedite library load time.
    • Added support for SHA-256 and SHA-512 digest auth algorithms.
    • Minor performance improvement to Request.content.
    • Migrate to using collections.abc for 3.7 compatibility.

    Bugfixes

    • Parsing empty Link headers with parse_header_links() no longer return one bogus entry.
    ... (truncated)
    Commits
    • bd84045 v2.20.0
    • 7fd9267 remove final remnants from 2.6
    • 6ae8a21 Add myself to AUTHORS
    • 89ab030 Use comprehensions whenever possible
    • 2c6a842 Merge pull request #4827 from webmaven/patch-1
    • 30be889 CVE URLs update: www sub-subdomain no longer valid
    • a6cd380 Merge pull request #4765 from requests/encapsulate_urllib3_exc
    • bbdbcc8 wrap url parsing exceptions from urllib3's PoolManager
    • ff0c325 Merge pull request #4805 from jdufresne/https
    • b0ad249 Prefer https:// for URLs throughout project
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot ignore this [patch|minor|major] version will close this PR and stop Dependabot creating any more for this minor/major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 3
  • Added support for open graph optional property `site_name`.

    Added support for open graph optional property `site_name`.

    Hi, I added supported for the open graph site_name property.

    This parse the following tag... <meta property="og:site_name" content="IMDb" /> into {"site_name": "IMDb"}

    opened by cameronmaske 3
  • make image urls absolute and added mock to test_requirements

    make image urls absolute and added mock to test_requirements

    I made a change so that when lassie.fetch is called with all_images=True the images src attributes contain absolute URLs. Since lassie already comes with a function that makes relative URLs absolute, I think it's better done inside lassie than in the application which imports it.

    When trying to run the tests after the changes the mock package was missing, so I added it to the test_requirements.txt file.

    opened by yaph 2
  • docs: Fix a few typos

    docs: Fix a few typos

    There are small typos in:

    • docs/usage/advanced_usage.rst

    Fixes:

    • Should read attributes rather than attibutes.
    • Should read actual rather than acutal.

    Semi-automated pull request generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    opened by timgates42 0
  • Any reason to pindown upper version in requirements.txt

    Any reason to pindown upper version in requirements.txt

    Hi,

    Since lassie is a library, limiting upper versions for dependencies as in

    requests>=2.18.4,<3.0.0
    beautifulsoup4>=4.9.0,<4.10.0
    

    can lead to conflicts for software using it, e.g. on pip install:

    The conflict is caused by:
        The user requested beautifulsoup4==4.10.0
        lassie 0.11.11 depends on beautifulsoup4<4.10.0 and >=4.9.0
    

    Is there any reason for the pindown?

    opened by idlesign 1
  • Encoding issues with german umlauts

    Encoding issues with german umlauts

    Hi,

    when getting the description from a German website the "ü" "ä" etc. end up being "ä", "ü" etc. Example: https://finanzguru.de/ Result:

    Finanzguru - Finanzen magisch einfach Finanzen magisch einfach. Verwalte deine Verträge, kündige per Fingertipp und spare Geld mit meinen Spartipps. Alles an einem Ort und komplett kostenfrei. Einfacher war es noch nie.

    I am using lassie within Django.

    opened by leugh 0
  • Add new filters for embeddable items

    Add new filters for embeddable items

    The idea is to return as much data as we can in the API so users can possibly embed media. (i.e. Spotify tracks)

    We'll probably add a new embed.py and return a new embed key in the lassie API response.

    enhancement 
    opened by michaelhelmick 0
Releases(0.11.11)
Web crawling framework based on asyncio.

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Requirements Python3.5+ Installation pip install gain pip install uvloo

Jiuli Gao 2k Jan 05, 2023
A web scraper for nomadlist.com, made to avoid website restrictions.

Gypsylist gypsylist.py is a web scraper for nomadlist.com, made to avoid website restrictions. nomadlist.com is a website with a lot of information fo

Alessio Greggi 5 Nov 24, 2022
UdemyBot - A Simple Udemy Free Courses Scrapper

UdemyBot - A Simple Udemy Free Courses Scrapper

Gautam Kumar 112 Nov 12, 2022
Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website by form number and returns the results as json

Web-scraping - A bot using Python with BeautifulSoup that scraps IRS website (prior form publication) by form number and returns the results as json. It provides the option to download pdfs over a ra

1 Jan 04, 2022
Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

Guilherme Silva Uchoa 3 Oct 04, 2022
This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.

IST Research 1.1k Jan 06, 2023
This program scrapes information and images for movies and TV shows.

Media-WebScraper This program scrapes information and images for movies and TV shows. Summary For more information on the program, read the WebScrape_

1 Dec 05, 2021
🐞 Douban Movie / Douban Book Scarpy

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

Xingbo Jia 1 Dec 03, 2022
Binance Smart Chain Contract Scraper + Contract Evaluator

Pulls Binance Smart Chain feed of newly-verified contracts every 30 seconds, then checks their contract code for links to socials.Returns only those with socials information included, and then submit

14 Dec 09, 2022
A web service for scanning media hosted by a Matrix media repository

Matrix Content Scanner A web service for scanning media hosted by a Matrix media repository Installation TODO Development In a virtual environment wit

Brendan Abolivier 5 Dec 01, 2022
download NCERT books using scrapy

download_ncert_books download NCERT books using scrapy Downloading Books: You can either use the spider by cloning this repo and following the instruc

1 Dec 02, 2022
This project was created using Python technology and flask tools to scrape a music site

python-scrapping This project was created using Python technology and flask tools to scrape a music site You need to install the following packages to

hosein moradi 1 Dec 07, 2021
一个m3u8视频流下载脚本

一个Python的m3u8流视频下载脚本 介绍 m3u8流视频日益常见,目前好用的下载器也有很多,我把之前自己写的一个小脚本分享出来,供广大网友使用。写此程序的目的在于给视频下载爱好者提供一个下载样例,可直接调用,勿再重复造轮子。 使用方法 在python中直接运行程序或进行外部调用 import

Nchu 0 Oct 10, 2021
Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data.

Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data. Then used Yahoo Finance to get the related stock data and displayed them in the form of chart

Samrat Mitra 3 Sep 09, 2022
爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

OnTimeHacker V1.0 OnTimeHacker 是一个爬取各大SRC当日公告,并通过微信通知的小工具 OnTimeHacker目前版本为1.0,已支持24家SRC,列表如下 360、爱奇艺、阿里、百度、哔哩哔哩、贝壳、Boss、58、菜鸟、滴滴、斗鱼、 饿了么、瓜子、合合、享道、京东、

Bywalks 95 Jan 07, 2023
A database scraper created with mechanical soup and sqlite

WebscrapingDatabases a database scraper created with mechanical soup and sqlite author: Mariya Sha Watch on YouTube: This repository was created to su

Mariya 30 Aug 08, 2022
A modern CSS selector implementation for BeautifulSoup

Soup Sieve Overview Soup Sieve is a CSS selector library designed to be used with Beautiful Soup 4. It aims to provide selecting, matching, and filter

Isaac Muse 151 Dec 23, 2022
Snowflake database loading utility with Scrapy integration

Snowflake Stage Exporter Snowflake database loading utility with Scrapy integration. Meant for streaming ingestion of JSON serializable objects into S

Oleg T. 0 Dec 06, 2021
The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

The open-source web scrapers that feed the Los Angeles Times' California coronavirus tracker. Processed data ready for analysis is available at datade

Los Angeles Times Data and Graphics Department 51 Dec 14, 2022
PS5 bot to find a console in france for chrismas 🎄🎅🏻 NOT FOR SCALPERS

Une PS5 pour Noël Python + Chrome --headless = une PS5 pour noël MacOS Installer chrome Tweaker le .yaml pour la listes sites a scrap et les criteres

Olivier Giniaux 3 Feb 13, 2022