Libextract: extract data from websites

https://travis-ci.org/datalib/libextract.svg?branch=master

    ___ __              __                  __
   / (_) /_  ___  _  __/ /__________ ______/ /_
  / / / __ \/ _ \| |/_/ __/ ___/ __ `/ ___/ __/
 / / / /_/ /  __/>   
Libextract is a statistics-enabled data extraction library that works on HTML and XML documents and written in Python. Originating from eatiht, the extraction algorithm works by making one simple assumption: data appear as collections of repetitive elements. You can read about the reasoning here.
  

  
   Overview 

  
 
  
   
  libextract.api.extract(document, encoding='utf-8', count=5)
 
   
 
  
   
  Given an html document, and optionally the encoding, return a list of nodes likely containing data (5 by default).
 
   

 
  

  
   Installation 
pip install libextract
  

  
   Usage 
Due to our simple definition of "data", we open up a single interfaceable method. Post-processing is up to you. 

 
 
  from requests import get
from libextract.api import extract

r = get('http://en.wikipedia.org/wiki/Information_extraction')
textnodes = list(extract(r.content))

  
Using lxml's built-in methods for post-processing: 

 
 
  >> print(textnodes[0].text_content())
Information extraction (IE) is the task of automatically extracting structured information...

  
The extraction algo is agnostic to article text as it is with tabular data: 

 
 
  height_data = get("http://en.wikipedia.org/wiki/Human_height")
tabs = list(extract(height_data.content))

  

 
 
  >> [elem.text_content() for elem in tabs[0].iter('th')]
['Country/Region',
 'Average male height',
 'Average female height',
 ...]

 
  

  
   Dependencies 
lxml
statscounter
  

  
   Disclaimer 
This project is still in its infancy; and advice and suggestions as to what this library could and should be would be greatly appreciated 
:)

Libextract: extract data from websites

Related tags

Overview

Libextract: extract data from websites

Overview

Installation

Usage

Dependencies

Disclaimer

Owner

Web scrapping

Web Scraping OLX with Python and Bsoup.

Google Maps crawler using Selenium

一些爬虫相关的签名、验证码破解

Html Content / Article Extractor, web scrapping lib in Python

Python scraper to check for earlier appointments in Clalit Health Services

This is my CS 20 final assesment.

京东秒杀商品抢购Python脚本

Scrapegoat is a python library that can be used to scrape the websites from internet based on the relevance of the given topic irrespective of language using Natural Language Processing

Scrape and display grades onto the console

热搜榜-python爬虫+正则re+beautifulsoup+xpath

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

Twitter Eye is a Twitter Information Gathering Tool With Twitter Eye

Examine.com supplement research scraper!

Amazon scraper using scrapy, a python framework for crawling websites.

Scraping Thailand COVID-19 data from the DDC's tableau dashboard

👁️ Tool for Data Extraction and Web Requests.

Pelican plugin that adds site search capability

Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Python script who crawl first shodan page and check DBLTEK vulnerability