Command line program to download documents from web portals.

Overview

command line document download made easy



Highlights

  • list available documents in json format or download them
  • filter documents using
    • string matching
    • regular expressions or
    • jq queries
  • display captcha or QR codes for interactive input
  • writing new plugins is easy
  • existing plugins (some of them even work):
    • amazon
    • ing.de
    • dkb.de
    • o2.de
    • kabel.vodafone.de
    • conrad.de
    • elster.de



Dependencies



Installation

$ git clone --recursive https://github.com/heeplr/document-dl
$ cd document-dl
$ pip install .



Usage

Display Help:

$ document-dl -h
Usage: document-dl [OPTIONS] COMMAND [ARGS]...

  download documents from web portals

Options:
  -u, --username TEXT             login id  [env var: DOCDL_USERNAME]
  -p, --password TEXT             secret password  [env var: DOCDL_PASSWORD]
  -m, --match <ATTRIBUTE PATTERN>...
                                  only output documents where attribute
                                  contains pattern string  [env var:
                                  DOCDL_MATCH]

  -r, --regex <ATTRIBUTE REGEX>...
                                  only output documents where attribute value
                                  matches regex  [env var: DOCDL_REGEX]

  -j, --jq JQ_EXPRESSION          only output documents if json query matches
                                  document's attributes (see
                                  https://stedolan.github.io/jq/manual/ )
                                  [env var: DOCDL_JQ]

  -H, --headless BOOLEAN          show browser window if false  [env var:
                                  DOCDL_HEADLESS; default: True]

  -b, --browser [chrome|edge|firefox|ie|opera|safari|webkitgtk]
                                  webdriver to use for selenium based plugins
                                  [env var: DOCDL_BROWSER; default: chrome]

  -t, --timeout INTEGER           seconds to wait for data before terminating
                                  connection  [env var: DOCDL_TIMEOUT;
                                  default: 15]

  -i, --image-loading BOOLEAN     Turn off image loading when False  [env var:
                                  DOCDL_IMAGE_LOADING; default: False]

  -a, --action [download|list]    download or just list documents  [env var:
                                  DOCDL_ACTION; default: list]

  -h, --help                      Show this message and exit.

Commands:
  amazon    amazon.de (invoices)
  conrad    conrad.de (invoices)
  dkb       dkb.de with photoTAN (postbox)
  elster    elster.de with path to .pfx certfile as username (postbox)
  ing       banking.ing.de with photoTAN (postbox)
  o2        o2online.de (invoices/postbox)
  vodafone  kabel.vodafone.de (postbox, invoices)

Display plugin-specific help: (currently there is a bug in click that prompts for username and password before displaying the help)

$ document-dl ing --help
Usage: document-dl ing [OPTIONS]

  banking.ing.de with photoTAN (postbox)

Options:
  -k, --diba-key TEXT  DiBa Key  [env var: DOCDL_DIBA_KEY]
  -h, --help           Show this message and exit.



Examples

List all documents from vodafone.de, prompt for username/password:

$ document-dl vodafone

Same, but show browser window this time:

$ document-dl --headless=false vodafone

Download all documents from conrad.de, pass credentials as commandline arguments:

$ document-dl --username mylogin --password mypass --action download conrad

Download all documents from conrad.de, pass credentials as env vars:

$ DOCDL_USERNAME='mylogin' DOCDL_PASSWORD='mypass' document-dl --action download conrad

Download all documents from o2online.de where "doctype" attribute contains "BILL":

$ document-dl --match doctype BILL --action download o2

You can also use regular expressions to filter documents:

$ document-dl --regex date '^(2021-04|2021-05).*$' o2

List all documents from o2online.de where year >= 2019:

$ document-dl --jq 'select(.year >= 2019)' o2

Download document from elster.de with id == 15:

$ document-dl --jq 'contains({id: 15})' --action download elster



Writing a plugin

Plugins are click-plugins which in turn are normal @click.command's registered in setup.py

  • put your plugin into "docdl/plugins"

  • write your plugin class:

    • if you just need requests, inherit from docdl.WebPortal and use self.session that's initialized for you
    • if you need selenium, inherit from docdl.SeleniumWebPortal and use self.webdriver that's initialized for you
    • add click glue code
    • add your plugin to setup.py docdl_plugins registry
import docdl
import docdl.util

class MyPlugin(docdl.WebPortal):

    URL_LOGIN = "https://myservice.com/login"

    def login(self):
        request = self.session.get(self.URL_LOGIN)
        # ... authenticate ...
        if not_logged_in:
            return False
        return True

    def logout(self):
        # ... logout ...

    def documents(self):
        # iterate over all available documents
        for count, document in enumerate(all_documents):

            # scrape:
            #  * document attributes
            #    * it's recommended to assign an incremental "id"
            #      attribute to every document
            #    * if you set a "filename" attribute, it will be used to
            #      rename the downloaded file
            #    * dates should be parsed to datetime.datetime objects
            #      docdl.util.parse_date() should parse the most common strings
            #
            # also you must scrape either:
            #  * the download URL
            #
            # or (for SeleniumWebPortal plugins):
            #  * the DOM element that triggers download. It is expected
            #    that the download starts immediately after click() on
            #    the DOM element
            # or implement a custom download() method

            yield docdl.Document(
                url = this_documents_url,
                # download_element = <some selenium element to click>
                attributes = {
                    "id": count,
                    "category": "invoices",
                    "title": this_documents_title,
                    "filename": this_documents_target_filename,
                    "date": docdl.util.parse_date(some_date_string)
                }
            )


    def download(self, document):
        """you shouldn't need this for most web portals"""
        # ... save file to os.getcwd() ...
        return self.rename_after_download(document, filename)


@click.command()
@click.pass_context
def myplugin(ctx):
    """plugin description (what, documents, are, scraped)"""
    docdl.cli.run(ctx, MyPlugin)

and in setup.py:

# ...
setup(
    # ...
    packages=find_packages(
        # ...
        entry_points={
            'docdl_plugins': [
                # ...
                'myplugin=docdl.plugins.myplugin:myplugin',
                # ...
            ],
            # ...
        }
)



Security

Beware that your login credentials are most probably saved in your shell history when you pass them as commandline arguments. You can use the input prompt to avoid that or set environment variables safely.



Bugs

document-dl is still in a very early state of development and a lot of things don't work, yet. Especially a ton of edge cases need to be covered. If you find a bug, please open an issue or send a pull request.

  • --browser settings beside chrome probably don't work unless you help to test them
  • some services offer more documents/data than currently scraped



TODO

  • logging
  • better documentation
  • properly parse rfc6266
  • delete action
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

Computational Linguistics Research Group 8.4k Jan 08, 2023
Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

Ege Feyzioglu 3 Oct 14, 2022
This program will help you to properly scrape all data from a specific website

This program will help you to properly scrape all data from a specific website

MD. MINHAZ 0 May 15, 2022
Comment Webpage Screenshot is a GitHub Action that captures screenshots of web pages and HTML files located in the repository

Comment Webpage Screenshot is a GitHub Action that helps maintainers visually review HTML file changes introduced on a Pull Request by adding comments with the screenshots of the latest HTML file cha

Maksudul Haque 21 Sep 29, 2022
原神爬虫 抓取原神界面圣遗物信息

原神圣遗物半自动爬虫 说明 直接抓取原神界面中的圣遗物数据 目前只适配了背包页面的抓取 准确率:97.5%(普通通用接口,对 40 件随机圣遗物识别,统计完全正确的数量为 39) 准确率:100%(4k 屏幕,普通通用接口,对 110 件圣遗物识别,统计完全正确的数量为 110) 不排除还有小错误的

hwa 28 Oct 10, 2022
Transistor, a Python web scraping framework for intelligent use cases.

Web data collection and storage for intelligent use cases. transistor About The web is full of data. Transistor is a web scraping framework for collec

BOM Quote Manufacturing 212 Nov 05, 2022
A simple proxy scraper that utilizes the requests module in python.

Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

3 Sep 08, 2021
The first public repository that provides free BUBT website scraping API script on Github.

BUBT WEBSITE SCRAPPING SCRIPT I think this is the first public repository that provides free BUBT website scraping API script on github. When I was do

Md Imam Hossain 3 Feb 10, 2022
Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Django and Vue.js

Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js. Documentation Documentation

Gerapy 2.9k Jan 03, 2023
A dead simple crawler to get books information from Douban.

Introduction A dead simple crawler to get books information from Douban. Pre-requesites Python 3 Install dependencies from requirements.txt (Optional)

Yun Wang 1 Jan 10, 2022
Get-web-images - A python code that get images from any site

image retrieval This is a python code to retrieve an image from the internet, a

CODE 1 Dec 30, 2021
A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

Danushka-Madushan 1 Nov 28, 2021
Incredibly fast crawler designed for OSINT.

Photon Incredibly fast crawler designed for OSINT. Photon Wiki • How To Use • Compatibility • Photon Library • Contribution • Roadmap Key Features Dat

Somdev Sangwan 9.3k Jan 02, 2023
A low-code tool that generates python crawler code based on curl or url

KKBA Intruoduction A low-code tool that generates python crawler code based on curl or url Requirement Python = 3.6 Install pip install kkba Usage Co

8 Sep 20, 2021
Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Aliexpress to telegram post Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a b

Fernando 6 Dec 06, 2022
Web-scraping - Program that scrapes a website for a collection of quotes, picks one at random and displays it

web-scraping Program that scrapes a website for a collection of quotes, picks on

Manvir Mann 1 Jan 07, 2022
Searching info from Google using Python Scrapy

Python-Search-Engine-Scrapy || Python-爬虫-索引/利用爬虫获取谷歌信息**/ Searching info from Google using Python Scrapy /* 利用 PYTHON 爬虫获取天气信息,以及城市信息和资料**/ translatio

HONGVVENG 1 Jan 06, 2022
Html Content / Article Extractor, web scrapping lib in Python

Python-Goose - Article Extractor Intro Goose was originally an article extractor written in Java that has most recently (Aug2011) been converted to a

Xavier Grangier 3.8k Jan 02, 2023
High available distributed ip proxy pool, powerd by Scrapy and Redis

高可用IP代理池 README | 中文文档 本项目所采集的IP资源都来自互联网,愿景是为大型爬虫项目提供一个高可用低延迟的高匿IP代理池。 项目亮点 代理来源丰富 代理抓取提取精准 代理校验严格合理 监控完备,鲁棒性强 架构灵活,便于扩展 各个组件分布式部署 快速开始 注意,代码请在release

SpiderClub 5.2k Jan 03, 2023
tweet random sand cat pictures

sandcatbot setup pip3 install --user -r requirements.txt cp sandcatbot.example.conf sandcatbot.conf vim sandcatbot.conf running the first parameter i

jess 8 Aug 07, 2022