A webmining CLI tool & library for python.

Overview

Build Status DOI download number

Minet

minet is a webmining command line tool & library for python (>= 3.6) that can be used to collect and extract data from a large variety of web sources such as raw webpages, Facebook, CrowdTangle, YouTube, Twitter, Media Cloud etc.

It adopts a very simple approach to various webmining problems by letting you perform a variety of actions from the comfort of the command line. No database needed: raw CSV files should be sufficient to do most of the work.

In addition, minet also exposes its high-level programmatic interface as a python library so you can tweak its behavior at will.

Shortcuts: Command line documentation, Python library documentation.

Summary

What it does

Minet can single-handedly:

  • Extract URLs from a text file (or a table)
  • Parse URLs (get useful information, with Facebook- and Youtube-specific stuff)
  • Join two CSV files by matching the columns containing URLs
  • From a list of URLs, resolve their redirections
    • ...and check their HTTP status
    • ...and download the HTML
    • ...and extract hyperlinks
    • ...and extract the text content and other metadata (title...)
    • ...and scrape structured data (using a declarative language to define your heuristics)
  • Crawl (using a declarative language to define a browsing behavior, and what to harvest)
  • Mine or search:
  • Scrape (without requiring special access):
  • Grab & dump cookies from your browser
  • Dump Hyphe data

Documented use cases

Features (from a technical standpoint)

  • Multithreaded, memory-efficient fetching from the web.
  • Multithreaded, scalable crawling using a comfy DSL.
  • Multiprocessed raw text content extraction from HTML pages.
  • Multiprocessed scraping from HTML pages using a comfy DSL.
  • URL-related heuristics utilities such as extraction, normalization and matching.
  • Data collection from various APIs such as CrowdTangle.

Installation

minet can be installed as a standalone CLI tool (currently only on mac >= 10.14, ubuntu & similar) by running the following command in your terminal:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash

Don't trust us enough to pipe the result of a HTTP request into bash? We wouldn't either, so feel free to read the installation script here and run it on your end if you prefer.

On ubuntu & similar you might need to install curl and unzip before running the installation script if you don't already have it:

sudo apt-get install curl unzip

Else, minet can be installed directly as a python CLI tool and library using pip:

pip install minet

If you need more help to install and use minet from scratch, you can check those installation documents.

Finally if you want to install the standalone binaries by yourself (even for windows) you can find them in each release here.

Upgrading

To upgrade the standalone version, simply run the install script once again:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/install.sh | bash

To upgrade the python version you can use pip thusly:

pip install -U minet

Uninstallation

To uninstall the standalone version:

curl -sSL https://raw.githubusercontent.com/medialab/minet/master/scripts/uninstall.sh | bash

To uninstall the python version:

pip uninstall minet

Documentation

Contributing

To contribute to minet you can check out this documentation.

How to cite

minet is published on Zenodo as DOI

You can cite it thusly:

Guillaume Plique, Pauline Breteau, Jules Farjas, Héloïse Théro, Jean Descamps, & Amélie Pellé. (2019, October 14). Minet, a webmining CLI tool & library for python. Zenodo. http://doi.org/10.5281/zenodo.4564399

Comments
  • casanova.exceptions.EmptyFileError

    casanova.exceptions.EmptyFileError

    I am trying to run minet in a github action. It fails with the following message:

      minet tw scrape tweets -o tweets.csv "from:@taniki #tutotal2022"
      shell: /usr/bin/bash -e {0}
      env:
        pythonLocation: /opt/hostedtoolcache/Python/3.9.5/x64
        LD_LIBRARY_PATH: /opt/hostedtoolcache/Python/3.9.5/x64/lib
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s]Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/casanova/reader.py", line 151, in __init__
        fieldnames = next(self.reader)
    StopIteration
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.9.5/x64/bin/minet", line 8, in <module>
        sys.exit(main())
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/minet/cli/__main__.py", line 218, in main
        fn(cli_args)
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/minet/cli/twitter/__init__.py", line 33, in twitter_action
        twitter_scrape_action(cli_args)
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/minet/cli/twitter/scrape.py", line 45, in twitter_scrape_action
        enricher = casanova.enricher(
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/casanova/enricher.py", line 31, in __init__
        super().__init__(input_file, no_headers=no_headers, **kwargs)
      File "/opt/hostedtoolcache/Python/3.9.5/x64/lib/python3.9/site-packages/casanova/reader.py", line 157, in __init__
        raise EmptyFileError
    casanova.exceptions.EmptyFileError
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s]
    Error: Process completed with exit code 1.
    
    opened by taniki 16
  • Get Retweeters

    Get Retweeters

    Hi, thanks for the last release, I'm glad to see there is a Retweeters tool but I went through some issues with it... for a few days.. I may not understood how it should implemented ? I run it and I get this error : image May someone who manage with it help me ?

    Thank you

    opened by jlbreeeez 15
  • Twitter API scraper: acquire guest_token by API

    Twitter API scraper: acquire guest_token by API

    new method to acquire the guest_token through activate API relates #384 #382

    Method taken from @JustAnotherArchivist in snscrape see: https://github.com/JustAnotherArchivist/snscrape/commit/0336ce13edbd195b3e91487061a0e7a2857f0c68 Thanks for sharing the solution.

    For now this edit is simply a new method to acquire the token. The token is used as a cookie as before but it's not preserved on disk in case of multiple calls.

    opened by paulgirard 11
  • tw scrape fails on some queries due to Over capacity error

    tw scrape fails on some queries due to Over capacity error

    minet tw scrape tweets '#5gcovid' > tweets.csv

    <class 'minet.twitter.exceptions.TwitterPublicAPIInvalidResponseError'>

    {'errors': [{'message': 'Over capacity', 'code': 130}]} 503

    bug 
    opened by Yomguithereal 10
  • [retweeters] KeyError: 'url'

    [retweeters] KeyError: 'url'

    Hi, when I try to retrieve the retweeters list from a file containing tweets previously extracted from Twitter using minet scrapper, I get this error after scanning a few tweets from my list (after 7, 10, or 30 tweets scanned... it depend of the database...). Does anyone encountered this error before ? Thanks for helping :-) image

    opened by tloops329384 8
  • impossible d'extraire totalité des tweets d'une requête

    impossible d'extraire totalité des tweets d'une requête

    Lorsque je lance une requête, avec comme critère un mot clé + un utilisateur, le résultat est très aléatoire : une fois 0 tweet, une fois 1 tweet, une fois 20 tweets, une fois 80 tweets etc sans jamais arriver à une extraction totale (qui est d'environ seulement 200 tweets pourtant). J'ai relancé cette requête de nombreuses fois, sans jamais extraire l'ensemble des tweets en question.

    Que dois-je faire pour y parvenir ? Merci

    opened by parisGH 8
  • [twitter] unable to get user tweets

    [twitter] unable to get user tweets

    Hello,

    Thanks for sharing the lib with the community. I am not able to get user tweets , I got the error:

    Traceback (most recent call last):
      File "/home/bafou/.local/bin/minet", line 8, in <module>
        sys.exit(main())
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/minet/cli/__main__.py", line 198, in main
        to_close = resolve_arg_dependencies(cli_args, config)
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/minet/cli/argparse.py", line 290, in resolve_arg_dependencies
        setattr(cli_args, name, value.resolve(config))
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/minet/cli/argparse.py", line 253, in resolve
        return getpath(config, self.key, self.default)
      File "/home/bafou/.local/pipx/venvs/minet/lib/python3.8/site-packages/ebbe/utils.py", line 72, in getpath
        target = target[step]
    TypeError: string indices must be integers
    

    when executingminet tw user-tweets screen_name users.csv > tweets.csv with users.csv

    Regards.

    bug 
    opened by billmetangmo 6
  • GH actions + Minet Scrap Twitter fail.

    GH actions + Minet Scrap Twitter fail.

    hi,

    i have this GH action to generate a twitter scrap csv (written by @taniki) :

    name: scrape bfm
    
    on:
      workflow_dispatch:
      schedule:
        - cron:  '0 9 * * *'
    
    jobs:
      scrape_bfm:
        runs-on: ubuntu-latest
        steps:
          - uses: actions/[email protected]
          - uses: actions/[email protected]
            with:
              python-version: '3.x'
          - name: install minet
            run: |
              python -m pip install --upgrade pip
              pip install minet==0.56.2
          - name: scrape @BFMTV tweets
            shell: bash
            run: |
              minet tw scrape tweets "from:@BFMTV since:2021-09-01" > bfmtv-tweets.csv
          - name: commit
            uses: ./.github/actions/commit
            with:
              message: lol @bfmtv
    

    Sometimes, no problem. Sometimes, GH return error log :

    Run minet tw scrape tweets "from:@CNEWS since:2021-09-01" > cnews-tweets.csv
    Collecting tweets: 0 tweets [00:00, ? tweets/s]                            
    Collecting tweets: 0 tweets [00:00, ? tweets/s]                   
    Searching for "from:@CNEWS since:2021-09-01"
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s]
    Collecting tweets: 0 tweets [00:00, ? tweets/s, queries=1, tokens=1]Traceback (most recent call last):
      File "/opt/hostedtoolcache/Python/3.10.1/x64/bin/minet", line 8, in <module>
        sys.exit(main())
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/cli/__main__.py", line 218, in main
        fn(cli_args)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/cli/twitter/__init__.py", line 31, in twitter_action
        twitter_scrape_action(cli_args)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/cli/twitter/scrape.py", line 69, in twitter_scrape_action
        for tweet, meta in iterator:
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/twitter/api_scraper.py", line 370, in search
        new_cursor, tweets = retryer(self.request_search, query, cursor, refs=refs)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/tenacity/__init__.py", line 404, in __call__
        do = self.iter(retry_state=retry_state)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/tenacity/__init__.py", line 349, in iter
        return fut.result()
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/concurrent/futures/_base.py", line 438, in result
        return self.__get_result()
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/concurrent/futures/_base.py", line 390, in __get_result
        raise self._exception
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/tenacity/__init__.py", line 407, in __call__
        result = fn(*args, **kwargs)
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/twitter/api_scraper.py", line 72, in wrapped
        self.acquire_guest_token()
      File "/opt/hostedtoolcache/Python/3.10.1/x64/lib/python3.10/site-packages/minet/twitter/api_scraper.py", line 261, in acquire_guest_token
        raise TwitterGuestTokenError
    minet.twitter.exceptions.TwitterGuestTokenError
    
    Collecting tweets: 0 tweets [00:00, ? tweets/s, queries=1, tokens=1]
    Error: Process completed with exit code 1.
    

    Dont understand. Did anyone have the same problem Twitter ban GH sometimes ?

    Thanks for Minet, super outil !

    opened by stefw 6
  • Access denied

    Access denied

    Forewords : sorry, new on GitHub, and I'm not sure it is the appropriate place to post my question... Is it ?

    Hi, First, thank you for the tool which will help me a lot in my research ! I got a problem, which I think is not that complicated, but when I run Minet in order to get the "friends" of the twitter_users contained in the data_users.csv file, I don't manage to get access to the file : "Permission Denied"... I tried to open the CMD as an Administrator but it didn't solve the problem. Can you help me ?

    Capture

    opened by jlbreeeez 6
  • error in installing pip install mineit

    error in installing pip install mineit

    while installing mineit via pip it does not work. says, "" Collecting mineit Could not install packages due to an EnvironmentError: 404 Client Error: Not Found for url: https://pypi.org/simple/mineit/

    ""

    is this issue already solved?

    opened by moonisali 6
  • Twitter scrape: systematic TwitterGuestTokenError with v0.56.2 or v0.56.1

    Twitter scrape: systematic TwitterGuestTokenError with v0.56.2 or v0.56.1

    As in #382 I experience systematic TwitterGuestTokenError exceptions. Was not the case a few weeks ago. I didn't test other versions than 0.56.1 and 0.56.2.

    Looks like we need to review the twitter scrape heuristic. I will try to have a look later today or tomorrow.

    bug 
    opened by paulgirard 5
  • instagram

    instagram

    • [ ] get comments from a post id: https://www.instagram.com/api/v1/media/POST_ID/comments/?can_support_threading=true&permalink_enabled=false
    • [x] get user info from username: https://i.instagram.com/api/v1/users/web_profile_info/?username=USERNAME
    • [ ] other route for posts associated with hashtag (more info but don't know how to change page): https://www.instagram.com/api/v1/tags/web_info/?tag_name=HASHTAG
    • [ ] get post info from post id: https://www.instagram.com/api/v1/media/POST_ID/info/
    • [ ] get post likers from post id (it seems that we can only have access to a limited number of them): https://www.instagram.com/api/v1/media/POST_ID/likers/

    Need 'cookie' and 'x-ig-app-id'

    enhancement 
    opened by MiguelLaura 0
Releases(0.66.1)
Owner
médialab Sciences Po
SciencesPo's médialab is an interdisciplinary research laboratory gathering engineers, designers & social science researchers.
médialab Sciences Po
Universal Command Line Interface for Amazon Web Services

This package provides a unified command line interface to Amazon Web Services.

Amazon Web Services 13.3k Jan 07, 2023
Pequeno joguinho pra você rodar no seu terminal

JokenPython Pequeno joguinho pra você rodar no seu terminal Olá! Joguinho legal pra vc rodar no seu terminal!! (rode no terminal, pra melhor experienc

Scott 4 Nov 25, 2021
commandline version of wordle game and my auto solver.

Wordle Machine (and Wordle Game) (in commandline) My implementation of the Wordle game (inspired by https://www.powerlanguage.co.uk/wordle/) and my in

Kevin Xu 11 Jan 03, 2023
Plumbum: Shell Combinators

Plumbum: Shell Combinators Ever wished the compactness of shell scripts be put into a real programming language? Say hello to Plumbum Shell Combinator

Tomer Filiba 2.5k Dec 28, 2022
A simple file transfer tools, similar to rz / sz but compatible with tmux (control mode), which works with iTerm2 and has a nice progress bar

trzsz A simple file transfer tools, similar to rz/sz but compatible with tmux (control mode), which works with iTerm2 and has a nice progress bar. Why

561 Jan 05, 2023
Shazam is a Command Line Application that checks the integrity of the file by comparing it with a given hash.

SHAZAM - Check the file's integrity Shazam is a Command Line Application that checks the integrity of the file by comparing it with a given hash. Crea

Anaxímeno Brito 1 Aug 21, 2022
Chat In Terminal - Chat-App in python

Chat In Terminal Hello all. 😉 Sockets and servers are vey important for connection and importantly chatting with others. 😂 😁 I have thought of maki

Shreejan Dolai 5 Nov 17, 2022
spid-sp-test is a SAML2 SPID/CIE Service Provider validation tool that can be executed from the command line.

spid-sp-test spid-sp-test is a SAML2 SPID/CIE Service Provider validation tool that can be executed from the command line. This tool was born by separ

Developers Italia 30 Nov 08, 2022
A command line tool to query source code from your current Python env

wxc wxc (pronounced "which") allows you to inspect source code in your Python environment from the command line. It is based on the inspect module fro

Clément Robert 13 Nov 08, 2022
TermPair lets developers securely share and control terminals in real time🔒

View and control terminals from your browser with end-to-end encryption 🔒

Chad Smith 1.5k Jan 05, 2023
slipit is a command line utility for creating archives with path traversal elements.

slipit is a command line utility for creating archives with path traversal elements. It is basically a successor of the famous evilarc utility with an extended feature set and improved base functiona

usd AG 35 Dec 23, 2022
cmsis-pack-manager is a python module, Rust crate and command line utility for managing current device information that is stored in many CMSIS PACKs

cmsis-pack-manager cmsis-pack-manager is a python module, Rust crate and command line utility for managing current device information that is stored i

pyocd 20 Dec 21, 2022
A Command Line Calculator With Python

CalculadoraPY Usando no Termux apt install python3 apt install git pip3 install termcolor git clone https://github.com/kayke981/CalculadoraPY.git

kayake 5 Jan 30, 2022
A dilligent command line tool to publish ads on ebay-kleinanzeigen.de

kleinanzeigen-bot Feedback and high-quality pull requests are highly welcome! About Installation Usage Development Notes License About kleinanzeigen-b

83 Dec 26, 2022
Themes for the kitty terminal emulator

Themes for the kitty terminal This is a collection of themes for the kitty terminal emulator. The themes were initially imported from dexpota/kitty-th

Kovid Goyal 190 Jan 05, 2023
Tiny command-line utility for mapping broken keys to other positions.

brokenkey Tiny command-line utility for mapping broken keys to other positions. Installation Clone this repository using git: git clone https://github

0 Oct 04, 2021
Command line parser for common log format (Nginx default).

Command line parser for common log format (Nginx default).

Lucian Marin 138 Dec 19, 2022
A mini command line tool to spellcheck text files using tadqeek.alsharekh.org

tadqeek_sakhr A mini command line tool to spellcheck text files using tadqeek.alsharekh.org Usage usage: python tadqeek_sakhr.py [-h] -i INPUT [-o OUT

Youssif Shaaban Alsager 5 Dec 11, 2022
Task-manager-CLI with Priority Modification

Task-manager-CLI with Priority Modification The functions for the app have been written in task.py file. 1. Install Node.js This project requires Node

1 Jan 21, 2022
eBay's TSV Utilities: Command line tools for large, tabular data files. Filtering, statistics, sampling, joins and more.

Command line utilities for tabular data files This is a set of command line utilities for manipulating large tabular data files. Files of numeric and

eBay 1.4k Jan 09, 2023