PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline from a configuration file.

Overview

PdpCLI

Actions Status Python version PyPI version License

Quick Links

Introduction

PdpCLI is a pandas DataFrame processing CLI tool which enables you to build a pandas pipeline powered by pdpipe from a configuration file. You can also extend pipeline stages and data readers / writers by using your own python scripts.

Features

  • Process pandas DataFrame from CLI without wrting Python scripts
  • Support multiple configuration file formats: YAML, JSON, Jsonnet
  • Read / write data files in the following formats: CSV, TSV, JSON, JSONL, pickled DataFrame
  • Import / export data with multiple protocols: S3 / Databse (MySQL, Postgres, SQLite, ...) / HTTP(S)
  • Extensible pipeline and data readers / writers

Installation

Installing the library is simple using pip.

$ pip install "pdpcli[all]"

Tutorial

Basic Usage

  1. Write a pipeline config file config.yml like below. The type fields under pipeline correspond to the snake-cased class names of the PdpipelineStages. Other fields such as stage and columns are the parameters of the __init__ methods of the corresponging classes. Internally, this configuration file is converted to Python objects by colt.
pipeline:
  type: pipeline
  stages:
    drop_columns:
      type: col_drop
      columns:
        - name
        - job

    encode:
      type: one_hot_encode
      columns: sex

    tokenize:
      type: tokenize_text
      columns: content

    vectorize:
      type: tfidf_vectorize_token_lists
      column: content
      max_features: 10
  1. Build a pipeline by training on train.csv. The following command generages a pickled pipeline file pipeline.pkl after training. If you specify a URL of file path, it will be automatically downloaded and cached.
$ pdp build config.yml pipeline.pkl --input-file https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/train.csv
  1. Apply the fitted pipeline to test.csv and get output of a processed file processed_test.jsonl by the following command. PdpCLI automatically detects the output file format based on the file name. In this example, the processed DataFrame will be exported as the JSON-Lines format.
$ pdp apply pipeline.pkl https://github.com/altescy/pdpcli/raw/main/tests/fixture/data/test.csv --output-file processed_test.jsonl
  1. You can also directly run the pipeline from a config file without fitting pipeline.
$ pdp apply config.yml test.csv --output-file processed_test.jsonl
  1. It is possible to override or add parameters by adding command line arguments:
pdp apply config.yml test.csv pipeline.stages.drop_columns.column=name

Data Reader / Writer

PdpCLI automatically detects a suitable data reader / writer based on a given file name. If you need to use the other data reader / writer, add a reader or writer config to config.yml. The following config is an exmaple to use SQL data reader. SQL reader fetches records from the specified database and converts them into a pandas DataFrame.

reader:
    type: sql
    dsn: postgres://${env:POSTGRES_USER}:${env:POSTGRES_PASSWORD}@your.posgres.server/your_database

Config files are interpreted by OmegaConf, so ${env:...} is interpolated by environment variables.

Prepare yuor SQL file query.sql to fetch data from the database:

select * from your_table limit 1000

You can execute the pipeline with SQL data reader via:

$ POSTGRES_USER=user POSTGRES_PASSWORD=password pdp apply config.yml query.sql

Plugins

By using plugins, you can extend PdpCLI. This plugin feature enables you to use your own pipeline stages, data readers / writers and commands.

Add a new stage

  1. Write your plugin script mypdp.py like below. Stage.register(" ") registers your pipeline stages, and you can specify these stages by writing type: in your config file.
import pdpcli

@pdpcli.Stage.register("print")
class PrintStage(pdpcli.Stage):
    def _prec(self, df):
        return True

    def _transform(self, df, verbose):
        print(df.to_string(index=False))
        return df
  1. Update config.yml to use your plugin.
pipeline:
    type: pipeline
    stages:
        drop_columns:
        ...

        print:
            type: print

        encode:
        ...
  1. Execute command with --module mypdp and you can see the processed DataFrame after running drop_columns.
$ pdp apply config.yml test.csv --module mypdp

Add a new command

You can also add new commands not only stages.

  1. Add the following script to mypdp.py. This greet command prints out a greeting message with your name.
@pdpcli.Subcommand.register(
    name="greet",
    description="say hello",
    help="say hello",
)
class GreetCommand(pdpcli.Subcommand):
    requires_plugins = False

    def set_arguments(self):
        self.parser.add_argument("--name", default="world")

    def run(self, args):
        print(f"Hello, {args.name}!")
  1. To register this command, you need to create the .pdpcli_plugins file in which module names are listed for each line. Due to module importing order, the --module option is unavailable for command registration.
.pdpcli_plugins ">
$ echo "mypdp" > .pdpcli_plugins
  1. Run the following command and get a message like below. By using the .pdpcli_plugins file, it is is not needed to add the --module option to a command line for each execution.
$ pdp greet --name altescy
Hello, altescy!
You might also like...
gget is a free and open-source command-line tool and Python package that enables efficient querying of genomic databases.
gget is a free and open-source command-line tool and Python package that enables efficient querying of genomic databases.

gget is a free and open-source command-line tool and Python package that enables efficient querying of genomic databases. gget consists of a collection of separate but interoperable modules, each designed to facilitate one type of database querying in a single line of code.

CLI program that allows you to change your Alacritty config with one command without editing the config file.
CLI program that allows you to change your Alacritty config with one command without editing the config file.

Pycritty Change your alacritty config on the fly! Installation: pip install pycritty By default, only the program itself will be installed, but you ca

This is a CLI program which can help you generate your own QR Code.

Python-QR-code-generator This is a CLI program which can help you generate your own QR Code. Single.py This will allow you only to input a single mess

A CLI tool that scans through a directory and organizes all loose files into folders by file type.
A CLI tool that scans through a directory and organizes all loose files into folders by file type.

Organizer CLI Organizer CLI is a python command line tool that goes through a given directory and organizes all un-folder bound files into folders by

tox-server is a command line tool which runs tox in a loop and calls it with commands from a remote CLI.

Tox Server tox-server is a command line tool which runs tox in a loop and calls it with commands from a remote CLI. It responds to commands via ZeroMQ

Pyrdle - Play Wordle in the CLI. Write an algorithm to play Wordle for you. Ruin all of the fun you've been having
Pyrdle - Play Wordle in the CLI. Write an algorithm to play Wordle for you. Ruin all of the fun you've been having

Pyrdle - Play Wordle in the CLI. Write an algorithm to play Wordle for you. Ruin all of the fun you've been having

flora-dev-cli (fd-cli) is command line interface software to interact with flora blockchain.

Install git clone https://github.com/Flora-Network/fd-cli.git cd fd-cli python3 -m venv venv source venv/bin/activate pip install -e . --extra-index-u

Python-Stock-Info-CLI: Get stock info through CLI by passing stock ticker.
Python-Stock-Info-CLI: Get stock info through CLI by passing stock ticker.

Python-Stock-Info-CLI Get stock info through CLI by passing stock ticker. Installation Use the following command to install the required modules at on

Releases(v0.4.1)
Owner
Yasuhiro Yamaguchi
Research Engineer / NLP / ML
Yasuhiro Yamaguchi
3DigitDev 29 Jan 17, 2022
A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Stream your favorite shows straight from the command line.

A command-line based, minimal torrent streaming client made using Python and Webtorrent-cli. Installation pip install -r requirements.txt It use

Jonardon Hazarika 17 Dec 11, 2022
An awesome Python wrapper for an awesome Docker CLI!

An awesome Python wrapper for an awesome Docker CLI!

Gabriel de Marmiesse 303 Jan 03, 2023
git-partial-submodule is a command-line script for setting up and working with submodules while enabling them to use git's partial clone and sparse checkout features.

Partial Submodules for Git git-partial-submodule is a command-line script for setting up and working with submodules while enabling them to use git's

Nathan Reed 15 Sep 22, 2022
Convert markdown to HTML using the GitHub API and some additional tweaks with Python.

Convert markdown to HTML using the GitHub API and some additional tweaks with Python. Comes with full formula support and image compression.

phseiff 70 Dec 23, 2022
Commandline Python app to Autodownload mediafire folders and files.

Commandline Python app to Autodownload mediafire folders and files.

Tharuk Renuja 3 May 12, 2022
Open a file in your locally running Visual Studio Code instance from arbitrary terminal connections.

code-connect Open a file in your locally running Visual Studio Code instance from arbitrary terminal connections. Motivation VS Code supports opening

Christian Volkmann 56 Nov 19, 2022
A CLI based task manager tool which helps you track your daily task and activity.

CLI based task manager tool This is the simple CLI tool can be helpful in increasing your productivity. More like your todolist. It uses Postgresql as

ritik 1 Jan 19, 2022
Command line tool for interacting and testing warehouse components

Warehouse debug CLI Example usage for Zumo debugging See all messages queued and handled. Enable by compiling the zumo-controller with -DDEBUG_MODE_EN

1 Jan 03, 2022
dsub is a command-line tool that makes it easy to submit and run batch scripts in the cloud.

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.

Data Biosphere 233 Jan 01, 2023
A CLI Spigot plugin manager that adheres to Unix conventions and Python best practices.

Spud A cross-platform, Spigot plugin manager that adheres to the Unix philosophy and Python best practices. Some focuses of the project are: Easy and

Tommy Dougiamas 9 Dec 02, 2022
Gamestonk Terminal is an awesome stock and crypto market terminal

Gamestonk Terminal is an awesome stock and crypto market terminal. A FOSS alternative to Bloomberg Terminal.

Gamestonk Terminal 18.6k Jan 03, 2023
sync-my-tasks is a CLI tool that copies tasks between apps.

sync-my-tasks Copy tasks between apps Report a Bug · Request a Feature . Ask a Question Table of Contents Table of Contents Getting Started Developmen

William Hutson 2 Dec 14, 2021
A simple command-line tracert implementation in Python 3 using ICMP packets

Traceroute A simple command-line tracert implementation in Python 3 using ICMP packets Details Traceroute is a networking tool designed for tracing th

James 3 Jul 16, 2022
Output Analyzer for you terminal commands

Output analyzer (OZER) You can specify a few words inside config.yaml file and specify the color you want to be used. installing: Install command usin

Ehsan Shirzadi 1 Oct 21, 2021
Enriches Click with option groups, constraints, command aliases, help sections for subcommands, themes for --help and other stuff.

Enriches Click with option groups, constraints, command aliases, help sections for subcommands, themes for --help and other stuff.

Gianluca Gippetto 62 Dec 22, 2022
A ZSH plugin that enables you to use OpenAI's powerful Codex AI in the command line.

A ZSH plugin that enables you to use OpenAI's powerful Codex AI in the command line.

Tom Dörr 976 Jan 03, 2023
Fun project to generate The Matrix Code effect on you terminal.

Fun project to generate The Matrix Code effect on you terminal.

Henrique Bastos 11 Jul 13, 2022
A handy command-line utility for generating and sending iCalendar events

A handy command-line utility for generating and sending iCalendar events This simple command-line utility is designed to generate an iCalendar event,

Baochun Li 17 Nov 21, 2022
CLI tool that helps manage shell libraries.

shmgr CLI tool that helps manage shell libraries. Badges 📛 project status badges: version badges: tools / frameworks used by test suite (i.e. used by

Bryan Bugyi 0 Dec 15, 2021