Python command line tool and python engine to label table fields and fields in data files.

Overview

Metacrafter

Python command line tool and python engine to label table fields and fields in data files. It could help to find meaningful data in your tables and data files or to find Personal identifable information (PII).

Installation

To install Python library use pip install metacrafter via pip or python setup.py install

Features

Metacrafter is a rule based tool that helps to label fields of the tables in databases. It scans table and finds person names, surnames, midnames, PII data, basic identifiers like UUID/GUID. These rules written as .yaml files and could be easily extended.

File formats supported:

  • CSV
  • JSON lines
  • JSON (array of dicts)
  • BSON
  • Parquet

Databases support:

Metacrafter key features:

  • 25 basic and PII rules.
  • all labels metadata collected into Metacrafter registry public repository -
  • 312 date detection rules/patterns, date detection using qddate, "quick and dirty" date detection library
  • extendable set of rules using PyParsing, exact text match and validation functions
  • support any database supported by SQLAlchemy
  • advanced context and language management. You could apply only rules relevant to certain data of choosen language
  • built-in API server
  • commercial support and additional rules available

Command line examples

File analysis examples

# Scan CSV file
$ metacrafter scan-file --format short somefile.csv

# Scan CSV file with delimiter ';' and windows-1251 encoding
$ metacrafter scan-file --format short --encoding windows-1251 --delimiter ';' somefile.csv

# Scan JSON lines file, output results as stats table to file file
$ metacrafter scan-file --format stats -o somefile_result.json somefile.jsonl

Result example of 'full' type of formatting

key                   ftype    tags    matches
--------------------  -------  ------  -------------------------------
name                  str      uniq
addressresidence      str      uniq    address 59.80
addressactivities     str              address 50.98
addressobjects        str              address 28.00
bin                   int              ogrn 99.02
inn                   str              inn 100.00,inn 99.02
purposeaudit          str              runpa 8.82
dateregistration      str              datetime 94.12 (dt:date:date_2)
expirydate            str              datetime 18.63 (dt:date:date_2)
startdateactivity     str              datetime 28.43 (dt:date:date_2)
othergrounds          str      dict
startdateaudit        str              datetime 65.69 (dt:date:date_2)
workdays              int      dict
workhours             str      dict
formaudit             str      dict
namestatecontrol      str
assignment decree     str      dict
effectivedate         str      dict
Inspectionenddate     str      empty
riskcategory          str      dict
expirationdate        str      empty
startupnotifications  str      empty
daylastcheck          str      empty
otherreasonsrefusal   str      empty
numbersystem          str      empty

Database analysis examples

# Scan MongoDB database 'fns', save results as result.json and format output as 'stats'
$ metacrafter scan-mongodb --dbname fns -o result.json -f full

# Scan Postgres database 'dbname', with schema 'public'.
$ metacrafter scan-db --schema public --connstr postgresql+psycopg2://username:[email protected]:15432/dbname

Rules

All rules described as YAML files and by default rules loaded from directory 'rules' or from list of directories provided in .metacrafter file with YAML format

All rules could be applied to fields or data .

Compare engines defined in match parameter in rule description:

  • text - scan text for exact match to one of text values. Text values delimited by comma (',')
  • ppr - scan text for PyParsing. PyParsing rule defined as Python code with PyParsing objects like Word(nums, exact=4)
  • func - scan text using Python function provided. Function shoud accept one string parameter and shoud return True or False

How to write rules

Function (func)

Example Russian administrative legal act/law matched by custom function

  runpabyfunc:
    key: runpa
    name: Russian legal act / law
    maxlen: 500
    minlen: 3
    priority: 1
    match: func
    type: data
    rule: metacrafter.rules.ru.gov.is_ru_law

Exact text match (text)

Example midname matching by exact field name

  midname:
    key: person_midname
    name: Person midname by known
    rule: midname,secondname,middlename,mid_name,middle_name
    type: field
    match: text

PyParsing rule (ppr)

Example Russian cadastral number

  rukadastr:
    key: rukadastr
    name: Russian land territory cadastral identifier
    rule: Word(nums, min=1, max=2) + Literal(':').suppress() + Word(nums, min=1, max=2) + Literal(':').suppress() + Word(nums, min=6, max=7) + Literal(':').suppress() + Word(nums, min=1, max=6)
    maxlen: 20
    minlen: 12
    priority: 1
    match: ppr
    type: data

Commercial support

Please write [email protected] or [email protected] to request beta access to commercial API. Commercial API support 195 fields and data rules and provided with dedicated support.

Comments
  • (sqlite3.OperationalError) no such module: VirtualSpatialIndex

    (sqlite3.OperationalError) no such module: VirtualSpatialIndex

    Error processing 008564_pal_features_v3.sqlite - (sqlite3.OperationalError) no such module: VirtualSpatialIndex [SQL: SELECT * FROM 'SpatialIndex' LIMIT 10000] (Background on this error at: http://sqlalche.me/e/13/e3q8)

    File 008564_pal_features_v3.zip

    bug 
    opened by ivbeg 2
  • Can I apply rules (eg pii) during scan-db

    Can I apply rules (eg pii) during scan-db

    I have successfully run scan-db against my database.

    I want to run scan-db with the pii rule but cannot see how this is possible from the examples. Is there an option to do this?

    Many thanks

    opened by ian-lewis-d 1
  • Consider adding boolean rules with prefixes

    Consider adding boolean rules with prefixes "is_" and "has_", 'was_', postfixes "flag" and e.t.c.

    From GitSchemas datasets analysis consider adding rules:

    • [ ] prefix based with prefixes: "is_" and "has_", "show_", 'was_'
    • [ ] name based with names: "deleted", "enabled", "approved", "active"
    • [ ] postfix based with postfixes: "_flag"

    Additional verification should include that field has no more than 2 values (yes or no) or 3 values including NULL (yes, no, None).

    enhancement 
    opened by ivbeg 0
  • Is there an integration for Datahub?

    Is there an integration for Datahub?

    Hi,

    I'm in the process of setting up Datahub (https://datahubproject.io) at our organisation and I wanted to know if there is a way to load the Metacrafter PII labels onto entities in Datahub?

    Many thanks, Ian

    enhancement 
    opened by ian-lewis-d 1
  • Add schema for report JSON and improve reporting

    Add schema for report JSON and improve reporting

    Right now JSON file of the metadata scanning report is not structured well enough. Improvements should include:

    • [ ] Add Cerberus schema (more info https://docs.python-cerberus.org)
    • [ ] Add scanning datetime
    • [ ] Add source info: source type, filename, connection string e.t.c. Make sure no secrets in connection string
    • [ ] Move 'table' to 'source' subtag
    • [ ] Add tests to validate reports with Cerberus validator
    enhancement 
    opened by ivbeg 0
  • Add support of NoSQL databases

    Add support of NoSQL databases

    Add support for the following NoSQL databases and search engines: MongoDB, ArangoDB, Milvus, ArcadeDB, ElasticSearch, OpenSearch, MeiliSearch, Apache Cassandra, StarGate (MongoDB-like API over NoSQL databases)

    The current state of database support:

    • [x] MongoDB
    • [ ] ArangoDB
    • [ ] ElasticSearch
    • [ ] Meilisearch
    • [ ] Milvus
    • [ ] OpenSearch
    • [ ] ArcadeDB

    Other tasks:

    • [ ] Write universal class for NoSQL document based databases
    • [ ] Replace command-line command 'scan-mongodb' with 'scan-nosql' or update command 'scan-db' with NoSQL databases connection strings
    • [ ] Write documentation with connection strings examples and limitations
    • [ ] Write tests for each database type
    enhancement 
    opened by ivbeg 0
  • Object of type bytes is not JSON serializable - Error processing some SQLite files

    Object of type bytes is not JSON serializable - Error processing some SQLite files

    Error Object of type bytes is not JSON serializable caused by table fields with bytes type. Better detection of types needed and serialization of bytes type in JSON report. Error caused not by processing, but by reporting function.

    Example 000001_run-546.zip

    bug 
    opened by ivbeg 1
  • sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) Could not decode to UTF-8 column

    sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) Could not decode to UTF-8 column

    Error processing SQLite database with non-unicode names for fields. Example 000012_world.zip

    `Traceback (most recent call last): File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1284, in fetchall l = self.process_rows(self._fetchall_impl()) File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1230, in _fetchall_impl return self.cursor.fetchall() sqlite3.OperationalError: Could not decode to UTF-8 column 'name' with text '\ufffdland Islands'

    The above exception was the direct cause of the following exception:

    Traceback (most recent call last): File "C:\Program Files\Python310\Scripts\metacrafter-script.py", line 33, in sys.exit(load_entry_point('metacrafter==0.0.2', 'console_scripts', 'metacrafter')()) File "C:\Program Files\Python310\lib\site-packages\metacrafter-0.0.2-py3.10.egg\metacrafter_main_.py", line 12, in main exit_status = cli() File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 1130, in call return self.main(*args, **kwargs) File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 1055, in main rv = self.invoke(ctx) File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "C:\Program Files\Python310\lib\site-packages\click\core.py", line 760, in invoke return __callback(*args, **kwargs) File "C:\Program Files\Python310\lib\site-packages\metacrafter-0.0.2-py3.10.egg\metacrafter\core.py", line 464, in scan_db acmd.scan_db( File "C:\Program Files\Python310\lib\site-packages\metacrafter-0.0.2-py3.10.egg\metacrafter\core.py", line 359, in scan_db items = [dict(u) for u in queryres.fetchall()] File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1288, in fetchall self.connection.handle_dbapi_exception( File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\base.py", line 1510, in handle_dbapi_exception util.raise( File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\util\compat.py", line 182, in raise raise exception File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1284, in fetchall l = self.process_rows(self._fetchall_impl()) File "C:\Users\ibegt\AppData\Roaming\Python\Python310\site-packages\sqlalchemy\engine\result.py", line 1230, in _fetchall_impl return self.cursor.fetchall() sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) Could not decode to UTF-8 column 'name' with text '\ufffdland Islands' (Background on this error at: http://sqlalche.me/e/13/e3q8) `

    bug 
    opened by ivbeg 1
Releases(second)
Owner
APICrafter
APICrafter Data API project
APICrafter
Python-Stock-Info-CLI: Get stock info through CLI by passing stock ticker.

Python-Stock-Info-CLI Get stock info through CLI by passing stock ticker. Installation Use the following command to install the required modules at on

Ayush Soni 1 Nov 05, 2021
A multipurpose discord bot with more than 220 commands

Welcome WM Bot A advanced bot with more than 220 commands to fit your needs Explore the commands » View Demo · Report Bug · Request Feature Table of C

Wasi Master 12 Dec 16, 2022
Pynavt is a cli tool to create clean architecture app for you including Fastapi, bcrypt and jwt.

Pynavt _____ _ | __ \ | | | |__) | _ _ __ __ ___ _| |_ | ___/ | | | '_ \ / _` \ \ / /

Alejandro Castillo 1 Dec 13, 2021
A webmining CLI tool & library for python.

minet is a webmining command line tool & library for python (= 3.6) that can be used to collect and extract data from a large variety of web sources

médialab Sciences Po 165 Dec 17, 2022
Shellcode runner to execute malicious payload and bypass AV

buffshark-shellcode-runner Python Shellcode Runner to execute malicious payload and bypass AV This script utilizes mmap(for linux) and win api wrapper

Momo Lenard 9 Dec 29, 2022
A Telegram Bot Written In Python To Upload Medias To telegra.ph

Telegraph-Uploader A Telegram Bot Written In Python To Upload Medias To telegra.ph DEPLOY YOU CAN SIMPLY DEPLOY ON HEROKU BY CLICKING THE BUTTON BELOW

Rithunand 31 Dec 03, 2022
Python command line tool and python engine to label table fields and fields in data files.

Python command line tool and python engine to label table fields and fields in data files. It could help to find meaningful data in your tables and data files or to find Personal identifable informat

APICrafter 22 Dec 05, 2022
CmdTube is a Python CLI library for searching, downloading, and watching YouTube tutorials

CmdTube is a Python CLI library for searching, downloading, and watching YouTube tutorials. This library was made with programmers in mind and it's dedicated to every programmer who watches YouTube v

Samuel Ayomide Ogunleke 2 Aug 22, 2022
A very simple OpenContest command line client written in Python

OpenContest Client A very simple OpenContest command line client written in Python. The only dependency is the requests library. Tested with Linux onl

Ladue Computer Science 1 May 25, 2022
Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

Phil Wang 4.4k Jan 09, 2023
Colab-xterm allows you to open a terminal in a cell

colab-xterm Colab-xterm allows you to open a terminal in a cell. Usage Install package and load the extension !pip install git+https://github.com/popc

InfuseAI 194 Dec 29, 2022
Freaky fast fuzzy Denite/CtrlP matcher for vim/neovim

Freaky fast fuzzy Denite/CtrlP matcher for vim/neovim This is a matcher plugin for denite.nvim and CtrlP.

Raghu 113 Sep 29, 2022
🦎 A NeoVim plugin for highlighting visual selections like in a normal document editor!

🦎 HighStr.nvim A NeoVim plugin for highlighting visual selections like in a normal document editor! Demo TL;DR HighStr.nvim is a NeoVim plugin writte

Pocco81 222 Jan 03, 2023
A simple command line tool written in python to manage a to-do list

A simple command line tool written in python to manage a to-do list Dependencies: python Commands: todolist (-a | --add) [(-p | --priority)] [(-l | --

edwloef 0 Nov 02, 2021
Browse Hacker News like a haxor: A Hacker News command line interface (CLI).

haxor-news Coworker who sees me looking at something in a browser: "Glad you're not busy; I need you to do this, this, this..." Coworker who sees me s

Donne Martin 3.8k Jan 07, 2023
🔖 Lemnos: A simple, light-weight command-line to-do list manager.

🔖 Lemnos: CLI To-do List Manager This is a simple program that allows one to manage a to-do list via the command-line. Example $ python3 todo.py add

Rohan Sikand 1 Dec 07, 2022
A simple command line tool for changing the icons of folders or files on MacOS.

Mac OS File Icon Changer Description A small and simple script to quickly change large amounts or a few files and folders icons to easily customize th

Eroxl 3 Jan 02, 2023
tiptop is a command-line system monitoring tool in the spirit of top.

Command-line system monitoring. tiptop is a command-line system monitoring tool in the spirit of top. It displays various interesting system stats, gr

Nico Schlömer 1.3k Jan 08, 2023
Convert ACSM files to DRM-free EPUB files with one command on Linux

Knock Convert ACSM files to DRM-free EPUB files using one command. This software does not utilize Adobe Digital Editions nor Wine. It is completely fr

Benton Edmondson 622 Dec 09, 2022
A begginer reverse shell tool python.

A begginer reverse shell tool python. Este programa é para apenas estudo e conhecimento. Não use isso em outra pessoas. Não me responsabilizo por uso

Dio brando 2 Jan 05, 2022