SpyQL - SQL with Python in the middle

Last update: Dec 30, 2022

Overview

SpyQL

SQL with Python in the middle

Concept

SpyQL is a query language that combines:

the simplicity and structure of SQL
with the power and readability of Python

SELECT
    date.fromtimestamp(purchase_ts) AS purchase_date,
    price * quantity AS total
FROM csv
WHERE department.upper() == 'IT'
TO json

SQL provides the structure of the query, while Python is used to define expressions, bringing along a vast ecosystem of packages.

SpyQL command-line tool

With the SpyQL command-line tool you can make SQL-like SELECTs powered by Python on top of text data (e.g. CSV and JSON). Data can come from files but also from data streams, such as as Kafka, or from databases such as PostgreSQL. Basically, data can come from any command that outputs text :-). More, data can be generated by a Python iterator! Take a look at the examples section to see how to query parquet, process API calls, transverse directories of zipped JSONs, among many other things.

SpyQL also allows you to easily convert between text data formats:

FROM: CSV, JSON, TEXT and Python iterators (YES, you can use a list comprehension as the data source)
TO: CSV, JSON, SQL (INSERT statements), pretty terminal printing, and terminal plotting.

The JSON format is JSON lines, where each line has a valid JSON object or array. Piping with jq allows SpyQL to handle any JSON input (more on the examples section).

You can leverage command line tools to process other file types like Parquet and XML (more on the examples section).

Installation

To install SpyQL, run this command in your terminal:

pip install spyql

Hello world

To test your installation run in the terminal:

spyql "SELECT 'Hello world' as Message TO pretty"

Output:

Message
-----------
Hello world

Try replacing the output format by json and csv, and try adding more columns. e.g. run in the terminal:

spyql "SELECT 'Hello world' as message, 1+2 as three TO json"

Output:

{"message": "Hello world", "three": 3}

Principles

Right now, the focus is on building a command-line tool that follows these core principles:

Simple: simple to use with a straightforward implementation
Familiar: you should feel at home if you are acquainted with SQL and Python
Light: small memory footprint that allows you to process large data that fit into your machine
Useful: it should make your life easier, filling a gap in the eco-system

Syntax

[ IMPORT python_module [ AS identifier ] [, ...] ]
SELECT [ DISTINCT | PARTIALS ] 
    [ * | python_expression [ AS output_column_name ] [, ...] ]
    [ FROM csv | spy | text | python_expression | json [ EXPLODE path ] ]
    [ WHERE python_expression ]
    [ GROUP BY output_column_number | python_expression  [, ...] ]
    [ ORDER BY output_column_number | python_expression
        [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] ]
    [ LIMIT row_count ]
    [ OFFSET num_rows_to_skip ]
    [ TO csv | json | spy | sql | pretty | plot ]

Notable differences to SQL

In SpyQL:

there is guarantee that the order of the output rows is the same as in the input (if no reordering is done)
the AS keyword must precede a column alias definition (it is not optional as in SQL)
you can always access the nth input column by using the default column names colN (e.g. col1 for the first column)
currently only a small subset of SQL is supported, namely SELECT statements without: sub-queries, joins, set operations, etc (check the Syntax section)
sub-queries are achieved by piping (see the [Command line examples](#command line examples) section)
aggregation functions have the suffix _agg to avoid conflicts with python's built-in functions:

Operation	PostgreSQL	SpyQL
Sum all values of a column	`SELECT sum(col_name)`	`SELECT sum_agg(col_name)`
Sum an array	`SELECT sum(a) FROM (SELECT unnest(array[1,2,3]) AS a) AS t`	`SELECT sum([1,2,3])`

expressions are pure Python:

SQL	SpySQL
`x = y`	`x == y`
`x BETWEEN a AND b`	`a <= x <= b`
`CAST(x AS INTEGER)`	`int(x)`
`CASE WHEN x > 0 THEN 1 ELSE -1 END`	`1 if x > 0 else -1`
`upper('hello')`	`'hello'.upper()`

Notable differences to Python

Additional syntax

We added additional syntax for making querying easier:

Python	SpySQL shortcut	Purpose
`json['hello']['planet earth']`	`json->hello->'planet earth'`	Easy access of elements in dicts (e.g. JSONs)

NULL datatype

Python's None generates exceptions when making operations on missing data, breaking query execution (e.g. None + 1 throws a TypeError). To overcome this, we created a NULL type that has the same behavior as in SQL (e.g. NULL + 1 returns NULL), allowing for queries to continue processing data.

Operation	Native Python throws	SpySQL returns	SpySQL warning
`NULL + 1`	`NameError`	`NULL`
`a_dict['inexistent_key']`	`KeyError`	`NULL`	yes
`int('')`	`ValueError`	`NULL`	yes
`int('abc')`	`ValueError`	`NULL`	yes

The above dictionary key access only returns NULL if the dict is an instance of NullSafeDict. SpyQL adds NullSafeDict, which extends python's native dict. JSONs are automatically loaded as NullSafeDict. Unless you are creating dictionaries on the fly you do not need to worry about this.

Importing python modules and user-defined functions

By default, spyql do some commonly used imports:

everything from the math module
datetime, date and timezone from the datetime module
the re module

SpyQL queries support a single import statement at the beginning of the query where several modules can be imported (e.g. IMPORT numpy AS np, sys SELECT ...). Note that the python syntax from module import identifier is not supported in queries.

In addition, you can create a python file that is loaded before executing queries. Here you can define imports, functions, variables, etc using regular python code. Everything defined in this file is available to all your spyql queries. The file should be located at XDG_CONFIG_HOME/spyql/init.py. If the environment variable XDG_CONFIG_HOME is not defined, it defaults to HOME/.config (e.g. /Users/janedoe/.config/spyql/init.py).

Example queries

You can run the following example queries in the terminal: spyql "the_query" < a_data_file

Example data files are not provided on most cases.

Query a CSV (and print a pretty table)

SELECT a_col_name, 'positive' if col2 >= 0 else 'negative' AS sign
FROM csv
TO pretty

Convert CSV to a flat JSON

SELECT * FROM csv TO json

Convert from CSV to a hierarchical JSON

SELECT {'client': {'id': col1, 'name': col2}, 'price': 120.40}
FROM csv TO json

SELECT {'id': col1, 'name': col2} AS client, 120.40 AS price
FROM csv TO json

JSON to CSV, filtering out NULLs

SELECT json->client->id AS id, json->client->name AS name, json->price AS price
FROM json
WHERE json->client->name is not NULL
TO csv

Explode JSON to CSV

SELECT json->invoice_num AS id, json->items->name AS name, json-items->price AS price
FROM json
EXPLODE json->items
TO csv

Sample input:

{"invoice_num" : 1028, "items": [{"name": "tomatoes", "price": 1.5}, {"name": "bananas", "price": 2.0}]}
{"invoice_num" : 1029, "items": [{"name": "peaches", "price": 3.12}]}

Output:

id, name, price
1028, tomatoes, 1.5
1028, bananas, 2.0
1029, peaches, 3.12

Python iterator/list/comprehension to JSON

SELECT 10 * cos(col1 * ((pi * 4) / 90)
FROM range(80)
TO json

SELECT col1
FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)]
TO json

Importing python modules

Here we import hashlib to calculate a md5 hash for each input line. Before running this example you need to install the hashlib package (pip install hashlib).

IMPORT hashlib as hl
SELECT hl.md5(col1.encode('utf-8')).hexdigest()
FROM text

Getting the top 5 records

SELECT int(score) AS score, player_name
FROM csv
ORDER BY 1 DESC NULLS LAST, score_date
LIMIT 5

Aggregations

Totals by player, alphabetically ordered.

SELECT json->player_name, sum_agg(json->score) AS total_score
FROM json
GROUP BY 1
ORDER BY 1

Partial aggregations

Calculating the cumulative sum of a variable using the PARTIALS modifier. Also demoing the lag aggregator.

SELECT PARTIALS 
    json->new_entries, 
    sum_agg(json->new_entries) AS cum_new_entries,
    lag(json->new_entries) AS prev_entries
FROM json
TO json

Sample input:

{"new_entries" : 10}
{"new_entries" : 5}
{"new_entries" : 25}
{"new_entries" : null}
{}
{"new_entries" : 100}

Output:

{"new_entries" : 10,   "cum_new_entries" : 10,  "prev_entries": null}
{"new_entries" : 5,    "cum_new_entries" : 15,  "prev_entries": 10}
{"new_entries" : 25,   "cum_new_entries" : 40,  "prev_entries": 5}
{"new_entries" : null, "cum_new_entries" : 40,  "prev_entries": 25}
{"new_entries" : null, "cum_new_entries" : 40,  "prev_entries": null}
{"new_entries" : 100,  "cum_new_entries" : 140, "prev_entries": null}

If PARTIALSwas omitted the result would be equivalent to the last output row.

Distinct rows

SELECT DISTINCT *
FROM csv

Command line examples

To run the following examples, type Ctrl-x Ctrl-e on you terminal. This will open your default editor (emacs/vim). Paste the code of one of the examples, save and exit.

Queries on Parquet with directories

Here, find transverses a directory and executes parquet-tools for each parquet file, dumping each file to json format. jq -c makes sure that the output has 1 json per line before handing over to spyql. This is far from being an efficient way to query parquet files, but it might be a handy option if you need to do a quick inspection.

a_field, json->a_num_field * 2 + 1 FROM json "">

find /the/directory -name "*.parquet" -exec parquet-tools cat --json {} \; |
jq -c |
spyql "
	SELECT json->a_field, json->a_num_field * 2 + 1
	FROM json
"

Querying multiple json.gz files

a_field, json->a_num_field * 2 + 1 FROM json "">

gzcat *.json.gz |
jq -c |
spyql "
	SELECT json->a_field, json->a_num_field * 2 + 1
	FROM json
"

Querying YAML / XML / TOML files

yq converts yaml, xml and toml files to json, allowing to easily query any of these with spyql.

a_field FROM json"">

cat file.yaml | yq -c | spyql "SELECT json->a_field FROM json"

a_field FROM json"">

cat file.xml | xq -c | spyql "SELECT json->a_field FROM json"

a_field FROM json"">

cat file.toml | tomlq -c | spyql "SELECT json->a_field FROM json"

Kafka to PostegreSQL pipeline

Read data from a kafka topic and write to postgres table name customer.

customer->id AS id, json->customer->name AS name FROM json TO sql " | psql -U an_user_name -h a.host.com a_database_name">

kafkacat -b the.broker.com -t the.topic |
spyql -Otable=customer -Ochunk_size=1 --unbuffered "
	SELECT
		json->customer->id AS id,
		json->customer->name AS name
	FROM json
	TO sql
" |
psql -U an_user_name -h a.host.com a_database_name

Monitoring statistics in Kafka

Read data from a kafka topic, continuously calculating statistics.

kafkacat -b the.broker.com -t the.topic |
spyql --unbuffered "
	SELECT PARTIALS
        count_agg(*) AS running_count,
		sum_agg(value) AS running_sum,
		min_agg(value) AS min_so_far, 
        value AS current_value
	FROM json
	TO csv
"

Sub-queries (piping)

A special file format (spy) is used to efficiently pipe data between queries.

first_name, json->middle_name, json->last_name]) AS full_name FROM json TO spy" | spyql "SELECT full_name, full_name.upper() FROM spy"">

cat a_file.json |
spyql "
	SELECT ' '.join([json->first_name, json->middle_name, json->last_name]) AS full_name
	FROM json
	TO spy" |
spyql "SELECT full_name, full_name.upper() FROM spy"

Queries over APIs

data->email AS email, 'Dear {}, thank you for being a great customer!'.format(json->data->first_name) AS msg FROM json EXPLODE json->data TO json "">

curl https://reqres.in/api/users?page=2 |
spyql "
	SELECT
		json->data->email AS email,
		'Dear {}, thank you for being a great customer!'.format(json->data->first_name) AS msg
	FROM json
	EXPLODE json->data
	TO json
"

Plotting to the terminal

spyql "
    SELECT col1
    FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)]
    TO plot
"

Plotting with gnuplot

To the terminal:

spyql "
    SELECT col1
    FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)]
    TO csv
" |
sed 1d |
feedgnuplot --terminal 'dumb 80,30' --exit --lines

To GUI:

spyql "
    SELECT col1
    FROM [10 * cos(i * ((pi * 4) / 90)) for i in range(80)]
    TO csv
" |
sed 1d |
feedgnuplot --lines --points --exit

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Comments

Interactive SpyQL
This is v0 for interactive spyql. It implements the following new things:

Q class and q function in interactive.py: This is the class/function that the user can add in their script

InteractiveProcessor in processor.py: row yielding loaded by the Processor

InteractiveWriter in processor.py: writer that adds the output in a list

Other Important changes:

init_vars, go and _go now have user_query_vars as a valid input

Processor can now be explicitly defined as interactive

Misc:

ifs changed to elifs where needed

Issue: #64

To test this python tests/interactive_test.py.
opened by yashbonde 12
Refactor `parse_structure` and allow comment in query
The purposes of this PR is

Refactor parse_structure function (cf. https://github.com/dcmoura/spyql/issues/85#issue-1459257843)

Support for line comments (cf. https://github.com/dcmoura/spyql/issues/84)

enhancement
opened by Hayashi-Yudai 9

Key-value aggregations and lookups

This PR closes #59.

First, it introduces dict_agg for key-value aggregations. Example:

$ cat codes.csv
MCC,MCC (int),MNC,MNC (int),ISO,Country,Country Code,Network
289,649,88,2191,ge,Abkhazia,7,A-Mobile
289,649,68,1679,ge,Abkhazia,7,A-Mobile
289,649,67,1663,ge,Abkhazia,7,Aquafon
412,1042,01,31,af,Afghanistan,93,AWCC
412,1042,50,1295,af,Afghanistan,93,Etisalat
412,1042,30,783,af,Afghanistan,93,Etisalat
452,1106,04,79,vn,Vietnam,84,Viettel
452,1106,02,47,vn,Vietnam,84,VinaPhone
543,1347,299,665,wf,Wallis and Futuna,,Failed Calls
543,1347,01,31,wf,Wallis and Futuna,,Manuia
421,1057,999,2457,ye,Yemen,967,Fix Line
421,1057,04,79,ye,Yemen,967,HITS/Y Unitel
421,1057,01,31,ye,Yemen,967,Sabaphone
421,1057,03,63,ye,Yemen,967,Yemen Mob. CDMA
645,1605,01,31,zm,Zambia,260,Airtel
645,1605,299,665,zm,Zambia,260,Failed Calls

$ spyql "
    SELECT dict_agg(MCC, Country) AS json 
    FROM csv TO json
  " < codes.csv > code_country.json

$ jq . code_country.json                                        
{
  "289": "Abkhazia",
  "412": "Afghanistan",
  "452": "Vietnam",
  "543": "Wallis and Futuna",
  "421": "Yemen",
  "645": "Zambia"
}

Second, it introduces the ~~kv function~~ -J option to allow loading JSON objects for key-value lookups, emulating left equi joins. Example:

$ cat towers_mini.csv 
radio,mcc,net,area,cell,unit,lon,lat,range,samples,changeable,created,updated,averageSignal
UMTS,262,2,776,165186,0,13.329163,52.46521,1000,2,1,1286645042,1295769656,0
GSM,262,3,1075,5651,0,13.329163,52.46521,1000,2,1,1286645042,1295769656,0
UMTS,262,2,801,165123,0,13.295516967773,52.494735717773,1000,1,1,1286864565,1286864565,0
UMTS,262,2,801,165121,0,13.301697,52.499886,1000,4,1,1286884439,1297735807,0
GSM,624,2,6,13883,0,9.704361,4.063911,1588,8,1,1352031870,1374212876,0
GSM,624,2,6,34381,0,9.709009,4.066368,2484,13,1,1352031870,1380389330,0
GSM,452,1,10068,5293,0,105.8031463623,20.959854125977,1000,2,1,1459692342,1488347104,0
GSM,645,3,130,3282,0,28.070755004883,-14.902267456055,1000,1,1,1459743588,1459743588,0
GSM,645,3,1,32221,0,28.252716,-15.439224,1000,2,1,1369353852,1369353852,0
GSM,645,3,1,33932,0,28.255325,-15.436752,1000,5,1,1369353888,1369353888,0
GSM,645,3,1,31662,0,28.258072,-15.434555,1000,5,1,1369353960,1369353960,0

$ spyql -Jcountries=code_country.json "SELECT mcc, countries[mcc] AS country, count_agg(1) AS n_recs FROM csv GROUP BY 1, 2 ORDER BY 1 TO pretty" < towers_mini.csv
  mcc  country      n_recs
-----  ---------  --------
  262                    4
  452  Vietnam           1
  624                    2
  645  Zambia            4

opened by dcmoura 7

Code improvements
Some cases where the code could be more pythonic:

some for cycles where range(len(l)) might be replaced by enumerate

some functions where yeld might be used to return a generator

Changes should be accessed for performance.
enhancement good first issue
opened by dcmoura 6
code improvement for issue #60
Improve the code.

TODOs

[x] Fix code to more pythonic style

Fix import style

Remove some unused imports

[x] Fix error in lint by flake8

Issue: https://github.com/dcmoura/spyql/issues/60
opened by Hayashi-Yudai 5
LIKE clause
Adds a LIKE function to the WHERE clause for easier matching. Closes #17

Some notes

Right now it can match not only with strings but also with everything else, as the values are stringified before matching. Let me know if this is not intended and if it should be for strings only.

It only supports the % wildcard operator.

Let me know if the commit messages should follow a specific format

Future work

Either in the scope of this PR or in a future one, it is still missing other basic matching functions, like _ for single character matching or [] to specify a range.
enhancement
opened by ricardocchaves 4
Fixes git-actions tests on py3.6 and adds py3.11

Python 3.6 env stopped launching. Seems to be related to an upgrade on ubuntu version and support for py3.6 was not included in the new version.

Solution based on https://github.com/actions/setup-python/issues/543#issuecomment-1335614916

opened by dcmoura 3
[FR] COUNT(DISTINCT col1) ... GROUP BY

Thank you for the great tool!

I'd like to request feature which works like COUNT(DISTINCT col1) ... GROUP BY of SQL. I tried count_agg(distinct col1), but got syntax error.

opened by Minyus 3
Full-featured documentation
This PR proposes a brand new documentation. This PR closes #34 .

Please ignore the commits... the history of this PR is quite disturbing 😅 Everything will be squashed into a single commit when merging to master.

Please focus on 1) the structure and then 2) the content of:

https://github.com/dcmoura/spyql/blob/docs/README.rst

https://spyql.readthedocs.io/en/docs/

I am proposing deprecating the -> operator. It will still be supported but is not mentioned in the documentation. From now on, the . (dot) operator should be the standard way of accessing fields in dicts, followed by standard Python syntax (a_dict["key"]). Major drawback is that we will need to do a new video demoing the CLI and update some posts (e.g. stack overflow).

There is still some work to do on the lib side. Namely, we should add a section under Examples/Recipes. However, there is still some dev work to do, namely on accessing Pandas dataframes and series, as well as numpy arrays. Eventually we can also visit polars and other libs for tabular data. When we do that, we can extend the documentation.

In this first review let's try to focus on the big fish, and not so much on the details. My goal is to publish the documentation soon, and then we can iterate on it over time. Thanks!
documentation
opened by dcmoura 3

Column names with TEXT format

When using TEXT as an import format, I couldn't find a way of configuring the column name. The column name seems to be evaluated as long as we don't specify * in the SELECT statement which we have to do for TEXT files. Am I missing something? Here's an example:

cat /tmp/test.txt | spyql -Otable=domains 'Select * as colname from text To SQL'
INSERT INTO "domains"("col1") VALUES ('aaaaaaaaaa'),('bbbbbbbbb'),('cccccccccc'),('dddddddddd'),('eeeeeeeeee');

cat /tmp/test.txt | spyql -Otable=domains 'Select 1 as colname from text To SQL' 
INSERT INTO "domains"("colname") VALUES (1),(1),(1),(1),(1);

Expected result would be:

cat /tmp/test.txt | spyql -Otable=domains 'Select * as colname from text To SQL'
INSERT INTO "domains"("colname") VALUES ('aaaaaaaaaa'),('bbbbbbbbb'),('cccccccccc'),('dddddddddd'),('eeeeeeeeee');

question

opened by henshin 3

Cannot dump JSON object with Null values to string

Use Case

I'm using the CLI to import JSON Line log files into a Postgres database. One of the columns I'm trying to import is jsonb, and needs to be inserted as a JSON string.

Approach

spyql 'IMPORT json SELECT json.dumps(.a) FROM {"a": {"foo": 1, "bar": 2}} TO sql'
INSERT INTO "table_name"("dumps_a") VALUES ('{"foo": 1, "bar": 2}');

This works fine until it hits a null value:

spyql 'IMPORT json SELECT json.dumps(.a) FROM {"a": {"foo": 1, "bar": null}} TO sql'
ERROR	could not evaluate SELECT expression #1: json.dumps(row.a)
	at data row #1: [{'a': {'foo': 1, 'bar': NULL}}]
TypeError: Object of type NullType is not JSON serializable

I also tried passing the default kwarg to json.dumps and hit a different error:

spyql 'IMPORT json SELECT json.dumps(.a, default=str) FROM {"a": {"foo": 1, "bar": null}} TO sql'
ERROR	could not evaluate SELECT expression #1: json.dumps(row.a, default=str)
	at data row #1: [{'a': {'foo': 1, 'bar': NULL}}]
ValueError: Circular reference detected

Very open to alternative approaches; thank you so much for making this great tool available. :)

opened by mgburns 2

[FR] join 2 CSV files

Hi @dcmoura , I see examples to join a JSON file in the document, would joining 2 CSV files be supported? https://spyql.readthedocs.io/en/latest/recipes.html?highlight=join#equi-joins

opened by Minyus 6
general text file support
I'm using spyql to handle general text files (single-column, not comma-separated), but got a problem.

A single-column text file like the following is misinterpreted as multiple columns by spyql.

test.csv:

user_id a11e11 b22e22

$ spyql "SELECT * FROM csv('test.csv') TO pretty" us r_id ---- ------ a11 11 b22 22

If there are no common characters among rows, the following error is returned.

ERROR Could not detect CSV dialect from input _csv.Error: Could not determine delimiter

I found that explicitly setting a delimiter like csv('test.csv', delimiter=',') works as a workaround in the current version, but I would like to suggest the following options.

[Option 1] Modify default behavior of csv()

As a delimiter character, search only "," (especially not alphanumeric characters)

If no delimiter is detected (and each row is not too long), read it as a single-column text file rather than returning an error

[Option 2] Add an alternative to csv() which can read a single-column text file (something like txt())
opened by Minyus 6

Cannot run CI/CD pipe on Python 3.11

Tried to add Python 3.11 but broke the build:

Installing build dependencies: started
[66](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:67)
  Installing build dependencies: finished with status 'done'
[67](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:68)
  Getting requirements to build wheel: started
[68](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:69)
  Getting requirements to build wheel: finished with status 'done'
[69](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:70)
  Preparing metadata (pyproject.toml): started
[70](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:71)
  Preparing metadata (pyproject.toml): finished with status 'error'
[71](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:72)
  error: subprocess-exited-with-error
[72](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:73)
  
[73](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:74)
  × Preparing metadata (pyproject.toml) did not run successfully.
[74](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:75)
  │ exit code: 1
[75](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:76)
  ╰─> [11 lines of output]
[76](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:77)
      💥 maturin failed
[77](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:78)
        Caused by: Cargo metadata failed. Does your crate compile with `cargo build`?
[78](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:79)
        Caused by: `cargo metadata` exited with an error: error: failed to run `rustc` to learn about target-specific information
[79](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:80)
      
[80](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:81)
      Caused by:
[81](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:82)
        process didn't exit successfully: `rustc - --crate-name ___ --print=file-names -Z mutable-noalias=yes -C target-feature=+sse2 --crate-type bin --crate-type rlib --crate-type dylib --crate-type cdylib --crate-type staticlib --crate-type proc-macro --print=sysroot --print=cfg` (exit status: 1)
[82](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:83)
        --- stderr
[83](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:84)
        error: the option `Z` is only accepted on the nightly compiler
[84](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:85)
      Error running maturin: Command '['maturin', 'pep517', 'write-dist-info', '--metadata-directory', '/tmp/pip-modern-metadata-2ryt4i58', '--interpreter', '/opt/hostedtoolcache/Python/3.11.0/x64/bin/python', '--manylinux=off', '--strip=on']' returned non-zero exit status 1.
[85](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:86)
      Checking for Rust toolchain....
[86](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:87)
      Running `maturin pep517 write-dist-info --metadata-directory /tmp/pip-modern-metadata-2ryt4i58 --interpreter /opt/hostedtoolcache/Python/3.11.0/x64/bin/python --manylinux=off --strip=on`
[87](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:88)
      [end of output]
[88](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:89)
  
[89](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:90)
  note: This error originates from a subprocess, and is likely not a problem with pip.
[90](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:91)
error: metadata-generation-failed
[91](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:92)

[92](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:93)
× Encountered error while generating package metadata.
[93](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:94)
╰─> See above for output.
[94](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:95)

[95](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:96)
note: This is an issue with the package mentioned above, not pip.
[96](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:97)
hint: See above for details.
[97](https://github.com/dcmoura/spyql/actions/runs/3609720398/jobs/6083085988#step:4:98)
Error: Process completed with exit code 1.

bug good first issue

opened by dcmoura 0

Support reading and writing of JSON objects (not JSON lines)

Currently SPyQL only allows to read and write JSON lines. Writing JSON arrays can be done using the dict_agg, aggregating everything into an array and writing an JSON with a single line.

The idea is to add an argument lines=True to json and orjson writers and processors. The processor should be able to handle single object files as well as arrays of objects or arrays of scalars. When lines is False the processor should load the full input into memory and then parse it. While this is not ideal, it is the most straightforward implementation. In addition, arrays of JSON shouldn't be used for large data, in that cases JSON lines should be used instead.

The writer should write an array of objects when lines is False.
core

opened by dcmoura 1

Releases(v0.9.0)

v0.9.0(Dec 4, 2022)
Added support for line comments

Improved and optimized parsing

General code improvements

Fixed bug of unwanted persistence of parameters from previous queries (only happend on the lib)

Source code(tar.gz)
Source code(zip)
v0.8.1-3(Oct 30, 2022)

Brand new documentation!
Source code(tar.gz)
Source code(zip)
v0.8.0(Oct 17, 2022)

Introduces the dot operator for universal access to the fields of different files formats.
Source code(tar.gz)
Source code(zip)
v0.7.1(Oct 10, 2022)
0.7.1 (2022-10-10)

Tweaked JSON loading (via -J) to support any valid JSON file

Source code(tar.gz)
Source code(zip)
v0.7.0(Oct 7, 2022)
0.7.0 (2022-10-08)

Added support to key-value lookup (equi joins)

Added dict aggregation

Fixed set aggregation when returning empty set

Source code(tar.gz)
Source code(zip)

Owner

Daniel Moura

GitHub Repository

SpyQL - SQL with Python in the middle

Related tags

Overview

SpyQL

Concept

SpyQL command-line tool

Installation

Hello world

Principles

Syntax

Notable differences to SQL

Notable differences to Python

Additional syntax

NULL datatype

Importing python modules and user-defined functions

Example queries

Query a CSV (and print a pretty table)

Convert CSV to a flat JSON

Convert from CSV to a hierarchical JSON

JSON to CSV, filtering out NULLs

Explode JSON to CSV

Python iterator/list/comprehension to JSON

Importing python modules

Getting the top 5 records

Aggregations

Partial aggregations

Distinct rows

Command line examples

Queries on Parquet with directories

Querying multiple json.gz files

Querying YAML / XML / TOML files

Kafka to PostegreSQL pipeline

Monitoring statistics in Kafka

Sub-queries (piping)

Queries over APIs

Plotting to the terminal

Plotting with gnuplot

Comments

Some notes

Future work

Use Case

Approach

Releases(v0.9.0)

v0.9.0(Dec 4, 2022)

v0.8.1-3(Oct 30, 2022)

v0.8.0(Oct 17, 2022)

v0.7.1(Oct 10, 2022)

0.7.1 (2022-10-10)

v0.7.0(Oct 7, 2022)

0.7.0 (2022-10-08)

Owner

Daniel Moura

python-beryl, a Python driver for BerylDB.

Sample scripts to show extracting details directly from the AIQUM database

A Python DB-API and SQLAlchemy dialect to Google Spreasheets

A Relational Database Management System for a miniature version of Twitter written in MySQL with CLI in python.

MySQL database connector for Python (with Python 3 support)

Kafka Connect JDBC Docker Image.

Making it easy to query APIs via SQL

Micro ODM for MongoDB

A Python Object-Document-Mapper for working with MongoDB

GINO Is Not ORM - a Python asyncio ORM on SQLAlchemy core.

Python interface to Oracle Database conforming to the Python DB API 2.0 specification.

Database connection pooler for Python

Monty, Mongo tinified. MongoDB implemented in Python !

Pandas Google BigQuery

Baserow is an open source no-code database tool and Airtable alternative

Generate database table diagram from SQL data definition.

pandas-gbq is a package providing an interface to the Google BigQuery API from pandas

Asynchronous interface for peewee ORM powered by asyncio

Redis client for Python asyncio (PEP 3156)

Create a database, insert data and easily select it with Sqlite