An extension package of 🤗 Datasets that provides support for executing arbitrary SQL queries on HF datasets

Last update: Dec 15, 2022

Related tags

Overview

datasets_sql

A 🤗 Datasets extension package that provides support for executing arbitrary SQL queries on HF datasets. It uses DuckDB as a SQL engine and follows its query syntax.

Installation

pip install datasets_sql

Quick Start

from datasets import load_dataset, Dataset
from datasets_sql import query

imdb_dset = load_dataset("imdb", split="train")

# Remove the rows where the `text` field has less than 1000 characters
imdb_query_dset1 = query("SELECT text FROM imdb_dset WHERE length(text) > 1000")

# Count the number of rows per label
imdb_query_dset2 = query("SELECT label, COUNT(*) as num_rows FROM imdb_dset GROUP BY label")

# Remove duplicated rows
imdb_query_dset3 = query("SELECT DISTINCT text FROM imdb_dset")

# Get the average length of the `text` field
imdb_query_dset4 = query("SELECT AVG(length(text)) as avg_text_length FROM imdb_dset")

order_customer_dset = Dataset.from_dict({
    "order_id": [10001, 10002, 10003],
    "customer_id": [3, 1, 2],
})

customer_dset = Dataset.from_dict({
    "customer_id": [1, 2, 3],
    "name": ["John", "Jane", "Mary"],
})

# Join two tables
join_query_dset = query(
    "SELECT order_id, name FROM order_customer_dset INNER JOIN customer_dset ON order_customer_dset.customer_id = customer_dset.customer_id"
)

You might also like...

SQL for Humans™

Records: SQL for Humans™ Records is a very simple, but powerful, library for making raw SQL queries to most relational databases. Just write SQL. No b

6.9k Jan 7, 2023

SQL for Humans™

Records: SQL for Humans™ Records is a very simple, but powerful, library for making raw SQL queries to most relational databases. Just write SQL. No b

6.9k Jan 3, 2023

Anomaly detection on SQL data warehouses and databases

With CueObserve, you can run anomaly detection on data in your SQL data warehouses and databases. Getting Started Install via Docker docker run -p 300

171 Dec 18, 2022

Simple DDL Parser to parse SQL (HQL, TSQL, AWS Redshift, Snowflake and other dialects) ddl files to json/python dict with full information about columns: types, defaults, primary keys, etc.

Simple DDL Parser Build with ply (lex & yacc in python). A lot of samples in 'tests/. Is it Stable? Yes, library already has about 5000+ usage per day

95 Jan 5, 2023

PyRemoteSQL is a python SQL client that allows you to connect to your remote server with phpMyAdmin installed.

PyRemoteSQL Python MySQL remote client Basically this is a python SQL client that allows you to connect to your remote server with phpMyAdmin installe

3 Nov 4, 2022

edaSQL is a library to link SQL to Exploratory Data Analysis and further more in the Data Engineering.

edaSQL is a python library to bridge the SQL with Exploratory Data Analysis where you can connect to the Database and insert the queries. The query results can be passed to the EDA tool which can give greater insights to the user.

8 Dec 12, 2022

Python script to clone SQL dashboard from one workspace to another

Databricks dashboard clone Unofficial project to allow Databricks SQL dashboard copy from one workspace to another. Resource clone Setup: Create a fil

12 Jan 1, 2023

Some scripts for microsoft SQL server in old version.

MSSQL_Stuff Some scripts for microsoft SQL server which is in old version. Table of content Overview Usage References Overview These script works when

5 Dec 29, 2022

Making it easy to query APIs via SQL

Shillelagh Shillelagh (ʃɪˈleɪlɪ) is an implementation of the Python DB API 2.0 based on SQLite (using the APSW library): from shillelagh.backends.apsw

207 Dec 30, 2022

Comments

How to use query function if dataset is a class attribute
Awesome library!

This is probably a generic duckdb question but figured I'd ask here first. If I store a reference to a dataset in a class attribute, how do I get query to find my dataset?

Repro:

class DatasetQuery: def __init__(self, dataset_name, split="train"): ds = datasets.load_dataset(dataset_name, split="train") self.dataset = ds def query(self, query_str): return query(query_str) dq = DatasetQuery("huggingnft/boredapeyachtclub") dq.query("select * from ?? limit 10;")

What do I put in the from clause? I tried ds and self.dataset but neither work. I get ValueError: The datasetdsnot found in the namespace.
opened by freddyaboulton 4
The readme demos are broken
I tried running an example from the repo but the code is broken:

imdb_dset = load_dataset("imdb", split="train") dataset = query( "SELECT text FROM imdb_dset" )

results in AttributeError: 'duckdb.DuckDBPyConnection' object has no attribute 'fetch_arrow_chunk'

I am using datasets_sql version 0.1.1 and datasets version 2.5.2
opened by mo6zes 1
Be able to stream the results of query
I'd like to query a large remote dataset (on the hub or elsewhere) and then stream the results of the query so that I don't have to download the entire dataset to my machine.

For example, you could query diffusiondb for images generated with prompts containing the word "ceo" to visualize biases:

SELECT * from poloclub/diffusiondb WHERE contains('prompt', 'ceo')

This combined with https://github.com/huggingface/datasets-server/issues/398 would open the door for a lot of cool applications of gradio + datasets where users could interactively explore datasets that don't fit on their machines and create spaces without having to download/store large datasets.

I see that data can be streamed from duckdb with pyarrow: https://duckdb.org/2021/12/03/duck-arrow.html . I wonder if this can be leveraged for this use case.
opened by freddyaboulton 5

Releases(0.3.0)

0.3.0(Nov 28, 2022)

Full Changelog: https://github.com/mariosasko/datasets_sql/compare/0.2.0...0.3.0
Source code(tar.gz)
Source code(zip)
0.2.0(Nov 6, 2022)

Full Changelog: https://github.com/mariosasko/datasets_sql/compare/0.1.1...0.2.0
Source code(tar.gz)
Source code(zip)
0.1.1(Mar 15, 2022)

Full Changelog: https://github.com/mariosasko/datasets_sql/commits/0.1.1
Source code(tar.gz)
Source code(zip)

Owner

Mario Šaško

SWE at Hugging Face

GitHub Repository

Query multiple mongoDB database collections easily

leakscoop Perform queries across multiple MongoDB databases and collections, where the field names and the field content structure in each database ma

5 Jun 24, 2021

Python PostgreSQL adapter to stream results of multi-statement queries without a server-side cursor

streampq Stream results of multi-statement PostgreSQL queries from Python without server-side cursors. Has benefits over some other Python PostgreSQL

6 Oct 31, 2022

Little wrapper around asyncpg for specific experience.

3 Nov 15, 2021

Create a database, insert data and easily select it with Sqlite

sqliteBasics create a database, insert data and easily select it with Sqlite Watch on YouTube a step by step tutorial explaining this code: https://yo

27 Dec 27, 2022

google-cloud-bigtable Apache-2google-cloud-bigtable (🥈31 · ⭐ 3.5K) - Google Cloud Bigtable API client library. Apache-2

Python Client for Google Cloud Bigtable Google Cloud Bigtable is Google's NoSQL Big Data database service. It's the same database that powers many cor

39 Dec 03, 2022

aiosql - Simple SQL in Python

aiosql - Simple SQL in Python SQL is code. Write it, version control it, comment it, and run it using files. Writing your SQL code in Python programs

1.1k Jan 08, 2023

Application which allows you to make PostgreSQL databases with Python

Automate PostgreSQL Databases with Python Application which allows you to make PostgreSQL databases with Python I used the psycopg2 library which is u

0 Dec 31, 2021

pandas-gbq is a package providing an interface to the Google BigQuery API from pandas

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

348 Jan 03, 2023

GINO Is Not ORM - a Python asyncio ORM on SQLAlchemy core.

GINO - GINO Is Not ORM - is a lightweight asynchronous ORM built on top of SQLAlchemy core for Python asyncio. GINO 1.0 supports only PostgreSQL with

2.5k Dec 29, 2022

A CRUD and REST api with mongodb atlas.

Movies_api A CRUD and REST api with mongodb atlas. Setup First import all the python dependencies in your virtual environment or globally by the follo

0 Nov 09, 2022

New generation PostgreSQL database adapter for the Python programming language

Psycopg 3 -- PostgreSQL database adapter for Python Psycopg 3 is a modern implementation of a PostgreSQL adapter for Python. Installation Quick versio

880 Jan 08, 2023

A simple wrapper to make a flat file drop in raplacement for mongodb out of TinyDB

Purpose A simple wrapper to make a drop in replacement for mongodb out of tinydb. This module is an attempt to add an interface familiar to those curr

180 Jan 01, 2023

dask-sql is a distributed SQL query engine in python using Dask

dask-sql is a distributed SQL query engine in Python. It allows you to query and transform your data using a mixture of common SQL operations and Python code and also scale up the calculation easily

271 Dec 30, 2022

A Pythonic, object-oriented interface for working with MongoDB.

PyMODM MongoDB has paused the development of PyMODM. If there are any users who want to take over and maintain this project, or if you just have quest

345 Dec 25, 2022

Database connection pooler for Python

Nimue Strange women lying in ponds distributing swords is no basis for a system of government! --Dennis, Peasant Nimue is a database connection pool f

1 Nov 09, 2021

A pythonic interface to Amazon's DynamoDB

PynamoDB A Pythonic interface for Amazon's DynamoDB. DynamoDB is a great NoSQL service provided by Amazon, but the API is verbose. PynamoDB presents y

2.1k Dec 30, 2022

Micro ODM for MongoDB

Beanie - is an asynchronous ODM for MongoDB, based on Motor and Pydantic. It uses an abstraction over Pydantic models and Motor collections to work wi

993 Jan 03, 2023

Confluent's Kafka Python Client

Confluent's Python Client for Apache KafkaTM confluent-kafka-python provides a high-level Producer, Consumer and AdminClient compatible with all Apach

3.1k Jan 05, 2023

A tutorial designed to introduce you to SQlite 3 database using python

SQLite3-python-tutorial A tutorial designed to introduce you to SQlite 3 database using python What is SQLite? SQLite is an in-process library that im

0 Dec 28, 2021

An extension package of 🤗 Datasets that provides support for executing arbitrary SQL queries on HF datasets

datasets_sql A 🤗 Datasets extension package that provides support for executing arbitrary SQL queries on HF datasets. It uses DuckDB as a SQL engine

19 Dec 15, 2022

An extension package of 🤗 Datasets that provides support for executing arbitrary SQL queries on HF datasets

Related tags

Overview

datasets_sql

Installation

Quick Start

You might also like...

SQL for Humans™

SQL for Humans™

Anomaly detection on SQL data warehouses and databases

Simple DDL Parser to parse SQL (HQL, TSQL, AWS Redshift, Snowflake and other dialects) ddl files to json/python dict with full information about columns: types, defaults, primary keys, etc.

PyRemoteSQL is a python SQL client that allows you to connect to your remote server with phpMyAdmin installed.

edaSQL is a library to link SQL to Exploratory Data Analysis and further more in the Data Engineering.

Python script to clone SQL dashboard from one workspace to another

Some scripts for microsoft SQL server in old version.

Making it easy to query APIs via SQL

Comments

How to use query function if dataset is a class attribute

The readme demos are broken

Be able to stream the results of query

Releases(0.3.0)

0.3.0(Nov 28, 2022)

0.2.0(Nov 6, 2022)

0.1.1(Mar 15, 2022)

Owner

Mario Šaško

Query multiple mongoDB database collections easily

Python PostgreSQL adapter to stream results of multi-statement queries without a server-side cursor

Little wrapper around asyncpg for specific experience.

Create a database, insert data and easily select it with Sqlite

google-cloud-bigtable Apache-2google-cloud-bigtable (🥈31 · ⭐ 3.5K) - Google Cloud Bigtable API client library. Apache-2

aiosql - Simple SQL in Python

Application which allows you to make PostgreSQL databases with Python

pandas-gbq is a package providing an interface to the Google BigQuery API from pandas

GINO Is Not ORM - a Python asyncio ORM on SQLAlchemy core.

A CRUD and REST api with mongodb atlas.

New generation PostgreSQL database adapter for the Python programming language

A simple wrapper to make a flat file drop in raplacement for mongodb out of TinyDB

dask-sql is a distributed SQL query engine in python using Dask

A Pythonic, object-oriented interface for working with MongoDB.

Database connection pooler for Python

A pythonic interface to Amazon's DynamoDB

Micro ODM for MongoDB

Confluent's Kafka Python Client

A tutorial designed to introduce you to SQlite 3 database using python

An extension package of 🤗 Datasets that provides support for executing arbitrary SQL queries on HF datasets