An extension package of 🤗 Datasets that provides support for executing arbitrary SQL queries on HF datasets

Overview

datasets_sql

A 🤗 Datasets extension package that provides support for executing arbitrary SQL queries on HF datasets. It uses DuckDB as a SQL engine and follows its query syntax.

Installation

pip install datasets_sql

Quick Start

from datasets import load_dataset, Dataset
from datasets_sql import query

imdb_dset = load_dataset("imdb", split="train")

# Remove the rows where the `text` field has less than 1000 characters
imdb_query_dset1 = query("SELECT text FROM imdb_dset WHERE length(text) > 1000")

# Count the number of rows per label
imdb_query_dset2 = query("SELECT label, COUNT(*) as num_rows FROM imdb_dset GROUP BY label")

# Remove duplicated rows
imdb_query_dset3 = query("SELECT DISTINCT text FROM imdb_dset")

# Get the average length of the `text` field
imdb_query_dset4 = query("SELECT AVG(length(text)) as avg_text_length FROM imdb_dset")

order_customer_dset = Dataset.from_dict({
    "order_id": [10001, 10002, 10003],
    "customer_id": [3, 1, 2],
})

customer_dset = Dataset.from_dict({
    "customer_id": [1, 2, 3],
    "name": ["John", "Jane", "Mary"],
})

# Join two tables
join_query_dset = query(
    "SELECT order_id, name FROM order_customer_dset INNER JOIN customer_dset ON order_customer_dset.customer_id = customer_dset.customer_id"
)
You might also like...
SQL for Humans™
SQL for Humans™

Records: SQL for Humans™ Records is a very simple, but powerful, library for making raw SQL queries to most relational databases. Just write SQL. No b

SQL for Humans™
SQL for Humans™

Records: SQL for Humans™ Records is a very simple, but powerful, library for making raw SQL queries to most relational databases. Just write SQL. No b

Anomaly detection on SQL data warehouses and databases
Anomaly detection on SQL data warehouses and databases

With CueObserve, you can run anomaly detection on data in your SQL data warehouses and databases. Getting Started Install via Docker docker run -p 300

Simple DDL Parser to parse SQL (HQL, TSQL, AWS Redshift, Snowflake and other dialects) ddl files to json/python dict with full information about columns: types, defaults, primary keys, etc.

Simple DDL Parser Build with ply (lex & yacc in python). A lot of samples in 'tests/. Is it Stable? Yes, library already has about 5000+ usage per day

PyRemoteSQL is a python SQL client that allows you to connect to your remote server with phpMyAdmin installed.

PyRemoteSQL Python MySQL remote client Basically this is a python SQL client that allows you to connect to your remote server with phpMyAdmin installe

edaSQL is a library to link SQL to Exploratory Data Analysis and further more in the Data Engineering.
edaSQL is a library to link SQL to Exploratory Data Analysis and further more in the Data Engineering.

edaSQL is a python library to bridge the SQL with Exploratory Data Analysis where you can connect to the Database and insert the queries. The query results can be passed to the EDA tool which can give greater insights to the user.

Python script to clone SQL dashboard from one workspace to another

Databricks dashboard clone Unofficial project to allow Databricks SQL dashboard copy from one workspace to another. Resource clone Setup: Create a fil

Some scripts for microsoft SQL server in old version.
Some scripts for microsoft SQL server in old version.

MSSQL_Stuff Some scripts for microsoft SQL server which is in old version. Table of content Overview Usage References Overview These script works when

Making it easy to query APIs via SQL

Shillelagh Shillelagh (ʃɪˈleɪlɪ) is an implementation of the Python DB API 2.0 based on SQLite (using the APSW library): from shillelagh.backends.apsw

Comments
  • How to use query function if dataset is a class attribute

    How to use query function if dataset is a class attribute

    Awesome library!

    This is probably a generic duckdb question but figured I'd ask here first. If I store a reference to a dataset in a class attribute, how do I get query to find my dataset?

    Repro:

    class DatasetQuery:
        
        def __init__(self, dataset_name, split="train"):
            ds = datasets.load_dataset(dataset_name, split="train")
            self.dataset = ds
        
        def query(self, query_str):
            return query(query_str)
    
    dq = DatasetQuery("huggingnft/boredapeyachtclub")
    dq.query("select * from ?? limit 10;")
    

    What do I put in the from clause? I tried ds and self.dataset but neither work. I get ValueError: The datasetdsnot found in the namespace.

    opened by freddyaboulton 4
  • The readme demos are broken

    The readme demos are broken

    I tried running an example from the repo but the code is broken:

    imdb_dset = load_dataset("imdb", split="train")
    dataset = query(
        "SELECT text FROM imdb_dset"
    )
    

    results in AttributeError: 'duckdb.DuckDBPyConnection' object has no attribute 'fetch_arrow_chunk'

    I am using datasets_sql version 0.1.1 and datasets version 2.5.2

    opened by mo6zes 1
  • Be able to stream the results of query

    Be able to stream the results of query

    I'd like to query a large remote dataset (on the hub or elsewhere) and then stream the results of the query so that I don't have to download the entire dataset to my machine.

    For example, you could query diffusiondb for images generated with prompts containing the word "ceo" to visualize biases:

    SELECT * from poloclub/diffusiondb
    WHERE contains('prompt', 'ceo')
    

    This combined with https://github.com/huggingface/datasets-server/issues/398 would open the door for a lot of cool applications of gradio + datasets where users could interactively explore datasets that don't fit on their machines and create spaces without having to download/store large datasets.

    I see that data can be streamed from duckdb with pyarrow: https://duckdb.org/2021/12/03/duck-arrow.html . I wonder if this can be leveraged for this use case.

    opened by freddyaboulton 5
Releases(0.3.0)
Owner
Mario Šaško
SWE at Hugging Face
Mario Šaško
db.py is an easier way to interact with your databases

db.py What is it Databases Supported Features Quickstart - Installation - Demo How To Contributing TODO What is it? db.py is an easier way to interact

yhat 1.2k Jan 03, 2023
GINO Is Not ORM - a Python asyncio ORM on SQLAlchemy core.

GINO - GINO Is Not ORM - is a lightweight asynchronous ORM built on top of SQLAlchemy core for Python asyncio. GINO 1.0 supports only PostgreSQL with

GINO Community 2.5k Dec 29, 2022
asyncio compatible driver for elasticsearch

asyncio client library for elasticsearch aioes is a asyncio compatible library for working with Elasticsearch The project is abandoned aioes is not su

97 Sep 05, 2022
Neo4j Bolt driver for Python

Neo4j Bolt Driver for Python This repository contains the official Neo4j driver for Python. Each driver release (from 4.0 upwards) is built specifical

Neo4j 762 Dec 30, 2022
High level Python client for Elasticsearch

Elasticsearch DSL Elasticsearch DSL is a high-level library whose aim is to help with writing and running queries against Elasticsearch. It is built o

elastic 3.6k Jan 03, 2023
aiomysql is a library for accessing a MySQL database from the asyncio

aiomysql aiomysql is a "driver" for accessing a MySQL database from the asyncio (PEP-3156/tulip) framework. It depends on and reuses most parts of PyM

aio-libs 1.5k Jan 03, 2023
Anomaly detection on SQL data warehouses and databases

With CueObserve, you can run anomaly detection on data in your SQL data warehouses and databases. Getting Started Install via Docker docker run -p 300

Cuebook 171 Dec 18, 2022
A wrapper for SQLite and MySQL, Most of the queries wrapped into commands for ease.

Before you proceed, make sure you know Some real SQL, before looking at the code, otherwise you probably won't understand anything. Installation pip i

Refined 4 Jul 30, 2022
Py2neo is a comprehensive toolkit for working with Neo4j from within Python applications or from the command line.

Py2neo Py2neo is a client library and toolkit for working with Neo4j from within Python applications and from the command line. The library supports b

Nigel Small 1.2k Jan 02, 2023
Find graph motifs using intuitive notation

d o t m o t i f Find graph motifs using intuitive notation DotMotif is a library that identifies subgraphs or motifs in a large graph. It looks like t

APL BRAIN 45 Jan 02, 2023
Records is a very simple, but powerful, library for making raw SQL queries to most relational databases.

Records: SQL for Humans™ Records is a very simple, but powerful, library for making raw SQL queries to most relational databases. Just write SQL. No b

Kenneth Reitz 6.9k Jan 03, 2023
A Relational Database Management System for a miniature version of Twitter written in MySQL with CLI in python.

Mini-Twitter-Database This was done as a database design course project at Amirkabir university of technology. This is a relational database managemen

Ali 12 Nov 23, 2022
Redis Python Client - The Python interface to the Redis key-value store.

redis-py The Python interface to the Redis key-value store. Installation | Contributing | Getting Started | Connecting To Redis Installation redis-py

Redis 11k Jan 08, 2023
Python cluster client for the official redis cluster. Redis 3.0+.

redis-py-cluster This client provides a client for redis cluster that was added in redis 3.0. This project is a port of redis-rb-cluster by antirez, w

Grokzen 1.1k Jan 05, 2023
Async database support for Python. 🗄

Databases Databases gives you simple asyncio support for a range of databases. It allows you to make queries using the powerful SQLAlchemy Core expres

Encode 3.2k Dec 30, 2022
A CRUD and REST api with mongodb atlas.

Movies_api A CRUD and REST api with mongodb atlas. Setup First import all the python dependencies in your virtual environment or globally by the follo

Pratyush Kongalla 0 Nov 09, 2022
A fast MySQL driver written in pure C/C++ for Python. Compatible with gevent through monkey patching.

:: Description :: A fast MySQL driver written in pure C/C++ for Python. Compatible with gevent through monkey patching :: Requirements :: Requires P

ESN Social Software 549 Nov 18, 2022
MinIO Client SDK for Python

MinIO Python SDK for Amazon S3 Compatible Cloud Storage MinIO Python SDK is Simple Storage Service (aka S3) client to perform bucket and object operat

High Performance, Kubernetes Native Object Storage 582 Dec 28, 2022
The JavaScript Database, for Node.js, nw.js, electron and the browser

The JavaScript Database Embedded persistent or in memory database for Node.js, nw.js, Electron and browsers, 100% JavaScript, no binary dependency. AP

Louis Chatriot 13.2k Jan 02, 2023
A Python wheel containing PostgreSQL

postgresql-wheel A Python wheel for Linux containing a complete, self-contained, locally installable PostgreSQL database server. All servers run as th

Michel Pelletier 71 Nov 09, 2022