dask-sql is a distributed SQL query engine in python using Dask

Last update: Dec 30, 2022

Overview

dask-sql is a distributed SQL query engine in Python. It allows you to query and transform your data using a mixture of common SQL operations and Python code and also scale up the calculation easily if you need it.

Combine the power of Python and SQL: load your data with Python, transform it with SQL, enhance it with Python and query it with SQL - or the other way round. With dask-sql you can mix the well known Python dataframe API of pandas and Dask with common SQL operations, to process your data in exactly the way that is easiest for you.
Infinite Scaling: using the power of the great Dask ecosystem, your computations can scale as you need it - from your laptop to your super cluster - without changing any line of SQL code. From k8s to cloud deployments, from batch systems to YARN - if Dask supports it, so will dask-sql.
Your data - your queries: Use Python user-defined functions (UDFs) in SQL without any performance drawback and extend your SQL queries with the large number of Python libraries, e.g. machine learning, different complicated input formats, complex statistics.
Easy to install and maintain: dask-sql is just a pip/conda install away (or a docker run if you prefer). No need for complicated cluster setups - dask-sql will run out of the box on your machine and can be easily connected to your computing cluster.
Use SQL from wherever you like: dask-sql integrates with your jupyter notebook, your normal Python module or can be used as a standalone SQL server from any BI tool. It even integrates natively with Apache Hue.

Example

For this example, we use some data loaded from disk and query them with a SQL command from our python code. Any pandas or dask dataframe can be used as input and dask-sql understands a large amount of formats (csv, parquet, json,...) and locations (s3, hdfs, gcs,...).

import dask.dataframe as dd
from dask_sql import Context

# Create a context to hold the registered tables
c = Context()

# Load the data and register it in the context
# This will give the table a name, that we can use in queries
df = dd.read_csv("...")
c.create_table("my_data", df)

# Now execute a SQL query. The result is again dask dataframe.
result = c.sql("""
    SELECT
        my_data.name,
        SUM(my_data.x)
    FROM
        my_data
    GROUP BY
        my_data.name
""", return_futures=False)

# Show the result
print(result)

Quickstart

Have a look into the documentation or start the example notebook on binder.

dask-sql is currently under development and does so far not understand all SQL commands (but a large fraction). We are actively looking for feedback, improvements and contributors!

If you would like to utilize GPUs for your SQL queries, have a look into the blazingSQL project.

Installation

dask-sql can be installed via conda (preferred) or pip - or in a development environment.

With `conda`

Create a new conda environment or use your already present environment:

conda create -n dask-sql
conda activate dask-sql

Install the package from the conda-forge channel:

conda install dask-sql -c conda-forge

With `pip`

dask-sql needs Java for the parsing of the SQL queries. Make sure you have a running java installation with version >= 8.

To test if you have Java properly installed and set up, run

$ java -version
openjdk version "1.8.0_152-release"
OpenJDK Runtime Environment (build 1.8.0_152-release-1056-b12)
OpenJDK 64-Bit Server VM (build 25.152-b12, mixed mode)

After installing Java, you can install the package with

pip install dask-sql

For development

If you want to have the newest (unreleased) dask-sql version or if you plan to do development on dask-sql, you can also install the package from sources.

git clone https://github.com/nils-braun/dask-sql.git

Create a new conda environment and install the development environment:

conda create -n dask-sql --file conda.txt -c conda-forge

It is not recommended to use pip instead of conda for the environment setup. If you however need to, make sure to have Java (jdk >= 8) and maven installed and correctly setup before continuing. Have a look into conda.txt for the rest of the development environment.

After that, you can install the package in development mode

pip install -e ".[dev]"

To compile the Java classes (at the beginning or after changes), run

python setup.py java

This repository uses pre-commit hooks. To install them, call

pre-commit install

Testing

You can run the tests (after installation) with

pytest tests

SQL Server

dask-sql comes with a small test implementation for a SQL server. Instead of rebuilding a full ODBC driver, we re-use the presto wire protocol. It is - so far - only a start of the development and missing important concepts, such as authentication.

You can test the sql presto server by running (after installation)

dask-sql-server

or by using the created docker image

docker run --rm -it -p 8080:8080 nbraun/dask-sql

in one terminal. This will spin up a server on port 8080 (by default) that looks similar to a normal presto database to any presto client.

You can test this for example with the default presto client:

presto --server localhost:8080

Now you can fire simple SQL queries (as no data is loaded by default):

=> SELECT 1 + 1;
 EXPR$0
--------
    2
(1 row)

You can find more information in the documentation.

CLI

You can also run the CLI dask-sql for testing out SQL commands quickly:

dask-sql --load-test-data --startup

(dask-sql) > SELECT * FROM timeseries LIMIT 10;

How does it work?

At the core, dask-sql does two things:

translate the SQL query using Apache Calcite into a relational algebra, which is specified as a tree of java objects - similar to many other SQL engines (Hive, Flink, ...)
convert this description of the query from java objects into dask API calls (and execute them) - returning a dask dataframe.

For the first step, Apache Calcite needs to know about the columns and types of the dask dataframes, therefore some java classes to store this information for dask dataframes are defined in planner. After the translation to a relational algebra is done (using RelationalAlgebraGenerator.getRelationalAlgebra), the python methods defined in dask_sql.physical turn this into a physical dask execution plan by converting each piece of the relational algebra one-by-one.

Comments

TypeError: sequence item 0: expected str instance, NoneType found on running python setup.py java on source

$ git clone https://github.com/nils-braun/dask-sql.git

$ cd dask-sql

$ pytest tests
ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --cov --cov-config=.coveragerc tests
  inifile: /mnt/d/Programs/dask/dask-sql/pytest.ini
  rootdir: /mnt/d/Programs/dask/dask-sql


$ python setup.py java
running java
Traceback (most recent call last):
  File "setup.py", line 93, in <module>
    command_options={"build_sphinx": {"source_dir": ("setup.py", "docs"),}},
  File "/home/saulo/anaconda3/lib/python3.7/site-packages/setuptools/__init__.py", line 165, in setup
    return distutils.core.setup(**attrs)
  File "/home/saulo/anaconda3/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/home/saulo/anaconda3/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/home/saulo/anaconda3/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "setup.py", line 30, in run
    self.announce(f"Running command: {' '.join(command)}", level=distutils.log.INFO)
TypeError: sequence item 0: expected str instance, NoneType found

$ python dask-sql-test.py
Traceback (most recent call last):
  File "dask-sql-test.py", line 1, in <module>
    from dask_sql import Context
  File "/mnt/d/Programs/dask/dask-sql/dask_sql/__init__.py", line 1, in <module>
    from .context import Context
  File "/mnt/d/Programs/dask/dask-sql/dask_sql/context.py", line 9, in <module>
    from dask_sql.java import (
  File "/mnt/d/Programs/dask/dask-sql/dask_sql/java.py", line 88, in <module>
    DaskTable = com.dask.sql.schema.DaskTable
AttributeError: Java package 'com' has no attribute 'dask'

$ python -V
Python 3.7.6

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.1 LTS
Release:        20.04
Codename:       focal

$ java -version
openjdk version "14.0.2" 2020-07-14
OpenJDK Runtime Environment (build 14.0.2+12-Ubuntu-120.04)
OpenJDK 64-Bit Server VM (build 14.0.2+12-Ubuntu-120.04, mixed mode, sharing)

opened by sauloal 14

Add a packaged version of dask-sql

Currently, dask-sql can only be installed via the source. We should find out, if uploading the packaged jar (contained in a wheel) together with the python code makes sense and if and how we can create a conda package (probably via conda-forge).

opened by nils-braun 13

[BUG] CVEs in conda release

What happened:

Running Grype on DaskSQL.jar from the latest conda release (dask-sql=2022.1) returned 6 fixable CVEs

grype graphistry/graphistry-nvidia:v2.39.7-11.4 \
    --only-fixed \
    -o template \
    -t grype.friendly.tmpl

with template grype.friendly.tmpl

"Package","Version Installed","Vulnerability ID","Severity","Location",
{{- range .Matches}}
"{{.Artifact.Name}}","{{.Artifact.Version}}","{{.Vulnerability.ID}}","{{.Vulnerability.Severity}}","{{.Artifact.Locations}}"
{{- end}}

...
jackson-databind","2.10.0","GHSA-57j2-w4cx-62h2","High","[Location<RealPath="/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_sql/jar/DaskSQL.jar" Layer="sha256:5c80fa32eb12dd95d387ae9121c3a8ba9713207626bbc7b849613b4bb0eb3586">]"
"httpclient","4.5.9","GHSA-7r82-7xv7-xcpj","Medium","[Location<RealPath="/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_sql/jar/DaskSQL.jar" Layer="sha256:5c80fa32eb12dd95d387ae9121c3a8ba9713207626bbc7b849613b4bb0eb3586">]"
"json-smart","2.3","GHSA-fg2v-w576-w4v3","High","[Location<RealPath="/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_sql/jar/DaskSQL.jar" Layer="sha256:5c80fa32eb12dd95d387ae9121c3a8ba9713207626bbc7b849613b4bb0eb3586">]"
"commons-io","2.4","GHSA-gwrp-pvrq-jmwv","Medium","[Location<RealPath="/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_sql/jar/DaskSQL.jar" Layer="sha256:5c80fa32eb12dd95d387ae9121c3a8ba9713207626bbc7b849613b4bb0eb3586">]"
"snakeyaml","1.24","GHSA-rvwf-54qp-4r6v","High","[Location<RealPath="/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_sql/jar/DaskSQL.jar" Layer="sha256:5c80fa32eb12dd95d387ae9121c3a8ba9713207626bbc7b849613b4bb0eb3586">]"
"json-smart","2.3","GHSA-v528-7hrm-frqp","Critical","[Location<RealPath="/opt/conda/envs/rapids/lib/python3.8/site-packages/dask_sql/jar/DaskSQL.jar" Layer="sha256:5c80fa32eb12dd95d387ae9121c3a8ba9713207626bbc7b849613b4bb0eb3586">]

What you expected to happen:

The latest stable release should ideally have no fixable CVEs

Minimal Complete Verifiable Example:

See above

Anything else we need to know?:

Environment:

dask-sql version: 2022.01
Python version: Any
Operating System: Any (Ubuntu container)
Install method (conda, pip, source): Conda

bug

opened by lmeyerov 10

Complex join fails with memory error

From @timhdesilva

So I have a large dataset (50GB) that needs to be merged with a small dataset that is a Pandas dataframe. Prior to the merge, I need to perform a groupby observation on the large dataset. Using Dask, I have been able to perform the groupby observation on the large dataset (which is a Dask dataframe). When I then merge the two datasets using X.merge(Y), I have no issues. The problem is that I need to perform a merge than is not exact (i.e. one column between two others), which is why I'm turning to dask-sql. When I try to do the merge with dask-sql though, I get a memory error (the number of observations should only be ~ 10x than the exact merge, so memory shouldn't be a problem).

Any ideas here? I'm thinking somehow the issue might be that I am performing a groupby operation on the Dask dataframe prior to the dask-sql merge. Is this allowed - i.e. can one do a groupby and not execute it prior to using the dask-sql create_table() command and then performing a dask-sql merge with c.sql?

opened by nils-braun 10
Upgrade to DataFusion 14.0.0
Changes in this PR:

Use DataFusion 14.0.0

Added copy of filter_push_down rule from DataFusion 13.0.0 because there are changes in the DataFusion 14.0.0 version that cause regressions for us. We should revert back to using DataFusion's version at some point. I filed https://github.com/dask-contrib/dask-sql/issues/908 for this.
opened by andygrove 9
[ENH] substr() not supported in dask-sql
Is your feature request related to a problem? Please describe. I'm working on porting a large set of queries from another engine to dask-sql. I see that I can update the queries to use "substring" instead, but it would be nice if users didn't have to.

Describe the solution you'd like Can we have substr() supported in dask-sql in the same way that substring() is?

Describe alternatives you've considered substring() works in dask-sql not substr(). However, we do not want to alter the sql files by changing substr() to substring()

Additional context Here's an example query I'd like to be able to run: `import cudf from dask_sql import Context

dc = Context() df = cudf.DataFrame({'s_c': ['ATX', 'LAX', 'SFO'], 's_d':['38714','37206','38714'], 'd_d':['1900-01-01','1900-01-04','2199-12-28']}) dc.create_table('my_table', df) query = """ select substr(s_c,1,30) from (select s_c from my_table where s_d = d_d group by s_c) """ print(dc.sql(query).compute())`
enhancement SQL grammar java
opened by DaceT 9

Error: Unable to instantiate java compiler

Hi! @nils-braun,

As you already know I mistakenly opened this issue on Dask-Docker repo and you were kindly alerted by @jrbourbeau

I will copy/paste my original post here as well as your initial answer (Thank you for your quick reply)

Here is my original post:

####################################################################

What happened:

After installing Java and dask-sql using pip, whenever I try to run a SQL query from my python code I get the following error:

...
File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 378, in sql
    rel, select_names, _ = self._get_ral(sql)
  File "/home/vquery/.local/lib/python3.8/site-packages/dask_sql/context.py", line 515, in _get_ral
    nonOptimizedRelNode = generator.getRelationalAlgebra(validatedSqlNode)
java.lang.java.lang.IllegalStateException: java.lang.IllegalStateException: Unable to instantiate java compiler
...
...
File "JaninoRelMetadataProvider.java", line 426, in org.apache.calcite.rel.metadata.JaninoRelMetadataProvider.compile
  File "CompilerFactoryFactory.java", line 61, in org.codehaus.commons.compiler.CompilerFactoryFactory.getDefaultCompilerFactory
java.lang.java.lang.NullPointerException: java.lang.NullPointerException

What you expected to happen:

I should get a dataframe as a result.

Minimal Complete Verifiable Example:


# The cluster/client setup is done first, in another module not the one executing the SQL query
# Also tried other cluster/scheduler types with the same error
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
    n_workers=4,
    threads_per_worker=1,
    processes=False,
    dashboard_address=':8787',
    asynchronous=False,
    memory_limit='1GB'
    )
client = Client(cluster)

# The SQL code is executed in its own module
import dask.dataframe as dd
from dask_sql import Context

c = Context()
df = dd.read_parquet('/vQuery/files/results/US_Accidents_June20.parquet') 
c.register_dask_table(df, 'df')
df = c.sql("""select ID, Source from df""") # This line fails with the error reported

Anything else we need to know?:

As mentioned in the code snippet above, due to the way my application is designed, the Dask client/cluster setup is done before dask-sql context is created.

Environment:

Dask version:
- dask: 2020.12.0
- dask-sql: 0.3.1
Python version:
- Python 3.8.5
Operating System:
- Ubuntu 20.04.1 LTS
Install method (conda, pip, source):
- pip
Application Framework
- Jupyter Notebook/Ipywidgets & Voila Server

Install steps

$ sudo apt install default-jre

$ sudo apt install default-jdk

$ java -version
openjdk version "11.0.10" 2021-01-19
OpenJDK Runtime Environment (build 11.0.10+9-Ubuntu-0ubuntu1.20.04)
OpenJDK 64-Bit Server VM (build 11.0.10+9-Ubuntu-0ubuntu1.20.04, mixed mode, sharing)

$ javac -version
javac 11.0.10

$ echo $JAVA_HOME
/usr/lib/jvm/java-11-openjdk-amd64

$ pip install dask-sql

$ pip list | grep dask-sql
dask-sql               0.3.1

opened by LaurentEsingle 9

Add max version constraint for `fugue`

It looks like the recent release of Fugue 0.7.0 has bumped its qpd dependency to a version that only has python support up to 3.8. I'm not sure if this is the cause for the recent Fugue-related failures, but it does mean that at least for now, we should constrain to fugue<0.7.0, where 3.9+ support is guaranteed.

In the long run, we should probably see what the blockers are to allowing 3.9+ support on qpd again, cc @goodwanghan in case you have any additional context to provide here.

opened by charlesbluca 8
Add STDDEV, STDDEV_SAMP, and STDDEV_POP

Closes #608

Blocked by: https://github.com/rapidsai/cudf/issues/11515

Note: currently, performing multiple aggregations at once seems to result in incorrect values. Ex: SELECT STDDEV(a) AS s1, STDDEV_POP(a) AS s2 FROM df returns the same result for both s1 and s2 but running two separate queries (one for each aggregation) returns the correct results (#655)
datafusion

opened by ChrisJar 8

[BUG] Segfaults on "select count(*) from test" with tables on top of cuDF DataFrames

test.py:

if __name__ == "__main__":
    from dask.distributed import Client
    from dask_cuda import LocalCUDACluster

    cluster = LocalCUDACluster(protocol="tcp")
    client = Client(cluster)
    print(client)

    from dask_sql import Context
    import cudf

    c = Context()

    test_df = cudf.DataFrame({'id': [0, 1, 2]})
    c.create_table("test", test_df)

    # segfault
    print(c.sql("select count(*) from test").compute())

EDIT: Leaving the below UCX snippet and trace for historical purposes, but the issue seems entirely unrelated to UCX.

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from dask_sql import Context
import pandas as pd

cluster = LocalCUDACluster(protocol="ucx")
client = Client(cluster)

c = Context()

test_df = pd.DataFrame({'id': [0, 1, 2]})
c.create_table("test", test_df)

# segfault
c.sql("select count(*) from test")

trace:

/home/rgelhausen/conda/envs/dsql-3-07/lib/python3.9/site-packages/distributed-2022.2.1+8.g39c5e885-py3.9.egg/distributed/comm/ucx.py:83: UserWarning: A CUDA context for device 0 already exists on process ID 1251168. This is often the result of a CUDA-enabled library calling a CUDA runtime function before Dask-CUDA can spawn worker processes. Please make sure any such function calls don't happen at import time or in the global scope of a program.
  warnings.warn(
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
...
[rl-dgx2-r13-u7-rapids-dgx201:1232380:0:1232380] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid:1232380) ====
 0  /home/rgelhausen/conda/envs/dsql-3-07/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7f921c5883f5]
 1  /home/rgelhausen/conda/envs/dsql-3-07/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7f921c588791]
 2  /home/rgelhausen/conda/envs/dsql-3-07/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d962) [0x7f921c588962]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7f976d27b0c0]
 4  [0x7f93a78e6b58]
=================================
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f93a78e6b58, pid=1232380, tid=1232380
#
# JRE version: OpenJDK Runtime Environment (11.0.1+13) (build 11.0.1+13-LTS)
# Java VM: OpenJDK 64-Bit Server VM (11.0.1+13-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 1791 c2 java.util.Arrays.hashCode([Ljava/lang/Object;)I [email protected] (56 bytes) @ 0x00007f93a78e6b58 [0x00007f93a78e6b20+0x0000000000000038]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/nfs/rgelhausen/notebooks/core.1232380)
#
# An error report file with more information is saved as:
# /home/nfs/rgelhausen/notebooks/hs_err_pid1232380.log
Compiled method (c2)   17616 1791       4       java.util.Arrays::hashCode (56 bytes)
 total in heap  [0x00007f93a78e6990,0x00007f93a78e6d80] = 1008
 relocation     [0x00007f93a78e6b08,0x00007f93a78e6b20] = 24
 main code      [0x00007f93a78e6b20,0x00007f93a78e6c60] = 320
 stub code      [0x00007f93a78e6c60,0x00007f93a78e6c78] = 24
 metadata       [0x00007f93a78e6c78,0x00007f93a78e6c80] = 8
 scopes data    [0x00007f93a78e6c80,0x00007f93a78e6ce8] = 104
 scopes pcs     [0x00007f93a78e6ce8,0x00007f93a78e6d48] = 96
 dependencies   [0x00007f93a78e6d48,0x00007f93a78e6d50] = 8
 handler table  [0x00007f93a78e6d50,0x00007f93a78e6d68] = 24
 nul chk table  [0x00007f93a78e6d68,0x00007f93a78e6d80] = 24
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
#
# If you would like to submit a bug report, please visit:

bug needs triage

opened by randerzander 8

Update docs theme, use sphinx-tabs for CPU/GPU examples

This PR bumps the dask-sphinx-theme to be more in line with Dask / Distributed's docs, and adds the sphinx-tabs extension so that code-blocks can be tabbed to show their GPU equivalent (when possible)

opened by charlesbluca 8
Bump pypa/cibuildwheel from 2.11.3 to 2.11.4
Bumps pypa/cibuildwheel from 2.11.3 to 2.11.4.

Release notes

Sourced from pypa/cibuildwheel's releases.

v2.11.4

🐛 Fix a bug that caused missing wheels on Windows when a test was skipped using CIBW_TEST_SKIP (#1377)

🛠 Updates CPython 3.11 to 3.11.1 (#1371)

🛠 Updates PyPy 3.7 to 3.7.10, except on macOS which remains on 7.3.9 due to a bug. (#1371)

📚 Added a reference to abi3audit to the docs (#1347)

Changelog

Sourced from pypa/cibuildwheel's changelog.

v2.11.4

24 Dec 2022

🐛 Fix a bug that caused missing wheels on Windows when a test was skipped using CIBW_TEST_SKIP (#1377)

🛠 Updates CPython 3.11 to 3.11.1 (#1371)

🛠 Updates PyPy to 7.3.10, except on macOS which remains on 7.3.9 due to a bug on that platform. (#1371)

📚 Added a reference to abi3audit to the docs (#1347)

Commits

27fc88e Bump version: v2.11.4

a7e9ece Merge pull request #1371 from pypa/update-dependencies-pr

b9a3ed8 Update cibuildwheel/resources/build-platforms.toml

3dcc2ff fix: not skipping the tests stops the copy (Windows ARM) (#1377)

1c9ec76 Merge pull request #1378 from pypa/henryiii-patch-3

22b433d Merge pull request #1379 from pypa/pre-commit-ci-update-config

98fdf8c [pre-commit.ci] pre-commit autoupdate

cefc5a5 Update dependencies

e53253d ci: move to ubuntu 20

e9ecc65 [pre-commit.ci] pre-commit autoupdate (#1374)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies github_actions
opened by dependabot[bot] 1
[BUG] Fix `test_random` on Dask cluster

Right now, test_random fails on our Dask cluster integration test with a TypeError: __randomstate_ctor() takes from 0 to 1 positional arguments but 2 were given.

Like #977, I think this may have to do with the newest NumPy version release.

See also the definition of __randomstate_ctor in the NumPy source code.
bug needs triage

opened by sarahyurick 0
[BUG] Schema: not found in DaskSQLContext

this worked in dask 2022.8, but after the switch to dataFusion, I get this error when running queries. We believe this is because dataFusion doesn't support schemas - is it possible to add support this again?
bug needs triage

opened by hungcs 1
Bump async-trait from 0.1.59 to 0.1.60 in /dask_planner
Bumps async-trait from 0.1.59 to 0.1.60.

Release notes

Sourced from async-trait's releases.

0.1.60

Documentation improvements

Commits

226521b Release 0.1.60

288930d Update build status badge

2d472d3 Prevent build.rs rerunning unnecessarily on all source changes

See full diff in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies rust
opened by dependabot[bot] 1
[ENH] Add ~isna() support for predicate pushdown

Is your feature request related to a problem? Please describe. A common filter applied to many sql queries is filtering out nulls for certain tables that usually get's pushed down to the TableScan step. We implement is not null as a combination of df.isna() chained with a not operation. It would be good to support identifying these patterns in the hlg for predicate pushdown.

Describe the solution you'd like

Describe alternatives you've considered

Additional context
enhancement needs triage

opened by ayushdg 0

Releases(2022.12.0)

2022.12.0(Dec 2, 2022)
What's Changed

Unpin dask/distributed for development by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/892

Add replace operator by @ChrisJar in https://github.com/dask-contrib/dask-sql/pull/897

Replace variadic with exact where appropriate by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/885

Bump pyo3 from 0.17.2 to 0.17.3 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/900

Sort + limit topk optimization (initial) by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/893

[bug][docs] my_ds -> my_df by @nickvazz in https://github.com/dask-contrib/dask-sql/pull/905

Bump env_logger from 0.9.1 to 0.9.3 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/906

Bump mimalloc from 0.1.30 to 0.1.31 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/910

Replace dask_ml.wrappers.Incremental with custom Incremental class by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/855

Update flake8 link to use github by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/915

Use conda-incubator/[email protected] & enable automatic GH Action updates by @jakirkham in https://github.com/dask-contrib/dask-sql/pull/917

Bump uuid from 1.2.1 to 1.2.2 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/916

Upgrade to DataFusion 14.0.0 by @andygrove in https://github.com/dask-contrib/dask-sql/pull/903

Bump actions/checkout from 2 to 3 by @dependabot in https://github.com/dask-contrib/dask-sql/pull/920

Support to_timestamp by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/838

Bump actions/setup-python from 2 to 4 by @dependabot in https://github.com/dask-contrib/dask-sql/pull/921

Bump Docker workflow actions by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/930

Bump mimalloc from 0.1.31 to 0.1.32 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/923

Bump tokio from 1.21.2 to 1.22.0 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/927

Bump peter-evans/create-pull-request from 3 to 4 by @dependabot in https://github.com/dask-contrib/dask-sql/pull/929

Temporarily fix gpuci by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/942

Remove all Dask-ML uses by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/886

Dependabot updates by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/944

Bump async-trait from 0.1.58 to 0.1.59 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/946

Add TIMESTAMPDIFF support by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/876

Implement basic COALESCE functionality by @ChrisJar in https://github.com/dask-contrib/dask-sql/pull/823

Add support for filter pushdown rule by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/924

Resolve test_date_functions() by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/813

Set dask/distributed pinning for release by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/947

Set dask/distributed max version in Dockerfile by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/952

New Contributors

@nickvazz made their first contribution in https://github.com/dask-contrib/dask-sql/pull/905

@jakirkham made their first contribution in https://github.com/dask-contrib/dask-sql/pull/917

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2022.10.1...2022.12.0
Source code(tar.gz)
Source code(zip)
2022.10.1(Oct 25, 2022)
What's Changed

Unpin dask/distributed for development by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/848

Switch docs/CI away from conda-installed Rust by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/817

Add /opt/cargo/bin to gpuCI PATH by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/856

Enable crate sorting with rustfmt by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/819

Update datafusion dependency during upstream testing by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/814

Bump mimalloc from 0.1.29 to 0.1.30 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/862

Update gpuCI RAPIDS_VER to 22.12 by @github-actions in https://github.com/dask-contrib/dask-sql/pull/863

Refactor which_upstream logic in upstream scheduled workflow by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/864

Add testing for OSX by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/859

Wrap which_upstream logic in expression syntax by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/866

Check for np.timedelta64 in as_timelike by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/860

Update test-upstream.yml typo by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/869

Use latest DataFusion rev by @andygrove in https://github.com/dask-contrib/dask-sql/pull/865

Only use upstream Dask in scheduled cluster testing if which_upstream == 'Dask' by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/872

Bump async-trait from 0.1.57 to 0.1.58 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/870

Add pypi release workflow by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/858

Ignore index for union all test by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/875

Bump versioneer-vendored files to 0.27 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/881

Bump uvicorn minimum version to 0.13.4 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/873

Install twine in cibuildwheel environment by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/874

Replace dask_ml.wrappers.ParallelPostFit with custom ParallelPostFit class by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/832

Add py to testing environments to resolve pytest 7.2.0 issues by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/890

Use latest DataFusion rev by @andygrove in https://github.com/dask-contrib/dask-sql/pull/889

Pin dask/distributed for release by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/891

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2022.10.0...2022.10.1
Source code(tar.gz)
Source code(zip)
2022.10.1rc1(Oct 24, 2022)
What's Changed

Ignore index for union all test by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/875

Bump versioneer-vendored files to 0.27 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/881

Bump uvicorn minimum version to 0.13.4 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/873

Install twine in cibuildwheel environment by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/874

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2022.10.1rc0...2022.10.1rc1
Source code(tar.gz)
Source code(zip)
2022.10.1rc0(Oct 19, 2022)
What's Changed

Unpin dask/distributed for development by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/848

Switch docs/CI away from conda-installed Rust by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/817

Add /opt/cargo/bin to gpuCI PATH by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/856

Enable crate sorting with rustfmt by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/819

Update datafusion dependency during upstream testing by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/814

Bump mimalloc from 0.1.29 to 0.1.30 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/862

Update gpuCI RAPIDS_VER to 22.12 by @github-actions in https://github.com/dask-contrib/dask-sql/pull/863

Refactor which_upstream logic in upstream scheduled workflow by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/864

Add testing for OSX by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/859

Wrap which_upstream logic in expression syntax by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/866

Check for np.timedelta64 in as_timelike by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/860

Update test-upstream.yml typo by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/869

Use latest DataFusion rev by @andygrove in https://github.com/dask-contrib/dask-sql/pull/865

Only use upstream Dask in scheduled cluster testing if which_upstream == 'Dask' by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/872

Bump async-trait from 0.1.57 to 0.1.58 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/870

Add pypi release workflow by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/858

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2022.10.0...2022.10.1c0
Source code(tar.gz)
Source code(zip)
2022.10.0(Oct 10, 2022)
What's Changed

Update README to link to DataFusion rather than Calcite by @andygrove in https://github.com/dask-contrib/dask-sql/pull/790

Unpin dask/distributed for development by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/794

Remove datafusion syncing workflow by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/793

Resolve syntax errors in upstream testing workflow by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/797

README update- remove 'experimental' from GPU support section by @randerzander in https://github.com/dask-contrib/dask-sql/pull/798

Fix new clippy warnings by @andygrove in https://github.com/dask-contrib/dask-sql/pull/801

Check split_out to decide on sorted groupby in aggregate.py by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/802

Resolve Docker build failures, update core dependency constraints by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/804

Fix docker build errors by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/805

Fix if condition for gpuCI updating workflow by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/808

pip install awscli in cloud images by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/809

Resolve bare requirement failures in upstream workflow by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/800

Refactor getValue<T> code to reduce duplication by @andygrove in https://github.com/dask-contrib/dask-sql/pull/803

Improve SqlTypeName to support more types and also improve error handling by @andygrove in https://github.com/dask-contrib/dask-sql/pull/824

Add dependabot config to update Rust deps by @andygrove in https://github.com/dask-contrib/dask-sql/pull/820

Bump uuid from 0.8.2 to 1.1.2 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/828

Bump rand from 0.7.3 to 0.8.5 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/827

Remove rust-toolchain.toml by @andygrove in https://github.com/dask-contrib/dask-sql/pull/826

Add quoting around partition keys for Hive table inputs by @randerzander in https://github.com/dask-contrib/dask-sql/pull/834

Configure dependabot to ignore arrow and datafusion by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/840

Bump pyo3 from 0.17.1 to 0.17.2 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/836

Add support for CREATE EXPERIMENT, expand support for WITH kwargs by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/796

Bump uuid from 1.1.2 to 1.2.1 in /dask_planner by @dependabot in https://github.com/dask-contrib/dask-sql/pull/845

Add Andy and Charles to the rust codeowners group by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/846

Update DataFusion and change order of optimization rules by @andygrove in https://github.com/dask-contrib/dask-sql/pull/825

Update doc pages after DataFusion merge by @randerzander in https://github.com/dask-contrib/dask-sql/pull/842

Resolve test_literals() by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/812

Faster limit computation on persisted dataframes by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/837

Pin dask/distributed for release by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/847

New Contributors

@randerzander made their first contribution in https://github.com/dask-contrib/dask-sql/pull/798

@dependabot made their first contribution in https://github.com/dask-contrib/dask-sql/pull/828

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2022.9.0...2022.10.0
Source code(tar.gz)
Source code(zip)
2022.9.0(Sep 21, 2022)
What's Changed

Unpin dask/distibuted post-release by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/694

Don't check order for filtered groupby test by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/702

Relax test_groupby_split_every key check by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/710

Update gpuCI environment file, updating workflow by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/731

Bump gpuCI test environment to use python 3.9 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/736

Refactor LIMIT computation to always use head when possible by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/696

Set pytest to fail on xpassing tests by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/756

Fix upstream failures in test_groupby_split_out by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/763

Add step argument to get_window_bounds for pandas>=1.5 by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/774

Remove PyPI release workflow by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/777

Switch to Arrow DataFusion SQL parser by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/788

Pin dask/distributed for release by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/789

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2022.8.0...2022.9.0
Source code(tar.gz)
Source code(zip)
2022.9.0.rc0(Sep 20, 2022)
What's Changed

Datafusion aggregate by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/471

Bump DataFusion version by @andygrove in https://github.com/dask-contrib/dask-sql/pull/494

Basic DataFusion Select Functionality by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/489

Allow for Cast parsing and logicalplan by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/498

Minor code cleanup in row_type() by @andygrove in https://github.com/dask-contrib/dask-sql/pull/504

Bump rust version by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/508

Improve code for getting column name from expression by @andygrove in https://github.com/dask-contrib/dask-sql/pull/509

Update exceptions that are thrown by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/507

Add support for Expr::Sort in expr_to_field by @andygrove in https://github.com/dask-contrib/dask-sql/pull/515

Reduce crate dependencies by @andygrove in https://github.com/dask-contrib/dask-sql/pull/516

Datafusion dsql explain by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/511

Port sort logic to the datafusion planner by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/505

Add helper method to convert LogicalPlan to Python type by @andygrove in https://github.com/dask-contrib/dask-sql/pull/522

Support CASE WHEN and BETWEEN by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/502

Upgrade to DataFusion 8.0.0 by @andygrove in https://github.com/dask-contrib/dask-sql/pull/533

Enable passing tests by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/539

Datafusion crossjoin by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/521

Implement TryFrom for plans by @andygrove in https://github.com/dask-contrib/dask-sql/pull/543

Support for LIMIT clause with DataFusion by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/529

Support Joins using DataFusion planner/parser by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/512

Datafusion is not by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/557

[REVIEW] Add support for UNION by @galipremsagar in https://github.com/dask-contrib/dask-sql/pull/542

[REVIEW] Fix issue with duplicates in column renaming by @galipremsagar in https://github.com/dask-contrib/dask-sql/pull/559

[REVIEW] Enable LIMIT tests by @galipremsagar in https://github.com/dask-contrib/dask-sql/pull/560

Add CODEOWNERS file by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/562

Upgrade DataFusion version & support non-equijoin join conditions by @andygrove in https://github.com/dask-contrib/dask-sql/pull/566

[DF] Add @ayushdg and @galipremsagar to rust codeowners by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/572

Enable DataFusion CBO and introduce DaskSqlOptimizer by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/558

Only use the specific DataFusion crates that we need by @andygrove in https://github.com/dask-contrib/dask-sql/pull/568

Fix some clippy warnings by @andygrove in https://github.com/dask-contrib/dask-sql/pull/574

Datafusion invalid projection by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/571

Datafusion upstream merge by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/576

Datafusion filter by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/581

Table_scan column projection by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/578

Expose groupby agg configs to drop_duplicates (distinct) egg by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/575

Datafusion year & support for DaskSqlDialect by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/585

Optimization rule to optimize out nulls for inner joins by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/588

Push down null filters into TableScan by @andygrove in https://github.com/dask-contrib/dask-sql/pull/595

Datafusion IndexError - Return fields from the lhs and rhs of a join by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/599

Datafusion uncomment working filter tests by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/601

Search all schemas when attempting to locate index by field name by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/602

Fix join condition eval when joining on 3 or more columns by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/603

Add inList support by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/604

Enable Datafusion user defined functions UDFs by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/605

Datafusion empty relation by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/611

Datafusion NOT LIKE Clause support by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/615

Uncomment passing pytests by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/616

Fix bug when filtering on specific scalars. by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/609

Datafusion NULL & NOT NULL literals by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/618

Fix the results from a subquery alias operation with optimizations enabled by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/613

Initial version of contributing guide by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/600

Add helper function for converting expression lists to Python by @andygrove in https://github.com/dask-contrib/dask-sql/pull/631

Plugins support multiply types by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/636

Consolidate limit/offset logic in partition func by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/598

Datafusion version bump by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/628

Expand getOperands support to cover all currently available Expr type… by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/642

Introduce Inverse Rex Operation by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/643

Remove code segment that was causing double the amount of columns to … by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/644

Include Columns in Empty DataFrame by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/645

Bump setuptools-rust from 1.1.1 -> 1.4.1 by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/646

Merge main into datafusion-sql-planner by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/654

Port window logic to datafusion by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/545

COT function by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/657

Math functions by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/660

Use PyErrs for all Python-facing methods in dask_planner by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/662

Invalid crossjoin in plan by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/653

[DF] Add support for CREATE TABLE | VIEW AS statements by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/656

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/669

Datafusion expand scalarvalue catchall by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/638

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/670

[DF] Add support for DROP TABLE statements by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/658

Remove un-necessary sqlparser dependency and duplicate Dialect defini… by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/671

[DF] Resolve UDF test failures by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/672

Uncomment skipped rex pytests by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/661

Merge "Bump arrow version to 6.0.0 (#674)" by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/677

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/676

[DF] Fix most of the clippy warnings by @andygrove in https://github.com/dask-contrib/dask-sql/pull/679

[DF] use datafusion 9956f80f197550051db7debae15d5c706afc22a3 by @andygrove in https://github.com/dask-contrib/dask-sql/pull/667

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/685

Configure clippy to error on warnings by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/692

Unpin dask/distibuted post-release by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/694

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/691

[DF] Add optimizer rules to translate subqueries to joins by @andygrove in https://github.com/dask-contrib/dask-sql/pull/680

[DF] Upgrade DataFusion to rev c0b4ba by @andygrove in https://github.com/dask-contrib/dask-sql/pull/689

Add STDDEV, STDDEV_SAMP, and STDDEV_POP by @ChrisJar in https://github.com/dask-contrib/dask-sql/pull/629

Rust parsing support for CREATE MODEL statements by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/693

Support for DROP MODEL parsing in Rust by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/695

Support for parsing [or replace] with create [or replace] model by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/700

Parsing logic for SHOW SCHEMAS by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/697

Support for parsing SHOW TABLES FROM grammar by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/699

Don't check order for filtered groupby test by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/702

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/708

Enable passing pytests by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/709

Relax test_groupby_split_every key check by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/710

Introduce 'schema' to the DaskTable instance and modify context.fqn t… by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/713

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/711

Use compiler function in nightly recipe, pin to Rust 1.62.1 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/687

Add test queries to gpuCI checks by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/650

Support for DISTRIBUTE BY by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/715

Datafusion create table with by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/714

[DF] Bump DataFusion to rev 076b42 by @andygrove in https://github.com/dask-contrib/dask-sql/pull/720

[DF] Add support for CREATE [OR REPLACE] TABLE [IF NOT EXISTS] WITH by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/718

Stop overwriting aggregations on same column by @ChrisJar in https://github.com/dask-contrib/dask-sql/pull/675

[DF] Add TypeCoercion optimizer rule by @andygrove in https://github.com/dask-contrib/dask-sql/pull/723

Support for SHOW COLUMNS syntax by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/721

Implment PREDICT parsing and python wiring by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/722

Support all boolean operations by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/719

Resolve issue that crept in during code merge and caused build issues by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/724

[DF] Add handling for overloaded UDFs by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/682

[DF] Minor quality of life updates to test_queries.py by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/730

[DF] Fix PyExpr.index bug where it returns Ok(0) instead of an Err if no match is found by @andygrove in https://github.com/dask-contrib/dask-sql/pull/732

[DF] Add Cargo.lock and bump DataFusion rev by @andygrove in https://github.com/dask-contrib/dask-sql/pull/734

Update gpuCI environment file, updating workflow by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/731

Bump gpuCI test environment to use python 3.9 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/736

[DF] Implement ANALYZE TABLE by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/733

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/735

[DF] Switch out gpuCI Java dependencies for Rust by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/737

{CREATE | USE | DROP} Schema support by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/727

Test function test_aggregate_function by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/738

Uncomment more test_model pytests by @ChrisJar in https://github.com/dask-contrib/dask-sql/pull/728

Unskip passing postgres test by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/739

[DF] Publish nightlies under dev_datafusion label by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/729

[DF] DataFusion upgrade by @andygrove in https://github.com/dask-contrib/dask-sql/pull/742

[DF] Resolve test_aggregations and test_group_by_all by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/743

Refactor LIMIT computation to always use head when possible by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/696

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/745

Upgrade to latest DataFusion by @andygrove in https://github.com/dask-contrib/dask-sql/pull/744

Uncomment passing pytests by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/750

[DF] Update DataFusion to pick up SQL support for LIKE, ILIKE, SIMILAR TO with escape char by @andygrove in https://github.com/dask-contrib/dask-sql/pull/751

Set pytest to fail on xpassing tests by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/756

Upgrade to Datafusion 12.0.0 RC1 by @andygrove in https://github.com/dask-contrib/dask-sql/pull/755

[DF] Optimize away COUNT DISTINCT aggregate operations - eliminate_agg_distinct by @andygrove in https://github.com/dask-contrib/dask-sql/pull/748

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/757

[DF] Upgrade pyo and change some signatures to use &str instead of String by @andygrove in https://github.com/dask-contrib/dask-sql/pull/762

Fix upstream failures in test_groupby_split_out by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/763

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/764

[DF] Switch back to architectured builds by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/765

[DF] Remove python constraint from nightly recipe by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/766

[DF] Generalize CREATE | PREDICT MODEL to accept non-native SELECT statements by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/747

[DF] Use Datafusion 12.0.0 by @andygrove in https://github.com/dask-contrib/dask-sql/pull/767

[DF] Use correct schema in TableProvider by @andygrove in https://github.com/dask-contrib/dask-sql/pull/769

Update docs by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/768

[DF] Add support for switching schema in DaskSqlContext by @andygrove in https://github.com/dask-contrib/dask-sql/pull/770

Add step argument to get_window_bounds for pandas>=1.5 by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/774

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/775

c.ipython_magic fix for Jupyter Lab by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/772

[DF] Remove PyPI release workflow by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/776

Remove PyPI release workflow by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/777

sync: main to datafusion-sql-planner by @github-actions in https://github.com/dask-contrib/dask-sql/pull/778

New Contributors

@andygrove made their first contribution in https://github.com/dask-contrib/dask-sql/pull/494

@galipremsagar made their first contribution in https://github.com/dask-contrib/dask-sql/pull/542

@ChrisJar made their first contribution in https://github.com/dask-contrib/dask-sql/pull/629

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2022.8.0...2022.9.0.rc0
Source code(tar.gz)
Source code(zip)
2022.8.0(Aug 16, 2022)
What's Changed

Unpin dask/distributed for development by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/564

Update docs theme by @scharlottej13 in https://github.com/dask-contrib/dask-sql/pull/567

Make sure scheduler has Dask nightlies in upstream cluster testing by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/573

Update gpuCI RAPIDS_VER to 22.08 by @github-actions in https://github.com/dask-contrib/dask-sql/pull/565

Modify test environment pinnings to cover minimum versions by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/555

Don't move jar to local mvn repo by @ksonj in https://github.com/dask-contrib/dask-sql/pull/579

Add max version constraint for fugue by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/639

Add environment file & documentation for GPU tests by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/633

Validate UDF metadata by @brandon-b-miller in https://github.com/dask-contrib/dask-sql/pull/641

Set Dask-sql as the default Fugue Dask engine when installed by @goodwanghan in https://github.com/dask-contrib/dask-sql/pull/640

Generalize analyze/sample tests to resolve CI failures by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/668

Update CodeCov upload step in CI by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/666

Bump arrow version to 6.0.0 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/674

Update gpuCI RAPIDS_VER to 22.10 by @github-actions in https://github.com/dask-contrib/dask-sql/pull/665

Constrain dask pinnings for release by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/690

New Contributors

@scharlottej13 made their first contribution in https://github.com/dask-contrib/dask-sql/pull/567

@ksonj made their first contribution in https://github.com/dask-contrib/dask-sql/pull/579

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2022.6.0...2022.8.0
Source code(tar.gz)
Source code(zip)
2022.6.0(Jun 3, 2022)
What's Changed

Unpin Dask/distributed versions by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/452

Add jsonschema to ci testing by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/454

Switch tests from pd.testing.assert_frame_equal to dd.assert_eq by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/365

Set max pin on antlr4-python-runtime by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/456

Move / minimize number of cudf / dask-cudf imports by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/480

Use map_partitions to compute LIMIT / OFFSET by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/517

Use dev images for independent cluster testing by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/518

Add documentation for FugueSQL integrations by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/523

Timestampdiff support by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/495

Relax jsonschema testing dependency by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/546

Update upstream testing workflows by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/536

Fix pyarrow / cloudpickle failures in cluster testing by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/553

Use bash -l as default entrypoint for all upstream testing jobs by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/552

Constrain dask/distributed for release by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/563

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2022.4.1...2022.6.0
Source code(tar.gz)
Source code(zip)
2022.4.1(Apr 8, 2022)
What's Changed

Add Java source code to source distribution by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/451

Bump httpclient dependency by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/453

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2022.4.0...2022.4.1
Source code(tar.gz)
Source code(zip)
2022.4.0(Apr 7, 2022)
What's Changed

Switch github-script action to v3 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/379

Unpin dask/distributed following release by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/381

Fix typo by @wence- in https://github.com/dask-contrib/dask-sql/pull/382

Remove defaults channel from conda envs by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/384

Don't persist dataframes before applying offset / limit by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/387

Update gpuCI RAPIDS_VER to 22.04 by @github-actions in https://github.com/dask-contrib/dask-sql/pull/374

Feature/jdbc by @PeterLappo in https://github.com/dask-contrib/dask-sql/pull/351

Bump gpuCI PYTHON_VER to 3.9 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/388

Stop using defaults channel in dev environments by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/393

Use versioneer to compute __version__ by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/396

[REVIEW] Modified show.ftl to conditionally expect FROM in parsing logic by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/371

Fix TIMESTAMP / DATE scalars, add support for DATE column casting by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/343

Enable ability for user to pass in a list of CBO rules that should be… by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/389

Drop support for python 3.7, add testing for python 3.10 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/383

Bump pre-release package versions to be greater than stable releases by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/405

Update pytest to generate a client fixture by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/398

Use build_ext/install_lib subclasses to build external java by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/406

Fix use of row UDFs at intermediate query stages by @brandon-b-miller in https://github.com/dask-contrib/dask-sql/pull/409

[Review] Refactor ConfigContainer to use dask config by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/392

Provide meta to result of complex _apply_offset by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/420

Fix logic for unary join operands like IS NOT NULL by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/428

Update docs theme, use sphinx-tabs for CPU/GPU examples by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/394

Resolve independent cluster test failures by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/437

Only use session-wide client fixture for independent cluster testing by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/439

Drop common column from result of cross join, remove from corresponding meta by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/408

Add basic predicate-pushdown optimization by @rjzamora in https://github.com/dask-contrib/dask-sql/pull/433

Add workflow to keep datafusion-sql-planner branch up to date by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/440

Update gpuCI RAPIDS_VER to 22.06 by @github-actions in https://github.com/dask-contrib/dask-sql/pull/434

Bump black style checks to 22.3.0 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/443

Check for ucx-py nightlies when updating gpuCI by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/441

Add handling for newer prompt_toolkit versions in cmd tests by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/447

Resolve gpuCI workflow failures by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/446

Update versions of Java dependencies by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/445

Update jackson databind version by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/449

Disable SQL server functionality by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/448

Update dask pinnings for release by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/450

New Contributors

@wence- made their first contribution in https://github.com/dask-contrib/dask-sql/pull/382

@PeterLappo made their first contribution in https://github.com/dask-contrib/dask-sql/pull/351

@rjzamora made their first contribution in https://github.com/dask-contrib/dask-sql/pull/433

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2022.1.0...2022.4.0
Source code(tar.gz)
Source code(zip)
2022.1.0(Jan 24, 2022)
What's Changed

Disable CodeCov upload in tests on forks by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/349

Cost based optimization by @nils-braun in https://github.com/dask-contrib/dask-sql/pull/226

Add latest dask-ml to upstream testing by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/354

Bump gpuCI CUDA_VER to 11.5 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/348

Update Calcite to 1.29.0 and log4j to 2.17.0 to address CVE-2021-44228 by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/347

Removed uneeded log4j instance that was causing version conflicts and generating slf4j warning messages by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/358

Added getContext() method to DaskPlanner to ensure that CalciteConfigC… by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/362

Add os environment option to enable remote jvm debugging by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/363

Fix issue reporting in scheduled upstream testing by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/355

Remove Join Condition Push CBO Rule since it was causing infinite cos… by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/359

Parse ROWS as tuples in SQL kwargs by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/338

Add support for gpu kwarg in Context.sql and explain by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/368

Remove max version restriction for Dask/Distributed by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/369

Use upstream Dask for complex sorting operations by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/336

xfail failing model tests by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/373

Add substr tests by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/372

Fix pandas BaseIndexer import by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/377

Bump dask-ml dependency by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/378

[REVIEW] Fix unary conditional join operations by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/366

Pin dask/distributed versions for release by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/380

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2021.12.0...2022.1.0
Source code(tar.gz)
Source code(zip)
2021.12.0(Dec 13, 2021)
What's Changed

Update nightly recipe / setup for 2021.11.0 release by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/308

Add test build using latest Dask/Distributed by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/306

General GHA workflow clean up by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/313

Add testing for Python 3.9 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/314

Use Boa for nightly builds by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/318

Add handling for cuDF-backed tables in dask-sql-server by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/312

Row UDF scalar arguments by @brandon-b-miller in https://github.com/dask-contrib/dask-sql/pull/311

Update register_func() in context.py by @DaceT in https://github.com/dask-contrib/dask-sql/pull/282

Bump dask-ml dependency to 2021.11.16 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/322

Add groupby split_out config options to dask-sql by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/286

Remove null-splitting from _perform_aggregation by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/273

Revert "Remove null-splitting from _perform_aggregation" by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/325

Resolve failures in nightly package builds by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/328

Add workflow to automate gpuCI updates by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/327

Update gpuCI RAPIDS_VER to 22.02 by @github-actions in https://github.com/dask-contrib/dask-sql/pull/329

Installing Dask-SQL w/ RAPIDS by @DaceT in https://github.com/dask-contrib/dask-sql/pull/324

Remove null-splitting from _perform_aggregation by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/326

Generalize table check in _get_tables_from_stack by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/333

Add support for GPU table creation in dask / location plugins by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/251

Circumvent deep copy of context in PredictModelPlugin by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/334

Unrestrict conda-build version used for nightly builds by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/335

Update conditions for apply_sort fast codepath by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/337

[REVIEW]Add support and tests for cuML and XGBoost by @VibhuJawa in https://github.com/dask-contrib/dask-sql/pull/330

Ignore case for queries in the parser configuration by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/316

Ignore .swp files by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/341

Added Alter schema and Alter Table by @rajagurunath in https://github.com/dask-contrib/dask-sql/pull/285

Bump dask dependency to >=2021.11.1,<=2021.11.2 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/345

New Contributors

@DaceT made their first contribution in https://github.com/dask-contrib/dask-sql/pull/282

@github-actions made their first contribution in https://github.com/dask-contrib/dask-sql/pull/329

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/2021.11.0...2021.12.0
Source code(tar.gz)
Source code(zip)
2021.11.0(Nov 10, 2021)
What's Changed

Use unique names for null/non-null groupby columns by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/289

Use string separator in nightly version string by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/295

[Review] Update readme and docstrings to indicate GPU support by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/292

Add DISTRIBUTE BY to dask-sql grammar by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/228

Use Dask's sort_values for first column sorting in apply_sort by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/255

xfail broken dask-ml tests by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/304

Bump dask pinning to 2021.10.0 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/303

Prevent JVM Segfault by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/294

Make meta consistent with results of cross join by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/300

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/0.4.0...2021.11.0
Source code(tar.gz)
Source code(zip)
0.4.0(Nov 2, 2021)
What's Changed

More efficient window implementation by @nils-braun in https://github.com/dask-contrib/dask-sql/pull/217

Support creating tables from cudf dataframes by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/220

Re-enable the hive tests by @nils-braun in https://github.com/dask-contrib/dask-sql/pull/221

Reading tables with a dask-cudf DataFrame by @sarahyurick in https://github.com/dask-contrib/dask-sql/pull/224

Introduces parallel tests to speed up the processing by @nils-braun in https://github.com/dask-contrib/dask-sql/pull/230

Explicitly install sasl in CI by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/244

Add gpuCI support by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/240

Add issue templates by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/247

Fix test_deprecation_warning in gpuCI by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/248

[Review] Add fast path for multi-column sorting by @quasiben in https://github.com/dask-contrib/dask-sql/pull/229

Add conda dev environments for Python 3.7/3.8, JDK 8/11 by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/238

Add support for CONCAT by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/253

[REVIEW] Fast path when possible for non numeric aggregation by @VibhuJawa in https://github.com/dask-contrib/dask-sql/pull/236

Restrict docker/deploy jobs to upstream repo, cancel concurrent test runs by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/254

Do not persist data to memory by default when creating tables by @jdye64 in https://github.com/dask-contrib/dask-sql/pull/245

Add flake8 pre-commit hook by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/235

Automatically label bugs / feature requests for triage by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/261

Support pandas style row udfs by @brandon-b-miller in https://github.com/dask-contrib/dask-sql/pull/246

Publish nightly builds to dask conda channel by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/263

Revert conda build tweaks by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/266

Try anaconda upload again for conda package upload by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/267

Feature/improve cli by @rajagurunath in https://github.com/dask-contrib/dask-sql/pull/231

Simplify DataContainer.assign operation by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/271

Added bug fix for window func by @rajagurunath in https://github.com/dask-contrib/dask-sql/pull/277

Pass return_type through to meta in apply by @brandon-b-miller in https://github.com/dask-contrib/dask-sql/pull/275

[Review] Add gpu tests for string functions by @ayushdg in https://github.com/dask-contrib/dask-sql/pull/256

Simplify single-partition sorting logic by @charlesbluca in https://github.com/dask-contrib/dask-sql/pull/262

Require UDF return type and update docs by @brandon-b-miller in https://github.com/dask-contrib/dask-sql/pull/283

New Contributors

@ayushdg made their first contribution in https://github.com/dask-contrib/dask-sql/pull/220

@charlesbluca made their first contribution in https://github.com/dask-contrib/dask-sql/pull/244

@quasiben made their first contribution in https://github.com/dask-contrib/dask-sql/pull/229

@VibhuJawa made their first contribution in https://github.com/dask-contrib/dask-sql/pull/236

@jdye64 made their first contribution in https://github.com/dask-contrib/dask-sql/pull/245

@brandon-b-miller made their first contribution in https://github.com/dask-contrib/dask-sql/pull/246

Full Changelog: https://github.com/dask-contrib/dask-sql/compare/0.3.9...0.4.0
Source code(tar.gz)
Source code(zip)
0.3.9(Aug 18, 2021)
Bugfixes

Do not depend on pkg not specified in setup.py (#214)

Use the mambaforge installer to speed up the build process (#216)

Update all links from nils-braun to dask-contrib. (Fixes #212)

Make JOINs also work for non-pandas dask dataframes (e.g. dask-cudf) (#211)

Source code(tar.gz)
Source code(zip)
0.3.8(Aug 17, 2021)
Bugfixes

Remove mandatory dask_ml dependencies (#208)

Source code(tar.gz)
Source code(zip)
0.3.7(Aug 10, 2021)
Features

Allow for multiple schemas (#205)

AutoML capabilities (#199)

Implement the regr count SQL operator (#193)

ML model improvement : Added SHOW MODELS, EXPORT MODEL and DESCRIBE MODEL (#185, #191)

Implement the search and sargs operator (#184)

Bugfixes

Fixes for pandas 1.3.0 (#202)

Fix test fixture order (#194)

Fix a failing build, as ciso8601 is currently not pip-installable (#192)

Source code(tar.gz)
Source code(zip)
0.3.6(May 16, 2021)
Bugfixes

Casting of literals is done by Calcite, except for strings (#178)

Source code(tar.gz)
Source code(zip)
0.3.5(May 15, 2021)
Bugfixes

Speed up aggregations when there are no aggregates (#174)

Register the lower and upper-case version of a function (#177)

Reverting a bug in the casting logic to cast only if really needed (#176)

Source code(tar.gz)
Source code(zip)
0.3.4(May 13, 2021)
Small feature addons

Added correct casting and mod operation (#172)

Implement OVER for arbitrary windows (#164)

Allow to start a SQL server from a jupyter notebook (#162)

Bugfixes and Improvements

Sort optimizations (#167, #173)

Fix scikit learn version in docker file

Add test with independent dask cluster (#165)

Speed up builds with mamba (#171)

Remove version constraints for pandas and dask as the errors were fixed upstream (#170)

Fixed the replacement of functions/aggregations and added a test (#169)

Added missing version in pom

Source code(tar.gz)
Source code(zip)
0.3.3(Apr 30, 2021)
Small feature addons

Allow function reregistration (#161)

upgrade fugue dependency (#160)

Implement a wrapper for the prompt_toolkit session (#159)

Source code(tar.gz)
Source code(zip)
0.3.2(Apr 13, 2021)
Small feature addons

First working (but slow) implementation of OVER (#157)

Add a visualize function (#153)

IPython/Jupyter Magic (#146)

Hive/Databricks from SQL (#145)

Bugfixes and Improvements

Improve documentation

Better cross joins (#150)

Fix a bug which occurs when only filters are present in groupbys (#154)

Make testing a bit easier to type

Fix a warning on regexes

Split out the jupyter notebook integration (#152)

Add pre commit hook (#149)

Limit the dask version until the dask-ml problem is fixed (#147)

Turn off docker image building of PRs

Fix integration with dbfs using the newest fsspec version (#140)

Show a reasonable traceback on exceptions (#142)

Docker image improvements (#137)

Support for Float (pandas extension type) and filter with NaNs (#136)

Source code(tar.gz)
Source code(zip)
0.3.1(Feb 7, 2021)
Small feature addons

Aggregate improvements and SQL compatibility (#134)

New call operations (#122)

Added notebook with a 'Tour de dask-sql' (#119)

Bugfixes and Improvements

Docs improvements (#132)

Fix the fugue dependency (#133)

Pandas dependency fix (#129)

Added missing iris.csv data set

Pip installation docs improvement (#128)

Correctly sort NULLs (#126)

Importlib import (#125)

Do not touch already installed dask and pandas version as this may lead to incompatibilities (#123)

Average decimal type (#121)

Fixing a bug in column container copies (#120)

Source code(tar.gz)
Source code(zip)
0.3.0(Jan 21, 2021)
Features

Allow for an sqlalchemy and a hive cursor input (#90)

Allow to register the same function with multiple parameter combinations (#93)

Additional datetime functions (#91)

Server and CMD CLI script (#94)

Split the SQL documentation in subpages and add a lot more documentation (#107)

DROP TABLE and IF NOT EXISTS/REPLACE (#98)

SQL Machine Learning Syntax (#108)

ANALYZE TABLE (#105)

Random sample operators (#115)

Read from Intake Catalogs (#113)

Adding fugue integration and tests (#116) and fsql (#118)

Bugfixes

Keep casing also with unquoted identifiers. (#88)

Scalar where clauses (#89)

Check for the correct java path on Windows (#86)

Remove # pragma once where it is not needed anymore (#92)

Refactor the hive input handling (#95)

Limit pandas version (#100)

Handle the case of the java version is undefined correctly (#101)

Add datetime[ns, UTC] as understood type (#103)

Make sure to treat integers as integers (#109)

On ORDER BY queries, show the column names of the SELECT query (#110)

Always refer to a function with the name given by the user (#111)

Do not fail on empty SQL commands (#114)

Fix the random sample test (#117)

Source code(tar.gz)
Source code(zip)
0.2.2(Nov 28, 2020)
Bugfixes and Improvements

Use new conda github action to prevent a failed build (#85)

Source code(tar.gz)
Source code(zip)
0.2.1(Nov 19, 2020)
Bugfixes and Improvements

Increase speed and parallelism of the limit algorithm and implement descending sorting (#75)

Improved the ability to create (materialized) views of queries (#77)

Added missing __version__ variable (#79)

Improved Docker image (#78)

Allow arbitrary return types in SQL server (#76)

Bugfix: Added tzlocal dependencies

Source code(tar.gz)
Source code(zip)
0.2.0(Nov 5, 2020)
Additional Features

Unify dask-sql API with blazing SQL (#63) This also brings an experimental hive input binding.

Added binder repository and example notebooks (forked from @raybellwaves) (#72)

Better/correct presto server (#69), now working together with many BI tools and ready for multiple clients in parallel

Enable input from published datasets (#68)

Use pytest for all the tests instead of unittest (#67)

SHOW SCHEMA now includes FROM and LIKE - and the information_schema is added (#62)

Some remaining simple operations (#54)

Bugfixes

Allow None in LIKE calls and add tests for regression (#71)

Bugfix: correct isinf check, which also works distributed

Use the default conformance level, which e.g. allows to reuse aliases in the query (#66)

Set the JAVA_HOME in conda environments and warn the user, if not set cirrectly. (#65)

Additional Fixes and Documentation

Some ignore file fixes

Fixes to typos, docu and format

Docker images with latest tag (#73)

Source code(tar.gz)
Source code(zip)
0.1.2(Oct 14, 2020)

Patch release with fixed dependencies for SQL server
Source code(tar.gz)
Source code(zip)
0.1.1(Oct 13, 2020)

First patch release fixing some build problems
Source code(tar.gz)
Source code(zip)

dask-sql is a distributed SQL query engine in python using Dask

Related tags

Overview

Example

Quickstart

Installation

With conda

With pip

For development

Testing

SQL Server

CLI

How does it work?

Comments

v2.11.4

v2.11.4

0.1.60

Releases(2022.12.0)

2022.12.0(Dec 2, 2022)

What's Changed

New Contributors

2022.10.1(Oct 25, 2022)

What's Changed

2022.10.1rc1(Oct 24, 2022)

What's Changed

2022.10.1rc0(Oct 19, 2022)

What's Changed

2022.10.0(Oct 10, 2022)

What's Changed

New Contributors

2022.9.0(Sep 21, 2022)

What's Changed

2022.9.0.rc0(Sep 20, 2022)

What's Changed

New Contributors

2022.8.0(Aug 16, 2022)

What's Changed

New Contributors

2022.6.0(Jun 3, 2022)

What's Changed

2022.4.1(Apr 8, 2022)

What's Changed

2022.4.0(Apr 7, 2022)

What's Changed

New Contributors

2022.1.0(Jan 24, 2022)

What's Changed

2021.12.0(Dec 13, 2021)

What's Changed

New Contributors

2021.11.0(Nov 10, 2021)

What's Changed

0.4.0(Nov 2, 2021)

What's Changed

New Contributors

0.3.9(Aug 18, 2021)

Bugfixes

0.3.8(Aug 17, 2021)

Bugfixes

0.3.7(Aug 10, 2021)

Features

Bugfixes

0.3.6(May 16, 2021)

Bugfixes

0.3.5(May 15, 2021)

Bugfixes

0.3.4(May 13, 2021)

Small feature addons

Bugfixes and Improvements

0.3.3(Apr 30, 2021)

Small feature addons

0.3.2(Apr 13, 2021)

Small feature addons

Bugfixes and Improvements

0.3.1(Feb 7, 2021)

Small feature addons

Bugfixes and Improvements

0.3.0(Jan 21, 2021)

Features

Bugfixes

With `conda`

With `pip`