Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Last update: Jan 04, 2023

Related tags

Data Analysis tuplex

Overview

Tuplex: Blazing Fast Python Data Science

Website Documentation

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set. Under the hood, Tuplex is based on data-driven compilation and dual-mode processing, two key techniques that make it possible for Tuplex to provide speed comparable to a pipeline written in hand-optimized C++.

You can join the discussion on Tuplex on our Gitter community or read up more on the background of Tuplex in our SIGMOD'21 paper.

Contributions welcome!

Installation
- Docker image
- Pypi
Building
Example
License

Installation

To install Tuplex, you can use a PyPi package for Linux, or a Docker container for MacOS which will launch a jupyter notebook with Tuplex preinstalled.

Docker

docker run -p 8888:8888 tuplex/tuplex

PyPI

pip install tuplex

Building

Tuplex is available for MacOS and Linux. The current version has been tested under MacOS 10.13-10.15 and Ubuntu 18.04 and 20.04 LTS. To install Tuplex, simply install the dependencies first and then build the package.

MacOS build from source

To build Tuplex, you need several other packages first which can be easily installed via brew.

brew install [email protected] boost boost-python3 aws-sdk-cpp pcre2 antlr4-cpp-runtime googletest gflags yaml-cpp celero
python3 -m pip cloudpickle numpy
python3 setup.py install

Ubuntu build from source

To faciliate installing the dependencies for Ubuntu, we do provide two scripts (scripts/ubuntu1804/install_reqs.sh for Ubuntu 18.04, or scripts/ubuntu2004/install_reqs.sh for Ubuntu 20.04). To create an up to date version of Tuplex, simply run

./scripts/ubuntu1804/install_reqs.sh
python3 -m pip cloudpickle numpy
python3 setup.py install

Customizing the build

Besides building a pip package, cmake can be also directly invoked. To compile the package via cmake

mkdir build
cd build
cmake ..
make -j$(nproc)

The python package corresponding to Tuplex can be then found in build/dist/python with C++ test executables based on googletest in build/dist/bin.

To customize the cmake build, the following options are available to be passed via -D:

option	values	description
`CMAKE_BUILD_TYPE`	`Release` (default), `Debug`, `RelWithDebInfo`, `tsan`, `asan`, `ubsan`	select compile mode. Tsan/Asan/Ubsan correspond to Google Sanitizers.
`BUILD_WITH_AWS`	`ON` (default), `OFF`	build with AWS SDK or not. On Ubuntu this will build the Lambda executor.
`GENERATE_PDFS`	`ON`, `OFF` (default)	output in Debug mode PDF files if graphviz is installed (e.g., `brew install graphviz`) for ASTs of UDFs, query plans, ...
`PYTHON3_VERSION`	`3.6`, ...	when trying to select a python3 version to build against, use this by specifying `major.minor`. To specify the python executable, use the options provided by cmake.
`LLVM_ROOT_DIR`	e.g. `/usr/lib/llvm-9`	specify which LLVM version to use
`BOOST_DIR`	e.g. `/opt/boost`	specify which Boost version to use. Note that the python component of boost has to be built against the python version used to build Tuplex

For example, to create a debug build which outputs PDFs use the following snippet:

cmake -DCMAKE_BUILD_TYPE=Debug -DGENERATE_PDFS=ON ..

Example

Tuplex can be used in python interactive mode, a jupyter notebook or by copying the below code to a file. To try it out, run the following example:

from tuplex import *
c = Context()
res = c.parallelize([1, 2, None, 4]).map(lambda x: (x, x * x)).collect()
# this prints [(1, 1), (2, 4), (4, 16)]
print(res)

More examples can be found here.

License

Tuplex is available under Apache 2.0 License, to cite the paper use:

@inproceedings{10.1145/3448016.3457244,
author = {Spiegelberg, Leonhard and Yesantharao, Rahul and Schwarzkopf, Malte and Kraska, Tim},
title = {Tuplex: Data Science in Python at Native Code Speed},
year = {2021},
isbn = {9781450383431},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3448016.3457244},
doi = {10.1145/3448016.3457244},
booktitle = {Proceedings of the 2021 International Conference on Management of Data},
pages = {1718–1731},
numpages = {14},
location = {Virtual Event, China},
series = {SIGMOD/PODS '21}
}

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Related tags

Overview

Tuplex: Blazing Fast Python Data Science

Contents

Installation

Docker

PyPI

Building

MacOS build from source

Ubuntu build from source

Customizing the build

Example

License

Owner

Tuplex

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

A set of procedures that can realize covid19 virus detection based on blood.

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

Synthetic Data Generation for tabular, relational and time series data.

API>local_db>AWS_RDS - Disclaimer! All data used is for educational purposes only.

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

ForecastGA is a Python tool to forecast Google Analytics data using several popular time series models.

Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Analysis of a dataset of 10000 passwords to find common trends and mistakes people generally make while setting up a password.

Provide a market analysis (R)

Data collection, enhancement, and metrics calculation.

Repository created with LinkedIn profile analysis project done

Business Intelligence (BI) in Python, OLAP

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Data-sets from the survey and analysis

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

Pip install minimal-pandas-api-for-polars

MapReader: A computer vision pipeline for the semantic exploration of maps at scale

A stock analysis app with streamlit

A utility for functional piping in Python that allows you to access any function in any scope as a partial.