A suite of benchmarks for CPU and GPU performance of the most popular high-performance libraries for Python :rocket:

Overview

DOI

HPC benchmarks for Python

This is a suite of benchmarks to test the sequential CPU and GPU performance of various computational backends with Python frontends.

Specifically, we want to test which high-performance backend is best for geophysical (finite-difference based) simulations.

Contents

FAQ

Why?

The scientific Python ecosystem is thriving, but high-performance computing in Python isn't really a thing yet. We try to change this with our pure Python ocean simulator Veros, but which backend should we use for computations?

Tremendous amounts of time and resources go into the development of Python frontends to high-performance backends, but those are usually tailored towards deep learning. We wanted to see whether we can profit from those advances, by (ab-)using these libraries for geophysical modelling.

Why do the benchmarks look so weird?

These are more or less verbatim copies from Veros (i.e., actual parts of a physical model). Most earth system and climate model components are based on finite-difference schemes to compute derivatives. This can be represented in vectorized form by index shifts of arrays (such as 0.5 * (arr[1:] + arr[:-1]), the first-order derivative of arr at every point). The most common index range is [2:-2], which represents the full domain (the two outermost grid cells are overlap / "ghost cells" that allow us to shift the array across the boundary).

Now, maths is difficult, and numerics are weird. When many different physical quantities (defined on different grids) interact, things get messy very fast.

Why only test sequential CPU performance?

Two reasons:

  • I was curious to see how good the compilers are without being able to fall back to thread parallelism.
  • In many physical models, it is pretty straightforward to parallelize the model "by hand" via MPI. Therefore, we are not really dependent on good parallel performance out of the box.

Which backends are currently supported?

(not every backend is available for every benchmark)

What is included in the measurements?

Pure time spent number crunching. Preparing the inputs, copying stuff from and to GPU, compilation time, time it takes to check results etc. are excluded. This is based on the assumption that these things are only done a few times per simulation (i.e., that their cost is amortized during long-running simulations).

How does this compare to a low-level implementation?

As a rule of thumb (from our experience with Veros), the performance of a Fortran implementation is very close to that of the Numba backend, or ~3 times faster than NumPy.

Environment setup

For CPU:

$ conda env create -f environment-cpu.yml
$ conda activate pyhpc-bench-cpu

GPU:

$ conda env create -f environment-gpu.yml
$ conda activate pyhpc-bench-gpu

If you prefer to install things by hand, just have a look at the environment files to see what you need. You don't need to install all backends; if a module is unavailable, it is skipped automatically.

Usage

Your entrypoint is the script run.py:

$ python run.py --help
Usage: run.py [OPTIONS] BENCHMARK

  HPC benchmarks for Python

  Usage:

      $ python run.py benchmarks/<BENCHMARK_FOLDER>

  Examples:

      $ taskset -c 0 python run.py benchmarks/equation_of_state

      $ python run.py benchmarks/equation_of_state -b numpy -b jax --device
      gpu

  More information:

      https://github.com/dionhaefner/pyhpc-benchmarks

Options:
  -s, --size INTEGER              Run benchmark for this array size
                                  (repeatable)  [default: 4096, 16384, 65536,
                                  262144, 1048576, 4194304]
  -b, --backend [numpy|cupy|jax|aesara|numba|pytorch|tensorflow]
                                  Run benchmark with this backend (repeatable)
                                  [default: run all backends]
  -r, --repetitions INTEGER       Fixed number of iterations to run for each
                                  size and backend [default: auto-detect]
  --burnin INTEGER                Number of initial iterations that are
                                  disregarded for final statistics  [default:
                                  1]
  --device [cpu|gpu|tpu]          Run benchmarks on given device where
                                  supported by the backend  [default: cpu]
  --help                          Show this message and exit.

Benchmarks are run for all combinations of the chosen sizes (-s) and backends (-b), in random order.

CPU

Some backends refuse to be confined to a single thread, so I recommend you wrap your benchmarks in taskset to set processor affinity to a single core (only works on Linux):

$ conda activate pyhpc-bench-cpu
$ taskset -c 0 python run.py benchmarks/<benchmark_name>

GPU

Some backends use all available GPUs by default, some don't. If you have multiple GPUs, you can set the one to be used through CUDA_VISIBLE_DEVICES, so keep things fair.

Some backends are greedy with allocating memory. On GPU, you can only run one backend at a time (add NumPy for reference):

--device gpu -b $backend -b numpy -s 10_000_000 ... done ">
$ conda activate pyhpc-bench-gpu
$ export CUDA_VISIBLE_DEVICES="0"
$ for backend in jax cupy pytorch tensorflow; do
...    python run benchmarks/<benchmark_name> --device gpu -b $backend -b numpy -s 10_000_000
...    done

Example results

Summary

Equation of state

Isoneutral mixing

Turbulent kinetic energy

Full reports

Conclusion

Lessons I learned by assembling these benchmarks: (your mileage may vary)

  • The performance of JAX is very competitive, both on GPU and CPU. It is consistently among the top implementations on both platforms.
  • Pytorch performs very well on GPU for large problems (slightly better than JAX), but its CPU performance is not great for tasks with many slicing operations.
  • Numba is a great choice on CPU if you don't mind writing explicit for loops (which can be more readable than a vectorized implementation), being slightly faster than JAX with little effort.
  • JAX performance on GPU seems to be quite hardware dependent. JAX performancs significantly better (relatively speaking) on a Tesla P100 than a Tesla K80.
  • If you have embarrasingly parallel workloads, speedups of > 1000x are easy to achieve on high-end GPUs.
  • TPUs are catching up to GPUs. We can now get similar performance to a high-end GPU on these workloads.
  • Tensorflow is not great for applications like ours, since it lacks tools to apply partial updates to tensors (such as tensor[2:-2] = 0.).
  • If you use Tensorflow on CPU, make sure to use XLA (experimental_compile) for tremendous speedups.
  • CuPy is nice! Often you don't need to change anything in your NumPy code to have it run on GPU (with decent, but not outstanding performance).
  • Reaching Fortran performance on CPU for non-trivial tasks is hard :)

Contributing

Community contributions are encouraged! Whether you want to donate another benchmark, share your experience, optimize an implementation, or suggest another backend - feel free to ask or open a PR.

Adding a new backend

Adding a new backend is easy!

Let's assume that you want to add support for a library called speedygonzales. All you need to do is this:

  • Implement a benchmark to use your library, e.g. benchmarks/equation_of_state/eos_speedygonzales.py.

  • Register the benchmark in the respective __init__.py file (benchmarks/equation_of_state/__init__.py), by adding "speedygonzales" to its __implementations__ tuple.

  • Register the backend, by adding its setup function to the __backends__ dict in backends.py.

    A setup function is what is called before every call to your benchmark, and can be used for custom setup and teardown. In the simplest case, it is just

    def setup_speedygonzales(device='cpu'):
        # code to run before benchmark
        yield
        # code to run after benchmark

Then, you can run the benchmark with your new backend:

$ python run.py benchmarks/equation_of_state -b speedygonzales
Comments
  • fastmath

    fastmath

    Hi @dionhaefner, great comparisons, thanks for that! Out of interest. Did you ever try to run numba with fastmath=True; does it make any difference, and if, how much?

    opened by prisae 9
  • turbulent_kinetic_energy returns inconsistent results

    turbulent_kinetic_energy returns inconsistent results

    I am working on https://github.com/dionhaefner/pyhpc-benchmarks/pull/14. The command has inconsistent result output:

    $ python run.py -r 2 -s 1048576 --device cpu -b pytorch benchmarks/turbulent_kinetic_energy/
    
    Using pytorch version 1.13.0.dev20220617+cu113
    Running 3 benchmarks...  [------------------------------------]    0%Error: inconsistent results for size 1048576
    Error: inconsistent results for size 1048576
    Error: inconsistent results for size 1048576
    Running 3 benchmarks...  [####################################]  100%
    
    benchmarks.turbulent_kinetic_energy
    ===================================
    Running on CPU
    
    size          backend     calls     mean      stdev     min       25%       median    75%       max       ฮ”
    ------------------------------------------------------------------------------------------------------------------
       1,048,576  pytorch            2     0.573     0.028     0.544     0.559     0.573     0.587     0.601     1.000
    
    (time in wall seconds, less is better)
    

    Looks like two consecutive runs will generate inconsistent results for turbulent_kinetic_energy. I guess the root cause is this line: https://github.com/dionhaefner/pyhpc-benchmarks/blob/master/benchmarks/turbulent_kinetic_energy/tke_pytorch.py#L264

    There could be non-deterministic numeric results when running mask = tke[2:-2, 2:-2, -1, taup1] < 0.0

    opened by xuzhao9 5
  • deprecated jax ops: index_update

    deprecated jax ops: index_update

    the call for backend in jax; do python run.py benchmarks/isoneutral_mixing/ --device gpu -b $backend -b numpy; done yields

    dTdz = jax.ops.index_update(
    AttributeError: module 'jax.ops' has no attribute 'index_update'
    

    and indeed index_update is no longer a thing: https://jax.readthedocs.io/en/latest/jax.ops.html

    opened by ilemhadri 1
  • DRAFT: Add Transonic + {Pythran, Cython}

    DRAFT: Add Transonic + {Pythran, Cython}

    Fixes #9

    Notes:

    1. Calling the setup function multiple times in a benchmark should be avoided
    2. Equation of state benchmark was easy to implement
    3. Isoneutral benchmark has some issues -- does not compile yet, despite workaround in ba03d48
    4. TODO: Turbulent kinetic energy benchmark
    opened by ashwinvis 2
  • Compare with TACO Python binding

    Compare with TACO Python binding

    The Tensor Algebra Compiler (https://github.com/tensor-compiler/taco) seems to be good at sparse/dense linear algebra and has Python frontend: http://tensor-compiler.org/docs/pycomputations/index.html

    contributions-welcome 
    opened by learning-chip 1
  • Compare with an MLIR-based stencil DSL

    Compare with an MLIR-based stencil DSL

    This project https://github.com/spcl/open-earth-compiler/ provides a DSL frontend for stencil/PDE programs, and rely on MLIR & LLVM to run on NVIDIA and AMD GPUs. It is not a Python frontend, but can be called from Python I think (see https://arxiv.org/abs/2005.13014)

    contributions-welcome 
    opened by learning-chip 1
  • Compare with DaCe framework?

    Compare with DaCe framework?

    DaCe (https://github.com/spcl/dace) is a parallel computing framework that also support Numpy frontend, similar to JAX and Numba. It runs on CPU/GPU/FPGA. Would be interesting to add it for comparison!

    contributions-welcome 
    opened by learning-chip 4
Releases(v3.0)
  • v3.0(Oct 28, 2021)

    • Theano and Bohrium are dead ๐Ÿ’€๐Ÿฆด
    • Aesara replaces Theano on CPU
    • New Pytorch implementation for TKE benchmark
    • Updates of all library versions and a complete re-run of reference results ๐Ÿ“ˆ
    Source code(tar.gz)
    Source code(zip)
  • v2.1(Oct 5, 2021)

  • v2.0(Jul 22, 2020)

Owner
Dion Hรคfner
I do science with Python.
Dion Hรคfner
Python package to easily work with selenium and manage tabs effectively.

Simple Selenium The aim of this package is to quickly get started with working with selenium for simple browser automation tasks. Installation Install

Vishal Kumar Mishra 1 Oct 27, 2021
Implement unittest, removing all global variable and returning values

Implement unittest, removing all global variable and returning values

Placide 1 Nov 01, 2021
Data App Performance Tests

Data App Performance Tests My hypothesis is that The different architectures of

Marc Skov Madsen 6 Dec 14, 2022
Turn any OpenAPI2/3 and Postman Collection file into an API server with mocking, transformations and validations.

Prism is a set of packages for API mocking and contract testing with OpenAPI v2 (formerly known as Swagger) and OpenAPI v3.x. Mock Servers: Life-like

Stoplight 3.3k Jan 05, 2023
Generic automation framework for acceptance testing and RPA

Robot Framework Introduction Installation Example Usage Documentation Support and contact Contributing License Introduction Robot Framework is a gener

Robot Framework 7.7k Jan 07, 2023
pywinauto is a set of python modules to automate the Microsoft Windows GUI

pywinauto is a set of python modules to automate the Microsoft Windows GUI. At its simplest it allows you to send mouse and keyboard actions to windows dialogs and controls, but it has support for mo

3.8k Jan 06, 2023
bulk upload files to libgen.lc (Selenium script)

LibgenBulkUpload bulk upload files to http://libgen.lc/librarian.php (Selenium script) Usage ./upload.py to_upload uploaded rejects So title and autho

8 Jul 07, 2022
Redis fixtures and fixture factories for Pytest.

Redis fixtures and fixture factories for Pytest.This is a pytest plugin, that enables you to test your code that relies on a running Redis database. It allows you to specify additional fixtures for R

Clearcode 86 Dec 23, 2022
A folder automation made using Watch-dog, it only works in linux for now but I assume, it will be adaptable to mac and PC as well

folder-automation A folder automation made using Watch-dog, it only works in linux for now but I assume, it will be adaptable to mac and PC as well Th

Parag Jyoti Paul 31 May 28, 2021
One-stop solution for HTTP(S) testing.

HttpRunner HttpRunner is a simple & elegant, yet powerful HTTP(S) testing framework. Enjoy! โœจ ๐Ÿš€ โœจ Design Philosophy Convention over configuration ROI

HttpRunner 3.5k Jan 04, 2023
๐Ÿ Material for PyData Global 2021 Presentation: Effective Testing for Machine Learning Projects

Effective Testing for Machine Learning Projects Code for PyData Global 2021 Presentation by @edublancas. Slides available here. The project is develop

Eduardo Blancas 73 Nov 06, 2022
Webscreener is a tool for mass web domains pentesting.

Webscreener is a tool for mass web domains pentesting. It is used to take snapshots for domains that is generated by a tool like knockpy or Sublist3r. It cuts out most of the pentesting time by scree

Seekurity 3 Jun 07, 2021
Simple frontend TypeScript testing utility

TSFTest Simple frontend TypeScript testing utility. Installation Install webpack in your project directory: npm install --save-dev webpack webpack-cli

2 Nov 09, 2021
์ž๋™ ๊ฑด๊ฐ•์ƒํƒœ ์ž๊ฐ€์ง„๋‹จ ๋ฉ”ํฌ๋กœ ์„œ๋ฒ„์ „์šฉ

Auto-Self-Diagnosis-for-server ์ž๋™ ์ž๊ฐ€์ง„๋‹จ ๋ฉ”ํฌ๋กœ ์„œ๋ฒ„์ „์šฉ ์ด ํ”„๋กœ๊ทธ๋žจ์€ SaidBySolo๋‹˜์˜ auto-self-diagnosis๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ์ œ์ž‘ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๊ฐœ์ธ ์‚ฌ์šฉ ๋ชฉ์ ์œผ๋กœ ์ œ์ž‘ํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์— ์ถ”ํ›„ ์—…๋ฐ์ดํŠธ๋Š” ์ง„ํ–‰ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์˜์กด์„ฑ G

JJooni 3 Dec 04, 2021
A library to make concurrent selenium tests that automatically download and setup webdrivers

AutoParaSelenium A library to make parallel selenium tests that automatically download and setup webdrivers Usage Installation pip install autoparasel

Ronak Badhe 8 Mar 13, 2022
To automate the generation and validation tests of COSE/CBOR Codes and it's base45/2D Code representations

To automate the generation and validation tests of COSE/CBOR Codes and it's base45/2D Code representations, a lot of data has to be collected to ensure the variance of the tests. This respository was

160 Jul 25, 2022
A Proof of concept of a modern python CLI with click, pydantic, rich and anyio

httpcli This project is a proof of concept of a modern python networking cli which can be simple and easy to maintain using some of the best packages

Kevin Tewouda 17 Nov 15, 2022
Docker-based integration tests

Docker-based integration tests Description Simple pytest fixtures that help you write integration tests with Docker and docker-compose. Specify all ne

Avast 326 Dec 27, 2022
This project demonstrates selenium's ability to extract files from a website.

This project demonstrates selenium's ability to extract files from a website. I've added the challenge of connecting over TOR. This package also includes a personal archive site built in NodeJS and A

2 Jan 16, 2022
Automated mouse clicker script using PyAutoGUI and Typer.

clickpy Automated mouse clicker script using PyAutoGUI and Typer. This app will randomly click your mouse between 1 second and 3 minutes, to prevent y

Joe Fitzgibbons 0 Dec 01, 2021