Fast 1D and 2D histogram functions in Python

Overview

Azure Status asv

About

Sometimes you just want to compute simple 1D or 2D histograms with regular bins. Fast. No nonsense. Numpy's histogram functions are versatile, and can handle for example non-regular binning, but this versatility comes at the expense of performance.

The fast-histogram mini-package aims to provide simple and fast histogram functions for regular bins that don't compromise on performance. It doesn't do anything complicated - it just implements a simple histogram algorithm in C and keeps it simple. The aim is to have functions that are fast but also robust and reliable. The result is a 1D histogram function here that is 7-15x faster than numpy.histogram, and a 2D histogram function that is 20-25x faster than numpy.histogram2d.

To install:

pip install fast-histogram

or if you use conda you can instead do:

conda install -c conda-forge fast-histogram

The fast_histogram module then provides two functions: histogram1d and histogram2d:

from fast_histogram import histogram1d, histogram2d

Example

Here's an example of binning 10 million points into a regular 2D histogram:

In [1]: import numpy as np

In [2]: x = np.random.random(10_000_000)

In [3]: y = np.random.random(10_000_000)

In [4]: %timeit _ = np.histogram2d(x, y, range=[[-1, 2], [-2, 4]], bins=30)
935 ms ± 58.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: from fast_histogram import histogram2d

In [6]: %timeit _ = histogram2d(x, y, range=[[-1, 2], [-2, 4]], bins=30)
40.2 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

(note that 10_000_000 is possible in Python 3.6 syntax, use 10000000 instead in previous versions)

The version here is over 20 times faster! The following plot shows the speedup as a function of array size for the bin parameters shown above:

Comparison of performance between Numpy and fast-histogram

as well as results for the 1D case, also with 30 bins. The speedup for the 2D case is consistently between 20-25x, and for the 1D case goes from 15x for small arrays to around 7x for large arrays.

Q&A

Why don't the histogram functions return the edges?

Computing and returning the edges may seem trivial but it can slow things down by a factor of a few when computing histograms of 10^5 or fewer elements, so not returning the edges is a deliberate decision related to performance. You can easily compute the edges yourself if needed though, using numpy.linspace.

Doesn't package X already do this, but better?

This may very well be the case! If this duplicates another package, or if it is possible to use Numpy in a smarter way to get the same performance gains, please open an issue and I'll consider deprecating this package :)

One package that does include fast histogram functions (including in n-dimensions) and can compute other statistics is vaex, so take a look there if you need more advanced functionality!

Are the 2D histograms not transposed compared to what they should be?

There is technically no 'right' and 'wrong' orientation - here we adopt the convention which gives results consistent with Numpy, so:

numpy.histogram2d(x, y, range=[[xmin, xmax], [ymin, ymax]], bins=[nx, ny])

should give the same result as:

fast_histogram.histogram2d(x, y, range=[[xmin, xmax], [ymin, ymax]], bins=[nx, ny])

Why not contribute this to Numpy directly?

As mentioned above, the Numpy functions are much more versatile, so they could not be replaced by the ones here. One option would be to check in Numpy's functions for cases that are simple and dispatch to functions such as the ones here, or add dedicated functions for regular binning. I hope we can get this in Numpy in some form or another eventually, but for now, the aim is to have this available to packages that need to support a range of Numpy versions.

Why not use Cython?

I originally implemented this in Cython, but found that I could get a 50% performance improvement by going straight to a C extension.

What about using Numba?

I specifically want to keep this package as easy as possible to install, and while Numba is a great package, it is not trivial to install outside of Anaconda.

Could this be parallelized?

This may benefit from parallelization under certain circumstances. The easiest solution might be to use OpenMP, but this won't work on all platforms, so it would need to be made optional.

Couldn't you make it faster by using the GPU?

Almost certainly, though the aim here is to have an easily installable and portable package, and introducing GPUs is going to affect both of these.

Why make a package specifically for this? This is a tiny amount of functionality

Packages that need this could simply bundle their own C extension or Cython code to do this, but the main motivation for releasing this as a mini-package is to avoid making pure-Python packages into packages that require compilation just because of the need to compute fast histograms.

Can I contribute?

Yes please! This is not meant to be a finished package, and I welcome pull request to improve things.

Owner
Thomas Robitaille
Thomas Robitaille
Productivity Tools for Plotly + Pandas

Cufflinks This library binds the power of plotly with the flexibility of pandas for easy plotting. This library is available on https://github.com/san

Jorge Santos 2.7k Dec 30, 2022
A dashboard built using Plotly-Dash for interactive visualization of Dex-connected individuals across the country.

Dashboard For The DexConnect Platform of Dexterity Global Working prototype submission for internship at Dexterity Global Group. Dashboard for real ti

Yashasvi Misra 2 Jun 15, 2021
📊 Extensions for Matplotlib

📊 Extensions for Matplotlib

Nico Schlömer 519 Dec 30, 2022
3D Vision functions with end-to-end support for deep learning developers, written in Ivy.

Ivy vision focuses predominantly on 3D vision, with functions for camera geometry, image projections, co-ordinate frame transformations, forward warping, inverse warping, optical flow, depth triangul

Ivy 61 Dec 29, 2022
Blender addon that creates a temporary window of any type from the 3D View.

CreateTempWindow2.8 Blender addon that creates a temporary window of any type from the 3D View. Features Can the following window types: 3D View Graph

3 Nov 27, 2022
Some useful extensions for Matplotlib.

mplx Some useful extensions for Matplotlib. Contour plots for functions with discontinuities plt.contour mplx.contour(max_jump=1.0) Matplotlib has pro

Nico Schlömer 519 Dec 30, 2022
A simple python tool for explore your object detection dataset

A simple tool for explore your object detection dataset. The goal of this library is to provide simple and intuitive visualizations from your dataset and automatically find the best parameters for ge

GRADIANT - Centro Tecnolóxico de Telecomunicacións de Galicia 142 Dec 25, 2022
matplotlib: plotting with Python

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Check out our home page for more inform

Matplotlib Developers 16.7k Jan 08, 2023
Python code for solving 3D structural problems using the finite element method

3DFEM Python 3D finite element code This python code allows for solving 3D structural problems using the finite element method. New features will be a

Rémi Capillon 6 Sep 29, 2022
Create a visualization for Trump's Tweeted Words Using Python

Data Trump's Tweeted Words This plot illustrates twitter word occurences. We already did the coding I needed for this plot, so I was very inspired to

7 Mar 27, 2022
Interactive Dashboard for Visualizing OSM Data Change

Dashboard and intuitive data downloader for more interactive experience with interpreting osm change data.

1 Feb 20, 2022
Quickly and accurately render even the largest data.

Turn even the largest data into images, accurately Build Status Coverage Latest dev release Latest release Docs Support What is it? Datashader is a da

HoloViz 2.9k Dec 28, 2022
A GUI for Pandas DataFrames

About Demo Installation Usage Features More Info About PandasGUI is a GUI for viewing, plotting and analyzing Pandas DataFrames. Demo Installation Ins

Adam Rose 2.8k Dec 24, 2022
Set of matplotlib operations that are not trivial

Matplotlib Snippets This repository contains a set of matplotlib operations that are not trivial. Histograms Histogram with bins adapted to log scale

Raphael Meudec 1 Nov 15, 2021
CONTRIBUTIONS ONLY: Voluptuous, despite the name, is a Python data validation library.

CONTRIBUTIONS ONLY What does this mean? I do not have time to fix issues myself. The only way fixes or new features will be added is by people submitt

Alec Thomas 1.8k Dec 31, 2022
HW_02 Data visualisation task

HW_02 Data visualisation and Matplotlib practice Instructions for HW_02 Idea for data analysis As I was brainstorming ideas and running through databa

9 Dec 13, 2022
Python histogram library - histograms as updateable, fully semantic objects with visualization tools. [P]ython [HYST]ograms.

physt P(i/y)thon h(i/y)stograms. Inspired (and based on) numpy.histogram, but designed for humans(TM) on steroids(TM). The goal is to unify different

Jan Pipek 120 Dec 08, 2022
termplotlib is a Python library for all your terminal plotting needs.

termplotlib termplotlib is a Python library for all your terminal plotting needs. It aims to work like matplotlib. Line plots For line plots, termplot

Nico Schlömer 553 Dec 30, 2022
Personal IMDB Graphs with Bokeh

Personal IMDB Graphs with Bokeh Do you like watching movies and also rate all of them in IMDB? Would you like to look at your IMDB stats based on your

2 Dec 15, 2021
649 Pokémon palettes as CSVs, with a Python lib to turn names/IDs into palettes, or MatPlotLib compatible ListedColormaps.

PokePalette 649 Pokémon, broken down into CSVs of their RGB colour palettes. Complete with a Python library to convert names or Pokédex IDs into eithe

11 Dec 05, 2022