Stats, linear algebra and einops for xarray

Last update: Dec 28, 2022

Related tags

Machine Learning xarray-einstats

Overview

xarray-einstats

Stats, linear algebra and einops for xarray

⚠️ Caution: This project is still in a very early development stage

Installation

To install, run

(.venv) $ pip install xarray-einstats

Overview

As stated in their website:

xarray makes working with multi-dimensional labeled arrays simple, efficient and fun!

The code is often more verbose, but it is generally because it is clearer and thus less error prone and intuitive. Here are some examples of such trade-off:

numpy	xarray
`a[2, 5]`	`da.sel(drug="paracetamol", subject=5)`
`a.mean(axis=(0, 1))`	`da.mean(dim=("chain", "draw"))`
``	``

In some other cases however, using xarray can result in overly verbose code that often also becomes less clear. xarray-einstats provides wrappers around some numpy and scipy functions (mostly numpy.linalg and scipy.stats) and around einops with an api and features adapted to xarray.

% ⚠️ Attention: A nicer rendering of the content below is available at our documentation

Data for examples

The examples in this overview page use the DataArrays from the Dataset below (stored as ds variable) to illustrate xarray-einstats features:


   
    
Dimensions:  (dim_plot: 50, chain: 4, draw: 500, team: 6)
Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
  * team     (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: dim_plot
Data variables:
    x_plot   (dim_plot) float64 0.0 0.2041 0.4082 0.6122 ... 9.592 9.796 10.0
    atts     (chain, draw, team) float64 0.1063 -0.01913 ... -0.2911 0.2029
    sd_att   (draw) float64 0.272 0.2685 0.2593 0.2612 ... 0.4112 0.2117 0.3401

Stats

xarray-einstats provides two wrapper classes {class}xarray_einstats.XrContinuousRV and {class}xarray_einstats.XrDiscreteRV that can be used to wrap any distribution in {mod}scipy.stats so they accept {class}~xarray.DataArray as inputs.

We can evaluate the logpdf using inputs that wouldn't align if using numpy in a couple lines:

norm_dist = xarray_einstats.XrContinuousRV(scipy.stats.norm)
norm_dist.logpdf(ds["x_plot"], ds["atts"], ds["sd_att"])

which returns:


   
    
array([[[[ 3.06470249e-01,  3.80373065e-01,  2.56575936e-01,
...
          -4.41658154e+02, -4.57599982e+02, -4.14709280e+02]]]])
Coordinates:
  * chain    (chain) int64 0 1 2 3
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
  * team     (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: dim_plot

einops

only rearrange wrapped for now

einops uses a convenient notation inspired in Einstein notation to specify operations on multidimensional arrays. It uses spaces as a delimiter between dimensions, parenthesis to indicate splitting or stacking of dimensions and -> to separate between input and output dim specification. einstats uses an adapted notation then translates to einops and calls {func}xarray.apply_ufunc under the hood.

Why change the notation? There are three main reasons, each concerning one of the elements respectively: ->, space as delimiter and parenthesis:

In xarray dimensions are already labeled. In many cases, the left side in the einops notation is only used to label the dimensions. In fact, 5/7 examples in https://einops.rocks/api/rearrange/ fall in this category. This is not necessary when working with xarray objects.
In xarray dimension names can be any {term}xarray:hashable. xarray-einstats only supports strings as dimension names, but the space can't be used as delimiter.
In xarray dimensions are labeled and the order doesn't matter. This might seem the same as the first reason but it is not. When splitting or stacking dimensions you need (and want) the names of both parent and children dimensions. In some cases, for example stacking, we can autogenerate a default name, but in general you'll want to give a name to the new dimension. After all, dimension order in xarray doesn't matter and there isn't much to be done without knowing the dimension names.

xarray-einstats uses two separate arguments, one for the input pattern (optional) and another for the output pattern. Each is a list of dimensions (strings) or dimension operations (lists or dictionaries). Some examples:

We can combine the chain and draw dimensions and name the resulting dimension sample using a list with a single dictionary. The team dimension is not present in the pattern and is not modified.

rearrange(ds.atts, [{"sample": ("chain", "draw")}])

Out:


   
    
array([[ 0.10632395,  0.1538294 ,  0.17806237, ...,  0.16744257,
         0.14927569,  0.21803568],
         ...,
       [ 0.30447644,  0.22650416,  0.25523419, ...,  0.28405435,
         0.29232681,  0.20286656]])
Coordinates:
  * team     (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: sample

Note that following xarray convention, new dimensions and dimensions on which we operated are moved to the end. This only matters when you access the underlying array with .values or .data and you can always transpose using {meth}xarray.Dataset.transpose, but it can matter. You can change the pattern to enforce the output dimension order:

rearrange(ds.atts, [{"sample": ("chain", "draw")}, "team"])

Out:


   
    
array([[ 0.10632395, -0.01912607,  0.13671159, -0.06754783, -0.46083807,
         0.30447644],
       ...,
       [ 0.21803568, -0.11394285,  0.09447937, -0.11032643, -0.29111234,
         0.20286656]])
Coordinates:
  * team     (team) object 'Wales' 'France' 'Ireland' ... 'Italy' 'England'
Dimensions without coordinates: sample

Now to a more complicated pattern. We will split the chain and draw dimension, then combine those split dimensions between them.

rearrange(
    ds.atts,
    # combine split chain and team dims between them
    # here we don't use a dict so the new dimensions get a default name
    out_dims=[("chain1", "team1"), ("team2", "chain2")],
    # use dicts to specify which dimensions to split, here we *need* to use a dict
    in_dims=[{"chain": ("chain1", "chain2")}, {"team": ("team1", "team2")}],
    # set the lengths of split dimensions as kwargs
    chain1=2, chain2=2, team1=2, team2=3
)

Out:


   
    
array([[[ 1.06323952e-01,  2.47005252e-01, -1.91260714e-02,
         -2.55769582e-02,  1.36711590e-01,  1.23165119e-01],
...
        [-2.76616968e-02, -1.10326428e-01, -3.99582340e-01,
         -2.91112341e-01,  1.90714405e-01,  2.02866563e-01]]])
Coordinates:
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 492 493 494 495 496 497 498 499
Dimensions without coordinates: chain1,team1, team2,chain2

More einops examples at {ref}einops

Linear Algebra

Still missing in the package

There is no one size fits all solution, but knowing the function we are wrapping we can easily make the code more concise and clear. Without xarray-einstats, to invert a batch of matrices stored in a 4d array you have to do:

inv = xarray.apply_ufunc(   # output is a 4d labeled array
    numpy.linalg.inv,
    batch_of_matrices,      # input is a 4d labeled array
    input_core_dims=[["matrix_dim", "matrix_dim_bis"]],
    output_core_dims=[["matrix_dim", "matrix_dim_bis"]]
)

to calculate it's norm instead, it becomes:

norm = xarray.apply_ufunc(  # output is a 2d labeled array
    numpy.linalg.norm,
    batch_of_matrices,      # input is a 4d labeled array
    input_core_dims=[["matrix_dim", "matrix_dim_bis"]],
)

With xarray-einstats, those operations become:

inv = xarray_einstats.inv(batch_of_matrices, dim=("matrix_dim", "matrix_dim_bis"))
norm = xarray_einstats.norm(batch_of_matrices, dim=("matrix_dim", "matrix_dim_bis"))

Similar projects

Here we list some similar projects we know of. Note that all of them are complementary and don't overlap:

Comments

distutils.errors.DistutilsOptionError: No configuration found for dynamic 'description'.

Build fails on FreeBSD:

/usr/local/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py:102: _ExperimentalProjectMetadata: Support for project metadata in `pyproject.toml` is still experimental and may be removed (or change) in future releases.
  warnings.warn(msg, _ExperimentalProjectMetadata)
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "setup.py", line 1, in <module>
    import setuptools; setuptools.setup()
  File "/usr/local/lib/python3.8/site-packages/setuptools/__init__.py", line 87, in setup
    return distutils.core.setup(**attrs)
  File "/usr/local/lib/python3.8/site-packages/setuptools/_distutils/core.py", line 122, in setup
    dist.parse_config_files()
  File "/usr/local/lib/python3.8/site-packages/setuptools/dist.py", line 854, in parse_config_files
    pyprojecttoml.apply_configuration(self, filename, ignore_option_errors)
  File "/usr/local/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py", line 54, in apply_configuration
    config = read_configuration(filepath, True, ignore_option_errors, dist)
  File "/usr/local/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py", line 134, in read_configuration
    return expand_configuration(asdict, root_dir, ignore_option_errors, dist)
  File "/usr/local/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py", line 189, in expand_configuration
    return _ConfigExpander(config, root_dir, ignore_option_errors, dist).expand()
  File "/usr/local/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py", line 236, in expand
    self._expand_all_dynamic(dist, package_dir)
  File "/usr/local/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py", line 271, in _expand_all_dynamic
    obtained_dynamic = {
  File "/usr/local/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py", line 272, in <dictcomp>
    field: self._obtain(dist, field, package_dir)
  File "/usr/local/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py", line 309, in _obtain
    self._ensure_previously_set(dist, field)
  File "/usr/local/lib/python3.8/site-packages/setuptools/config/pyprojecttoml.py", line 295, in _ensure_previously_set
    raise OptionError(msg)
distutils.errors.DistutilsOptionError: No configuration found for dynamic 'description'.
Some dynamic fields need to be specified via `tool.setuptools.dynamic`
others must be specified via the equivalent attribute in `setup.py`.
*** Error code 1

Version: 0.2.2 Python-3.8 FreeBSD 13.1

opened by yurivict 6

test stats on datasets

If using the right subset of dimensions, the summary stats already work on datasets. This PR adds tests to make sure this behaviour always works. This is very convenient for MCMC output for example to take the mad over the chain and draw dimension of all the variables in a dataset at once.

opened by OriolAbril 2
update readme, index and install pages

Update the installation page to add conda, and separate the readme and the index page to have slightly different content (expecting different audiences in each page)

:books: Documentation preview :books:: https://xarray-einstats--33.org.readthedocs.build/en/33/

opened by OriolAbril 1
Use Read the Docs action v1

Read the Docs repository was renamed from readthedocs/readthedocs-preview to readthedocs/actions/. Now, the preview action is under readthedocs/actions/preview and is tagged as v1

:books: Documentation preview :books:: https://xarray-einstats--31.org.readthedocs.build/en/31/

opened by humitos 1
Catch positive definite error

Even though the docstring from https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.multivariate_normal.html says

Symmetric positive (semi)definite covariance matrix of the distribution.

It also accepts matrices whose determinant is close to 0 but negative, generally due to numerical issues. It doesn't expect users to solve those numerical issues themselves before passing the covariance matrix to the multivariate_normal class or methods. This adds a try/except to catch the error related to this issue and checks if adding an identity matrix with 1e-10 to the provided covariance matrix solves the issue.

cc @symeneses

opened by OriolAbril 1
Check behaviour on groupby objects

I think that most functions in the stats module can be used on xarray groupby objects, like az.hdi as shown here. If that is the case we should document that and add tests to prevent future changes from removing that feature.

opened by OriolAbril 1
try preferred citation to add the doi in generated citation
The cite this repository button generated by github currenly copies the following text to the clipboard:

@software{Abril_Pla_xarray-einstats, author = {Abril Pla, Oriol}, license = {Apache-2.0}, title = {{xarray-einstats}}, url = {https://github.com/arviz-devs/xarray-einstats} }

which completely ignores the doi even though it is provided as identifier. There is also an option for "preferred citation" that can be used to point to an article instead. This PR tries to use that to generate still a software citation but with the general Zenodo doi for this repository.

Note: Zenodo and releases are a cyclic dependency. The doi is only generated once the release is crafted, so the released code can't include the right doi.

This branch currently generates this:

@software{Abril-Pla_xarray-einstats_2022, author = {Abril-Pla, Oriol}, doi = {10.5281/zenodo.5895451}, license = {Apache-2.0}, title = {{xarray-einstats}}, url = {https://github.com/arviz-devs/xarray-einstats}, year = {2022} }

will think about what should be present and update accordingly. Some info like the publisher is ignored (I guess for software type citations) even if provided inside the preferred_ctation section.
opened by OriolAbril 1
Change the monkeypatch thing to using with contexts?
It would be nice to be able to do things like

with matrix_dims(["dim1", "dim3"]): chol = xe.linalg.cholesky(da) eig = xe.linalg.eig(da)

instead of having to use the monkeypatch trick (currently documented) or needing to pass the dimensions every time.
opened by OriolAbril 0

Tests fail: AttributeError: partially initialized module 'einops' has no attribute '_backends'

collected 140 items / 2 errors                                                                                                                                                               

=========================================================================================== ERRORS ===========================================================================================
_________________________________________________________________ ERROR collecting src/xarray_einstats/tests/test_einops.py __________________________________________________________________
tests/test_einops.py:6: in <module>
    from xarray_einstats.einops import raw_rearrange, raw_reduce, rearrange, reduce, translate_pattern
einops.py:9: in <module>
    import einops
einops.py:407: in <module>
    class DaskBackend(einops._backends.AbstractBackend):  # pylint: disable=protected-access
E   AttributeError: partially initialized module 'einops' has no attribute '_backends' (most likely due to a circular import)
__________________________________________________________________ ERROR collecting src/xarray_einstats/tests/test_numba.py __________________________________________________________________
tests/test_numba.py:6: in <module>
    from xarray_einstats.numba import histogram
numba.py:2: in <module>
    import numba
numba.py:9: in <module>
    @numba.guvectorize(
E   AttributeError: partially initialized module 'numba' has no attribute 'guvectorize' (most likely due to a circular import)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 2 errors during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
===================================================================================== 2 errors in 1.83s ======================================================================================
*** Error code 2

OS: FreeBSD 13.1

opened by yurivict 3

Add logsumexp wrapper

I think it is the only function in scipy.special worth wrapping, so it might be worth adding a misc module for this and other possible "loose ends" functions to wrap

opened by OriolAbril 0

Releases(v0.4.0)

v0.4.0(Dec 9, 2022)

The changelog for the 0.4.0 release is available at the xarray-einstats documentation.

The highlights of the 0.4.0 release are the addition of the multivatiate_normal distribution and the new getting started page in the docs.
Source code(tar.gz)
Source code(zip)
v0.3.0(Jun 19, 2022)

The changelog for the 0.3.0 release is available at the xarray-einstats documentation. The main change is the update of the requirements to follow NEP 29
Source code(tar.gz)
Source code(zip)
v0.2.2(Apr 2, 2022)

Patch release to include the license and changelog files in the pypi package, now using the PEP 621 metadata in pyproject.toml. Packaging the license is needed to add an xarray-einstats feedstock to conda forge.
Source code(tar.gz)
Source code(zip)
v0.2.1(Apr 2, 2022)

Patch release to use a manifest file to include the license and changelog files in the pypi package. Packaging the license is needed to add an xarray-einstats feedstock to conda forge.
Source code(tar.gz)
Source code(zip)
v0.2.0(Apr 1, 2022)
The changelog for the 0.2.0 release is available at the xarray-einstats documentation

New Contributors

@aloctavodia made their first contribution in https://github.com/arviz-devs/xarray-einstats/pull/2

Source code(tar.gz)
Source code(zip)
v0.1.1(Jan 24, 2022)
Initial release of xarray_einstats.

xarray_einstats extends array manipulation libraries to use with xarray. It starts with 4 modules:

linalg -> extends functionality from numpy.linalg module

stats -> extends functionality from scipy.stats module

einops -> extends einops library, which needs to be installed

numba -> miscellaneous extensions (numpy.histogram for now only) that need numba to accelerate and/or vectorize the functions. numba needs to be installed to use it

v0.1.1 indicates the second try at uploading to pypi
Source code(tar.gz)
Source code(zip)

Owner

ArviZ

GitHub Repository https://xarray-einstats.readthedocs.io

Simple and flexible ML workflow engine.

This is a simple and flexible ML workflow engine. It helps to orchestrate events across a set of microservices and create executable flow to handle requests. Engine is designed to be configurable wit

295 Jan 06, 2023

Solve automatic numerical differentiation problems in one or more variables.

numdifftools The numdifftools library is a suite of tools written in _Python to solve automatic numerical differentiation problems in one or more vari

181 Dec 16, 2022

Adaptive: parallel active learning of mathematical functions

adaptive Adaptive: parallel active learning of mathematical functions. adaptive is an open-source Python library designed to make adaptive parallel fu

741 Dec 27, 2022

ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

648 Dec 16, 2022

A Time Series Library for Apache Spark

Flint: A Time Series Library for Apache Spark The ability to analyze time series data at scale is critical for the success of finance and IoT applicat

970 Jan 04, 2023

Neural Machine Translation (NMT) tutorial with OpenNMT-py

Neural Machine Translation (NMT) tutorial with OpenNMT-py. Data preprocessing, model training, evaluation, and deployment.

29 Jan 09, 2023

Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them

Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them.

3 Apr 03, 2022

An implementation of Relaxed Linear Adversarial Concept Erasure (RLACE)

Background This repository contains an implementation of Relaxed Linear Adversarial Concept Erasure (RLACE). Given a dataset X of dense representation

4 Apr 13, 2022

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies. We have amassed a dataset of millions of rows of high-frequency market data dating back to 2018 w

4 Sep 22, 2022

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and t

164 Jan 04, 2023

SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.

SageMaker Python SDK SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker. With the S

1.8k Jan 01, 2023

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.

Machine Learning Notebooks, 3rd edition This project aims at teaching you the fundamentals of Machine Learning in python. It contains the example code

1.6k Jan 05, 2023

Lightning ⚡️ fast forecasting with statistical and econometric models.

Nixtla Statistical ⚡️ Forecast Lightning fast forecasting with statistical and econometric models StatsForecast offers a collection of widely used uni

2.1k Dec 29, 2022

🤖 ⚡ scikit-learn tips

🤖 ⚡ scikit-learn tips New tips are posted on LinkedIn, Twitter, and Facebook. 👉 Sign up to receive 2 video tips by email every week! 👈 List of all

1.6k Jan 03, 2023

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

15.4k Jan 07, 2023

Stats, linear algebra and einops for xarray

Related tags

Overview

xarray-einstats

Installation

Overview

Data for examples

Stats

einops

Linear Algebra

Similar projects

Comments

Releases(v0.4.0)

v0.4.0(Dec 9, 2022)

v0.3.0(Jun 19, 2022)

v0.2.2(Apr 2, 2022)

v0.2.1(Apr 2, 2022)

v0.2.0(Apr 1, 2022)

New Contributors

v0.1.1(Jan 24, 2022)

Owner

ArviZ

Simple and flexible ML workflow engine.

Solve automatic numerical differentiation problems in one or more variables.

Adaptive: parallel active learning of mathematical functions

ThunderGBM: Fast GBDTs and Random Forests on GPUs

A Time Series Library for Apache Spark

Neural Machine Translation (NMT) tutorial with OpenNMT-py

Sleep stages are classified with the help of ML. We have used 4 different ML algorithms (SVM, KNN, RF, NN) to demonstrate them

An implementation of Relaxed Linear Adversarial Concept Erasure (RLACE)

Crypto-trading - ML techiques are used to forecast short term returns in 14 popular cryptocurrencies

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

SageMaker Python SDK is an open source library for training and deploying machine learning models on Amazon SageMaker.

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.

Lightning ⚡️ fast forecasting with statistical and econometric models.

🤖 ⚡ scikit-learn tips

Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

distfit - Probability density fitting

Turning images into '9-pan' palettes using KMeans clustering from sklearn.

Xeasy-ml is a packaged machine learning framework.

Tools for Optuna, MLflow and the integration of both.

A GitHub action that suggests type annotations for Python using machine learning.