Handle, manipulate, and convert data with units in Python

Related tags

Data Analysisunyt
Overview

unyt

conda-forge https://github.com/yt-project/unyt/actions/workflows/ci.yml/badge.svg?branch=master Documentation Status Test Coverage Code Paper

The yt Project

A package for handling numpy arrays with units.

Often writing code that deals with data that has units can be confusing. A function might return an array but at least with plain NumPy arrays, there is no way to easily tell what the units of the data are without somehow knowing a priori.

The unyt package (pronounced like "unit") provides a subclass of NumPy's ndarray class that knows about units. For example, one could do:

>>> import unyt as u
>>> distance_traveled = [3.4, 5.8, 7.2] * u.mile
>>> print(distance_traveled.to('km'))
[ 5.4717696  9.3341952 11.5872768] km

And a whole lot more! See the documentation for installation instructions, more examples, and full API reference.

This package only depends on numpy and sympy. Notably, it does not depend on yt and it is written in pure Python.

Code of Conduct

The unyt package is part of The yt Project. Participating in unyt development therefore happens under the auspices of the yt community code of conduct. If for any reason you feel that the code of conduct has been violated, please send an e-mail to [email protected] with details describing the incident. All emails sent to this address will be treated with the strictest confidence by an individual who does not normally participate in yt development.

License

The unyt package is licensed under the BSD 3-clause license.

Citation

If you make use of unyt in work that leads to a publication we would appreciate a mention in the text of the paper or in the acknowledgements along with a citation to our paper in the Journal of Open Source Software. You can use the following BibTeX:

@article{Goldbaum2018,
  doi = {10.21105/joss.00809},
  url = {https://doi.org/10.21105/joss.00809},
  year  = {2018},
  month = {aug},
  publisher = {The Open Journal},
  volume = {3},
  number = {28},
  pages = {809},
  author = {Nathan J. Goldbaum and John A. ZuHone and Matthew J. Turk and Kacper Kowalik and Anna L. Rosen},
  title = {unyt: Handle,  manipulate,  and convert data with units in Python},
  journal = {Journal of Open Source Software}
}

Or the following citation format:

Goldbaum et al., (2018). unyt: Handle, manipulate, and convert data with units in Python . Journal of Open Source Software, 3(28), 809, https://doi.org/10.21105/joss.00809
Comments
  • daskified unyt arrays

    daskified unyt arrays

    This PR introduces the unyt_dask_array class, which implements a subclass of standard dask arrays with units attached. Still a work in progress, but it is generally useable now!

    Basic usage (also shown here in a notebook) begins by using the unyt_from_dask function to create a new unyt_dask_array instance from a dask array:

    from unyt.dask_array import unyt_from_dask
    from dask import array as dask_array
    x = unyt_from_dask(dask_array.random.random((10000,10000), chunks=(1000,1000)), 'm')
    x
    Out[2]:  unyt_dask_array<random_sample, shape=(10000, 10000), dtype=float64, chunksize=(1000, 1000), chunktype=numpy.ndarray, units=m>
    

    The array can be manipulated as any other dask array:

    result = (x * 2).mean()
    result
    Out[3]: unyt_dask_array<mean_agg-aggregate, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray, units=m>
    result.compute()
    Out[4]:  unyt_quantity(1.00009275, 'm')
    

    If the return is an array, we get a unyt_array instead:

    (x * 2 + x.to('cm')).mean(1).compute()
    Out[8]: unyt_array([1.50646938, 1.48487083, 1.49774744, ..., 1.49939197,
                1.49462512, 1.48263323], 'm')
    

    Unit conversions:

    x = unyt_from_dask(dask_array.random.random((10000,10000), chunks=(1000,1000)), 'lb')
    x.mean().compute()
    Out[9]:
        unyt_quantity(0.50002619, 'lb')
    x.in_mks().mean().compute()
    Out[10]: unyt_quantity(0.22680806, 'kg')
    x.to('mg').mean().compute()
    Out[11]: unyt_quantity(226808.06379903, 'mg')
    from unyt import g
    x.to(g).mean().compute()
    Out[12]: unyt_quantity(226.8080638, 'g')
    

    The implementation relies primarily on decorators and a hidden unyt_dask_array._unyt_array to track unit conversions and has very minimal modifications to the existing unyt codebase. If a user is running a dask client, then all the above calculations will be executed by that client (see notebook), but the implementation here only needs the dask array subset (i.e., pip install dask[array]).

    Some remaining known issues:

    • [x] reductions return standard dask arrays when using external functions (see note below)
    • [x] dask is added to _on_demand_imports but haven't added it to the testing environment yet, so new tests will fail
    • [x] haven't yet done flake8/etc checks
    • [x] no new docs yet (update: added to the usage page)
    • [x] new tests could use a bit more tweaking
    • [x] squash commits? I have a lot... but would be easy to squash. let me know. (update: chose not to squash)

    Note on the issue with dask reductions:

    If you do:

    from unyt.dask_array import unyt_from_dask
    from dask import array as dask_array
    
    x = unyt_from_dask(dask_array.random.random((10000,10000), chunks=(1000,1000)), 'm')
    x.min().compute()
    

    You get a unyt_quantity as expected: unyt_quantity(0.50002407, 'm')

    But if you use the daskified equivalent of np.min(ndarray):

    dask_array.min(x).compute()
    

    You get a plain float: 0.50002407. This isn't much of an issue for simple functions like min, but many more complex functions are not implemented as attributes. Not yet sure what the best approach is here...

    Update (8/24) to the dask reductions: I've played around with many approaches focused around manually wrapping all of the dask reductions but have decided that the added complexity is not worth it. Instead, I added a user-facing function, unyt.dask_array.reduce_with_units that accepts a dask function handle, the unyt array and any args and kwargs for the dask function that internally wraps the dask function handle to track units.

    standalone package?

    One final note: while I've been developing this to be incorporated into unyt, the fact that there are very minimal changes to the rest of the codebase means that this could be a standalone package. Happy to go that route if it seems more appropriate!

    enhancement 
    opened by chrishavlin 24
  • Unyt2.5.0 breaks matplotlib's errorbar function

    Unyt2.5.0 breaks matplotlib's errorbar function

    • unyt version: 2.5.0
    • Python version: 3.7.4
    • Operating System: MacOS Catalina, RHEL 7(?)

    Description

    unyt v2.5.0 is unable to create matplotlib plots that have an unyt_quantity as the axis limit when using the errorbar function, when a scatter is provided in the required 2XN format as a list of two unyt arrays.

    What I Did

    Example script (matplotlib3.1.2 and unyt2.5.0):

    import matplotlib.pyplot as plt
    import unyt
    
    x = unyt.unyt_array([8, 9, 10], "cm")
    y = unyt.unyt_array([8, 9, 10], "kg")
    # It is convenient often to supply the 2XN required array
    # in this format
    y_scatter = [
        unyt.unyt_array([0.1, 0.2, 0.3], "kg"),
        unyt.unyt_array([0.1, 0.2, 0.3], "kg"),
    ]
    
    x_lims = (unyt.unyt_quantity(5, "cm"), unyt.unyt_quantity(12, "cm"))
    y_lims = (unyt.unyt_quantity(5, "kg"), unyt.unyt_quantity(12, "kg"))
    
    plt.errorbar(x, y, yerr=y_scatter)
    plt.xlim(*x_lims)
    plt.ylim(*y_lims)
    
    plt.show()
    

    Output:

    python3 test.py
    Traceback (most recent call last):
      File "/private/tmp/env/lib/python3.7/site-packages/matplotlib/axis.py", line 1550, in convert_units
        ret = self.converter.convert(x, self.units, self)
      File "/private/tmp/env/lib/python3.7/site-packages/unyt/mpl_interface.py", line 105, in convert
        return value.to(*unit)
    AttributeError: 'list' object has no attribute 'to'
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "test.py", line 14, in <module>
        plt.errorbar(x, y, yerr=y_scatter)
      File "/private/tmp/env/lib/python3.7/site-packages/matplotlib/pyplot.py", line 2554, in errorbar
        **({"data": data} if data is not None else {}), **kwargs)
      File "/private/tmp/env/lib/python3.7/site-packages/matplotlib/__init__.py", line 1599, in inner
        return func(ax, *map(sanitize_sequence, args), **kwargs)
      File "/private/tmp/env/lib/python3.7/site-packages/matplotlib/axes/_axes.py", line 3430, in errorbar
        barcols.append(self.vlines(xo, lo, uo, **eb_lines_style))
      File "/private/tmp/env/lib/python3.7/site-packages/matplotlib/__init__.py", line 1599, in inner
        return func(ax, *map(sanitize_sequence, args), **kwargs)
      File "/private/tmp/env/lib/python3.7/site-packages/matplotlib/axes/_axes.py", line 1176, in vlines
        x = self.convert_xunits(x)
      File "/private/tmp/env/lib/python3.7/site-packages/matplotlib/artist.py", line 180, in convert_xunits
        return ax.xaxis.convert_units(x)
      File "/private/tmp/env/lib/python3.7/site-packages/matplotlib/axis.py", line 1553, in convert_units
        f'units: {x!r}') from e
    matplotlib.units.ConversionError: Failed to convert value(s) to axis units: [unyt_quantity(8, 'cm'), unyt_quantity(9, 'cm'), unyt_quantity(10, 'cm')]
    

    Even wrapping the list in a call to unyt.unyt_array doesn't save the day.

    opened by JBorrow 22
  • ENH: Provisional support for NEP 18 (__array_function__ protocol)

    ENH: Provisional support for NEP 18 (__array_function__ protocol)

    My initial motivation here was to add some unit representation to error messages when comparing two unyt_array instance via functions from numpy.testing, like so:

    import numpy as np
    import unyt as un
    
    a = [1, 2, 3] * un.cm
    b = [1, 2, 3] * un.km
    np.testing.assert_array_equal(a, b)
    

    which yields, on master:

    ...
    AssertionError:
    Arrays are not equal
    
    Mismatched elements: 3 / 3 (100%)
    Max absolute difference: 299997.
    Max relative difference: 0.99999
     x: unyt_array([1, 2, 3])
     y: unyt_array([1, 2, 3])
    

    and on this branch

    previous version
    ...
    AssertionError:
    Arrays are not equal
    
    Mismatched elements: 3 / 3 (100%)
    Max absolute difference: 299997. cm
    Max relative difference: 0.99999 dimensionless
     x: unyt_array([1, 2, 3] cm)
     y: unyt_array([1, 2, 3] km)
    

    edit:

    AssertionError:
    Arrays are not equal
    
    Mismatched elements: 3 / 3 (100%)
    Max absolute difference: 299997., units='cm'
    Max relative difference: 0.99999, units='dimensionless'
     x: unyt_array([1, 2, 3], units='cm')
     y: unyt_array([1, 2, 3], units='km')
    

    Incidentally, it turns out that fixing this necessitated a kick-off implementation of NEP 18, so this work laid the foundation to solve:

    • [x] #69
    • [x] #130
    • [ ] #50 (most likely out of scope)

    More broadly, implementing NEP 18 for any was the topic of #139. Granted I need to take more time to check that I'm not going against the original intentions here. My current approach is that, since covering the whole numpy public API in one go seems like an gigantic task, I'm implementing unyt_array.__array_function__ with a fallthrough condition: if a special case isn't implemented yet, just fallback to the raw numpy implem (which is currently the behaviour for all functions subject to NEP 18). This way we can add support for more and more functions in a progressive way. I'm going to set the bar low(ish) for now, and try and fix the already reported cases, as reported above, as a first step.

    An important question to address is: what should be done in the general case where we don't have a custom implementation for an array function ?

    1. transparently default to the raw numpy implementation without a warning (this is de facto what is done as of unyt 2.8.0, and will remain the case until NEP 18 isn't at least partially implemented).
    2. same as 1. but emit a warning (possibly, we could have a whitelist of functions that are known to be perfectly fine without a specialised implementation, for which no warning would be emitted) along the lines of
    UserWarning: calling `numpy.FUNC` on a unyt_array. Results may hold incorrect units. A future version of unyt will remove this warning, and possibly change the behaviour of this function to be dimensionally correct.
    
    1. Error out

    Option 1 this is the current implementation of this PR because I think it is the less disruptive or noisy one. My personal opinion is that it's probably okay to have incomplete support for NEP 18 for a couple release, as long as it is clearly stated in the release notes.

    bug 
    opened by neutrinoceros 19
  • ENH: optimize import time

    ENH: optimize import time

    This is an answer to #27. I shave off about 33% of unyt's import time by making Unit objects copies shallow by default, the one difference with a deep copy being that the attached UnitRegistry is shallow-copied.

    Using the benchmark I described in #27, I get the import time from 1.6s to 1.0s on my machine. I hope that this doesn't have undesirable side effects. Another aspect that could be considered is that the sheer number of copies that are performed at import time is probably a sign that something else isn't optimized.

    opened by neutrinoceros 16
  • Equality test of equivalent quantities, but with different prefixes, returns False.

    Equality test of equivalent quantities, but with different prefixes, returns False.

    • unyt version: 2.4.1
    • Python version: 3.8.1
    • Operating System: Win10

    Description

    The quantities 1 s and 1000 ms are equal, but unyt says they're not equal.

    What I Did

    >>> from unyt import s, ms
    >>> 1*s == 1000*ms
    array(False)
    

    I also find the rather surprising result:

    >>> 1*s >= 999*ms
    array(True)
    >>> 1*s >= 1000*ms
    array(False)
    opened by l-johnston 15
  • bugfix: fix commutativity in unyt_array operators

    bugfix: fix commutativity in unyt_array operators

    fix https://github.com/yt-project/yt/issues/874

    Here's an actualised version of the script provided by @jzuhone at the time with updated reference outputs.

    
    >>> import yt
    >>> ds = yt.testing.fake_amr_ds()
    >>> a = yt.YTArray([1,2,3], "cm")
    >>> b = ds.arr([1,2,3], "code_length")
    
    >>> a*b
    old > SymbolNotFoundError: The symbol 'code_length' does not exist in this registry.
    new > unyt_array([1, 4, 9], 'cm*code_length')
    
    >>> b*a
    old > unyt_array([1, 4, 9], 'cm*code_length')
    new > unyt_array([1, 4, 9], 'cm*code_length')
    
    >>> (a*b).in_units("code_length**2")
    old > SymbolNotFoundError: The symbol 'code_length' does not exist in this registry.
    new > unyt_array([1., 4., 9.], 'code_length**2')
    
    >>> (b*a).in_units("code_length**2")
    old > unyt_array([1., 4., 9.], 'code_length**2')
    new > unyt_array([1., 4., 9.], 'code_length**2')
    

    For reference, this issue was referenced in https://github.com/yt-project/yt/issues/2797, hence the fix.

    bug 
    opened by neutrinoceros 14
  • MNT: add explicit support for Python 3.10

    MNT: add explicit support for Python 3.10

    follow up to #194 This will likely fail CI at first, it may be a little early to expect support is already provided by unyt's dependencies, so I'll open as a draft for now, and see what happens.

    enhancement 
    opened by neutrinoceros 13
  • ci: Migrate CI to GitHub Actions workflows

    ci: Migrate CI to GitHub Actions workflows

    • Closes PR #187
    • Requires PR #189

    This PR migrates CI from Travis CI and Appveyor to use GitHub Actions workflows. The GHA CI will run across Ubuntu, MacOS, and Windows environments across CPython runtimes spanning 3.6 to 3.9. To reduce the number of runs (especially on slower runners like MacOS) the test matrix only runs on MacOS and Windows for the edge CPython versions: Python 3.6 and Python 3.9. The CI runs on a variety of event triggers:

    • Pushes to the master branch (PR merges trigger "push" events)
    • Pushes to pull requests
    • As a nightly CRON job (useful for being alerted to dependencies breaking APIs)
    • On demand manual triggers

    Travis CI and Appveyor are dropped in this PR and coverage reporting is switched over to use the Codecov GHA (this will require some follow up from the maintainers as you'll want to get an optional CODECOV_TOKEN to greatly speed up reporting).

    opened by matthewfeickert 13
  • fix: Apply Black and update usage docs code

    fix: Apply Black and update usage docs code

    This PR simply gets the CI passing (in local runs of tox and in GitHub Actions) so that PR #187 can proceed smoothly. It just applies Black to the code base to take care of the space differences that Black v21.4b0+ now enforces and then in the docs adds in a missing import of unt_array and then adds a write option to a h5py.File call (perhaps a somewhat recent change in h5py?).

    I don't think that yt does squash and merge commits like I usually do, but if that is something that happens

    Suggested squash and merge commit message

    * Apply Black to codebase to revise docstring whitespace
       - Black v21.4b0 release notes: Black now processes one-line docstrings by stripping leading and trailing spaces, and adding a padding space when needed to break up “”””
    * Add missing import of unyt.unyt_array to usage docs
    * Add missing write option to h5py.File call in usage docs
    
    opened by matthewfeickert 12
  • TST: migrate from tox-pyenv to tox-gh-actions

    TST: migrate from tox-pyenv to tox-gh-actions

    Since tox-pyenv looks unmaintained (no response from the maintainer in 3 weeks now), let's experiment with a candidate replacement. Because this makes tox 4 usable in CI for the first time, this may require some tweaks. I'd also need to make changes in the dev guide if this works.

    opened by neutrinoceros 10
  • FEAT: implement unyt.unyt_quantity.from_string

    FEAT: implement unyt.unyt_quantity.from_string

    This adds a from_string method to the unyt_quantity class. I originally wrote it in a separate project where I need to parse quantities from configuration (text) files, then realized it would be useful to have it as part of the library.

    I consider this a draft for now. The implementation works as intended in every case I could think of (valid as well as invalid ones), but I would like to add doc strings (w/ doctests) to the actual function.

    opened by neutrinoceros 10
  • MNT: out of date copyright headers

    MNT: out of date copyright headers

    A bunch of files have a copyright header. Most of them use # Copyright (c) 2018, yt Development Team., but some are on # Copyright (c) 2019 ... and even one # Copyright (c) 2013 .... The LICENSE file itself uses a different header Copyright (c) 2018, Nathan Goldbaum. It would be easy to uniformise those and keep up to date with a pre-commit hook such as insert-license from https://github.com/Lucas-C/pre-commit-hooks (I've been using it for a couple years on another project, never had any issues with it). I'm happy to do it, I would just like to know if that's desired. If not, should we simply take these headers out ?

    opened by neutrinoceros 1
  • Refining exceptions

    Refining exceptions

    To keep track of this important comment from @ngolbaum

    I'm not really a fan of UnytError but I also don't think that should block getting the __array_function__ stuff working. I wish this was raising UnitOperationError, or we somehow made UnitOperationError more general since I would guess that's the most common type of exception people would be catching for this sort of thing and it irks me a bit that they'd need to catch more than one kind of exception for different corner cases.

    We probably need to more carefully look at how exceptions work in unyt in general since right now the situation is kind of a hodgepodge, although that might need a deprecation cycle since we'd be doing an API break.

    For now I'm just going to merge this, but I'd like to have a discussion about how to handle exceptions, whether we need to do some sort of deprecation cycle, and how we can make it simpler to deal with exceptions raised by unyt before we do the final release.

    Originally posted by @ngoldbaum in https://github.com/yt-project/unyt/issues/338#issuecomment-1369188611

    opened by neutrinoceros 0
  • Additional metallicity mass fraction conversions

    Additional metallicity mass fraction conversions

    This PR introduces several other common values for the solar metallicity found in the literature as new metallicity units, e.g. "Zsun_angr", etc.

    The default mass fraction in "Zsun" is still the one from Cloudy and has not been touched.

    Explanatory documentation has been added.

    opened by jzuhone 4
  • Type checking unyt ?

    Type checking unyt ?

    This is mostly a question to @ngoldbaum and @jzuhone: How would you guys feel about progressively adding type hints and a type checking stage to CI ? To be clear I'm thinking about doing it at least partially myself, because numpy is almost 100% "typed" now and IMO it would make sense to follow their lead. This is a long term goal as this could be quite an undertaking (though maybe not !), so I wanted to get your sentiment on it first.

    opened by neutrinoceros 4
Releases(v2.9.3)
Owner
The yt project
A toolkit for analysis and visualization of volumetric data
The yt project
Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

Sven Eschlbeck 2 Dec 19, 2021
Building house price data pipelines with Apache Beam and Spark on GCP

This project contains the process from building a web crawler to extract the raw data of house price to create ETL pipelines using Google Could Platform services.

1 Nov 22, 2021
Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

Long Course "Geophysical Python for Seismic Data Analysis" Instruktur: Dr.rer.nat. Wiwit Suryanto, M.Si Dipersiapkan oleh: Anang Sahroni Waktu: Sesi 1

Anang Sahroni 0 Dec 04, 2021
Python library for creating data pipelines with chain functional programming

PyFunctional Features PyFunctional makes creating data pipelines easy by using chained functional operators. Here are a few examples of what it can do

Pedro Rodriguez 2.1k Jan 05, 2023
Titanic data analysis for python

Titanic-data-analysis This Repo is an analysis on Titanic_mod.csv This csv file contains some assumed data of the Titanic ship after sinking This full

Hardik Bhanot 1 Dec 26, 2021
Randomisation-based inference in Python based on data resampling and permutation.

Randomisation-based inference in Python based on data resampling and permutation.

67 Dec 27, 2022
Vectorizers for a range of different data types

Vectorizers for a range of different data types

Tutte Institute for Mathematics and Computing 69 Dec 29, 2022
A CLI tool to reduce the friction between data scientists by reducing git conflicts removing notebook metadata and gracefully resolving git conflicts.

databooks is a package for reducing the friction data scientists while using Jupyter notebooks, by reducing the number of git conflicts between different notebooks and assisting in the resolution of

dataroots 86 Dec 25, 2022
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3.1k Jan 05, 2023
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 1.2k Dec 31, 2022
Includes all files needed to satisfy hw02 requirements

HW 02 Data Sets Mean Scale Score for Asian and Hispanic Students, Grades 3 - 8 This dataset provides insights into the New York City education system

7 Oct 28, 2021
Tools for working with MARC data in Catalogue Bridge.

catbridge_tools Tools for working with MARC data in Catalogue Bridge. Borrows heavily from PyMarc

1 Nov 11, 2021
BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics. It provides: Frequency table constr

Angel Chavez 1 Oct 31, 2021
VevestaX is an open source Python package for ML Engineers and Data Scientists.

VevestaX Track failed and successful experiments as well as features. VevestaX is an open source Python package for ML Engineers and Data Scientists.

Vevesta 24 Dec 14, 2022
Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

Fabio Barbazza 1 Feb 15, 2022
Python implementation of Principal Component Analysis

Principal Component Analysis Principal Component Analysis (PCA) is a dimension-reduction algorithm. The idea is to use the singular value decompositio

Ignacio Darago 1 Nov 06, 2021
ASOUL直播间弹幕抓取&&数据分析

ASOUL直播间弹幕抓取&&数据分析(更新中) 这些文件用于爬取ASOUL直播间的弹幕(其他直播间也可以)和其他信息,以及简单的数据分析生成。

159 Dec 10, 2022
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 01, 2022
A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

Andre Pretorius 9 Sep 15, 2022