A HDF5-based python pickle replacement

Related tags

Data Structureshickle
Overview

PyPI - Latest Release PyPI - Python Versions Travis CI - Build Status AppVeyor - Build Status CodeCov - Coverage Status JOSS Status

Hickle

Hickle is an HDF5 based clone of pickle, with a twist: instead of serializing to a pickle file, Hickle dumps to an HDF5 file (Hierarchical Data Format). It is designed to be a "drop-in" replacement for pickle (for common data objects), but is really an amalgam of h5py and dill/pickle with extended functionality.

That is: hickle is a neat little way of dumping python variables to HDF5 files that can be read in most programming languages, not just Python. Hickle is fast, and allows for transparent compression of your data (LZF / GZIP).

Why use Hickle?

While hickle is designed to be a drop-in replacement for pickle (or something like json), it works very differently. Instead of serializing / json-izing, it instead stores the data using the excellent h5py module.

The main reasons to use hickle are:

  1. It's faster than pickle and cPickle.
  2. It stores data in HDF5.
  3. You can easily compress your data.

The main reasons not to use hickle are:

  1. You don't want to store your data in HDF5. While hickle can serialize arbitrary python objects, this functionality is provided only for convenience, and you're probably better off just using the pickle module.
  2. You want to convert your data in human-readable JSON/YAML, in which case, you should do that instead.

So, if you want your data in HDF5, or if your pickling is taking too long, give hickle a try. Hickle is particularly good at storing large numpy arrays, thanks to h5py running under the hood.

Documentation

Documentation for hickle can be found at telegraphic.github.io/hickle/.

Usage example

Hickle is nice and easy to use, and should look very familiar to those of you who have pickled before.

In short, hickle provides two methods: a hickle.load method, for loading hickle files, and a hickle.dump method, for dumping data into HDF5. Here's a complete example:

import os
import hickle as hkl
import numpy as np

# Create a numpy array of data
array_obj = np.ones(32768, dtype='float32')

# Dump to file
hkl.dump(array_obj, 'test.hkl', mode='w')

# Dump data, with compression
hkl.dump(array_obj, 'test_gzip.hkl', mode='w', compression='gzip')

# Compare filesizes
print('uncompressed: %i bytes' % os.path.getsize('test.hkl'))
print('compressed:   %i bytes' % os.path.getsize('test_gzip.hkl'))

# Load data
array_hkl = hkl.load('test_gzip.hkl')

# Check the two are the same file
assert array_hkl.dtype == array_obj.dtype
assert np.all((array_hkl, array_obj))

HDF5 compression options

A major benefit of hickle over pickle is that it allows fancy HDF5 features to be applied, by passing on keyword arguments on to h5py. So, you can do things like:

hkl.dump(array_obj, 'test_lzf.hkl', mode='w', compression='lzf', scaleoffset=0,
         chunks=(100, 100), shuffle=True, fletcher32=True)

A detailed explanation of these keywords is given at http://docs.h5py.org/en/latest/high/dataset.html, but we give a quick rundown below.

In HDF5, datasets are stored as B-trees, a tree data structure that has speed benefits over contiguous blocks of data. In the B-tree, data are split into chunks, which is leveraged to allow dataset resizing and compression via filter pipelines. Filters such as shuffle and scaleoffset move your data around to improve compression ratios, and fletcher32 computes a checksum. These file-level options are abstracted away from the data model.

Recent changes

  • June 2020: Major refactor to version 4, and removal of support for Python 2.
  • December 2018: Accepted to Journal of Open-Source Software (JOSS).
  • June 2018: Major refactor and support for Python 3.
  • Aug 2016: Added support for scipy sparse matrices bsr_matrix, csr_matrix and csc_matrix.

Performance comparison

Hickle runs a lot faster than pickle with its default settings, and a little faster than pickle with protocol=2 set:

CPU times: user 764 us, sys: 35.6 ms, total: 36.4 ms Wall time: 36.2 ms">
In [1]: import numpy as np

In [2]: x = np.random.random((2000, 2000))

In [3]: import pickle

In [4]: f = open('foo.pkl', 'w')

In [5]: %time pickle.dump(x, f)  # slow by default
CPU times: user 2 s, sys: 274 ms, total: 2.27 s
Wall time: 2.74 s

In [6]: f = open('foo.pkl', 'w')

In [7]: %time pickle.dump(x, f, protocol=2)  # actually very fast
CPU times: user 18.8 ms, sys: 36 ms, total: 54.8 ms
Wall time: 55.6 ms

In [8]: import hickle

In [9]: f = open('foo.hkl', 'w')

In [10]: %time hickle.dump(x, f)  # a bit faster
dumping <type 'numpy.ndarray'> to file <HDF5 file "foo.hkl" (mode r+)>
CPU times: user 764 us, sys: 35.6 ms, total: 36.4 ms
Wall time: 36.2 ms

So if you do continue to use pickle, add the protocol=2 keyword (thanks @mrocklin for pointing this out).

For storing python dictionaries of lists, hickle beats the python json encoder, but is slower than uJson. For a dictionary with 64 entries, each containing a 4096 length list of random numbers, the times are:

json took 2633.263 ms
uJson took 138.482 ms
hickle took 232.181 ms

It should be noted that these comparisons are of course not fair: storing in HDF5 will not help you convert something into JSON, nor will it help you serialize a string. But for quick storage of the contents of a python variable, it's a pretty good option.

Installation guidelines

Easy method

Install with pip by running pip install hickle from the command line.

Manual install

  1. You should have Python 3.5 and above installed

  2. Install h5py (Official page: http://docs.h5py.org/en/latest/build.html)

  3. Install hdf5 (Official page: http://www.hdfgroup.org/ftp/HDF5/current/src/unpacked/release_docs/INSTALL)

  4. Download hickle: via terminal: git clone https://github.com/telegraphic/hickle.git via manual download: Go to https://github.com/telegraphic/hickle and on right hand side you will find Download ZIP file

  5. cd to your downloaded hickle directory

  6. Then run the following command in the hickle directory: python setup.py install

Testing

Once installed from source, run python setup.py test to check it's all working.

Bugs & contributing

Contributions and bugfixes are very welcome. Please check out our contribution guidelines for more details on how to contribute to development.

Referencing hickle

If you use hickle in academic research, we would be grateful if you could reference our paper in the Journal of Open-Source Software (JOSS).

Price et al., (2018). Hickle: A HDF5-based python pickle replacement. Journal of Open Source Software, 3(32), 1115, https://doi.org/10.21105/joss.01115
Comments
  • Hickle 5 rc

    Hickle 5 rc

    Ok here we go as agreed discussing pr #138 here the fully assembled hickle-5-RC branch. In case you prefer to directly review within my repo feel free to close this pr again. I'm fine with whatever you prefer. Not sure if all commits reported below are really visible from within this branch or whether the log below is an exceprt of the git reflog.

    I also expcet that appveyor and travis to complain a bit due to the tox related changes would be surprised if things would work on the first try. Anyway i cross fingers

    Reason for python 3.5 failure known (cant test here any more lacking Python3.5). Fixing if support of Python 3.5 should be kept beyond hickle 4. Is easy just one change in one line.

    Fixing fail in Pyhton 3.5 is one change in one line will do if support for Python3.5, which i have no means any more to test locally here) shall be supported beyond hickle 4. No problem at all.

    With the astropy warnings i would need your help as we do not use Astropy here so i have not any clue how to fix.

    opened by hernot 54
  • support for python copy protocol __setstate__ __getstate__ if present in object

    support for python copy protocol __setstate__ __getstate__ if present in object

    [Suggestion]

    I do have serveral complex classes which in order to be picled by different pickle replacements like jsonpickle and others implement __getstate__ and __setstate__ methods. Besides beeing copyable for free using copy.copy and copy.deepcopy pickling is quite straight forward.

    import numpy as np
    class with_state():
        def __init__(self):
             self.a=12
             sef.b = { 'love':np.ones([12,7]),'hatred':np.zeros([4,9]) }
        def __getstate__(self):
             return dict(
                a=self.a
                b = self.b
           )
       def __setstate__(self,state):
             self.a=state['a']
            self.b=state['b']
    
       def __getitem__(self,index):
           if index == 0 :
              return self.a
           if index < 2:
              return b['hatred']
          if index > 2:
              raise ValueError("index unknown")
           return b['love']
    
    

    The above example is very simplified removing anyhting unnecessary. Currently these classes are hickled with the warning that object is not understood as a test whether __setstate__ __getstate__ is implemented is missing. Admitted both are handled by pickle fallback but class ends up as string instead of dataset making it quite tedious to extract from hdf5 file on non Python end like c# or other languages.

    Therefore i do suggest to add a test for both methods defined and storing class as class state dictionary instead of pickled string. Would need some flag or other means which allows indicating that dict represents result of <class>.__getstate__ and not a plain python dictionary. Test should be run after testing for numpy data and before testing for python iterables as the above class appears to be iterable but isn't.

    ADDENDUM: If somebody guides met through i would try the attempt to add appropriate test and conversion function. But I would need atleast guidance which existing methods would be best suitable for template and inspiration and what parts and sections to carefully read from h5py manual, hdf5 spec and other contribute to hickle documentation.

    bug 
    opened by hernot 37
  • Implementaion of Container and mixed loaders (H4EP001)

    Implementaion of Container and mixed loaders (H4EP001)

    At first:

    @1313e with this pull request i want to express how much i apriciate your really great work done for hickle 4.0.0 implementing the first step to dedicated loaders.

    Second: the reason why I'm so pushing upon implementation of H4EP001

    The research conducted by the research group I'm establishing and leading is split into two tracks. A methodological one dealing with improvement and development of new algorithms and methods for clinical procedures in diagnostics and treatment. The second one is concerned with clinical research utilizing the tools based upon the methods and algorithms provided by the first track.

    In the first track python, numpy, scipy etc. are the primary tools for working on the algorithms, investigating new procedures and algorithmic approaches. The work in the second track is primarily conducted by clinicians. Therefore the tools provided for their research and studies have to be thoroughly tested and validated. This validation at least the part which can be automatized through unit test utilizes test data, including intermediate data and results obtained and provided by the python programs and scripts developed during development of underlying algorithm.

    As the clinical tools are implemented in compiled languages which support true multi-threading the data passed on has to be stored in a file format readable outside python, out-ruling pickle strings. Therefore jsonpickle was used to dump the data. Meanwhile the amount of data has grown into the large so that json files even if compressed using zip, gzip or other compression schemes is not feasible any more. NPY, and NPZ files which was the next choice mandate a dependency upon numpy library. Just for conducting unit tests a self contained file format for which only the corresponding library without any further has to be included would be the better choice.

    And this is the point where hdf5libraries and hickle come into play. I do consider both as the best and most suitable option i have found so far. And the current limitation that objects without dedicated loader are stored as pickle strings can be solved by supporting python copy protocol. Which i offer hereby to contribute to hickle.

    Third content of this pull-request:

    Implementation of Container based and mixed loaders as proposed by #135 hickle extension proposal H4EP001. For details see commit-message and the proposal #135.

    Finally i do recommend:

    Not to put this into an official release. Some extended tests using a real dataset compiled for testing and validating software tools and components developed for use in clinical track showed that an important part is missing to keep file sizes at reasonable level. Especially type-strings and pickle strings for class and function objects currently take-up most of the file space letting dumped files grow quickly into GB even with hdf5 file compression activated where the pickle stream just requires 400MB of space. Therefore i do recommend to implement additionally memoization (H4EP002 #139 ) first before considering the resulting code base ready for release.

    Ps.: loading of hickle 4.0.0 files should be still possible out of the box. Due to the lack of an appropirate testfile no test is included to verify.

    opened by hernot 34
  • Hickle subclasses

    Hickle subclasses

    This PR adds support for hickling objects that are subinstances of classes that hickle supports. So, for example, anything that subclasses a dict can now be hickled properly as well. This PR also includes the changes made in #109. Finally, I have removed all instances where the type of a hickled dataset was saved as a list instead of just a normal string.

    As it is required to save a bit more data in the HDF5-file now, previously made hickle-files are not supported.

    opened by 1313e 29
  • Several improvements concerning dicts, passing open HDF5-files and more

    Several improvements concerning dicts, passing open HDF5-files and more

    This PR contains the following changes:

    • Improved and simplified the imports;
    • Made sure that all required future imports are done in hickle.py;
    • Merged many checks for Python2/Python3 into the same check, decreasing the need to duplicate certain statements;
    • Empty dicts can now be properly hickled and unhickled ( #91 );
    • Dicts using tuple keys can now be properly hickled and unhickled ( #91 );
    • Dicts using both integers/floats and integer/float strings as dict keys (e.g., 1 and '1') can now be properly hickled and unhickled;
    • Passing an open h5py.File object to hickle.dump and hickle.load will no longer automatically close the file ( #92 );
    • If an Exception is raised in hickle.dump or hickle.load, an opened HDF5-file is always closed, unless it was provided as open to these functions (that is the user's task obviously) ( #90 );
    • The working directory for doing tests is now automatically set to the system's temporary directory;
    • Removed all os.remove uses in the tests, as pytest or your own machine automatically performs clean-ups in the temporary directory;
    • Fixed a problem with the dtype of a dict not always being properly saved ( #91 ).

    I have tried to remain as true as possible to the original style of coding (except I use parenthesis for return, since I hate using it as a statement). All changes should be backwards compatible, except for hickled dicts that had their dtype saved incorrectly.

    opened by 1313e 18
  • Huge dict() object loads failed

    Huge dict() object loads failed

    Hi,

    I have hug dict() to save, about 430~MB, after saved it by hickle.

    Loads failed: File "D:\Users\Cidge\Anaconda3\envs\research_env_final\lib\site-packages\hickle\hickle.py", line 531, in load py_container = _load(py_container, h_root_group) File "D:\Users\Cidge\Anaconda3\envs\research_env_final\lib\site-packages\hickle\hickle.py", line 601, in _load py_subcontainer = _load(py_subcontainer, h_node) File "D:\Users\Cidge\Anaconda3\envs\research_env_final\lib\site-packages\hickle\hickle.py", line 608, in _load subdata = load_dataset(h_group) File "D:\Users\Cidge\Anaconda3\envs\research_env_final\lib\site-packages\hickle\hickle.py", line 561, in load_dataset return load_fn(h_node) File "D:\Users\Cidge\Anaconda3\envs\research_env_final\lib\site-packages\hickle\loaders\load_python3.py", line 157, in load_pickled_data return pickle.loads(data[0]) File "D:\Users\Cidge\Anaconda3\envs\research_env_final\lib\site-packages\dill_dill.py", line 275, in loads return load(file, ignore, **kwds) File "D:\Users\Cidge\Anaconda3\envs\research_env_final\lib\site-packages\dill_dill.py", line 270, in load return Unpickler(file, ignore=ignore, **kwds).load() File "D:\Users\Cidge\Anaconda3\envs\research_env_final\lib\site-packages\dill_dill.py", line 472, in load obj = StockUnpickler.load(self) _pickle.UnpicklingError: pickle data was truncated

    Does hickle have any memory limitations like pickle?

    opened by Larryliu912 15
  • HEP005:  Add support for hdf5plugin compression filters (e.g. bitshuffle+lz4)

    HEP005: Add support for hdf5plugin compression filters (e.g. bitshuffle+lz4)

    Comparison of bytesize and elapsed times for a hickle operating on this object: np.random.rand(64, 2, 1048576)

    uncompressed ......... 1073.8 MiB, E.T. = 0.5 s
    gzip-compressed ...... 1013.1 MiB, E.T. = 32.7 s
    bitshuffle-compressed  937.4 MiB, E.T. = 1.3 s
    

    For smaller objects, it won't matter much. The value of this proposed feature depends on how large of an object that hickle operates on.

    Source code: try_deflation.py.txt

    (same bitshuffle+lz4 that's used in blimpy and rawspec)

    enhancement pull-request-welcome 
    opened by texadactyl 12
  • Hickle not working with h5py 3.0

    Hickle not working with h5py 3.0

    Hickle seems to stop working with h5py 3.0.0. It works fine with 2.10.

    Example code:

    import hickle as hkl
    
    hkl.dump([1,2,3], "/tmp/foo.hkl")
    hkl.load("/tmp/foo.hkl")
    

    Fails with

    ValueError: Provided argument 'file_obj' does not appear to be a valid hickle file! (Cannot load <HDF5 dataset "data": shape (3,), type "<i8"> data type)
    
    opened by cyfra 12
  • Release of v4.0.0?

    Release of v4.0.0?

    Also create this PR in advance, such that we can keep track of what issues have already been dealt with. This includes changes made in any (non-merged) PR to the dev branch.

    Issues solved:

    • OrderedDict is now supported (fixes #65);
    • data_0 is no longer used if there is only a single data group/set (fixes #44);
    • HDF5 groups can now be dumped to and loaded from (fixes #54);
    • Integers using Python's arbitrary-precision (integers larger than 64-bit) can now be dumped and loaded properly (fixes #113);
    • Replaced broken link to pickle documentation with proper one (fixes #122);
    • Objects that appear to be iterable are no longer considered as such unless hickle knows for sure they are iterables (fixes #70 and fixes #125);
    • Dict keys with slashes are now supported (fixes #124);
    • Loaders are only loaded when they are required for dumping or loading specific objects (fixes #114);

    NOTE: Before this gets merged, a v3 branch must be made of master.

    opened by 1313e 12
  • [FEATURE] Providing an open HDF5-file to dump/load does not close the file

    [FEATURE] Providing an open HDF5-file to dump/load does not close the file

    Something that may be useful is that when the user passes an open HDF5-file to the hickle.dump or hickle.load functions, they do not automatically close the HDF5-file afterwards. Given that the user provided them as open, I think it would be pretty safe to assume that it will close the file itself.

    Reason why I think this is useful is because I am hickling many objects to the exact same file, while I am controlling the paths within the file. Currently, this means that the file is being opened and closed for every dump or load that I perform, while they are all executed after each other. This is a bit stupid, as it creates both overhead and a lot of strain on the file system (obviously depending on how many times the file is opened and closed).

    Doing this however would change the comment I made in #90 about using with-statements, as the h5f.close should only be called if hickle opened the file itself (this can be done quite easily by using finally-clauses). If you want, I can make the changes myself (including addressing the closing issue of #90) and simply create a PR.

    enhancement pull-request-welcome 
    opened by 1313e 12
  • Hickling an empty dict does not return one when unhickled

    Hickling an empty dict does not return one when unhickled

    Not sure if this is intended behavior, but when hickling a dict that is empty and unhickling it, an empty list is returned instead of an empty dict. Hickling and unhickling a non-empty dict will work perfectly fine. Is this intended behavior or simply something that was overlooked?

    bug 
    opened by 1313e 12
  • Fix for failing tests with numpy 1.24.1

    Fix for failing tests with numpy 1.24.1

    numpy has removed numpy.float(), the solution is to use a regular float() instead

    Debian bug report here: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1027194

    opened by EdwardBetts 1
  • hickleable integration

    hickleable integration

    @steven-murray has written a neat package called hickleable. The idea is to provide a simple decorator to apply to classes that will make them hickle-able without pickling.

    Is this better as a standalone package, or could we consider merging it? At the least, we should mention it in the hickle documentation / README (IMO).

    enhancement question 
    opened by telegraphic 1
  • Failing test with Python 3.11: AttributeError: property 'dtype' of 'Dataset' object has no setter

    Failing test with Python 3.11: AttributeError: property 'dtype' of 'Dataset' object has no setter

    $ python3.11 -mpytest --no-cov
    ============================= test session starts ==============================
    platform linux -- Python 3.11.0+, pytest-7.1.2, pluggy-1.0.0+repack
    benchmark: 3.2.2 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
    rootdir: /home/edward/src/2022/vendor/hickle-5.0.2, configfile: tox.ini
    plugins: benchmark-3.2.2, astropy-header-0.2.2, forked-1.4.0, flaky-3.7.0, anyio-3.6.2, sugar-0.9.6, openfiles-0.5.0, hypothesis-6.36.0, arraydiff-0.5.0, doctestplus-0.12.1, kgb-7.1.1, repeat-0.9.1, django-4.5.2, timeout-2.1.0, astropy-0.10.0, pylama-7.4.3, cov-4.0.0, tornasync-0.6.0.post2, remotedata-0.3.3, filter-subpackage-0.1.1, mock-3.8.2, requests-mock-1.9.3, xdist-2.5.0, asyncio-0.19.0
    asyncio: mode=Mode.STRICT
    collected 102 items
    
    hickle/tests/test_01_hickle_helpers.py ..F...                            [  5%]
    hickle/tests/test_02_hickle_lookup.py .......................            [ 28%]
    hickle/tests/test_03_load_builtins.py ......                             [ 34%]
    hickle/tests/test_04_load_numpy.py ....                                  [ 38%]
    hickle/tests/test_05_load_scipy.py ..                                    [ 40%]
    hickle/tests/test_06_load_astropy.py .........                           [ 49%]
    hickle/tests/test_07_load_pandas.py .                                    [ 50%]
    hickle/tests/test_99_hickle_core.py ..........                           [ 59%]
    hickle/tests/test_hickle.py .......................................      [ 98%]
    hickle/tests/test_legacy_load.py ..                                      [100%]
    
    =================================== FAILURES ===================================
    ____________________________ test_H5NodeFilterProxy ____________________________
    
    h5_data = <HDF5 file "hickle_helpers_test_H5NodeFilterProxy.hdf5" (mode r)>
    
        def test_H5NodeFilterProxy(h5_data):
            """
            tests H5NodeFilterProxy class. This class allows to temporarily rewrite
            attributes of h5py.Group and h5py.Dataset nodes before being loaded by
            hickle._load method.
            """
        
            # load data and try to directly modify 'type' and 'base_type' Attributes
            # which will fail cause hdf5 file is opened for read only
            h5_node = h5_data['somedata']
            with pytest.raises(OSError):
                try:
                    h5_node.attrs['type'] = pickle.dumps(list)
                except RuntimeError as re:
                    raise OSError(re).with_traceback(re.__traceback__)
            with pytest.raises(OSError):
                try:
                    h5_node.attrs['base_type'] = b'list'
                except RuntimeError as re:
                    raise OSError(re).with_traceback(re.__traceback__)
        
            # verify that 'type' expands to tuple before running
            # the remaining tests
            object_type = pickle.loads(h5_node.attrs['type'])
            assert object_type is tuple
            assert object_type(h5_node[()].tolist()) == dummy_data
        
            # Wrap node by H5NodeFilterProxy and rerun the above tests
            # again. This time modifying Attributes shall be possible.
            h5_node = H5NodeFilterProxy(h5_node)
            h5_node.attrs['type'] = pickle.dumps(list)
            h5_node.attrs['base_type'] = b'list'
            object_type = pickle.loads(h5_node.attrs['type'])
            assert object_type is list
        
            # test proper pass through of item and attribute access
            # to wrapped h5py.Group or h5py.Dataset object respective
            assert object_type(h5_node[()].tolist()) == list(dummy_data)
            assert h5_node.shape == np.array(dummy_data).shape
            with pytest.raises(AttributeError,match = r"can't\s+set\s+attribute"):
    >           h5_node.dtype = np.float32
    
    /home/edward/src/2022/vendor/hickle-5.0.2/hickle/tests/test_01_hickle_helpers.py:154: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
    
    self = <hickle.helpers.H5NodeFilterProxy object at 0x7f4c808b2090>
    name = 'dtype', value = <class 'numpy.float32'>
    
        def __setattr__(self, name, value):
            # if wrapped _h5_node and attrs shall be set store value on local attributes
            # otherwise pass on to wrapped _h5_node
            if name in {'_h5_node'}:
                super().__setattr__(name, value)
                return
            if name in {'attrs'}: # pragma: no cover
                raise AttributeError('attribute is read-only')
            _h5_node = super().__getattribute__('_h5_node')
    >       setattr(_h5_node, name, value)
    E       AttributeError: property 'dtype' of 'Dataset' object has no setter
    
    /home/edward/src/2022/vendor/hickle-5.0.2/hickle/helpers.py:180: AttributeError
    
    During handling of the above exception, another exception occurred:
    
    h5_data = <HDF5 file "hickle_helpers_test_H5NodeFilterProxy.hdf5" (mode r)>
    
        def test_H5NodeFilterProxy(h5_data):
            """
            tests H5NodeFilterProxy class. This class allows to temporarily rewrite
            attributes of h5py.Group and h5py.Dataset nodes before being loaded by
            hickle._load method.
            """
        
            # load data and try to directly modify 'type' and 'base_type' Attributes
            # which will fail cause hdf5 file is opened for read only
            h5_node = h5_data['somedata']
            with pytest.raises(OSError):
                try:
                    h5_node.attrs['type'] = pickle.dumps(list)
                except RuntimeError as re:
                    raise OSError(re).with_traceback(re.__traceback__)
            with pytest.raises(OSError):
                try:
                    h5_node.attrs['base_type'] = b'list'
                except RuntimeError as re:
                    raise OSError(re).with_traceback(re.__traceback__)
        
            # verify that 'type' expands to tuple before running
            # the remaining tests
            object_type = pickle.loads(h5_node.attrs['type'])
            assert object_type is tuple
            assert object_type(h5_node[()].tolist()) == dummy_data
        
            # Wrap node by H5NodeFilterProxy and rerun the above tests
            # again. This time modifying Attributes shall be possible.
            h5_node = H5NodeFilterProxy(h5_node)
            h5_node.attrs['type'] = pickle.dumps(list)
            h5_node.attrs['base_type'] = b'list'
            object_type = pickle.loads(h5_node.attrs['type'])
            assert object_type is list
        
            # test proper pass through of item and attribute access
            # to wrapped h5py.Group or h5py.Dataset object respective
            assert object_type(h5_node[()].tolist()) == list(dummy_data)
            assert h5_node.shape == np.array(dummy_data).shape
    >       with pytest.raises(AttributeError,match = r"can't\s+set\s+attribute"):
    E       AssertionError: Regex pattern "can't\\s+set\\s+attribute" does not match "property 'dtype' of 'Dataset' object has no setter".
    
    /home/edward/src/2022/vendor/hickle-5.0.2/hickle/tests/test_01_hickle_helpers.py:153: AssertionError
    =============================== warnings summary ===============================
    hickle/tests/test_06_load_astropy.py::test_create_astropy_constant
      /usr/lib/python3/dist-packages/astropy/constants/constant.py:99: AstropyUserWarning: Constant 'Gravitational constant' already has a definition in the None system from 'CODATA 2018' reference
        warnings.warn('Constant {!r} already has a definition in the '
    
    hickle/tests/test_06_load_astropy.py::test_create_astropy_constant
      /usr/lib/python3/dist-packages/astropy/constants/constant.py:99: AstropyUserWarning: Constant 'Electron charge' already has a definition in the 'emu' system from 'CODATA 2018' reference
        warnings.warn('Constant {!r} already has a definition in the '
    
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
      /home/edward/src/2022/vendor/hickle-5.0.2/hickle/tests/test_06_load_astropy.py:169: DeprecationWarning: tostring() is deprecated. Use tobytes() instead.
        assert reloaded.value[index].tostring() == t1.value[index].tostring()
    
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array
      /home/edward/src/2022/vendor/hickle-5.0.2/hickle/tests/test_06_load_astropy.py:177: DeprecationWarning: tostring() is deprecated. Use tobytes() instead.
        assert reloaded.value[index].tostring() == t1.value[index].tostring()
    
    hickle/tests/test_hickle.py::test_scalar_compression
      /home/edward/src/2022/vendor/hickle-5.0.2/hickle/tests/test_hickle.py:745: DeprecationWarning: `np.float` is a deprecated alias for the builtin `float`. To silence this warning, use `float` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.float64` here.
      Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
        data = {'a': 0, 'b': np.float(2), 'c': True}
    
    hickle/tests/test_hickle.py::test_slash_dict_keys
      /usr/lib/python3/dist-packages/pytest_tornasync/plugin.py:45: PytestRemovedIn8Warning: Passing None has been deprecated.
      See https://docs.pytest.org/en/latest/how-to/capture-warnings.html#additional-use-cases-of-warnings-in-tests for alternatives in common use cases.
        pyfuncitem.obj(**testargs)
    
    hickle/tests/test_legacy_load.py::test_4_0_0_load
      /home/edward/src/2022/vendor/hickle-5.0.2/hickle/loaders/load_scipy.py:91: DeprecationWarning: Please use `csr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csr` namespace is deprecated.
        self.object_type = pickle.loads(item.attrs['type'])
    
    hickle/tests/test_legacy_load.py::test_4_0_0_load
      /home/edward/src/2022/vendor/hickle-5.0.2/hickle/loaders/load_scipy.py:91: DeprecationWarning: Please use `csc_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.csc` namespace is deprecated.
        self.object_type = pickle.loads(item.attrs['type'])
    
    hickle/tests/test_legacy_load.py::test_4_0_0_load
      /home/edward/src/2022/vendor/hickle-5.0.2/hickle/loaders/load_scipy.py:91: DeprecationWarning: Please use `bsr_matrix` from the `scipy.sparse` namespace, the `scipy.sparse.bsr` namespace is deprecated.
        self.object_type = pickle.loads(item.attrs['type'])
    
    hickle/tests/test_legacy_load.py::test_4_0_0_load
      /home/edward/src/2022/vendor/hickle-5.0.2/hickle/lookup.py:1611: MockedLambdaWarning: presenting '<function _moc_numpy_array_object_lambda at 0x7f4c796220c0>' instead of stored lambda 'type'
        warnings.warn(
    
    -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
    =========================== short test summary info ============================
    FAILED hickle/tests/test_01_hickle_helpers.py::test_H5NodeFilterProxy - Asser...
    ================== 1 failed, 101 passed, 24 warnings in 3.88s ==================
    
    pull-request-welcome help-wanted 
    opened by EdwardBetts 1
  • Failing test on big-endian: TypeError: No conversion path for dtype: dtype('>U23')

    Failing test on big-endian: TypeError: No conversion path for dtype: dtype('>U23')

    When I run test tests on a machine with the s390x architecture the test_astropy_time_array fails with this exception:

    TypeError: No conversion path for dtype: dtype('>U23')
    

    This looks like an error caused by the s390x architecture being big-endian.

    Here is the full output of the failing test.

    $ python3 -mpytest --verbose -k test_astropy_time_array --no-cov
    ================================================================================================ test session starts ================================================================================================
    platform linux -- Python 3.10.6, pytest-7.1.2, pluggy-1.0.0+repack -- /usr/bin/python3
    cachedir: .pytest_cache
    hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase('/home/edward/hickle/hickle/.hypothesis/examples')
    rootdir: /home/edward/hickle/hickle, configfile: tox.ini
    plugins: doctestplus-0.12.0, arraydiff-0.5.0, openfiles-0.5.0, cov-3.0.0, mock-3.8.2, hypothesis-6.36.0, filter-subpackage-0.1.1, remotedata-0.3.3, astropy-header-0.2.1, astropy-0.10.0
    collected 102 items / 101 deselected / 1 selected                                                                                                                                                                   
    
    hickle/tests/test_06_load_astropy.py::test_astropy_time_array FAILED                                                                                                                                          [100%]
    
    ===================================================================================================== FAILURES ======================================================================================================
    ______________________________________________________________________________________________ test_astropy_time_array ______________________________________________________________________________________________
    
    h5_data = <HDF5 group "/root_group" (2 members)>, compression_kwargs = {}
    
        def test_astropy_time_array(h5_data,compression_kwargs):
            """
            test proper storage and loading of astropy time representations
            """
        
            loop_counter = 0
        
        
            for times in ([58264, 58265, 58266], [[58264, 58265, 58266], [58264, 58265, 58266]]):
                t1 = Time(times, format='mjd', scale='utc')
        
                h_dataset, subitems = load_astropy.create_astropy_time(t1,h5_data, f'time_{loop_counter}',**compression_kwargs)
                assert isinstance(h_dataset,h5.Dataset) and not subitems and iter(subitems)
                assert h_dataset.attrs['format'] in( str(t1.format).encode('ascii'),str(t1.format))
                assert h_dataset.attrs['scale'] in ( str(t1.scale).encode('ascii'),str(t1.scale))
                assert h_dataset.attrs['np_dtype'] in( t1.value.dtype.str.encode('ascii'),t1.value.dtype.str)
                reloaded = load_astropy.load_astropy_time_dataset(h_dataset,b'astropy_time',t1.__class__)
                assert reloaded.value.shape == t1.value.shape
                assert reloaded.format == t1.format
                assert reloaded.scale == t1.scale
                for index in range(len(t1)):
                    assert np.allclose(reloaded.value[index], t1.value[index])
                loop_counter += 1
        
            t_strings = ['1999-01-01T00:00:00.123456789', '2010-01-01T00:00:00']
        
            # Check that 2D time arrays work as well (github issue #162)
            for times in (t_strings, [t_strings, t_strings]):
                t1 = Time(times, format='isot', scale='utc')
        
    >           h_dataset,subitems = load_astropy.create_astropy_time(t1,h5_data,f'time_{loop_counter}',**compression_kwargs)
    
    /home/edward/hickle/hickle/hickle/tests/test_06_load_astropy.py:159: 
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    /home/edward/hickle/hickle/hickle/loaders/load_astropy.py:134: in create_astropy_time
        d = h_group.create_dataset(
    /usr/lib/python3/dist-packages/h5py/_debian_h5py_serial/_hl/group.py:161: in create_dataset
        dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    /usr/lib/python3/dist-packages/h5py/_debian_h5py_serial/_hl/dataset.py:88: in make_new_dset
        tid = h5t.py_create(dtype, logical=1)
    h5py/_debian_h5py_serial/h5t.pyx:1663: in h5py._debian_h5py_serial.h5t.py_create
        ???
    h5py/_debian_h5py_serial/h5t.pyx:1687: in h5py._debian_h5py_serial.h5t.py_create
        ???
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    
    >   ???
    E   TypeError: No conversion path for dtype: dtype('>U23')
    
    h5py/_debian_h5py_serial/h5t.pyx:1753: TypeError
    ============================================================================================== short test summary info ==============================================================================================
    FAILED hickle/tests/test_06_load_astropy.py::test_astropy_time_array - TypeError: No conversion path for dtype: dtype('>U23')
    ========================================================================================= 1 failed, 101 deselected in 0.86s =========================================================================================
    $
    
    pull-request-welcome help-wanted 
    opened by EdwardBetts 2
  • Hickle now in Debian

    Hickle now in Debian

    Thanks to @EdwardBetts, hickle is now included in Debian (unstable) 🙏.

    Mainly opening this issue to let @hernot, @1313e, and other devs know 🚀. I have just fixed a few outstanding issues and will bump to v5.0.2.

    opened by telegraphic 1
Releases(v5.0.2)
  • v5.0.2(Aug 31, 2022)

  • v5.0.1(Aug 30, 2022)

  • v5.0.0(Dec 17, 2021)

    • Support for newer versions of numpy >= 1.21 and h5py >= 3.0
    • Improved internal HDF5 structure for python dictionaries (no longer trailing /data)
    • Deprecated use of dill in favor of in-built pickle (given the updates to pickle functionality in Py3, and issues with numpy dtypes)
    • Switched to github actions for CI
    • Objects referred to multiple times are now only dumped once within the HDF5 file (HEP002)
    Source code(tar.gz)
    Source code(zip)
  • v4.0.3(Dec 17, 2020)

  • v4.0.1(Jul 28, 2020)

  • v3.4.8(Jul 28, 2020)

  • v4.0.0(Jun 25, 2020)

    This is the major v4.0.0 release of the hickle package.

    Changes:

    • Dropped support for Python 2.7;
    • Dropped legacy support for hickle files made with v1 and v2;
    • OrderedDict is now supported (#65);
    • Subclasses of supported classes can now be properly dumped;
    • data_0 is no longer used if there is only a single data group/set (#44);
    • HDF5 groups can now be dumped to and loaded from (#54);
    • Integers using Python's arbitrary-precision (integers larger than 64-bit) can now be dumped and loaded properly (#113);
    • Replaced broken link to pickle documentation with proper one (#122);
    • Objects that appear to be iterable are no longer considered as such unless hickle knows for sure they are iterables (#70 and #125);
    • Dict keys with slashes are now supported (#124);
    • Loaders are only loaded when they are required for dumping or loading specific objects (#114);
    • hickle now has 100% test coverage;
    • NumPy arrays containing unicode strings can be properly dumped and loaded;
    • NumPy arrays containing non-NumPy objects can be dealt with as well (#90);
    • Removed the use of 'track_times' (#130);
    • If an object fails to be hickled using normal means, hickle will now fall back to pickling the object;
    • Massively simplified the way in which builtin Python scalars are stored, making it easier for the user to view.
    Source code(tar.gz)
    Source code(zip)
  • 3.4.6(Mar 12, 2020)

A high-performance immutable mapping type for Python.

immutables An immutable mapping type for Python. The underlying datastructure is a Hash Array Mapped Trie (HAMT) used in Clojure, Scala, Haskell, and

magicstack 996 Jan 02, 2023
schemasheets - structuring your data using spreadsheets

schemasheets - structuring your data using spreadsheets Create a data dictionary / schema for your data using simple spreadsheets - no coding required

Linked data Modeling Language 23 Dec 01, 2022
A collection of data structures and algorithms I'm writing while learning

Data Structures and Algorithms: This is a collection of data structures and algorithms that I write while learning the subject Stack: stack.py A stack

Dhravya Shah 1 Jan 09, 2022
A Python implementation of red-black trees

Python red-black trees A Python implementation of red-black trees. This code was originally copied from programiz.com, but I have made a few tweaks to

Emily Dolson 7 Oct 20, 2022
One-Stop Destination for codes of all Data Structures & Algorithms

CodingSimplified_GK This repository is aimed at creating a One stop Destination of codes of all Data structures and Algorithms along with basic explai

Geetika Kaushik 21 Sep 26, 2022
A DSA repository but everything is in python.

DSA Status Contents A: Mathematics B: Bit Magic C: Recursion D: Arrays E: Searching F: Sorting G: Matrix H: Hashing I: String J: Linked List K: Stack

Shubhashish Dixit 63 Dec 23, 2022
A mutable set that remembers the order of its entries. One of Python's missing data types.

An OrderedSet is a mutable data structure that is a hybrid of a list and a set. It remembers the order of its entries, and every entry has an index number that can be looked up.

Elia Robyn Lake (Robyn Speer) 173 Nov 28, 2022
IADS 2021-22 Algorithm and Data structure collection

A collection of algorithms and datastructures introduced during UoE's Introduction to Datastructures and Algorithms class.

Artemis Livingstone 20 Nov 07, 2022
Python tree data library

Links Documentation PyPI GitHub Changelog Issues Contributors If you enjoy anytree Getting started Usage is simple. Construction from anytree impo

776 Dec 28, 2022
Svector (pronounced Swag-tor) provides extension methods to pyrsistent data structures

Svector Svector (pronounced Swag-tor) provides extension methods to pyrsistent data structures. Easily chain your methods confidently with tons of add

James Chua 5 Dec 09, 2022
Python library for doing things with Grid-like structures

gridthings Python library for doing things with Grid-like structures Development This project uses poetry for dependency management, pre-commit for li

Matt Kafonek 2 Dec 21, 2021
RLStructures is a library to facilitate the implementation of new reinforcement learning algorithms.

RLStructures is a lightweight Python library that provides simple APIs as well as data structures that make as few assumptions as possibl

Facebook Research 262 Nov 18, 2022
My notes on Data structure and Algos in golang implementation and python

My notes on DS and Algo Table of Contents Arrays LinkedList Trees Types of trees: Tree/Graph Traversal Algorithms Heap Priorty Queue Trie Graphs Graph

Chia Yong Kang 0 Feb 13, 2022
Final Project for Practical Python Programming and Algorithms for Data Analysis

Final Project for Practical Python Programming and Algorithms for Data Analysis (PHW2781L, Summer 2020) Redlining, Race-Exclusive Deed Restriction Lan

Aislyn Schalck 1 Jan 27, 2022
A HDF5-based python pickle replacement

Hickle Hickle is an HDF5 based clone of pickle, with a twist: instead of serializing to a pickle file, Hickle dumps to an HDF5 file (Hierarchical Data

Danny Price 450 Dec 21, 2022
Python collections that are backended by sqlite3 DB and are compatible with the built-in collections

sqlitecollections Python collections that are backended by sqlite3 DB and are compatible with the built-in collections Installation $ pip install git+

Takeshi OSOEKAWA 11 Feb 03, 2022
This repo is all about different data structures and algorithms..

Data Structure and Algorithm : Want to learn data strutrues and algorithms ??? Then Stop thinking more and start to learn today. This repo will help y

Priyanka Kothari 7 Jul 10, 2022
A simple tutorial to use tree-sitter to parse code into ASTs

A simple tutorial to use py-tree-sitter to parse code into ASTs. To understand what is tree-sitter, see https://github.com/tree-sitter/tree-sitter. Tr

Nghi D. Q. Bui 7 Sep 17, 2022
This repository is a compilation of important Data Structures and Algorithms based on Python.

Python DSA 🐍 This repository is a compilation of important Data Structures and Algorithms based on Python. Please make seperate folders for different

Bhavya Verma 27 Oct 29, 2022
Google, Facebook, Amazon, Microsoft, Netflix tech interview questions

Algorithm and Data Structures Interview Questions HackerRank | Practice, Tutorials & Interview Preparation Solutions This repository consists of solut

Quan Le 8 Oct 04, 2022