An implementation of chunked, compressed, N-dimensional arrays for Python.

Last update: Dec 30, 2022

Overview

Zarr

Latest Release

Package Status
License
Build Status
Coverage
Downloads
Gitter
Citation

What is it?

Zarr is a Python package providing an implementation of compressed, chunked, N-dimensional arrays, designed for use in parallel computing. See the documentation for more information.

Main Features

Create N-dimensional arrays with any NumPy dtype.
Chunk arrays along any dimension.
Compress and/or filter chunks using any NumCodecs codec.
Store arrays in memory, on disk, inside a zip file, on S3, etc...
Read an array concurrently from multiple threads or processes.
Write to an array concurrently from multiple threads or processes.
Organize arrays into hierarchies via groups.

Where to get it

Zarr can be installed from PyPI using pip:

pip install zarr

or via conda:

conda install -c conda-forge zarr

For more details, including how to install from source, see the installation documentation.

Comments

Refactoring WIP
This PR contains work on refactoring to achieve separation of concerns between compression/decompression, storage, synchronization and array/chunk data management. Primarily motivated by #21.

Still TODO:

[x] Improve test coverage

[x] Rework persistance/storage documentation to generalise beyond directory store

[x] Improve docstrings, add examples where appropriate

[x] Do some benchmarking/profiling

[x] Fall back to pure python install
opened by alimanfoo 103

async in zarr

I think there are some places where zarr would benefit immensely from some async capabilities when reading and writing data. I will try to illustrate this with the simplest example I can.

Let's consider a zarr array stored in a public S3 bucket, which we can read with fsspec's HTTPFileSystem interface (no special S3 API needed, just regular http calls).

import fsspec
url_base = 'https://mur-sst.s3.us-west-2.amazonaws.com/zarr/time'
mapper = fsspec.get_mapper(url_base)
za = zarr.open(mapper)
za.info

Note that this is a highly sub-optimal choice of chunks. The 1D array of shape (6443,) is stored in chunks of only (5,) items, resulting in over 1000 tiny chunks. Reading this data takes forever, over 5 minutes

%prun tdata = za[:]

         20312192 function calls (20310903 primitive calls) in 342.624 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1289  139.941    0.109  140.077    0.109 {built-in method _openssl.SSL_do_handshake}
     2578   99.914    0.039   99.914    0.039 {built-in method _openssl.SSL_read}
     1289   68.375    0.053   68.375    0.053 {method 'connect' of '_socket.socket' objects}
     1289    9.252    0.007    9.252    0.007 {built-in method _openssl.SSL_CTX_load_verify_locations}
     1289    7.857    0.006    7.868    0.006 {built-in method _socket.getaddrinfo}
     1289    1.619    0.001    1.828    0.001 connectionpool.py:455(close)
   930658    0.980    0.000    2.103    0.000 os.py:674(__getitem__)
...

I believe fsspec is introducing some major overhead by not reusing a connectionpool. But regardless, zarr is iterating synchronously over each chunk to load the data:

https://github.com/zarr-developers/zarr-python/blob/994f2449b84be544c9dfac3e23a15be3f5478b71/zarr/core.py#L1023-L1028

As a lower bound on how fast this approach could be, we bypass zarr and fsspec and just fetch all the chunks with requests:

import requests
s = requests.Session()

def get_chunk_http(n):
    r = s.get(url_base + f'/{n}')
    r.raise_for_status()
    return r.content

%prun all_data = [get_chunk_http(n) for n in range(za.shape[0] // za.chunks[0])]

         12550435 function calls (12549147 primitive calls) in 98.508 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     2576   87.798    0.034   87.798    0.034 {built-in method _openssl.SSL_read}
       13    1.436    0.110    1.437    0.111 {built-in method _openssl.SSL_do_handshake}
   929936    1.042    0.000    2.224    0.000 os.py:674(__getitem__)

As expected, reusing a connection pool sped things up, but it still takes 100 s to read the array.

Finally, we can try the same thing with asyncio

import asyncio
import aiohttp
import time

async def get_chunk_http_async(n, session):
    url = url_base + f'/{n}'
    async with session.get(url) as r:
        r.raise_for_status()
        data = await r.read()
    return data

async with aiohttp.ClientSession() as session:
    tic = time.time()
    all_data = await asyncio.gather(*[get_chunk_http_async(n, session)
                                    for n in range(za.shape[0] // za.chunks[0])])
    print(f"{time.time() - tic} seconds")

# > 1.7969944477081299 seconds

This is a MAJOR speedup!

I am aware that using dask could possibly help me here. But I don't have big data here, and I don't want to use dask. I want zarr to support asyncio natively.

I am quite new to async programming and have no idea how hard / complicated it would be to do this. But based on this experiment, I am quite sure there are major performance benefits to be had, particularly when using zarr with remote storage protocols.

Thoughts?

cc @cgentemann

opened by rabernat 93

Sharding Prototype I: implementation as translating Store
This PR is for an early prototype of sharding support, as described in the corresponding issue #877. It serves mainly to discuss the overall implementation approach for sharding. This PR is not (yet) meant to be merged.

This prototype

allows to specify shards as the number of chunks that should be contained in a shard (e.g. using arr.zeros((20, 3), chunks=(3, 3), shards=(2, 2), …)). One shard corresponds to one storage key, but can contain multiple chunks:

ensures that this setting is persisted in the .zarray config and loaded when opening an array again, adding two entries:

"shard_format": "indexed" specifies the binary format of the shards and allows to extend sharding with other formats later

"shards": [2, 2] specifies how many chunks are contained in a shard,

adds a IndexedShardedStore class that is used to wrap the chunk-store when sharding is enabled. This store handles the grouping of multiple chunks to one shard and transparently reads and writes them via the inner store in a binary format which is specified below. The original store API does not need to be adapted, it just stores shards instead of chunks, which are translated back to chunks by the IndexedShardedStore.

adds a small script chunking_test.py for demonstration purposes, this is not meant to be merged but servers to illustrate the changes.

The currently implemented file format is still up for discussion. It implements "Format 2" @jbms describes in https://github.com/zarr-developers/zarr-python/pull/876#issuecomment-973462279.

Chunks are written successively in a shard (unused space between them is allowed), followed by an index referencing them. The index holding an offset, length pair of little-endian uint64 per chunk, the chunks-order in the index is row-major (C) order, e.g. for (2, 2) chunks per shard an index would look like:

| chunk (0, 0) | chunk (0, 1) | chunk (1, 0) | chunk (1, 1) | | offset | length | offset | length | offset | length | offset | length | | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 | uint64 |

Empty chunks are denoted by setting both offset and length to 2^64 - 1. All the index always has the full shape of all possible chunks per shard, even if they are outside of the array size.

For the default order of the actual chunk-content in a shard I'd propose to use Morton order, but this can easily be changed and customized, since any order can be read.

If the overall direction of this PR is pursued, the following steps (and possibly more) are missing:

Functionality

[ ] Use a default write-order (Morton) of chunks in a shard and allow customization

[ ] Support deletion in the ShardedStore

[ ] Group chunk-wise operations in Array where possible (e.g. in digest & _resize_nosync)

[ ] Consider locking mechanisms to guard against concurrency issues within a shard

[ ] Allow partial reads and writes when the wrapped store supports them

[ ] Add support for prefixes before the chunk-dimensions in the storage key, e.g. for arrays that are contained in a group

[ ] Add warnings for inefficient reads/writes (might be configured)

[ ] Maybe use the new partial read method on the Store also for the current PartialReadBuffer usage (to detect if this is possible and reading via it)

Tests

[ ] Add unit tests and/or doctests in docstrings

[ ] Test coverage is 100% (Codecov passes)

Documentation

[ ] also document optional optimization possibilities on the Store or BaseStore class, such as getitems or partial reads

[ ] Add docstrings and API docs for any new/modified user-facing classes and functions

[ ] New/modified features documented in docs/tutorial.rst

[ ] Changes documented in docs/release.rst

changed 2021-12-07: added file format description and updated TODOs
opened by jstriebel 69
Consolidate zarr metadata into single key
A simple possible way of scanning all the metadata keys ('.zgroup'...) in a dataset and copying them into a single key, so that on systems where there is a substantial overhead to reading small files, everything can be grabbed in a single read. This is important in the context of xarray, which traverses all groups during opening the dataset, to find the various sub-groups and arrays.

The test shows how you could use the generated key. We could contemplate automatically looking for the metadata key when opening.

REF: https://github.com/pangeo-data/pangeo/issues/309

TODO:

[x] Add unit tests and/or doctests in docstrings

[x] Unit tests and doctests pass locally under Python 3.6 (e.g., run tox -e py36 or pytest -v --doctest-modules zarr)

[x] Unit tests pass locally under Python 2.7 (e.g., run tox -e py27 or pytest -v zarr)

[x] PEP8 checks pass (e.g., run tox -e py36 or flake8 --max-line-length=100 zarr)

[x] Add docstrings and API docs for any new/modified user-facing classes and functions

[x] New/modified features documented in docs/tutorial.rst

[x] Doctests in tutorial pass (e.g., run tox -e py36 or python -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst)

[x] Changes documented in docs/release.rst

[x] Docs build locally (e.g., run tox -e docs)

[x] AppVeyor and Travis CI passes

[ ] Test coverage to 100% (Coveralls passes)
opened by martindurant 66
WIP: google cloud storage class
First, apologies for submitting an un-solicited pull request. I know that is against the contributor guidelines. I thought this idea would be easier to discuss with a concrete implementation to look at.

In my highly opinionated view, the killer feature of zarr is its ability to efficiently store array data in cloud storage. Currently, the recommended way to do this is via outside packages (e.g. s3fs, gcsfs), which provide a MutableMapping that zarr can store things in.

In this PR, I have implemented an experimental google cloud storage class directly within zarr.

Why did I do this? Because, in the pangeo project, we are now making heavy use of the xarray -> zarr -> gcsfs -> cloud storage stack. I have come to the conclusion that a tighter coupling between zarr and gcs, via the google.cloud.storage API, may prove advantageous.

In addition to performance benefits and easier debugging, I think there are social advantages to having cloud storage as a first-class part of zarr. Lots of people want to store arrays in the cloud, and if zarr can provide this capability more natively, it could increase adoption.

Thoughts?

These tests require GCP credentials and the google cloud storage package. It is possible add encrypted credentials to travis, but I haven't done that yet. Tests are (mostly) working locally for me.

TODO:

[x] Add unit tests and/or doctests in docstrings

[ ] Unit tests and doctests pass locally under Python 3.6 (e.g., run tox -e py36 or pytest -v --doctest-modules zarr)

[ ] Unit tests pass locally under Python 2.7 (e.g., run tox -e py27 or pytest -v zarr)

[ ] PEP8 checks pass (e.g., run tox -e py36 or flake8 --max-line-length=100 zarr)

[ ] Add docstrings and API docs for any new/modified user-facing classes and functions

[ ] New/modified features documented in docs/tutorial.rst

[ ] Doctests in tutorial pass (e.g., run tox -e py36 or python -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst)

[ ] Changes documented in docs/release.rst

[ ] Docs build locally (e.g., run tox -e docs)

[ ] AppVeyor and Travis CI passes

[ ] Test coverage to 100% (Coveralls passes)
opened by rabernat 52
Add N5 Support
This adds support to read and write from and to N5 containers. The N5Store handling the conversion between the zarr and N5 format will automatically be selected whenever the path for a container ends in .n5 (similar to how the ZipStore is used for files ending in .zip).

The conversion is done mostly transparently, with one exception being the N5ChunkWrapper: This is a Codec with id n5_wrapper that will automatically be wrapped around the requested compressor. For example, if you create an array with a zlib compressor, in fact the n5_wrapper codec will be used that delegates to the zlib codec internally. The additional codec was necessary to introduce N5's chunk headers and ensure big endian storage.

In a related note, gzip compressed N5 arrays can at the moment not be read, since numcodecs treats zlib and gzip as synonyms, which they aren't (their compression headers differ). PR https://github.com/zarr-developers/numcodecs/pull/87 solves this issue.

See also https://github.com/zarr-developers/zarr/issues/231

TODO:

[x] Add unit tests and/or doctests in docstrings

[x] Unit tests and doctests pass locally under Python 3.6 (e.g., run tox -e py36 or pytest -v --doctest-modules zarr)

[x] Unit tests pass locally under Python 2.7 (e.g., run tox -e py27 or pytest -v zarr)

[x] PEP8 checks pass (e.g., run tox -e py36 or flake8 --max-line-length=100 zarr)

[x] Add docstrings and API docs for any new/modified user-facing classes and functions

[x] New/modified features documented in docs/tutorial.rst

[x] Doctests in tutorial pass (e.g., run tox -e py36 or python -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst)

[x] Changes documented in docs/release.rst

[x] Docs build locally (e.g., run tox -e docs)

[x] AppVeyor and Travis CI passes

[x] Test coverage to 100% (Coveralls passes)
opened by funkey 45
Confusion about the dimension_separator keyword
I don't really understand how to use the new dimension_separator keyword, in particular:

Creating a DirectoryStore(dimension_separator="/") does not have the effect I would expect (see code and problem description below).

Why does zarr still have the NestedDirectoryStore? Shouldn't it be the same as DirectoryStore(dimension_separator="/")? Hence I would assume that NestedDirectoryStore could either be removed or (if to be kept for legacy purposes) should just map to DirectoryStore(dimension_seperator="/").

Minimal, reproducible code sample, a copy-pastable example if possible

import zarr store = zarr.DirectoryStore("test.zarr", dimension_separator="/") g = zarr.open(store, mode="a") ds = g.create_dataset("test", shape=(10, 10, 10)) ds[:] = 1

Problem description

Now, I would assume that the chunks are nested, but I get:

$ ls test.zarr/test 0.0.0

but to, to my confusion, also this:

$ cat test.zarr/test/.zarray ... "dimension_separator": "/", ...

If I use NestedDirectoryStore instead, the chunks are nested as expected.

Version and installation information

Please provide the following:

Value of zarr.__version__: 2.8.3

Value of numcodecs.__version__: 0.7.3

Version of Python interpreter: 3.8.6

Operating system: Linux

How Zarr was installed: using conda
opened by constantinpape 39
Add support for fancy indexing on get/setitem
Addresses #657

This matches NumPy behaviour in that basic, boolean, and vectorized integer (fancy) indexing are all accessible from __{get,set}item__. Users still have access to all the indexing methods if they want to be sure to use only basic indexing (integer + slices).

I'm not 100% sure about the approach, but it seemed much easier to use a try/except than to try to detect all the cases when fancy indexing should be used. Happy to hear some guidance about how best to arrange that.

I still need to update docstrings + docs, will do that now — thanks for the checklist below. 😂

TODO:

[x] Add unit tests and/or doctests in docstrings

[x] Add docstrings and API docs for any new/modified user-facing classes and functions

[x] New/modified features documented in docs/tutorial.rst

[x] Changes documented in docs/release.rst

[x] GitHub Actions have all passed

[x] Test coverage is 100% (Codecov passes)
opened by jni 39
Migrate to pyproject.toml + cleanup
Zarr has a lot of redundant files in the root directory. The information contained in these files can be easily moved to pyproject.toml.

Latest Python PEPs encourage users to get rid of setup.py and switch to pyproject.toml

We should not be using setup.cfg until it is a necessity - https://github.com/pypa/setuptools/issues/3214

We should not be using setuptools as a frontend (python setup.py install) - this is not maintained (confirmed by setuptools developers, but I cannot find the exact issue number at this moment)

Zarr should switch to pre-commit.ci and remove the pre-commit workflow

I have tried to perform a 1:1 port. No extra information was added and all the existing metadata has been moved to pyproject.toml.

TODO:

[ ] Add unit tests and/or doctests in docstrings

[ ] Add docstrings and API docs for any new/modified user-facing classes and functions

[ ] New/modified features documented in docs/tutorial.rst

[ ] Changes documented in docs/release.rst

[ ] GitHub Actions have all passed

[ ] Test coverage is 100% (Codecov passes)
opened by Saransh-cpp 38
Add FSStore

Fixes #540 Ref https://github.com/zarr-developers/zarr-python/pull/373#issuecomment-592722584 (@rabernat )

Introduces a short Store implementation for generic fsspec url+options. Allows both lowercasing of keys and choice between '.' and '/'-based ("nested") keys.

For testing, have hijacked the TestNestedDirectoryStore tests just as an example - this is not how things will remain.

opened by martindurant 38
Mutable mapping for Azure Blob
[Description of PR]

TODO:

[ ] Add unit tests and/or doctests in docstrings

[ ] Add docstrings and API docs for any new/modified user-facing classes and functions

[ ] New/modified features documented in docs/tutorial.rst

[ ] Changes documented in docs/release.rst

[ ] Docs build locally (e.g., run tox -e docs)

[ ] AppVeyor and Travis CI passes

[ ] Test coverage is 100% (Coveralls passes)
opened by shikharsg 34
http:// → https://
[Description of PR]

TODO:

[ ] Add unit tests and/or doctests in docstrings

[ ] Add docstrings and API docs for any new/modified user-facing classes and functions

[ ] New/modified features documented in docs/tutorial.rst

[x] Changes documented in docs/release.rst

[x] GitHub Actions have all passed

[x] Test coverage is 100% (Codecov passes)
opened by DimitriPapadopoulos 1
Allow reading utf-8 encoded json files
Fixes #1308

Currently, zarr-python errors when reading a json file with non-ascii characters encoded with utf-8, however, Zarr.jl writes json files that include non-ascii characters using utf-8 encoding. This PR will enable zarr attributes written in Zarr.jl to be read by zarr-python.

TODO:

[x] Add unit tests and/or doctests in docstrings

[x] Add docstrings and API docs for any new/modified user-facing classes and functions

[x] New/modified features documented in docs/tutorial.rst

[x] Changes documented in docs/release.rst

[ ] GitHub Actions have all passed

[ ] Test coverage is 100% (Codecov passes)
opened by nhz2 0
Bump numpy from 1.24.0 to 1.24.1
Bumps numpy from 1.24.0 to 1.24.1.

Release notes

Sourced from numpy's releases.

v1.24.1

NumPy 1.24.1 Release Notes

NumPy 1.24.1 is a maintenance release that fixes bugs and regressions discovered after the 1.24.0 release. The Python versions supported by this release are 3.8-3.11.

Contributors

A total of 12 people contributed to this release. People with a "+" by their names contributed a patch for the first time.

Andrew Nelson

Ben Greiner +

Charles Harris

Clément Robert

Matteo Raso

Matti Picus

Melissa Weber Mendonça

Miles Cranmer

Ralf Gommers

Rohit Goswami

Sayed Adel

Sebastian Berg

Pull requests merged

A total of 18 pull requests were merged for this release.

#22820: BLD: add workaround in setup.py for newer setuptools

#22830: BLD: CIRRUS_TAG redux

#22831: DOC: fix a couple typos in 1.23 notes

#22832: BUG: Fix refcounting errors found using pytest-leaks

#22834: BUG, SIMD: Fix invalid value encountered in several ufuncs

#22837: TST: ignore more np.distutils.log imports

#22839: BUG: Do not use getdata() in np.ma.masked_invalid

#22847: BUG: Ensure correct behavior for rows ending in delimiter in...

#22848: BUG, SIMD: Fix the bitmask of the boolean comparison

#22857: BLD: Help raspian arm + clang 13 about __builtin_mul_overflow

#22858: API: Ensure a full mask is returned for masked_invalid

#22866: BUG: Polynomials now copy properly (#22669)

#22867: BUG, SIMD: Fix memory overlap in ufunc comparison loops

#22868: BUG: Fortify string casts against floating point warnings

#22875: TST: Ignore nan-warnings in randomized out tests

#22883: MAINT: restore npymath implementations needed for freebsd

#22884: BUG: Fix integer overflow in in1d for mixed integer dtypes #22877

#22887: BUG: Use whole file for encoding checks with charset_normalizer.

Checksums

... (truncated)

Commits

a28f4f2 Merge pull request #22888 from charris/prepare-1.24.1-release

f8fea39 REL: Prepare for the NumPY 1.24.1 release.

6f491e0 Merge pull request #22887 from charris/backport-22872

48f5fe4 BUG: Use whole file for encoding checks with charset_normalizer [f2py] (#22...

0f3484a Merge pull request #22883 from charris/backport-22882

002c60d Merge pull request #22884 from charris/backport-22878

38ef9ce BUG: Fix integer overflow in in1d for mixed integer dtypes #22877 (#22878)

bb00c68 MAINT: restore npymath implementations needed for freebsd

64e09c3 Merge pull request #22875 from charris/backport-22869

dc7bac6 TST: Ignore nan-warnings in randomized out tests

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies python needs release notes
opened by dependabot[bot] 0
Bump actions/setup-python from 4.3.0 to 4.4.0
Bumps actions/setup-python from 4.3.0 to 4.4.0.

Release notes

Sourced from actions/setup-python's releases.

Add support to install multiple python versions

In scope of this release we added support to install multiple python versions. For this you can try to use this snippet:

- uses: actions/[email protected] with: python-version: | 3.8 3.9 3.10

Besides, we changed logic with throwing the error for GHES if cache is unavailable to warn (actions/setup-python#566).

Improve error handling and messages

In scope of this release we added improved error message to put operating system and its version in the logs (actions/setup-python#559). Besides, the release

fixes issue about specifying architecture for pypy-nightly on Windows with related pull request.

improves error handling for Http Errors (actions/setup-python#511).

updates minimatch (actions/setup-python#558).

Commits

5ccb29d Install multiple python versions (#567)

c3e0339 Update action to use reusable workflows (#569)

206e984 refactor: Use early return pattern to avoid nested conditions (#566)

2c3dd9e Add OS info to the error message (#559)

76bbdfa Update minimatch (#558)

1aafadc Caching projects that use setup.py (#549)

b80efd6 Update to latest actions/publish-action (#546)

5cddb27 Recommend setting python-version (#545)

47c4a7a fix(ci): run .github/workflows/workflow.yml on ubuntu-20.04 (#535)

af57b64 Extend docu regarding rate limit issues. (#510)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

dependencies github_actions needs release notes
opened by dependabot[bot] 0
Ensure `zarr.create` uses writeable mode
Closes https://github.com/zarr-developers/zarr-python/issues/1306

cc @ravwojdyla @djhoese

TODO:

[x] Add unit tests and/or doctests in docstrings

[ ] ~Add docstrings and API docs for any new/modified user-facing classes and functions~

[ ] ~New/modified features documented in docs/tutorial.rst~

[ ] Changes documented in docs/release.rst

[x] GitHub Actions have all passed

[x] Test coverage is 100% (Codecov passes)

needs release notes
opened by jrbourbeau 4

Cannot read attributes that contain non-ASCII characters

Zarr version

2.13.4.dev68

Numcodecs version

0.11.0

Python Version

3.10.6

Operating System

Linux

Installation

With pip, using the instructions here https://zarr.readthedocs.io/en/stable/contributing.html

Description

I expect zarr to be able to read attribute files that contain non-ASCII characters, because JSON files use utf-8 encoding. However, currently, there is a check to error if the JSON file contains any non-ASCII characters.

Steps to reproduce

import zarr
import tempfile
tempdir = tempfile.mkdtemp()
f = open(tempdir + '/.zgroup','w', encoding='utf-8')
f.write('{"zarr_format":2}')
f.close()
f = open(tempdir + '/.zattrs','w', encoding='utf-8')
f.write('{"foo": "た"}')
f.close()
z = zarr.open(tempdir, mode='r')
z.attrs['foo']

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/github/zarr-python/zarr/attrs.py", line 74, in __getitem__
    return self.asdict()[item]
  File "/root/github/zarr-python/zarr/attrs.py", line 55, in asdict
    d = self._get_nosync()
  File "/root/github/zarr-python/zarr/attrs.py", line 48, in _get_nosync
    d = self.store._metadata_class.parse_metadata(data)
  File "/root/github/zarr-python/zarr/meta.py", line 104, in parse_metadata
    meta = json_loads(s)
  File "/root/github/zarr-python/zarr/util.py", line 56, in json_loads
    return json.loads(ensure_text(s, 'ascii'))
  File "/root/pyenv/zarr-dev/lib/python3.10/site-packages/numcodecs/compat.py", line 181, in ensure_text
    s = codecs.decode(s, encoding)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe3 in position 9: ordinal not in range(128)

Additional output

No response

bug

opened by nhz2 0

Releases(v2.13.3)

v2.13.3(Oct 9, 2022)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-13-3
Source code(tar.gz)
Source code(zip)
v2.13.2(Sep 27, 2022)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-13-2
Source code(tar.gz)
Source code(zip)
v2.13.1(Sep 26, 2022)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-13-1
Source code(tar.gz)
Source code(zip)
v2.13.0(Sep 22, 2022)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-13-0
Source code(tar.gz)
Source code(zip)
v2.13.0a2(Sep 8, 2022)

See release notes https://zarr.readthedocs.io/en/latest/release.html#release-2-13-0
Source code(tar.gz)
Source code(zip)
v2.13.0a1(Sep 8, 2022)

See release notes https://zarr.readthedocs.io/en/latest/release.html#release-2-13-0
Source code(tar.gz)
Source code(zip)
v2.12.0(Jun 23, 2022)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-12-0.
Source code(tar.gz)
Source code(zip)
v2.12.0a2(May 23, 2022)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-12-0a2.
Source code(tar.gz)
Source code(zip)
v2.12.0a1(May 10, 2022)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-12-0a1.
Source code(tar.gz)
Source code(zip)
v2.11.3(Apr 6, 2022)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-11-3.
Source code(tar.gz)
Source code(zip)
v2.11.2(Apr 6, 2022)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-11-2.
Source code(tar.gz)
Source code(zip)
v2.11.1(Mar 7, 2022)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-11-1.
Source code(tar.gz)
Source code(zip)
v2.10.3(Nov 19, 2021)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-10-3.
Source code(tar.gz)
Source code(zip)
v2.11.0a2(Nov 4, 2021)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-11-0a2.
Source code(tar.gz)
Source code(zip)
v2.10.2(Oct 19, 2021)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-10-2.
Source code(tar.gz)
Source code(zip)
v2.10.1(Sep 30, 2021)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-10-1.
Source code(tar.gz)
Source code(zip)
v2.10.0(Sep 19, 2021)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-10-0.
Source code(tar.gz)
Source code(zip)
v2.9.5(Sep 1, 2021)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-9-5.
Source code(tar.gz)
Source code(zip)
v2.9.4(Aug 30, 2021)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-9-4.
Source code(tar.gz)
Source code(zip)
v2.9.3(Aug 26, 2021)

See release notes https://zarr.readthedocs.io/en/stable/release.html#release-2-9-3.
Source code(tar.gz)
Source code(zip)
v2.9.2(Aug 24, 2021)

See release notes.
Source code(tar.gz)
Source code(zip)
v2.9.1(Aug 24, 2021)

See release notes.
Source code(tar.gz)
Source code(zip)
v2.9.0(Aug 24, 2021)

See release notes.
Source code(tar.gz)
Source code(zip)
v2.8.3(Jul 12, 2021)

See release notes.
Source code(tar.gz)
Source code(zip)
v2.8.2(Jul 12, 2021)

See release notes.
Source code(tar.gz)
Source code(zip)
v2.8.1(Jul 12, 2021)

See release notes.
Source code(tar.gz)
Source code(zip)
v2.8.0(Jul 12, 2021)

See release notes.
Source code(tar.gz)
Source code(zip)
v2.7.1(Jul 12, 2021)

See release notes.
Source code(tar.gz)
Source code(zip)
v2.7.0(Jul 9, 2021)

See release notes.
Source code(tar.gz)
Source code(zip)
v2.6.1(Jul 9, 2021)

See release notes.
Source code(tar.gz)
Source code(zip)

Owner

Zarr Developers

Contributors to the Zarr open source project.

GitHub Repository http://zarr.readthedocs.io/

Code for "Universal inference meets random projections: a scalable test for log-concavity"

How to use this repository This repository contains code to replicate the results of "Universal inference meets random projections: a scalable test fo

0 Nov 21, 2021

Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

TensorLayer is a novel TensorFlow-based deep learning and reinforcement learning library designed for researchers and engineers. It provides an extens

7.1k Dec 27, 2022

🐦 Opytimizer is a Python library consisting of meta-heuristic optimization techniques.

Opytimizer: A Nature-Inspired Python Optimizer Welcome to Opytimizer. Did you ever reach a bottleneck in your computational experiments? Are you tired

546 Dec 31, 2022

Fuzzy Overclustering (FOC)

Fuzzy Overclustering (FOC) In real-world datasets, we need consistent annotations between annotators to give a certain ground-truth label. However, in

2 Nov 08, 2022

Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks

Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks Contributions A novel pairwise feature LSP to extract structural

31 Dec 06, 2022

This repository contains the code for TABS, a 3D CNN-Transformer hybrid automated brain tissue segmentation algorithm using T1w structural MRI scans

This repository contains the code for TABS, a 3D CNN-Transformer hybrid automated brain tissue segmentation algorithm using T1w structural MRI scans. TABS relies on a Res-Unet backbone, with a Vision

6 Nov 07, 2022

PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech

PortaSpeech - PyTorch Implementation PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech. Model Size Module Nor

279 Jan 04, 2023

An open source Python package for plasma science that is under development

PlasmaPy PlasmaPy is an open source, community-developed Python 3.7+ package for plasma science. PlasmaPy intends to be for plasma science what Astrop

444 Jan 07, 2023

PyTorch implementations of Top-N recommendation, collaborative filtering recommenders.

129 Dec 22, 2022

The dataset of tweets pulling from Twitters with keyword: Hydroxychloroquine, location: US, Time: 2020

HCQ_Tweet_Dataset: FREE to Download. Keywords: HCQ, hydroxychloroquine, tweet, twitter, COVID-19 This dataset is associated with the paper "Understand

2 Mar 16, 2022

This is a Python Module For Encryption, Hashing And Other stuff

EnroCrypt This is a Python Module For Encryption, Hashing And Other Basic Stuff You Need, With Secure Encryption And Strong Salted Hashing You Can Do

5 Sep 15, 2022

Official source code of paper 'IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo'

IterMVS official source code of paper 'IterMVS: Iterative Probability Estimation for Efficient Multi-View Stereo' Introduction IterMVS is a novel lear

127 Jan 04, 2023

A PyTorch implementation: "LASAFT-Net-v2: Listen, Attend and Separate by Attentively aggregating Frequency Transformation"

LASAFT-Net-v2 Listen, Attend and Separate by Attentively aggregating Frequency Transformation Woosung Choi, Yeong-Seok Jeong, Jinsung Kim, Jaehwa Chun

29 Jun 04, 2022

Deep Q-network learning to play flappybird.

AI Plays Flappy Bird I've trained a DQN that learns to play flappy bird on it's own. Try the pre-trained model First install the pip requirements and

3 Mar 01, 2022

A pre-trained model with multi-exit transformer architecture.

ElasticBERT This repository contains finetuning code and checkpoints for ElasticBERT. Towards Efficient NLP: A Standard Evaluation and A Strong Baseli

48 Dec 14, 2022

PyTorch Implementation of Spatially Consistent Representation Learning(SCRL)

Spatially Consistent Representation Learning (CVPR'21) Official PyTorch implementation of Spatially Consistent Representation Learning (SCRL). This re

102 Nov 03, 2022

Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485

python-pylontech Python lib to talk to pylontech lithium batteries (US2000, US3000, ...) using RS485 What is this lib ? This lib is meant to talk to P

26 Dec 28, 2022

CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes Implementation of CoSMA: Convolutional Semi-Regular Mesh Autoencoder arXiv p

10 Oct 11, 2022

Implementation of Deep Deterministic Policy Gradiet Algorithm in Tensorflow

ddpg-aigym Deep Deterministic Policy Gradient Implementation of Deep Deterministic Policy Gradiet Algorithm (Lillicrap et al.arXiv:1509.02971.) in Ten

247 Dec 07, 2022

Official PyTorch Implementation of "Self-supervised Auxiliary Learning with Meta-paths for Heterogeneous Graphs". NeurIPS 2020.

Self-supervised Auxiliary Learning with Meta-paths for Heterogeneous Graphs This repository is the implementation of SELAR. Dasol Hwang* , Jinyoung Pa

48 Nov 09, 2022

An implementation of chunked, compressed, N-dimensional arrays for Python.

Related tags

Overview

Zarr

What is it?

Main Features

Where to get it

Comments

Minimal, reproducible code sample, a copy-pastable example if possible

Problem description

Version and installation information

v1.24.1

NumPy 1.24.1 Release Notes

Contributors

Pull requests merged

Checksums

Add support to install multiple python versions

Improve error handling and messages

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Additional output

Releases(v2.13.3)

v2.13.3(Oct 9, 2022)

v2.13.2(Sep 27, 2022)

v2.13.1(Sep 26, 2022)

v2.13.0(Sep 22, 2022)

v2.13.0a2(Sep 8, 2022)

v2.13.0a1(Sep 8, 2022)

v2.12.0(Jun 23, 2022)

v2.12.0a2(May 23, 2022)

v2.12.0a1(May 10, 2022)

v2.11.3(Apr 6, 2022)

v2.11.2(Apr 6, 2022)

v2.11.1(Mar 7, 2022)

v2.10.3(Nov 19, 2021)

v2.11.0a2(Nov 4, 2021)

v2.10.2(Oct 19, 2021)

v2.10.1(Sep 30, 2021)

v2.10.0(Sep 19, 2021)

v2.9.5(Sep 1, 2021)

v2.9.4(Aug 30, 2021)

v2.9.3(Aug 26, 2021)

v2.9.2(Aug 24, 2021)

v2.9.1(Aug 24, 2021)

v2.9.0(Aug 24, 2021)

v2.8.3(Jul 12, 2021)

v2.8.2(Jul 12, 2021)

v2.8.1(Jul 12, 2021)

v2.8.0(Jul 12, 2021)

v2.7.1(Jul 12, 2021)

v2.7.0(Jul 9, 2021)

v2.6.1(Jul 9, 2021)