Zarr protocol v3 design update

2019-06-19T00:00:00+00:00

Today I put together some slides summarising the current state of exploratory work on the Zarr v3 protocol spec. The purpose of this blog post is to share those slides more widely, and to provide some context explaining why work has started on a v3 spec.

Why work on a v3 spec?

The current (v2) Zarr spec is implemented in a number of software libraries, and is a stable and robust protocol that is used in production in a number of different scientific communities. If you need to store and compute in parallel against large array-like data, it’s a good solution. So why start thinking about a new protocol version?

Language-agnostic

One reason is that the v2 protocol is somewhat Python-centric, and includes some features which are not straightforward to implement in other languages. This has meant that implementations do not all support the same feature set. It would be good to have a minimal v3 protocol spec that could be fully implemented in any language, so all implementations have parity around a core feature set.

Unifying Zarr and N5

Another reason is that we would like to merge development efforts between the Zarr and N5 communities, and so a goal for the v3 spec is to unify the two approaches and provide a common implementation target.

Extensibility

A third reason is that a number of different groups have started experimenting and extending the Zarr protocol in interesting ways, but it’s not always clear how to extend the v2 protocol to support new features. It would be good if the v3 spec provided a variety of clear extension points and extension mechanisms.

Cloud storage

Finally, while the v2 spec can be used very effectively with distributed storage systems like Amazon S3 or Google Cloud Storage, there is room for improvement, particularly regarding how metadata is stored and organised.

Zarr v3 design update

I you are interested in knowing more about the current status of work on the v3 spec, please take a look at the v3 design update slides. The slides use reveal.js and have both horizontal and vertical navigation - if you haven’t seen that before, then navigate downwards first wherever you can, before navigating to the right.

As I mention in the slides, the current v3 spec is just a straw man, meant to illustrate some ideas and potential solutions, but everything is up for discussion. So if you have any comments or ideas, please do get in touch, anyone is welcome to participate.

Blog post written by Alistair Miles.

Zarr Python 2.3 release

2019-05-23T00:00:00+00:00

Recently we released version 2.3 of the Python Zarr package, which implements the Zarr protocol for storing N-dimensional typed arrays, and is designed for use in distributed and parallel computing. This post provides an overview of new features in this release, and some information about future directions for Zarr.

New storage options for distributed and cloud computing

A key feature of the Zarr protocol is that the underlying storage system is decoupled from other components via a simple key/value interface. In Python, this interface corresponds to the MutableMapping interface, which is the interface that Python dict implements. I.e., anything dict-like can be used to store Zarr data. The simplicity of this interface means it is relatively straightforward to add support for a range of different storage systems. The 2.3 release adds support for storage using SQLite, Redis, MongoDB and Azure Blob Storage.

For example, here’s code that creates an array using MongoDB:

import zarr
store = zarr.MongoDBStore('localhost')
root = zarr.group(store=store, overwrite=True)
foo = bar.create_group('foo')
bar = foo.create_dataset('bar', shape=(10000, 1000), chunks=(1000, 100))
bar[:] = 42
store.close()

To do the same thing but storing the data in the cloud via Azure Blob Storage, replace the instantiation of the store object with:

store = zarr.ABSStore(container='test', account_name='foo', account_key='bar')

Support for other cloud object storage storage services was already available via other packages, with Amazon S3 supported via the s3fs package, and Google Cloud Storage supported via the gcsfs package. Further notes on using cloud storage are available from the Zarr tutorial.

The attraction of cloud storage is that total I/O bandwidth scales linearly with the size of a computing cluster, so there are no technical limits to the size of the data or computation you can scale up to. Here’s a slide from a recent presentation by Ryan Abernathey showing how I/O scales when using Zarr over Google Cloud Storage:

Optimisations for cloud storage: consolidated metadata

One issue with using cloud object storage is that, although total I/O throughput can be high, the latency involved in each request to read the contents of an object can be >100 ms, even when reading from compute nodes within the same data centre. This latency can add up when reading metadata from many arrays, because in Zarr each array has its own metadata stored in a separate object.

To work around this, the 2.3 release adds an experimental feature to consolidate metadata for all arrays and groups within a hierarchy into a single object, which can be read once via a single request. Although this is not suitable for rapidly changing datasets, it can be good for large datasets which are relatively static.

To use this feature, two new convenience functions have been added. The consolidate_metadata() function performs the initial consolidation, reading all metadata and combining them into a single object. Once you have done that and deployed the data to a cloud object store, the open_consolidated() function can be used to read data, making use of the consolidated metadata.

Support for the new consolidated metadata feature is also now available via xarray and intake-xarray (see this blog post for an introduction to intake), and many of the datasets in Pangeo’s cloud data catalog use Zarr with consolidated metadata.

Here’s an example of how to open a Zarr dataset from Pangeo’s data catalog via intake:

import intake
cat_url = 'https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml'
cat = intake.Catalog(cat_url)
ds = cat.atmosphere.gmet_v1.to_dask()

…and here’s the underlying catalog entry.

Compatibility with N5

Around the same time that development on Zarr was getting started, a separate team led by Stephan Saafeld at the Janelia research campus was experiencing similar challenges storing and computing with large amounts of neural imaging data, and developed a software library called N5. N5 is implemented in Java but is very similar to Zarr in the approach it takes to storing both metadata and data chunks, and to decoupling the storage backend to enable efficient use of cloud storage.

There is a lot of commonality between Zarr and N5 and we are working jointly to bring the two approaches together. As a first experimental step towards that goal, the Zarr 2.3 release includes an N5 storage adapter which allows reading and writing of data on disk in the N5 format.

Support for the buffer protocol

Zarr is intended to work efficiently across a range of different storage systems with different latencies and bandwidth, from cloud object stores to local disk and memory. In many of these settings, making efficient use of local memory, and avoiding memory copies wherever possible, can make a substantial difference to performance. This is particularly true within the Numcodecs package, which is a companion to Zarr and provides implementations of compression and filter codecs such as Blosc and Zstandard. A key aspect of achieving fewer memory copies has been to leverage the Python buffer protocol.

The Python buffer protocol is a specification for how to share large blocks of memory between different libraries without copying. This protocol has evolved over time from its original introduction in Python 2 and later revamped implementation added in Python 3 (with backports to Python 2.6 and 2.7). Due to the changes in its behavior from Python 2 to Python 3 and what objects supported which implementation of the buffer protocol, it was a bit challenging to leverage effectively in Zarr.

Thanks to some under-the-hood changes in Zarr 2.3 and Numcodecs 0.6, the buffer protocol is now cleanly supported for Python 2/3 in both libraries when working with data. In addition to improved memory handling and performance, this should make it easier for users developing their own stores, compressors, and filters to use with Zarr. Also it has cutdown on the amount of code specialized for handling different Python versions.

Future developments

There is a growing community of interest around new approaches to storage of array-like data, particularly in the cloud. For example, Theo McCaie from the UK Met Office Informatics Lab recently wrote a series of blog posts about the challenges involved in storing 200TB of “high momentum” weather model data every day. This is an exciting space to be working in and we’d like to do what we can to build connections and share knowledge and ideas between communities. We’ve started a regular teleconference which is open to anyone to join, and there is a new gitter channel for general discussion.

The main focus of our conversations so far has been setting up work towards development of a new set of specifications that support the features of both Zarr and N5, and provide a platform for exploration and development of new features, while also identifying a minimal core protocol that can be implemented in a range of different programming languages. It is still relatively early days and there are lots of open questions to work through, both on the technical side and in terms of how we organise and coordinate efforts. However, the community is very friendly and supportive, and anyone is welcome to participate, so if you have an interest please do consider getting involved.

If you would like to stay in touch with or contribute to new developments, keep an eye on the zarr and zarr-specs GitHub repositories, and please feel free to raise issues or add comments if you have any questions or ideas.

And finally… SciPy!

If you’re coming to SciPy this year, we’re very pleased to be giving a talk on Zarr on day 1 of the conference (Wednesday 10 July). Several members of the Zarr community will be at the conference, and there are sprints going on after the conference in a number of related areas, including an Xarray sprint on the Saturday. Please do say hi or drop us a comment on this issue if you’d like to connect and discuss anything.

Blog post written by Alistair Miles and John Kirkham.