How Can Object Storage Help Manage Research Data?

Article

Published: June 17, 2020

| by Martin Ellis

How Can Object Storage Help Manage Research Data? content piece image

Listen with

Speechify

0:00

Thank you. Listen to this article using the player above. ✖

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 3 minutes

Today’s “left-over” data can be the basis of tomorrow’s breakthrough. As we keep data for longer and try harder to share and re-use data, it becomes critical that data is accurately catalogued and easily retrievable. Although long touted as the savior to the scalability crisis being waged on file systems, object storage still remains niche outside of web-scale deployments.

What is object storage?

Object storage is a way of storing huge numbers of files without having to worry about where they live or manage a folder-based file system. Object stores can be distributed over many sites, including globally. This ensures data remains available even during major site failures and can allow faster access to data from remote locations.

Parallel file systems are great at delivering multiple petabytes as a single file system, but for some the volume of data is becoming not a technological issue, but a human issue. The directory structure becomes too cumbersome for users to navigate and a new workflow is needed. In this space, object storage could truly thrive.

All around the world

At OCF we work with many object storage vendors. Many use erasure coding to reduce replication capacity overheads. When you’re going for site-level resiliency, object stores start to become unbeatable. Usually, when we need files to be available even when a whole building goes offline, we end up storing the data at least three times, often mirrored copies of the data on two different sites, with a third copy periodically saved to a separate disaster-recovery site. With a three-site object store, erasure coding can be used to reduce this from three copies down to approximately 1.7 copies. It does this by splitting the file and generating some parity bits so that each site stores a part of the file, and enough extra information to fill in the blanks if one site’s data is lost.

The greatest disadvantage here is that if erasure coding is used, each site does not have a full copy of the object. When an erasure coded object is retrieved, an inter-site network transfer and compute-overhead is needed to re-assemble the whole object. The results – space efficient dispersed object stores – are inherently slow.

Teamwork

An object store should be a repository or means of sharing data. We shouldn’t try to replace file storage with objects, but rather use the strengths of each in combination to achieve more.

As a specialist in high-performance computing (HPC) and research data storage, I view object storage with my HPC hat on. In HPC land, many of the largest data sets are generated by scientific instruments like high-resolution microscopes, spectrometers and sequencers. In an object-based workflow, files generated by these instruments can immediately be objectized, tagged with appropriate metadata like researcher, project, instrument settings, what was sampled and conditions in which the data should be shared (example as required by some funding bodies). The resulting objects are then ingested into an object store for preservation.

An overview of object storage. Credit: Martin Ellis

If a researcher needs to revisit the output they can do, and they can easily cache whole projects on their local systems.

With data catalogued in an object storage solution’s metadata management system, it can be published and shared, giving researchers that want more data without extra funding the ability to query past projects for similar instruments and samples.

Integrate with HPC

Although object stores tend to be too slow to efficiently support HPC resources, the programmatical nature of object interfaces allows them to integrate well into HPC workflows.

An HPC system usually consists of a cluster of compute servers with a shared high-speed file system known as scratch storage. Scheduling programs like SLURM are used to schedule and start compute jobs on the compute servers. Often the scheduler can be set up to pre-stage data, copying data a job needs onto the cluster’s scratch storage so that it is ready before the job is scheduled to run. Object storage solutions offer an API (application programming interface) to access data, which makes it simple for the required files to be copied from the object store as part of the job submission script.

Similarly, any output written to the HPC’s scratch file storage can be assembled into an object and published to the object store after the job finishes. Objects can be tagged with not just date-time, but also any input parameters and the submission script included, allowing researchers to manage and locate outputs from many similar but different iterations more easily.

Fueling AI

A growing number of our life science customers are adopting artificial intelligence (AI) and an object-based workflow would be great here. For AI, you typically need a lot of data, often more than one practitioner can generate.

The ability to pull many thousands of output files, tagged with what was sampled and observed from potentially hundreds of projects, would be a treasure trove to AI researchers wanting to expand their dataset.

In conclusion, although I do not foresee object storage replacing file storage for active research data, they do offer an excellent means to curate and preserve data efficiently, in a geographically dispersed solution with a programmatical interface to support research computing systems.

Martin Ellis is a pre-sales engineer at OCF