Managing the Genomics Data Deluge
Article May 04, 2017
Credit: Jer Thorp on flickr
Genomic data is one of the fastest growing datasets in the world. A recent Intel analysis stated that it would take 7.3 zettabytes, meaning 7.300,000,000,000 GB, of data to store the genomes of our global population. This is equal to 50 percent of all data on the internet in 2016 and does not factor in the data created when analysing and using this information.
Over time, genomics will play an increasingly important role in our healthcare, particularly in realising the promise of precision healthcare. With such tremendous amounts of data involved, adequate storage capacity and methodology will be crucial for the advancement of genomic research and medicine.
To learn about the real world challenges of managing all this data and how you can reduce your genomics data footprint we spoke to Dr. Warren Kaplan, Chief of Informatics at the Garvan Institute of Medical Research and Rafael Feitelberg, Geneformics CEO. The Garvan Institute of Medical Research has recently integrated Geneformics technology into its workflow.
How is NGS utilised at Garvan and, how much data does your institution produce?
The Garvan Institute uses whole genome sequencing (WGS) technology to improve our understanding of genome biology and its impact on disease, as well as to advance the use of genomic information in patient care.
The analysis of a single human genome requires at least 200 gigabytes of raw data, meaning that researchers at the Garvan produce and work with significant amounts of data every day. Running at our theoretical maximum, our WGS for research purposes alone could generate more than 1.5 petabytes of data per year.
Why was it important for you to reduce your genomics data footprint?
Reducing the size of the data we work with helps to lower the cost of our work, as well as make it more efficient.
Our goal is to be able to run large genomic data sets through our complex quality control and analysis pipeline quickly, so that we can continue to expand our research, without reducing the quality of the outputs or compromising on security.
Reducing the size of our data footprint also helps us collaborate with others in field, which multiplies the benefits of our work.
What points did you consider when selecting an approach for tackling your data storage problem?
We had a number of considerations. The first was compression ratio. We aimed to achieve a dramatic reduction in our footprint to manage costs. Second, we insisted on a solution that provided full integrity of data whereby the process was completely lossless, and what we had before compression would be the same once we had uncompressed the data again. Finally, speed and ease of use were important for adoption and use by our team and partners.
What other data management and analysis challenges do you currently face at Garvan?
Data storage without appropriate metadata is limiting. Fantastic work by the NIH and others into the concept of a Data Commons and making data Findable, Accessible, Interoperable, and Reproducible (FAIR) is where we’re heading.
A key challenge is that you pay to store data whether you are using it or not. Therefore, finding sustainable models that allow us to grow our datasets without costs blowing out is vital.
As our datasets grow, it is also becoming increasing important for users to be able to bring their analytics to the data, rather than downloading the data. We’re making quite a bit of headway in this space, for example being able to support analytics in situ for people accessing our cohorts.
A key requirement for doing the most efficient analytics at the best price is the ability to compute on any environment. We, therefore, use very diverse environments, and being able to move highly compressed data into and from them eases the entire process.