Managing the Genomics Data Deluge
Article May 04, 2017
Credit: Jer Thorp on flickr
Genomic data is one of the fastest growing datasets in the world. A recent Intel analysis stated that it would take 7.3 zettabytes, meaning 7.300,000,000,000 GB, of data to store the genomes of our global population. This is equal to 50 percent of all data on the internet in 2016 and does not factor in the data created when analysing and using this information.
Over time, genomics will play an increasingly important role in our healthcare, particularly in realising the promise of precision healthcare. With such tremendous amounts of data involved, adequate storage capacity and methodology will be crucial for the advancement of genomic research and medicine.
To learn about the real world challenges of managing all this data and how you can reduce your genomics data footprint we spoke to Dr. Warren Kaplan, Chief of Informatics at the Garvan Institute of Medical Research and Rafael Feitelberg, Geneformics CEO. The Garvan Institute of Medical Research has recently integrated Geneformics technology into its workflow.
How is NGS utilised at Garvan and, how much data does your institution produce?
The Garvan Institute uses whole genome sequencing (WGS) technology to improve our understanding of genome biology and its impact on disease, as well as to advance the use of genomic information in patient care.
The analysis of a single human genome requires at least 200 gigabytes of raw data, meaning that researchers at the Garvan produce and work with significant amounts of data every day. Running at our theoretical maximum, our WGS for research purposes alone could generate more than 1.5 petabytes of data per year.
Why was it important for you to reduce your genomics data footprint?
Reducing the size of the data we work with helps to lower the cost of our work, as well as make it more efficient.
Our goal is to be able to run large genomic data sets through our complex quality control and analysis pipeline quickly, so that we can continue to expand our research, without reducing the quality of the outputs or compromising on security.
Reducing the size of our data footprint also helps us collaborate with others in field, which multiplies the benefits of our work.
What points did you consider when selecting an approach for tackling your data storage problem?
We had a number of considerations. The first was compression ratio. We aimed to achieve a dramatic reduction in our footprint to manage costs. Second, we insisted on a solution that provided full integrity of data whereby the process was completely lossless, and what we had before compression would be the same once we had uncompressed the data again. Finally, speed and ease of use were important for adoption and use by our team and partners.
What other data management and analysis challenges do you currently face at Garvan?
Data storage without appropriate metadata is limiting. Fantastic work by the NIH and others into the concept of a Data Commons and making data Findable, Accessible, Interoperable, and Reproducible (FAIR) is where we’re heading.
A key challenge is that you pay to store data whether you are using it or not. Therefore, finding sustainable models that allow us to grow our datasets without costs blowing out is vital.
As our datasets grow, it is also becoming increasing important for users to be able to bring their analytics to the data, rather than downloading the data. We’re making quite a bit of headway in this space, for example being able to support analytics in situ for people accessing our cohorts.
A key requirement for doing the most efficient analytics at the best price is the ability to compute on any environment. We, therefore, use very diverse environments, and being able to move highly compressed data into and from them eases the entire process.
Plant Epigenetics: An untapped molecular resource for crop improvementArticle
Epigenetic phenomena such as paramutation, transgenic silencing, imprinting, and transposable element inactivation are prevalent in plants and potentially offer a huge resource for directed crop improvement.READ MORE
Structure of Tau Filaments from Alzheimer’s Brain Solved by Cryo-EMArticle
Researchers from the MRC's Laboratory of Molecular Biology, Cambridge, UK, have solved the structure of the dementia-causing ‘tau’ protein in unprecedented detail.READ MORE
Academic Drug Discovery: Repurposing to treat diseaseArticle
As the cost of drug design and development continues to escalate, repurposing drugs offers faster and cheaper ways of treating disease.READ MORE
Comments | 0 ADD COMMENT
EMBL Course: Next Generation Sequencing: RNA Sequencing Library Preparation
Apr 23 - Apr 27, 2018
EMBO Practical Course: Microbial Metagenomics: A 360º Approach
Apr 23 - Apr 30, 2018
EMBL Conference: European Conference of Life Science Funders and Foundations
Apr 19 - Apr 20, 2018
EMBL Course: Next Generation Sequencing: Whole Genome Sequencing Library Preparation
Apr 16 - Apr 20, 2018