What's the Big Deal about Big Data?
Blog Oct 15, 2012
Big Data and the Data-Intensive Lab
The data-intensive nature of scientific research is currently driving the emergence of big data solutions that can gather, analyze, and transport extremely large volumes of data among multiple locations worldwide.
Wikipedia defines big data as "a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools. The challenges include capture, storage, search, sharing, analysis, and visualization".
Laboratories have been dealing with large amounts of data for decades, with the volume increasing dramatically every year and the trend now being toward larger data sets. The problem has long been how to manage and mine that data for relevant information. In the current data-intensive environment, the difficulty executing data management tasks has increased exponentially.
What's interesting is how big data is changing the nature of data management in the lab. Relational databases and desktop statistics and visualization packages that have been so effective previously are not up to the task. Instead, big data utilizes massively parallel software running on a large number of servers, typically more than any one business can afford.
One such solution is an open-source NoSQL database that is designed for massive amount of data delivery over web and cloud applications. NoSQL databases do not use tables and thus generally do not use SQL as the query language. What they do use is a distributed, fault-tolerant architecture that manages the data redundantly on multiple servers.
NoSQL databases don't replace databases such as Oracle RDBMS, instead they provide an entirely new way to manage data because they allow applications to collect and analyze massive amounts of information from numerous sources.
Life sciences laboratories are particularly affected by the big data trend. When it comes to genomics, for instance, petabyte-scale networks are emerging that better support genomic research and emerging clinical requirements.
There is also growing demand to manage big data using cloud computing platforms, and to move large volumes of next-gen DNA sequencing and research data at high-speed over vast distances. The challenges of performing these activities into and out of the cloud are being addressed. This area has been led by Genentech, one of the early adopters of big data and cloud computing solutions to support their research.
Perhaps laboratories should have seen this coming since it is the inevitable result of better instrumentation that generates more data faster that then needs better analytical solutions–but hindsight is always 20/20.