Genomics has an obsession, and it’s called Big Data. However, unlike other obsessions, this one will probably not ruin anyone’s life—maybe only a few late nights or weekend plans for the researcher on a tight deadline.
This preoccupation was born out of necessity. It began as an innate need to understand how our genetic makeup controls every facet of human life, from our greatest mental and physical achievements to the debilitating illnesses that render us helpless to our own body systems.
This was the first time scientists were introduced to truly large sets of genomic data that required quantitative analysis and training. Assigning values to tiny fluorescent grids on glass slides and then sifting through piles of information about which genes were upregulated or downregulated became a fixation for many research groups. At the time scientific presentations were riddled with heat map displays and descriptions of dye vs probe ratios, clustering, and normalization values. Yet, this was to be just the beginning of the field of genomics’ fascination with mass quantities of data.
The completion of the human genome project in 2003 not only had scientists feverishly searching for cheaper, more expedient ways to sequence genetic information, it further whetted their appetites for analyzing large datasets. In short order burgeoning next-generation sequencing (NGS) platforms started producing exponentially more data, faster and less expensive than most thought possible. Now, scientists have become inundated with sequencing information at speeds faster than they can analyze, leading to an information log jam that has researchers scrambling for solutions.
“Since 2005, the cost of sequencing has fallen by four orders of magnitude, and new technologies are allowing us to produce more data than ever at much higher rates,” says Daniel Meyer, COO at GenoSpace. “As data generation approaches commoditization, the greatest challenge has shifted to effective analysis and interpretation.”
In any discussion about NGS data management it’s important to have some understanding of the steps leading to the analysis phase. Even a cursory overview should impress upon the uninitiated the overwhelming amount of information generated from each sequencing run.
The human genome has over three billion base pairs (bp), with chromosomes containing between 50 million and 250 million nucleic acid residues. To obtain usable information form these large expanses of DNA it must be first broken into much smaller overlapping stretches (known as reads). These pieces are typically in a highly disordered state and sampled randomly during a sequencing run. To ascertain with sufficient probability that the entire read is covered, sequencing assays rely on heavy oversampling, i.e., sequencing the overlapping stretches multiple times.
Historically, read lengths were between 500 to 1,000 bp, which meant that even for the shortest of chromosomes, a sequencing project would need to generate close to 800,000 reads to be statistically accurate. Furthermore, prior to the advent of NGS technology, sequencers could only average about 50 reads in parallel. Using some quick math it’s not difficult to understand why the Human Genome Project (HGP) took as long as it did, especially in the early days.
When NGS exploded onto the scene about 10 years ago, it revolutionized the field by radically increasing overall speeds and driving costs down. However, NGS technology didn’t solve any issues associated with data acquisition size; to the contrary it contributed to swelling the total file size dramatically. NGS uses much shorter reads, currently averaging around 50 to 100 bp for whole genome sequencing (WGS), but compensating with exponentially more total reads and achieving staggering numbers (millions) in parallel over traditional Sanger sequencing methods, which drove the HGP for the majority of its tenure.
To further underscore the enormity of NGS data generation, we need only look at some of the more recent undertakings of various international consortia.
For example, in 2008 the 1000 Genome Project looked to sequence large genomic regions from 1,000 individuals among various populations groups around the globe. In 2012 the project team published their findings from 1,092 individuals amid 14 distinct population groups. By way of reference, using mainly Sanger sequencing it took the HGP 13 years to sequence one genome. Additionally, by the time they are complete the UK10K project and International Cancer Genome Consortium will have sequenced 10,000 and 50,000 genomes respectively—generating petabytes (1 petabyte = 1000 terabytes) of raw sequencing data.
Life sciences’ big data obsession is only a small portion of a larger issue that has been bubbling beneath the surface ever since NGS made its debut. In many respects it’s a problem of infrastructure. Most investigators have little to no capacity for analyzing datasets from modern NGS platforms that are useful and reproducible by others in the field.
For example, the raw data from a whole exome sequencing (WES) run that has a 100 bp average read length and 50X coverage is roughly 1–1.5 terabytes and with multiple replicates for improved statistics, one individual’s exome could average between 3–5 terabytes of storage space. While the price for computer memory is always falling, having enough hard drive space to store data from multiple runs and multiple exomes could get expensive very quickly.
Once the raw sequence data is obtained from the sequencer the heavy lifting for computational devices begins with the read mapping phase. The computer attempts to align the short reads either back to a reference sequence or uses overlapping sequences at the end of each read to generate a de novo sequence. “Our ability to measure outweighs our ability to interpret and apply,” says Mike Lelivelt, Ph.D., senior director of bioinformatic products at Thermo Fisher Scientific. “What is truly rate limiting is clinical interpretation of a genetic marker.”
While data acquisition and management is a concern for many institutions, not everyone agrees that it is the all-encompassing stumbling block that is preventing NGS from becoming an integral part of precision medicine. “I actually think the thing that’s holding back NGS from the clinic is reimbursement, not data analysis,” says Shawn Baker, Ph.D., Chief Science Officer at AllSeq. “It is true that trying to analyze all the variants from a whole genome analysis is quite challenging, but if no one is paying, it doesn’t matter how simple and straight forward the data analysis process is.”
Sean Scott, VP of Clinical Genomics at Qiagen, has a similar stance stating that the “demonstration of actionability and clinical utility of NGS tests is still a growing issue for clinical labs, needing payer reimbursement for the economics to become favorable for the lab and the greater community.”
What can be agreed upon is that there are plenty of areas of NGS that require standardization for it to live up to its potential as an extremely powerful tool for use in clinical medicine. “I believe we are going to need the standardization and validation of algorithms on real world data to advance the use of NGS in the clinic and hopefully lead to a more phenotypic rather than disease term approach to treatment,” says Mark Hughes, Ph.D., Senior Product Manager for Thomson Reuters’ MetaCore™.
It’s often said that the only cure for an obsession is to get a new one. At the moment it would seem that is easier said than done for the field of genomics, but as scientists are always looking for easier, faster, and more efficient analysis methods, it will come as little shock when researchers move on to the next ground breaking application. For now however, the future of NGS and Big Data should provide investigators with improved methods that will simplify the data output stream and the management of that information for use with various bioinformatics analysis programs.
“I believe that Cloud computing, and the ability to co-locate big data with highly scalable computing resources, is already making a positive impact on streamlining data analysis— and that this trend will continue as more data and a wider variety of analysis applications moves into the cloud,” says Scott Kahn, Ph.D., VP of Commercial Enterprise Informatics at Illumina.
Dr. Lelivelt is in agreement adding that, “while data security must be the first priority, the benefit of centralized data storage and scalable computing resources will drive the clinical analysis into the Cloud.”
Additionally, advances on the sequencing side could indirectly aid the data analysis stream, allowing for more accurate read alignments and opening up opportunities for new investigational methodology.
“The most interesting thing that’s on the visible horizon is long reads,” explains Dr. Baker. “Getting truly long reads (>10kb) will dramatically improve the alignment process and allow for new analyses like haplotyping that just aren’t possible with short reads. Long read technology is available, but it’s still an order of magnitude or two more expensive than the best short read platforms.”
Lastly, another interesting technology waiting for its time in the spotlight, is third-generation sequencing technology. This method skips the DNA amplification technique, avoiding PCR bias, and allows for genetic material to be sequenced directly at the single-molecule level through the use of engineered polymerases that can tolerate longer average read lengths. Currently, only a few companies provide platforms for this sequencing method and it comes at a premium cost.
Regardless of the analysis method, scientists will always try to improve ways to pour over mountains of data for that small nugget of info that may lead them to that career defining discovery.