We Can Sequence the World…but What Then?
Complete the form below to unlock access to ALL audio articles.
With roughly 8.7 million species on Earth , you might assume that sequencing the DNA of all of them would take decades. According to Professor Knut Reinert, PhD, of the Free University of Berlin, that massive undertaking could be completed in just 10 days using roughly 10,000 of today’s fastest DNA sequencers. That’s because sequencing has become about 4,000 times faster in the last 10 to 15 years.
In that same time frame, sequencing has become about 10 million times cheaper. A human genome can now be sequenced for as little as $1,000 in high-volume laboratories. According to the McKinsey Global Institute, that price will drop by another factor of 10 within the next decade. Collecting genomic data is becoming fast, cheap, and relatively easy.
Handling the Torrent of Genomic Data
The bigger challenge will be analyzing all the data that is generated. One study estimates that 100 million to 2 billion individual genomes will be sequenced by 2025, generating 2 to 40 exabytes of data. Buried in all that data will be insights that could fundamentally transform our understanding of life, from cellular biology and disease mechanisms to drug discovery and agriculture. Uncovering those insights will require massive amounts of analysis.
Professor Reinert and his team are working with Intel to provide researchers and laboratories with new tools to help them tackle this immense analytic challenge. Because of their groundbreaking work in bioinformatics, the Free University of Berlin was selected as a new Intel Parallel Computing Center (IPCC) in 2015, with Professor Reinert at the helm. Since then, his team has been working to improve performance for important genomic algorithms by optimizing the codes to take better advantage of modern multicore, multi-threaded processors.
A Flexible Software Library for High-Speed Genomics
Once Professor Reinert’s team optimizes an algorithm, they package it in SeqAn*, an open source library of genomics-focused applications. The algorithms in SeqAn are not only optimized for performance, but also for ease-of-use, maintainability and portability in standards-based hardware environments. They can be combined to create complex, flexible analytic pipelines and they can be used freely by both academic and commercial organizations.
The purpose of SeqAn is to accelerate genomics research and increase use by simplifying pipeline development, driving down costs, and delivering order-of-magnitude and higher gains in performance. Good progress has been made toward these goals and SeqAn is growing in popularity. According to Professor Reinert, it has already been cited in more than 300 research studies.
Modernizing Code for High Performance
Complex software algorithms have been developed over many years to analyze genomic data. These algorithms offer amazing functionality, but have been written primarily to enable accurate science. Most of the developers were experts in genetics and bioinformatics, but not in the complex hardware and software issues that are pivotal to maximizing computing efficiency and application throughput. As a result, performance and scalability issues often arise in high-volume and time-critical environments. Ongoing increases in the size and complexity of genomic data adds to these challenges.
One reason it has taken so long for algorithm optimization to occur is that it requires expertise across multiple disciplines to fully understand the science, math, software, and hardware. Every base pair matters when looking for genetic markers, so software developers must understand how their optimization efforts impact not only the speed of analysis, but also the accuracy and reliability of the results. In some cases, Professor Reinert and his team have had to create new algorithmic approaches so that the code could be scaled efficiently across large numbers of threads and cores.
Their code optimization efforts focus on two main strategies: vectorization and multi-threading.
More Performance Per Core
Most processors today include integrated support for single instruction multiple data (SIMD) execution strategies. With SIMD, a single instruction can be applied simultaneously to multiple data points, a process known as vector processing.
Depending on the algorithm, vector processing can dramatically increase the number of calculations that can be performed per clock cycle. Hardware support for vector processing continues to advance. For example, the latest Intel Xeon Scalable processors support 512-bit vectors, versus 256-bit vectors in the previous generation. This effectively doubles the maximum number of calculations that can be performed per clock cycle.
More Cores Per Workload
Twenty years ago, mainstream computer processors were designed to handle a single stream of software instructions. Each processor would process one software instruction after the other in a linear fashion. Today, a single processor may include dozens of “cores,” each of which functions as an independent execution engine that can process its own software stream, or “thread.” With this approach, a single processor can simultaneously execute dozens or even hundreds of simultaneous instruction streams.
However, most software code is designed to run as a single thread, which means that the application can only take advantage of a single processor core. To improve parallel throughput, software developers can often break the sequential stream of code into multiple threads that can be run concurrently across multiple processor cores. For software that has sufficient intrinsic parallelism, this process can be used to generate code that can take advantage of all available cores in modern processors, servers, and clusters.
Mainstream multicore server processors, such as Intel Xeon Scalable Processors, include as many as 28 cores and support two simultaneous threads per core. Many-core processors, such as Intel® Xeon Phi™ processors, provide up to 72 cores and support four threads per core. Taking full advantage of these parallel resources can have a substantial impact on performance.
Order of Magnitude and Higher Performance Gains
Vectorization and multi-threading are well suited to the computational demands of genome analysis, which relies on a series of relatively simple calculations that are performed iteratively across large data sets. Increasing performance-per-core while also utilizing more cores has a multiplicative impact and can deliver dramatic overall performance gains. Performance tests to date show:
- Substantial per-core performance improvements through vectorization. Higher per-core performance is clearly indicated in benchmarks using the latest Intel Xeon Gold 6148 processor (Figure 1). Although these newer processors offer incremental core increases (up to 20 cores versus up to 18 on previous generation processors), the additional cores cannot account for the 1.6X to 2.7X higher performance provided by the new processors. A large portion of these gains can be attributed to the enhanced vector support.
Figure 1. The optimized SeqAn code takes full advantage of the advanced vector support in the latest Intel® Xeon® Gold 6148 processor, which helps to enable performance gains as high as 1.6X to 2.7X across a range of SeqAn workloads versus the previous-generation Intel® Xeon® processor E5-2697 v4. 8
- Near-linear scalability across large numbers of cores. Before the code was optimized, SeqAn could be run efficiently only on a single thread. Benchmarks using the new optimized code demonstrate excellent scalability across the large number of cores and threads provided by the Intel® Xeon Phi™ processor 7250. Runtimes were decreased by as much as 55X when run on all 68 cores versus the same workload running on a single core (Figure 2).
Performance and scalability improvements of this magnitude are transformative, potentially reducing the time required for complex genome analyses from days to just minutes. Professor Reinert’s software optimization efforts focus on extracting as much parallelism as possible from the code, so the performance benefits will continue to increase as core densities increase in future processor generations
Unlocking the Secrets of Life
Optimizing algorithms for high performance is a critical step toward handling the coming torrent of genomic data. The stakes are high. The McKinsey Global Institute estimates that next-generation genomics has the potential to impact the global economy by as much as $1.6 trillion per year by 2025, and the impact will ultimately go far beyond economics.
We are standing at a tipping point in our ability to understand the genetic code and its detailed impact on plants, animals, humans, populations, and even whole ecosystems. Today’s advances in DNA sequencing are happening alongside complementary advances in life sciences research tools, such as molecular imaging and molecular dynamics (computer simulations of atomic and molecular interactions).
In combination with these technologies, fast genome analysis provides the foundation for understanding, in unprecedented detail, the role that individual genes and combinations of genes play in cellular processes. These advances will deepen our understanding of life in its many forms, improve our ability to heal and shape ecosystems, and pave the way for precision medicine that can adapt to the unique physiology of each individual.
Information on performance and benchmarks can be found at http://www.intel.com/performance/datacenter.
References
- Source: Number of species on Earth tagged at 8.7 million, by Lee Sweetlove, Nature, August 23, 2011. https://www.nature.com/news/2011/110823/full/news.2011.498.html
- Source: Professor Knut Reinhert’s presentation at the Intel HPC Developer Conference at Super Computing 2015, published December 2, 2015. https://www.youtube.com/watch?v=YVDaQFTeBlw Generally speaking, there are “107 species on earth with 108 average number of base pairs per genome; therefore, the earth’s genome has 1015 base pairs. 104 Illumina HiSeq sequencers could sequence 1011 base pairs each per day, so they could sequence the earth’s genome at 10x coverage in approximately 10 days.”
- Source: Professor Knut Reinhert presentation at the Intel HPC Developer Conference at Super Computing 2015, published December 2, 2015. https://www.youtube.com/watch?v=YVDaQFTeBlw
- Source: Professor Knut Reinhert presentation at the Intel HPC Developer Conference at Super Computing 2015, published December 2, 2015. https://www.youtube.com/watch?v=YVDaQFTeBlw
- Source: Disruptive technologies: Advances that will transform life, business, and the global economy. McKinsey Global Institute, May 2013.
- Source: Genome researchers raise alarm over big data, by Erika Check Hayden, Nature, July 7, 2015. https://www.nature.com/news/genome-researchers-raise-alarm-over-big-data-1.17912
- http://www.fu-berlin.de/en/presse/informationen/fup/2015/fup_15_285-professor-reinert-leitet-intel-parallel-computer-center/index.html
- Results based on local and global alignments of banded and unbanded cells for 150 bp Illumina reads (2.85 x 1011 unbanded, 3.21 x 1010 banded). Baseline configuration: 2 x Intel® Xeon® processor E5-2697 v4 (2.30 GHz, 18 cores). System under test: 2 x Intel® Xeon® Gold processor 6148 (2.4 GHz, 20 cores). All tests run using Linux 3.10.0-514.21.1.el7.x86_64, and the GNU compiler 7.2.0.
- Results based on Pac Bio Reads for a global alignment of 2.66 x 1013 cells. Test configuration: Intel® Xeon Phi™ processor 7250 (1.4 GHz, 68 cores, 16 GiBi MCDRAM). All tests run using seqan_global, Linux 3.10.0-514.21.1.el7.x86_64, and the GNU compiler 7.2.0.
- Source: Disruptive technologies: Advances that will transform life, business, and the global economy. McKinsey Global Institute, May 2013.
This article was produced as part of Intel’s HPC editorial program, with the goal of highlighting cutting-edge science, research and innovation driven by the HPC community through advanced technology. The publisher of the content has final editing rights and determines what articles are published.