Building Terra, the Broad Institute’s Platform for a Collaborative, Scalable Genomics Research Ecosystem for All
Genomics has changed how biological science is done. Organizations like the Broad Institute of MIT and Harvard (Broad Institute) give life sciences researchers the capabilities they need for genome sequencing and analysis to accelerate scientific understanding, insight and breakthroughs in human genomics, disease and treatments. With on-premises infrastructure, the Broad Institute has provided sequencing, developed and made public genome analysis pipelines and tools—Broad Institute is the author of the Genome Analysis Toolkit (GATK)—and provided supercomputing capacity and data storage since 2004.
“Our genome sequencing facility generates data on the order of a whole human genome about every three to five minutes, 24 hours a day,” Geraldine Van der Auwera, Director of Outreach and Communications at the Broad Institute's Data Sciences Platform, explained. “Each genome corresponds to approximately 350 gigabytes of data, resulting in some 30 petabytes so far of genomic data being managed by the Broad Institute.”
Historically, sequencing has been expensive to do—the seminal Human Genome Project cost USD 2.7 billion. But, over the last decade, as technology has advanced, whole genome sequencing (WGS) has become easier to do, faster, more available, and less expensive. Today a whole genome can be sequenced for less than USD 1,000, making this valuable tool more accessible to researchers and resulting in a flood of genomic data. By 2014, the Broad Institute needed to address their ever-expanding storage capacity and network and computational capability. Besides the Institute’s growing genomic repositories, other researchers and institutions were making more data available, creating new opportunities for research and analytics and challenges for storage and computing demand. By 2017, the Broad Institute reached an inflection point.
“We realized that the on-premises infrastructure was going to quickly run out of capacity for both storage and computing,” Van der Auwera stated.
The Broad Institute begins a cloud journey
This realization led to a shift in thinking at the Broad Institute and the vision of Terra, a new platform built in the cloud that would enable a scalable, collaborative genomics research ecosystem designed for a variety of life sciences researchers.
“We decided to go to the cloud for several reasons,” Van der Auwera added. “One was logistics and economics of operating our processing pipelines and data storage. In the cloud, we could scale as needed for both compute and storage, paying only for the capacity we used. Also, the cloud would allow a whole new level of data federation and collaboration. We could work with others to create a cloud-based data ecosystem, where researchers could combine the data they generated with other datasets into richer, more powerful computational experiments. This would help them achieve greater statistical confidence, integrate additional sources of information, and generate critical insight into the areas of research they were focused on.”
Building on a cloud infrastructure would also support new projects, like the National Institutes of Health’s “All of Us” research initiative. All of Us is gathering and processing genomic, healthcare and real-time lifestyle data from 1,000,000 Americans to “learn how our biology, lifestyle, and environment affect health.”
But, moving to the cloud presented new challenges. The Broad Institute’s pipelines were designed for their on-premises infrastructure. Life scientists, computational scientists, and data scientists approach their research and tools differently. And, while the Broad Institute develops pipelines and makes them publicly available, it was not a software development organization. The Broad Institute would need the expertise of those offering the cloud services.
“We couldn’t just copy our existing pipelines over to the cloud,” Van der Auwera commented. “The infrastructures are different. We needed to re-implement our pipelines in a cloud-native way. Plus, to realize our vision of a federated data ecosystem would require building a whole new platform to handle the complexities of the cloud infrastructure, and provide applications and interfaces tailored to the needs of life scientists, in order to enable them to work effectively in the cloud.”
While exploring deployment and development options, Google, with their own genomics analysis pipelines, offered to contribute to the development process.
“The early collaboration was a key piece of how we began the migration of the institute's production pipelines and setting the foundation for building what would ultimately become the Terra platform,” Van der Auwera stated.
Modularizing workflows and optimizing pipelines
Variety and choice of cloud computing and storage opened doors for the Broad Institute to migrate their pipelines as cloud-native applications. Their pipelines comprise many codes to perform various operations, from data reformatting to data curating to analytics, and more. For example, the GATK comprises 24 tasks, of which 6 are multi-threaded and 18 are single threaded.
“With on-premises infrastructure, you don’t have access to a variety of machine types like you can get on the cloud,” Van der Auwera explained. “On-premises clusters are typically all one type of system. With different types of cloud instances, however, we could modularize our workflows and right-size the instances allocated for each task based on its needs. Thus, we could cut processing costs considerably.”
“Many customers who deploy genomics workflows on cloud reserve large instances, because some parts of the workflows are compute-intensive,” Marissa Powers, an Intel Solution Architect who works with the Broad Institute's data engineering team explained. “The Broad Institute does have processes in their pipelines that need massive amounts of computation. But most of the tools that are part of the genome analysis pipeline are actually single-threaded. They just need to run as long as they take, and they could use a smaller, less costly instance. So, the Broad Institute team built a sophisticated workflow automation mechanism where individual VMs are right-sized for the job and orchestrated across the entire pipeline of tasks.”
Another key innovation was how they avoided moving data whenever possible. Most analysis tools normally require localizing an entire input file by moving it from object storage to VM memory. But, the Broad Institute's GATK can stream just a subset of the genomic data from the original input files. For many stages in the pipeline, execution is parallelized over subsets of the genome, with each subset being sent for processing to a different VM. This streaming approach reduces the amount of storage and memory needed, reduces time spent copying large amounts of data to the VM, and ultimately reduces costs. Compared to the Broad Institute’s initial deployment on cloud, these optimizations, along with the use of preemptible instances, reduced the cost of their main genome analysis pipeline by about 85 percent.
The Broad Institute chose Google N1 and N2 instances, running on several families of Intel Xeon Scalable processors, to run their pipelines on in the cloud. Intel has had a joint partnership with the Broad Institute since 2017, helping optimize the organization’s pipelines and GATK with Intel libraries, including the Intel® Genomics Kernel Library. Intel and the Broad Institute have also collaborated on powerful and flexible data center solutions for genome analysis for several years. Together they manage the Intel-Broad Center for Genomic Data Engineering. The Center helps researchers and software engineers build, optimize, and widely share new tools and infrastructure that will help scientists integrate and process genomic data. The project optimizes best practices in hardware and software for genome analytics.
Intel worked with the Broad Institute to help optimize their pipelines on Google Cloud. Intel developers benchmarked workloads targeted for Google Cloud and prescribed instances using the Workflow Definition Language (WDL) , an open-source, community-based standard for pipeline development stewarded by the OpenWDL organization. For example, specific kernels in the GATK are optimized for vector operations with Intel Advanced Vector Extensions 512 (Intel AVX-512). Some optimized storage functions use the Intel Intelligent Storage Acceleration Library (Intel ISA-L).
“One kernel of the pipeline, called PairHMM, is a hidden Markov model,” Powers explained. “Intel AVX-512 is a good fit for it based on the length of the vectors being processed. With the optimized version, we’ve seen continuous improvement from the original Java implementation to Intel AVX2 and Intel AVX-512. Anyone who runs the GATK pipeline on 1st Gen Intel Xeon Scalable processors, or later generations, gets by default the optimized version, whether they run on-premises or in the cloud.”
On the Terra platform, Broad Institute pipelines run on Google Cloud N1 instances by default. But their pipelines are freely available to anyone to download through GitHub and run on their own infrastructure or the cloud infrastructure of their choice, including Google N2 instances, which are built on 2nd Gen Intel Xeon Scalable processors. As part of its benchmarking efforts, Intel studied the benefits of N2 instances on their pipelines. According to Intel, the Broad Institute’s GATK runs about 25 percent faster and costs 34 percent less on N2 instances. For the All of Us program, pipelines deploy on N2 instances by default.
Genome sequencing, high-resolution medical imaging, and the digital transformation of clinical data have created a sea change in biomedical research. The Broad Institute, in collaboration with other organizations and academic institutions, envisioned a federated data ecosystem that would leverage connections between interoperable data repositories, tool repositories, and analysis engines, with user portals tailored to the needs of specific research communities. This vision became the Terra platform, a collaboration between the Broad Institute, Verily and Microsoft for an open data ecosystem available to life sciences researchers around the world. The aim of Terra is to enable next-generation biomedical research and put powerful tools in the hands of the wider life sciences research community.
“Terra provides a user-friendly environment that enables researchers to access the datasets they need, and apply the tools they want, securely and at scale,” Van der Auwera commented. “The platform also makes it easy to share their work at any stage, either privately with their collaborators or publicly with the world, in a form that makes their analysis completely reproducible and extensible.”
Using WDL, the platform simplifies running Broad Institute optimized pipelines, integrates external tools, performs interactive analyses, and enables access to a variety of data hosted in the cloud. It allows importing tools and workflow descriptions from other organizations, such as University of California at Santa Cruz Genomics Institute’s bioinformatics tools on Dockstore. Bioinformatics scientists who maintain their tools and workflows in Github can register them in Dockstore for use by other researchers, who can then run them on a range of connected analysis platforms, including Terra. Terra also provides secure workspaces to build projects for wide collaboration by scientists around the world. Terra currently supports nearly 20,000 users around the globe on Google Cloud with support for Microsoft Azure cloud currently in the works.
Terra enables high levels of collaboration in the cloud, allowing researchers to tackle human health challenges that are larger than a single organization can solve. Able to draw on a large variety of data, scientists can use existing bioinformatics and emerging artificial intelligence (AI) techniques and tools to gain new insights. Terra allows researchers to focus on their science instead of the infrastructure and provides shareable workspaces for wide collaboration.
The Broad Institute Data Sciences Platform
As a research organization, the Broad Institute has developed several life sciences processing pipelines and software tools. But, migrating their services to the cloud, optimizing their workloads, and building Terra required the Broad Institute to expand their development expertise. Today, the Broad Institute Data Sciences Platform (DSP) includes engineers, analysts and designers. These professionals develop software products and operate services to support life sciences research using the many types of datasets available to scientists. The DSP also supports many national and international scientific initiatives in which the Broad Institute is involved.
Today, genomics is at the center of how researchers solve human health challenges, such as understanding how the SARS-CoV-2 virus works in order to create effective vaccines. In life sciences and human health breakthroughs, the world owes much to genomics research and the innovations achieved by its researchers and organizations supporting and enabling advancements in the field. The Broad Institute and its collaborators are at the forefront of innovation, enabling and helping accelerate genomics research.
This article was produced as part of Intel’s editorial program, with the goal of highlighting cutting-edge science, research and innovation driven by the HPC and AI communities through advanced technology. The publisher of the content has final editing rights and determines what articles are published.
Complete the form below to unlock access to this Audio Article: "Building Terra, the Broad Institute’s Platform for a Collaborative, Scalable Genomics Research Ecosystem for All"