Scream If You Wanna Go FASTA: Breaking the Bottleneck in Genomic Data Analysis

Article

Published: December 11, 2019

| by Sophie Laurenson

Scream If You Wanna Go FASTA: Breaking the Bottleneck in Genomic Data Analysis content piece image

Listen with

Speechify

0:00

Thank you. Listen to this article using the player above. ✖

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 5 minutes

The rising popularity and accessibility of Next Generation Sequencing (NGS) technologies have generated an excess of data, requiring concomitant advances in data processing and analytics. NGS describes a collection of platforms that enable the rapid profiling of nucleic acid sequences. Over the last decade, several groundbreaking technologies competing on efficiency and cost have facilitated large genome and transcriptome sequencing projects in different geographical and ethnic settings.

The direct cost associated with performing sequencing reactions has diminished, whereas the data processing, storage, management, and interpretation costs have increased exponentially. The uptake of NGS technologies in research and clinical settings has produced petabytes of genomic data, creating a process bottleneck. Furthermore, the expansion in precision medicine has necessitated the challenging task of combining genomic data with clinical data, such as electronic health records (EHR), to gain insights. In this article, we explore how new tools are simplifying the process of genomic data analysis.

What is genomic data analysis?

Genomic data analysis is a series of processes organized into a pipeline that converts raw nucleic acid sequence data into useful insights. The raw data from sequencing experiments are short sequence reads containing 200 – 1000 nucleotide base pairs, dependent on the underlying NGS technology. A single NGS experiment produces billions of individual short sequences, totaling gigabytes of data, which must be arranged into order. Raw sequence data is most commonly stored in FASTQ format and the quality of the data is determined by the Phred algorithm. Raw sequence data is stored as Sequence Read Archive (SRA) files in databases, the largest of which is the International Nucleotide Sequence Database Collaboration (INSDC).

To place individual sequence reads into order, they are aligned against a reference genome and encoded in FASTA format. During the alignment process, each individual raw sequence is matched to a genomic position. Alignment to a reference genome is a highly controversial topic in genomic data analysis. Despite advances in NGS technology and data analytics, the identification of an ideal model reference genome remains elusive in most species.

From a computational perspective, sequence alignment algorithms were developed and refined in the 1970s and 1980s, giving rise to the more efficient FASTA and BLAST algorithms that remain in use today. The exact alignment algorithm used in a given study is heavily dependent on experimental parameters such as the length of sequence reads, artifacts and biases in the sequencing reactions, the quality of the reference genome assembly, the computational resources available, and subsequent analyses to be performed. Aligned sequences are commonly stored as SAM (Sequence Alignment Map) / BAM (Binary Alignment Map) files. SAMtools is a popular method used for managing SAM/BAM files.

Following alignment, genomic sequences are annotated to emphasize regions of interest such as genes, exons and regulatory regions. Annotated genomic data exists in specialized formats containing the genomic sequence region accompanied by the appropriate annotations. The intention of these files is to prioritize sequence regions that may have biological or clinical significance in subsequent analyses.

Making a good call

The final stage of most NGS data pipelines is to generate data that can be used to gain valuable insights. The methods and tools used vary depending on the aims of the experiment. The most common aim of NGS experiments is to identify and characterize genomic variants. These are sequences that differ significantly from the reference genome and are usually described in a VCF (Variant Call Format). To perform variant analysis, researchers must have access to large sets of variant data. Sources include the 1000 Genomes Project, the Exome Aggregation (ExAC) Consortium, and the Cancer Genome Atlas (TCGA) project. Specialized tools exist for variant detection such as pGENMI, a computational method for analyzing molecular variants in determining drug responses; CODEX, a copy-number variant detection tool; LUMPY, an algorithm for detecting structural rearrangements; and BreakDancer, a genomic structural variation detection tool. Further downstream analysis could include geographic determinants of variants tools such as GGV (geographic visualization of variant prevalence), or kinship analysis tools such as SEEKIN.

Clinical potential

NGS technologies have evolved to play an important role in life sciences research. However, many believe that the true value of genomic analysis will be in clinical applications. The Precision Medicine program for the World Economic Forum (WEF) suggests that as genomics experiments move into clinical settings, data analytics will need to integrate sequencing data with other data types. The WEF is establishing a global center to coordinate a framework on which the NGS community can develop tools and policies to bring genomics into the mainstream of the health sector. Separately, researchers are developing solutions to integrate genomic data with clinical data. OntoFusion is an ontology-based integration of genomic and clinical databases, exemplifying the union between NGS and clinical data.

Given the complexity and modularity of genomic data analysis, recent emphasis has been placed on integrating tools to construct NGS data pipelines. Companies active in the NGS technologies space have developed integrated solutions for end-use customers in both research and clinical settings. These platforms aim to be easily-implemented turnkey solutions, although most are tied to the NGS platform technologies offered by the service provider.

Can open-source tools break the bottleneck?

Researchers operating on a budget, or who prefer flexibility, can build custom systems using open source tools. Ronaldo da Silva Francisco Junior, a freelance bioinformatician based in Brazil, prefers open-source solutions. Although budget is a factor, the key reason is the ability to fine-tune parameters. He asserts that, “With proper knowledge in programming language skills, it is also possible to adjust such tools to specific problems.” His preferred platform is GATK, hosted by the Broad Institute. The plethora of open-source options available for genomic analysis indicates that other researchers agree. In particular, Bioconductor, which is coded in the R environment, is a popular tool used by researchers in both academic and industry settings. SRAdb is a tool within Bioconductor which can be used to query SRA data within R. Workflow management systems (WMS), such as Galaxy, are another open solution aiming to automate and streamline data processing and analytics. Some researchers, as da Silva Francisco Junior says, opt for combining aspects of proprietary turnkey solutions “…using the freely available part of such tools, for example, GeneCards, MalaCatds, and VarSelect.”

For researchers with less advanced data and coding skills, many public organizations also host online tools for genomic data analysis. For example, the SRA toolkit enables data processing and conversion from public sources. The European Bioinformatics Institute (EBI) also offers a comprehensive selection of databases and tools for data analytics. As an example of a private initiative, the Broad Institute hosts the Integrated Genome Viewer (IGV) for open use by researchers.

The tools and technologies described above have been developed in a modular fashion, often flourishing organically from an unmet need. As genomic data analysis matures as a discipline, these modular tools are gradually integrating into a cohesive toolbox. With increased computational capacity and automation, the potential to unravel the complex genetic factors that underpin biology and disease is slowly becoming reality. This potential is what ultimately excites researchers like da Silva Francisco Junior who concludes, “The prospective use of recent advances in sequencing technologies and data generation towards the understanding of biological phenomena in human genetic diseases is an area that deeply grips my interest. I see with great excitement the possibility to use computational methods to better comprehend the biological basis of human genetic diseases. Therefore, as far as I am concerned, the use of machine and deep learning approaches in the genomic data science fields has been an exciting advance in the last years, making it a powerful way to gain insights about the molecular mechanism of mutations and translational applications in the medical field.”

Informatics