Josh P. Roberts
Technology and computing power have allowed a wholesale examination of what was, not too long ago, a necessarily piecemeal study. ’omes – and with them ’omics – are now pervasive in biological research. Genes have ceded the spotlight (if not the footlights) to genomes, proteins to proteomes, metabolites to metabolomes, and microbiota to microbiomes. And in the process, genetics has ceded the spotlight to genomics, and so on down the line.
Progenesis QI is a multi-vendor LC-MS quantitative analysis software that can be used across multiple research areas. It offers 100% matching with no missing values. This gives increased sensitivity, enabling a robust statistical analysis. Findings can be put in a biological context by integration with pathway tools.
"“Progenesis QI’s unique approach aligns ion signals across all samples before detection takes place, resulting in 100% matching of ions and no missing values!”"
Dr Paul Goulding, Business Development Manager: Asia, Africa and Australasia
Not content with looking at just a single ’ome at a time, a small but growing contingent has been taking their studies to the next logical step: multi-omics. Here, a single ’ome is seen as a piece of the greater puzzle that is the system or organism (let’s call it M). Not only do M’s genes play a role in determining who and what he is, including his current and future mental and physical health, but so do his mRNA, proteins, and metabolites. And so too do his gut flora. Affecting all these as well are methylation and phosphorylation patterns, miRNA, diet and heart rate, as well as signaling molecules and other things that all feed back into the web-like network that is M.
Studying multi-omics promises to give a more holistic picture of the organism and its place in its ecosystem. It can help to tease out pathways that were not clear from a single ’omic. It allows for the identification of biomarkers unseen using a single ’omic, and for the corroboration of others that were found by a single ’omic. It can lead to insights into the workings of the organism, how those workings may be perturbed by disease or other forces, and the development of remedies (such as therapeutics) to set things right.
Yet taking a multi-omics approach is not as simple as adding another data point to a single-omics model. Combining ’omes adds additional challenges such as standardization and compatibility of the data, lack of appropriate statistical and computational tools, data architecture issues, and so on. Yet results have been generated, progress continues to be made, and the field is optimistic.
’Omics and Big Data
One of the principle issues of doing even single-omics work stems from the sheer volume of data that is generated, and thus the computing horsepower needed to crunch it. For example, Illumina’s HiSeq 2500 can generate up to a terabyte of sequencing data in a single run.1 That information needs to be in a form such that it can be stored, secured, retrieved, shared, and analyzed.
Genomics research has the fortune (or curse) to be largely dominated by a single platform, and it has been around for a while, so there is considerable standardization and many good strategies in place to to effectively and efficiently store and access the data, points out Tiffany Timbers, Ph.D., a teaching fellow in the Masters of Data Science program at the University of British Columbia.
“But from the phenome perspective this isn’t a solved problem,” laments Timbers, whose research on C. elegans involves integrating genomics and phenomics. “We record videos – do you store the raw videos? Do you store intermediate file types? What’s the metadata that you associate with that? Not all researchers across the field agree, so everyone is doing something slightly different.”
Similar issues come to the fore when dealing with other ’omics. To take a simple example: how should post-translational modifications be handled in proteomics? Is it a single protein that has multiple phosphorylations, or is each phosphorylated state considered a different entity? How are they to be related? The situation becomes even more murky when, like metabolites, “there are multiple standard ways to describe the same thing -- and this is a problem,” says Robert Tonge, Ph.D., Principle Product Manager, Informatics at Waters Corporation. “You need to describe your differences in a way that’s compatible with the database that you’re going to search.”
While ’omics can be done using in-house resources, “it’s a big investment for labs to set up all that server infrastructure and keep it up to date, have somebody manage it, keep it secure, and so on,” says Christie Hunter, Ph.D., Director of Global Technical Marketing at SCIEX.
The solution most often advocated is cloud computing, in which the data is uploaded to and lives on large remote server farms. Illumina, for example, operates the cloud-based BaseSpace Sequence Hub platform to which data can be uploaded or streamed directly from Illumina sequencers. A collection of applications, some from third parties, are hosted on the site that allow users to organize, manipulate, and analyze the data, and query databases in a single environment.
“The cloud brings to the party the potential for huge amounts of processing horsepower that can be turned on as needed – you don’t have to have your own resources that the vast majority of the time aren’t doing any work,” notes Tonge.
Multi-omics and Big Data
Once different ’omes are asked to talk with each other, data management issues can increase exponentially. It’s not so much the amount of data – this may well be merely additive – as it is the disparities among the data, and how to correlate one ’omic with the other. This may seem straightforward in that the genome encodes transcriptome – these can be aligned by mapping each A, G, C, and T to U, C, G, and A. Similarly, each three bases of the transcriptome (degenerately) encodes a given amino acid, allowing the proteome in turn to be aligned to the transcriptome. (This gross oversimplification is merely to illustrate the next point.)
“So if you have a change in genomics you’d expect to see a change in the equivalent protein. But what’s not known is how proteins affect metabolites or lipids -- we don’t fully understand all the connections in that data matrix,” Tonge points out.
“One of the biggest challenges is how to integrate across data sets – there aren’t many examples,” says Janet Jansson, Ph.D., Division Director of Biological Sciences at the Pacific Northwest National Laboratory. “The way we approach it is to look at relative abundance data across data types. Think of it like a huge Excel table. And then you can build networks, with nodes and lines connecting networks of data.” For example, the nodes may be bacterial 16S genes taken from different niches, and for each node there may be ten or twenty different metabolites correlated to the various nodes, and so on for other ’omes. “So it’s possible to graphically represent the data – that’s one way of doing it.” Another way, “if you’re looking at genes and transcripts and proteins that all have the same code” is to essentially pile the data on top of each other to align them to specific genes in a pathway.
This, of course, is not to trivialize the task of normalizing the data – to distill it down to the point at which it becomes just “relative abundance” (what Hunter calls “the same currency”). That is where most of the work happens. Once it is accomplished, though, Hunter says, the “comparison between different techniques, different ’omes, starts to become more on the currency. … Once you distill down to that primary heat map of sample vs. quant[ity], then comparison becomes generic and easy to do across platforms.” SCIEX, for example, offers applications (which it calls OneOmics) on BaseSpace to integrate proteomics and genomics (transcriptomics) data.
“These days you distill big data into biological pathways and things like this in order to get meaning out of it,” says Michael Snyder, Ph.D., Director, Center for Genomics and Personalized Medicine at Stanford University.
Scripts and Databases
Many available web-based tools – whether open source or commercial products – aim to allow researchers to use their ’omics data to query public and proprietary databases. Software suites such as Clarivate Analytics’ MetaCore and Key Pathway Advisor, Qiagen’s Ingenuity Pathway Analysis, and Advaita’s iPathwayGuide, will use curated collections to link a customer’s ’omic data to known or predicted biological pathways and other information that may be valuable for biomarker or drug discovery, for example. Publicly-accessible web interfaces and databases such as IMPaLA, KEGG, Reactome, Pathway Commons, and WikiPathways, can serve similar functions.
Feeding in a list of genes, proteins, and metabolites that are up- or down-regulated in cancerous tissue, for example, may yield a list of pathways that have been affected – quite often several in related areas such as fat regulation, or protein turnover, “or some other high-level biological descriptor,” explains Tonge. While this may not yield an immediate causal connection, “all this data then starts pointing you in a particular direction.”
Historically researchers have used in-house, DIY tools to pull their datasets together, says Brady Davis, Senior Director of Market Development at Illumina, but lots of DIY tools don’t scale. “We’re seeing a shift where there are organizations like Illumina that are building platforms to help accelerate that … so building the data models that allow that data to become normalized and brought into an ecosystem that you can build analytics and do search on datasets with accuracy.”
The view of those like Timbers in the DIY space is that the commercial platforms “let you do certain things, but [don’t] let you do everything you’d like to be able to do, or that you could do if you could computer program.” She advocates more statistics and computational training across biology-related programs. “The idea is maybe we can create the ‘bioinformatics middle class’, so that it’s easier for the biologist to at least collaborate with, talk to, and speak the same language as the computational people, or even implement some of the things themselves.”
“The technologies are robust enough now that anybody can collect these kinds of data. You still need to be pretty expert to figure out how to analyze it and combine them,” says Snyder. So for now “you’d probably want to talk to an expert so you do it right – at least so you know what the issues are.”
Combining different types of data is not new, even if the term “multi-omics” is. Just as police may use footprint and hair follicle evidence along with eye witness testimony to help solve a crime, genetic, proteomic, and lifestyle have collectively been used to give a fuller, more complete picture of organisms in health and disease. What is only now beginning to be realized are the computing power and tools able combine these on an ’omic scale.
Josh P. Roberts is a freelance writer living in Minneapolis, USA