Multiomics and Metaproteomics: Why More Is Better In Proteomic Data Analysis
Multiomics and Metaproteomics: Why More Is Better In Proteomic Data Analysis
Developments in next-gen multi-omics have made it possible to generate and process massive amounts of data that cover anything from genomic mutations to metabolomic and microbiological processes. However, handling and analyzing all this data continues to present challenges to even the most experienced bioinformatics experts. In this article we explore some of the solutions and advancements in multi-omics and metaproteomics data analysis approaches.
Multi-Omics: More is Better
The multi-omics approach has become a hot topic in the biomedical field, with researchers drawn to globally analyzing data integrated from multiple “omes” such as the genome, transcriptome or proteome. Collecting information from these multiple “omes” allows better understanding of complex diseases like cancer.
“Many labs have so far focused on understanding disease mechanism, progression and treatment strategy using single ‘omics’ (like genomics, epigenomics, transcriptomics, proteomics, or metabolomics) analysis,” said Dr. Suhas Vasaikar, research associate fellow at the Baylor College of Medicine, Houston, Texas. “Although single omics analysis gives some understanding about the cellular condition, it does not provide a global picture. The beauty about multi-omics is that it offers a comprehensive assessment of multiple ‘omic’ profiles to leverage information from individual ‘omics’.”
Considering the incredible complexity of the human genome and its regulation at multiple levels, the “More is Better” approach using multi-omics has become quite popular in this era of precision medicine.1 The Cancer Genome Atlas (TCGA) and Clinical Proteomic Tumor Analysis Consortium (CPTAC) are such national efforts to understand the molecular basis of cancer through genomics and proteogenomic analysis, respectively.
Currently, there are a number of tools that focus on multi-omics as a recourse, analysis module, or visualization tool. Existing databases and web-portals allow users to exploit publicly available cancer data, but oftentimes they focus on particular datasets/cohorts, or specific questions under study. For instance, Oncomine is a cancer microarray database and web-based data-mining platform. Similarly, MethyCancer is a database that helps to elucidate the relationship between DNA methylation, gene expression and cancer. While PrognoScan database is focused on meta-analysis, the cBioPortal explores cancer genomics with ample multi-omics datasets.
“Overall, omic-generalized tools usually limit the application to a specific question within and across known cancer types,” explained Vasaikar. “Hence, there is a need for tools that integrate available ‘big data’ under a common platform and assist in interpretation of ‘big data’ in relation to one another.”
The LinkedOmics Portal
To this end, Vasaikar and his colleagues in the Bing Zhang Lab at Baylor developed a database called LinkedOmics for disseminating data from large scale cancer omics projects2. Currently, it uses preprocessed and normalized data from the Broad TCGA Firehose and CPTAC data portals to reduce redundant efforts. The platform focuses on the discovery and interpretation of attribute associations, complementing existing cancer data portals.
LinkedOmics integrates not only genomics data from the TCGA portal for 32 cancer types, but also proteomics data for available cancers from the CPTAC portal with clear description of the application, pipelines used, and methods used for normalization.2 At the moment, LinkedOmics contains multi-omics data for primary tumors from a total of 11,158 patients, including:
- mutation, copy number alteration (CNA), methylation, mRNA expression, miRNA expression, and reverse phase protein array (RPPA) data at the gene level
- mutation data at the site level
- CNA data at the region-level
- RPPA data at the analyte-level
- clinical data
“LinkedOmics is the first data portal that integrates mass spectrometry-based global proteomics data generated by CPTAC on selected TCGA tumor samples,” added Vasaikar. “The portal is user-friendly, and particularly beneficial for researchers in the field because it uses the ‘guilt-by-association’ method and performs functional enrichment analysis – some of the most widely-used and well-understood approaches in biomedical research. The visualization tools within this platform are very effective in helping users understand the results easily.”
A major drawback when applying association analysis to high-dimensional data is the difficulty in identifying superficial and non-functional relationships. Vasaikar explained that this limitation is directly addressed by the multi-omics, pan-cancer, and pathway/network analysis features in LinkedOmics.
What Does the Future Hold for LinkedOmics?
Vasaikar and his team envisions incorporating multivariate analysis to the LinkedOmics platform so that confounding variables can be controlled.
“Our current model allows univariate analysis results to be obtained in less than a minute on-the-fly, but for multi-variate analysis we would like to use cloud computing to provide invaluable results to users without much waiting time,” Vasaikar said.
Other future improvements include allowing users to customize query features (e.g. only loss-of-function mutations instead of all mutations), merge query features (e.g. all mutations in a pathway or all aberration types in a gene), select multiple target datasets at the same time, explore hypothesis driven relationships, and create correlation networks for top-ranking genes.
What’s the Deal with Metaproteomics Data?
Metaproteomics refers to the large-scale characterization of the entire protein complement of environmental microbiota at a given time point3. One of the biggest differences between classical proteomics and metaproteomics is that the community samples handled in the latter studies contain proteins from multiple (i.e. up to hundreds or even thousands of) different species. According to Dr. Thilo Muth, bioinformatics expert and post-doctoral fellow at the Robert Koch Institute in Berlin, Germany, the field of metaproteomics is evolving rapidly as an important sub-discipline of proteome research used to assess the functional repertoire of microbiomes (e.g. within the human gut) and of environmental samples.
“Metaproteomics has become increasingly popular in human health studies that investigate the role of the microbiome in disease states,” said Muth. “For example, intervention studies that investigated the impact of diet on the gut microbiome have shown very interesting patterns: although the microbial community structure (i.e. the taxonomic composition) remained relatively stable for given perturbations, significant qualitative and quantitative changes of protein expression could be observed in these samples.”
The Problems with Metaproteomics Data Analysis
Although metaproteomics studies can turn up valuable information on protein expression patterns, the actual process of analyzing the data can be particularly arduous. In contrast to genomic approaches, the analysis of proteins from microbial community samples comes with added challenges related to experimental setup and computational factors. Some of the most severe problems, as described by Muth, are listed below:
- The complexity and heterogeneity of community samples leads to low protein identification yield and reduced protein coverage.4
- The enormous amount of proteome references that need to be considered when identifying proteins via database searches is problematic for statistical validation and correct assessment of false discovery rates.5
- Despite the large number of microbial proteome references, the databases are far from complete since many species/strains have not yet been sequenced nor annotated.6
Muth went on to explain that “proteins are typically identified via peptides, or short protein sequences, in a mass spectrometry experiment, and these peptide sequences need to be puzzled back to the correct ‘original’ protein. This can be a difficult task, when proteins with many similar peptide sequences are in the sample.”
In proteomics, this problem is known as the ‘protein inference issue’, whereby many proteins are similar or even identical in their sequence between different species or strains in a microbial community sample7. Thus, in metaproteomics, inference becomes even more complicated when one wants to resolve which exact organism an identified protein came from.
Furthermore, most of the currently available software for the processing, evaluation, and interpretation of metaproteomic data come with their own set of limitations.
“Peptide and protein identification algorithms are rather limited for metaproteomics when large public databases are used,” said Muth. “The increased search space presented by microbial databases affects the number of identified proteins due to problems of the scoring functions and statistical validation.”
Developments on the Horizon Can Provide Solutions
To overcome these problems, many research groups have created their own tailored databases, given that they can derive a more specific metagenome from the samples under investigation. If this is not possible, so-called ‘pseudo-metagenomes’ from single microbial genomes may be created. “Pseudo-metagenomes must be developed with caution, because only those species/strains can be identified that were included in the database by the researcher in advance, and other organisms may be missed out, leading to selection bias,” explained Muth. However, he remains optimistic that in the near future, the decreasing costs for performing metagenomics experiments will allow their routine application in combination with metaproteome analyses. This will lead to customized databases for each investigated sample that can provide a more targeted method for specifically identifying the proteomes of microbial communities.
Muth also believes that the resolution and throughput of the analytical tools/instruments will continue to improve, increasing the analysis depth and the protein coverage in microbial samples. In parallel, database search engines will need to be improved further with respect to accuracy and statistical validation.
“It’s clear that the constant increase of high-functioning databases will continue to challenge conventional identification workflows and hardware,” he added. “De novo sequencing, or database-free sequence identification, may also soon become a real alternative in proteomics, with algorithmic improvements (e.g. using latest machine learning techniques) exploiting the potential of high-resolution data.”
With many such advancements on the horizon, Muth is confident that the research community will soon have access to better tools for investigating the taxonomic and functional profiles of microbial community samples.
 Huang, S.; Chaudhary, K.; Garmire, L. X. More Is Better: Recent Progress in Multi-Omics Data Integration Methods. Frontiers in Genetics, 2017 8, 84.
 Vasaikar, S. V.; Straub, P.; Wang, J.; Zhang, B., LinkedOmics: analyzing multi-omics data within and across 32 cancer types Nucleic Acids Research 2018, 46 (D1), D956-D963.
 Wilmes, P.; Bond, P.L.; Metaproteomics: studying functional gene expression in microbial ecosystems. Trends in Microbiology. 2006;14 (2):92–97.
 Haange, S.-B.; Jehmlich, N., Proteomic interrogation of the gut microbiota: potential clinical impact. Expert Review of Proteomics 2016, 13 (6), 535-537.
 Muth, T.; Kolmeder, C. A.; Salojärvi, J. et al. Navigating through metaproteomics data: A logbook of database searching. Proteomics, 2015, 15: 3439-3453.
 Locey, K. J.; Lennon, J. T., Scaling laws predict global microbial diversity. Proceedings of the National Academy of Sciences USA 2016, 113 (21), 5970-5975.
 Heyer, R.; Schallert, K.; Zoun, R.; Becher, B.; Saake, G.; Benndorf, D., Challenges and perspectives of metaproteomic data analysis. Journal of Biotechnology 2017, 261, 24-36.