Major advancements in the development of sequencing technologies over recent years mean that scientists are no longer required to rely on simply proteomic or genomic information in isolate. Proteogenomics couples mass spectrometry (MS) techniques with high-throughput next-generation sequencing (NGS) technologies to study the role of protein variants in biological mechanisms and disease pathologies. Part of the "systems biology" trend, proteogenomics is expanding at an accelerating rate. In this article, we look at the approaches adopted in proteogenomics and explore how proteogenomics is changing the dynamics of precision medicine in oncology.
In the study of proteins, MS data is typically matched against existing mapped peptides in a reference protein database. Several key issues arise here; notably the possibility that the protein in question may be novel and thus is not referenced in a database, and the peptide may contain mutations or represent an alternative splice form.
By combining proteomics and genomics, proteogenomics integrates genome, transcriptome and proteome data sets to surmount these issues. “Proteogenomics allows for the analysis of the correlation of mRNA and protein pairs across samples, of mutations, post-translational modifications, and signaling pathways, and of correlation of the regulatory effects on RNA and protein expression levels caused by genetic variants (eQTL), microRNAs (miRNAs) and copy number aberrations (CNAs),” says Alexzander Asea, Professor and Director at the Precision Therapeutics Proteogenomics Diagnostics Center, University of Toledo. "Common techniques used in proteogenomics research include RNA sequencing and data analysis, LC-MS/MS and MALDI mass spectrometry."
Proteogenomics represents an equal partnership in which each component contributes and each component benefits. NGS allows researchers to characterize variants in the genome, such as single nucleotide polymorphisms (SNPs) and translocations. Using in silico methods, these variants can then be translated into proteoforms that can be added to existing protein databases used for interpretation of MS data; making such databases comprehensive.
The continuous feedback loop between genomics, proteomics and transcriptomics data in proteogenomics relies heavily on streamlined data integration and bioinformatic software systems – of which an array exist: "To achieve reproducible and reliable data, it is extremely important to use and/or combine quantitation bioinformatics software, including Ingenuity Pathway analysis (IPA), Parallel Reaction Monitoring (PRM), Progenesis, Library of Integrated Network-Based Cellular Signatures (LINCS) and Skyline, DESeq, Limma, EdgeR, R and MStats," notes Asea.
Best Practices for Sample Preparation and Lipid Extraction [Whitepaper]
Traditional lipid extraction technique often imply a slow process with risk of contamination. Find out how to prepare your sample to ensure a quick and safe extraction.View Whitepaper
Challenges in a novel field
Novel research fields encounter challenges in their establishment and in the refinement of techniques used – proteogenomics is no exception.
"Whole genome sequencing (genomics) is still very cutting edge and while there are a lot of perks to using it, there are also a few drawbacks – lots of data, powerful computers, reference genomes etc.," notes Henry Rodriguez, Director of the Office of Cancer Clinical Proteomics Research at the National Cancer Institute.
Combining the three disciplines – each one generating substantially large data sets – presents significant challenges with regards to analysis. Nesvizhskii highlights several key issues in a 2014 Nature Methods review of the field, including the need to overcome "data hoarding" and false discoveries in proteogenomics.1 Nesvizhskii discusses several sources of false discoveries in proteogenomics, including the "application of the same filtering thresholds to both known and novel peptides, incorrect identification of novel peptides highly homologous to known sequences, and making unsupported conclusions based on shared peptides", and encourages a focus on establishing thorough data analysis guidelines to overcome such issues. Furthermore, speaking of his experiences at the US HUPO 2019 conference, Asea says: "Due to the huge amount of data that are generated during mass spectrometry-based experiments, Dr Birgit Schilling from the Buck Institute for Aging, Novato, CA made it clear that improvements in bioinformatics algorithms is an essential strategy to the future of clinical proteomics."
Whilst MS has advanced and improved immensely over recent years, issues relating to sensitivity, the size of proteins, sample solubility, separation and data analysis still remain. “MS-based proteome approaches still have a lot of optimization ahead," Rodriguez remarks. "However, the flexibility and potential of mass spectrometry remains to be fully exploited. In the coming years, I’m excited it will provide insights into previously inaccessible corners of cell biology," he adds.
Making precision oncology more “precise”
From an “omics” perspective, the clinical oncology field was historically dominated by genomics research alone. However, the NCI's Office of Cancer Clinical Proteomics Research changed the dynamics of the field through their Clinical Proteomic Tumor Analysis Consortium (CPTAC). Rodriguez says: “Integrating genomic and proteomic data through proteogenomics approaches can illuminate the biology that is either difficult to obtain or not possible through genomics alone, to help make precision oncology more precise.” Asea also adds that these projects “coupled with the Human Proteome Project (HPP) are all significant breakthroughs that have greatly pushed the field of proteogenomics and precision medicine into the future.”
Researchers from CPTAC have utilized proteogenomic methods to make novel findings in colorectal, breast and ovarian cancer. They wanted to know across each study which protein coding alterations are expressed at the physical protein level; information that cannot be deduced from MS data alone.
In the above-mentioned colorectal (CRC) study, database searches were performed with customized sequence databases from matched RNA sequencing data for individual tumor samples.2 The findings showed 796 single amino acid variants (SAAVs) across 86 tumor samples. Particularly, 20q amplification was associated with the largest global changes in both mRNA and protein levels, highlighting the importance of 20q amplification in colorectal cancer; a concept that was previously disputed. Amongst the 79 genes within the 20q region, only 40 showed significant CNA–protein correlation, a measure that indicates sequences that translate to high protein abundance. By combining proteomic and genomic resources, the work subsequently identified a subset of genes that can be prioritized in future studies by combining proteomic and genomic resources.
More recently, a study published in Cell Press involved proteomic and genomic analyses of 80 tumor tissues from young diffuse gastric cancer (GC) patients3. Initial genome analysis found 7,079 nonsynonymous somatic single nucleotide variants (SNVs) that were detected in tumors but not in peripheral blood mononuclear cells from the same patient in 4,982 genes. Of these genes, six were found to be significantly mutated, including CDH1, TP53, BANP, MUC5B, RHOA and ARID1A. MUC5B and BANP are previously unreported in GCs. Proteomic analysis identified proteins whose phosphorylation levels were significantly increased in the samples with mutations of these genes. CDH1, ARID1A, and RHOA illustrated mutation-phosphorylation correlations in 80 proteins. By clustering mRNA, protein, phosphorylation and N-glycosylation data, four subtypes of diffuse GCs and their associated cellular pathways were subsequently distinguished; information that would not have been attainable through mRNA analysis alone.
Now, young patients with diffuse GCs can be categorized into a positive or negative prognosis based on whether they show increased or decreased mRNA and protein expression levels of select oncogenes and tumor suppressor genes. The authors also note that drug sensitivity can be predicted based on the mutation-phosphorylation associations of ARD1A, CDH1 and RHOA – however further validation is required.
Personalized medicine in cancer
Patients and physicians in the field of oncology face a growing problem – resistance to cancer therapies. Ninety percent of failures in chemotherapy are during the invasion and metastasis of cancers related to drug resistance. Proteogenomics can help understand this resistance, by exploring the significance of certain gene and protein variants in dictating treatment success.
In CRC for example, patients often receive treatment with monoclonal antibodies cetuximab and panitumumab (anti-EGFR drugs). A study used customized mining of RNA-sequence data from the International Genome Consortium and the Cancer Genome Atlas databases to investigate the role of variant peptides in addition to immunoglobulin gene variations in Anti-GFR therapy.4 They found that wild-type KRAS is required for anti-EGFR drug efficacy in this form of cancer, therefore variations in this gene may result in poor therapy response.
A paper published in Advances in Experimental Medicine and Biology used MS-based proteogenomic analysis to explore gatekeeper mutations in lung cancer.5 The researchers showed that the efficacy of tyrosine kinase inhibitors (a type of anti-cancer drug) can vary across racial groups, emphasizing the value of MS-based proteogenomic approaches as it “enables the direct analysis of mutated and fusion proteins expressed in a clinical sample” providing drug discovery and development scientists the ability to stratify patients.
“At the Precision Therapeutics Proteogenomics Diagnostics Center, we are using a proteogenomics platform to understand why triple-negative breast cancer (TNBC) is such an aggressive disease,” says Asea. "Although slightly responsive to chemotherapy, TNBC is more difficult to treat, generally insensitive to most available hormonal or targeted therapeutic agents and depending on its stage of diagnosis, TNBC can be extremely aggressive-recurring and metastasizing more often than other subtypes of breast cancer."
Considering the advancements made in the field over recent years, researchers are excited to see where proteogenomics will take precision medicine in the future. "I would like to see advances in single cell proteogenomics in terms of both mass spectrometry imaging of single cells with clear resolution of cytoplasmic and nuclear elements, and the ability of LC-MS/MS to obtain accurate, reproducible data from a single cell", adds Asea.
Rodriguez concludes: "We are in an exciting era in which we are learning a tremendous amount about the molecular origins of cancers due to rapid advances in molecular measurement technologies...knowledge that is being translated into tangible advances in our understanding of cancer biology, resulting in more reasons than ever to be hopeful. My vision in the next 10 years is to see proteogenomics well-entrenched into the fabric of precision medicine."
1. Nesvizhskii, A. (2014). Proteogenomics: concepts, applications and computational strategies. Nature Methods, 11(11), pp.1114-1125.
2. Zhang, B., Wang, J., Wang, X., Zhu, J., Liu, Q., Shi, Z., Chambers, M., Zimmerman, L., Shaddox, K., Kim, S., Davies, S., Wang, S., Wang, P., Kinsinger, C., Rivers, R., Rodriguez, H., Townsend, R., Ellis, M., Carr, S., Tabb, D., Coffey, R., Slebos, R. and Liebler, D. (2014). Proteogenomic characterization of human colon and rectal cancer. Nature, 513(7518), pp.382-387.
3. Mun, D., Bhin, J., Kim, S., Kim, H., Jung, J., Jung, Y., Jang, Y., Park, J., Kim, H., Jung, Y., Lee, H., Bae, J., Back, S., Kim, S., Kim, J., Park, H., Li, H., Hwang, K., Park, Y., Yook, J., Kim, B., Kwon, S., Ryu, S., Park, D., Jeon, T., Kim, D., Lee, J., Han, S., Song, K., Park, D., Park, J., Rodriguez, H., Kim, J., Lee, H., Kim, K., Yang, E., Kim, H., Paek, E., Lee, S., Lee, S. and Hwang, D. (2019). Proteogenomic Characterization of Human Early-Onset Gastric Cancer. Cancer Cell, 35(1), pp.111-124.e10.
4. Woo, S., Cha, S., Bonissone, S., Na, S., Tabb, D., Pevzner, P. and Bafna, V. (2015). Advanced Proteogenomic Analysis Reveals Multiple Peptide Mutations and Complex Immunoglobulin Peptides in Colon Cancer. Journal of Proteome Research, 14(9), pp.3555-3567.
5. Nishimura, T. and Nakamura, H. (2016). Developments for Personalized Medicine of Lung Cancer Subtypes: Mass Spectrometry-Based Clinical Proteogenomic Analysis of Oncogenic Mutations. Advances in Experimental Medicine and Biology, pp.115-137.