4 Challenges in Proteome Analysis
Proteomics is a rapidly expanding field, aided by improvements in instrumentation accuracy and sensitivity, size and affordability. This offers many possibilities for the increasing interest in personalized and precision medicine which could see improved diagnosis, treatment and disease management in the coming years.
Proteins are the major effectors of cell functions through post-translational modifications and changes in abundance. It is well known that changes in gene expression do not always reflect changes at the protein level, so it is important to consider proteomics data to understand disease, cell and system dynamics. However, proteomics data have not been extensively used in precision medicine due to a number of confounding factors.
In this list, we will discuss some of the challenges that researchers come up against in proteome analysis.
As proteome analysis becomes faster, cheaper and easier, the increased accessibility of the technique inevitably leads to an increasing volume of generated data. Whilst we can store many times more data than we could even a year ago, storage requirements are struggling to keep pace with data output1,2.
Cloud storage offers one solution. However, for sensitive and confidential data there are concerns around the security of remote storage1. Companies can offer guaranteed security but, inevitably, this comes at a price. Therefore, whilst the generation of data becomes cheaper, long term considerations for the cost of maintaining the data must also be factored in.
For other omics data, such as RNA-seq, it has been suggested that advances in technology and analysis methodologies will result in, raw sequencing data no longer requiring storage3. Discarding raw data in favour of compressed summaries could free up vast amounts of storage capacity, but it remains to be seen if this could be a reality in proteomics any time soon.
One of the major challenges that must be addressed is the combination of new and existing proteome data with other valuable omics and metadata. Only with successful integration can the data be used to its full potential in the study of systems, disease, and translated into beneficial outcomes.
Currently, there is no optimal or standardized approach to enable data integration, however they generally take one of two approaches. Data may either be combined in one go, or sequentially in a stepwise fashion, allowing prior knowledge to guide data integration at a later stage.
Inconsistencies in annotation, reporting of datasets and outputs of analysis pipelines also represent major headaches for combining data sets. Only with the introduction of consistent, standardized processes for data collection and recording is this issue likely to ease.
The instrumentation used to analyze a proteome, the technology used to reconstruct the data and the range of abundances of the proteins within a sample can all impact the final proteome coverage. However, in general, proteome coverage tends to be poorer than other omics data types. The broad application of computational approaches designed for other omics types may therefore be inappropriate and more tailored approaches necessary to fairly represent proteome data within the omics framework.
Mathematical models, many utilizing an array of network analysis techniques, are being applied to this problem and Bayesian models tested to identify more efficient algorithms that provide a better fit across data types4.
Data Robustness and Standardization for Biomarker Identification
The extraction of relevant and reliable protein targets from high-throughput proteomic data is one of the main challenges for biomarker identification.
At a basic level, proteins that are differentially expressed between two different sample types, for example disease versus non-disease, can be identified. However, more sophisticated methods employing machine learning and network analysis are becoming more popular and have been used successfully to identify biomarkers for heart failure5 and some cancer types6-12.
High-throughput proteomic data however, can suffer from large amounts of noisy, irrelevant features that mask true indicators. When coupled with the inherent heterogeneity of biological samples, it can prove very challenging to isolate robust, relevant biomarkers. It is hoped that the incorporation of other data types and known information13 may help to make more informed biomarker selections that will withstand interrogation across larger sample sets.
One such integrative analysis approach, that bridges the gap between discovery proteomics and targeted proteomics to generate hypothesis-driven candidate biomarkers for Melanoma, is showing promising results14.
Unifying Data Repositories
Whilst the amount of proteomic data being generated is always growing, the amount in catalogued, publicly available repositories does not reflect this influx. The lack of a unified central point for data sharing adds further complications for researchers hoping to data mine existing studies and place their own data in a broader context. A plethora of data repositories exist15, some with restricted access whilst others are freely available. With the financial burden of drug development and the route to market, more coherent coordination of data that could guide clinical drug trials will play an important role in improving success rates for the future. In response to these issues, the Proteome Xchange consortium has been developed to unify proteome data from a host of public repositories in a coordinated manor. Proteome Xchange also aim to provide backup for the contributing resources in case of financial difficulties so that valuable data is not lost.
1 Sousa JS, Lefebvre C, Huang Z, Raisaro JL, Aguilar-Melchor C, Killijian MO, Hubaux JP. Efficient and secure outsourcing of genomic data storage. BMC Med Genomics. 2017 Jul 26;10 (Suppl 2):46.
2 Check Hayden, E. Genome researchers raise alarm over big data. Nature News doi:10.1038/nature.2015.17912.
3 Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ, Iyer R, Schatz MC, Sinha S, Robinson GE. Big Data: Astronomical or Genomical? PLoS Biol. 2015 Jul 7;13(7):e1002195.
4 Bersanelli M, Mosca E, Remondini D, Giampieri E, Sala C, Castellani G, Milanesi L. Methods for the integration of multi-omics data: mathematical aspects. BMC Bioinformatics. 2016 Jan 20;17 Suppl 2:15.
5 Willingale R, Jones DJL, Lamb JH, et al. Searching for biomarkers of heart failure in the mass spectra of blood plasma. Proteomics 2006;6(22):5903–5914.
6 ZhangChen F, Wang JM, et al. A neural network approach to multi-biomarker panel discovery by high-throughput plasma proteomics profiling of breast cancer. BMC Proc 2013;7:S10.
7 Rogers M, Clarke A, Noble PJ, et al. Proteomic profiling of urinary proteins in renal cancer by surface enhanced laser desorption ionization and neural-network analysis. Cancer Res 2003;63(20):6971–83.
8 Chen Y, Zheng S, Yu J, et al. Artificial neural networks analysis of surface-enhanced laser desorption/ionization mass spectra of serum protein pattern distinguishes colorectal cancer from healthy population. Clin Cancer Res 2004;10(24):8380–85.
9 Luk JM, Lam BY, Lee NPY, et al. Artificial neural networks and decision tree model analysis of liver cancer proteomes. Biochem Biophys Res Commun 2007;361(1):68–73.
10 Ward DG, Suggett N, Cheng Y, et al. Identification of serum biomarkers for colon cancer by proteomic analysis. Br J Cancer 2006;94(12):1898–905.
11 Ostroff RM, Mehan M, Stewart RA, et al. Early detection of malignant pleural mesothelioma in asbestos-exposed individuals with a noninvasive proteomics-based surveillance tool. PLoS One 2012;7:e46091.
12 Petricoin EF, Ardekani AM, Hitt BA, et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet 2002;359(9306):572–7.
13 Giudice G, Petsalaki E. Proteomics and phosphoproteomics in precision medicine: applications and challenges. Brief Bioinform. 2017 Oct 25 [Epub ahead of print].
14 Kawahara R, Meirelles GV, Heberle H, Domingues RR, Granato DC, Yokoo S, Canevarolo RR, Winck FV, Ribeiro AC, Brandão TB, Filgueiras PR, Cruz KS, Barbuto JA, Poppi RJ, Minghim R, Telles GP, Fonseca FP, Fox JW, Santos-Silva AR, Coletta RD, Sherman NE, Paes Leme AF. Integrative analysis to select cancer candidate biomarkers to targeted validation. Oncotarget. 2015 Dec 22;6(41):43635-52.
15 Perez-Riverol Y, Alpi E, Wang R, Hermjakob H, Vizcaíno JA. Making proteomics data accessible and reusable: current state of proteomics databases and repositories. Proteomics. 2015 Mar;15(5-6):930-49.