Using Library-Based Approaches To Increase Depth and Accuracy of Proteome Profiling
Complete the form below to unlock access to ALL audio articles.
The field of proteomics aims to advance the techniques and strategies used to identify and quantify proteins within a proteome and plays a crucial role in advancing economic and scientific fields as they serve three primary functions. In the pharmaceutical industry, most biopharmaceutical products are made from proteins; in medicine, the molecular diagnosis of protein anomalies could result in novel therapeutic interventions through in-depth characterization of these anomalies; and ultimately, proteins are the by-products of cellular machinery –making them a molecule of interest in many other industries.1
Analyzing proteins or proteomes is a challenge however, as widely available techniques do not provide enough data to identify the proteome in its entirety. Even though techniques such as mass spectrometry (MS) and liquid chromatography (LC) have made the most substantial contributions to the field, the data is still limited. It is partly because analytical challenges such as sample loss and differences in biological activity (protein expression) between samples exist, making it harder to detect and quantify proteins and peptides. In order to circumvent this problem, researchers use other methods, such as bioinformatics analysis, chemometric analysis, and mathematical modeling, to identify and quantify these proteins.
This article discusses how library-based approaches in quantitative proteomics can increase the sensitivity and accuracy of such detection systems.
Challenges in proteome analysis
Typically, proteomic analysis is done using proteins that have already been broken down using enzymatic digestion (bottom-up, shotgun, and middle-down proteomics).1 In this scenario it can be difficult to convert the datasets generated from these techniques into tangible peptide spectrum matches (PSMs), which are used to identify the different peptides and then proteins present in the proteome.
Even the available datasets tend to be incomplete as peptides are lost during the enzymatic digestion and purification process or cannot be recognized by the detection system, leading to several gaps in the dataset. In turn, it leads to inadequate sequence coverage, which impacts reporting on these peptides' structural and functional analysis.2 It is important to note that the complexity of the proteome also impacts the data generation process due to stochastic peptide detection, which reduces the sampling depth.3 Methods like multi-step fractionation and shotgun proteomics can help overcome these issues, but they might increase variability between the samples and have trouble differentiating between various proteoforms.4
There are also several other challenges, including the inability to measure low-abundance proteins due to a lack of highly sensitive instruments, long data transfer, processing timelines and the need for robust database search algorithms. As peptide loss is a common issue, there is a dire need for instruments that can identify peptides with confidence, even in the most negligible concentrations to prevent significant waste of time and resources. All these factors can also increase the false discovery rate (FDR) of these techniques, cementing the need for a more robust and accurate process.
Moreover, the need for high throughput and commercialization also requires the standardization of analytical workflows for peptide analysis. For example, it is now possible to analyze thousands of genomes simultaneously in a shorter time span using this approach – mandating the need for one within the field of proteomics as well.2,5
Solving the data analysis bottleneck
One way to solve the data analysis bottleneck would be to connect the detection system with real-time analysis software that handles the entire workflow, including quantification. Parallel search engine in real-time (PaSER) is a GPU-powered database search platform that can be integrated with detection systems like MS to allow the simultaneous detection of peptides as the samples are processed (Figure 1).
The main intention is to identify peptides in the samples using established algorithms6 complemented by machine learning models to tally the detected peptide's collision cross-section (CCS) value with the data present in its database. CCS value refers to the shape, size and charge of the ion in the gas phase, and as each peptide has a specific CCS value at a given charge state, the model compares that value with the experimental data to determine the peptide's identity. As the trapped ion mobility spectrometry (TIMS) technique analyzes the samples and generates a CCS value for each analyte, this value can be consistently measured as it is an intrinsic property of the analyte. This feature makes the technique highly reproducible – adding a layer of standardization in proteomics.
Figure 1: A CCS-enabled database search including TIMScore as an additional dimension. Credit: Bruker Daltonics.
Usually, traditional search algorithms rely on precursor and fragment ion spectra to determine the best fit, and based on that, it assigns a probability score. The output suggests only one result, despite there potentially being a marginally better fit, indicating that even though there is only one PSM – many other PSMs are available for that result. The lack of a robust search feature increases the FDR over time and simultaneously decreases reliability of databases search results such as these.
Alternatively, with PaSER, that issue can be avoided as the model is trained heavily using tryptic and phosphorylated peptides, including doubly, triply and quadruply charged states of these peptides, as they are the most prevalent form of post-translational modifications (PTMs) and have a strong biological significance. It can accurately identify the peptide from its primary amino acid sequence by measuring the deviation between the predicted and experimental CSS values. This approach has a 95% accuracy level for tryptic peptides and a 92% confidence level for phosphorylated tryptic peptides (Figure 2).
Figure 2: Scatter plots of the predicted ion mobility (CCS) values from the machine-learned model and the experimentally derived values for tryptic (A) and phosphorylated peptides (B). Credit: Bruker Daltonics.
As analysts complete the peptide run, the scoring algorithm can be deployed along with machine learning to generate a predicted CCS value. A correlation score is generated for five best-fit predictions for each spectrum based on the comparison between the predicted and measured CCS values. As the peptide dimension can be vectorized in 3 dimensions as opposed to 2 dimensions in non-CCS-enabled algorithms, it achieves a 1% FDR rate. This capability increases the confidence in the results as a deeper profiling depth can be achieved, identifying a greater number of peptides (Figure 3).
Figure 3: Sequence coverage of tryptic and phosphorylated peptides is doubled when TIMScore is deployed, indicating a higher profiling depth than standard techniques available.7 Credit: Bruker Daltonics.
In order to improve the entire peptide analytical workflow, there is a need for an integrated solution that combines data generation with data processing capabilities, reducing the time for analysis and increasing the accuracy of the results. PaSER can be combined with data analysis techniques like data-independent acquisition (DIA) to increase the depth and quantitative accuracy in terms of additional separation of the fragmented ion space or convoluted precursor.8
A 2019 study introduced a new software, DIA-NN, that leverages deep neural networks to differentiate between real peptide signals and noise using interference-correction strategies. In typical DIA-MS analysis, each precursor gives rise to multiple chromatograms due to the number of fragment ions generated. As co-fragmenting precursors tend to interfere with the peptide signal, the resulting chromatogram can be inaccurate or too noisy to analyze. The DIA-NN software uses a peptide-centric approach that matches annotated precursors and their fragmented ions to those in the chromatogram. In this case, the software first generates negative controls based on the input provided (through a spectral library or in silico analysis of a protein sequence) and identifies putative elution peaks for these controls. It calculates 73 peak scores and determines the best candidate peak for each precursor, generating a single score for this peak, allowing for accurate identification of these precursors and peptides.3
The DIA approach method was further adapted to include parallel accumulation-serial fragmentation (PASEF), resulting in the dia-PASEF method, which utilizes data from the TIMS device where the ion mobility dimension allows the differentiation of peptide signals that are usually co-fragmented.9 It results in an improvement of two to five times the sensitivity by stacking precursor ion isolation windows in the ion mobility dimension – increasing the duty cycle. Studies have found that it increases the proteomic depth by 69% where one study could quantify 5,200 proteins from 10 ng of HeLa peptides separated with a 95-minute nanoflow gradient and in another, 5,000 proteins from 200 ng using a 4.8-minute separation with a standardized proteomics platform. This method could detect 11,700 proteins in single runs acquired with a 100-minute nanoflow gradient for complex mixtures.7
The field of proteomics is expanding in its knowledge due to recent technological advancements. However, methods considered the gold standard a decade ago do not necessarily provide the entire picture. For example, in most proteomic analyses, it's possible to detect proteins, gain insight into the kinds of peptides they are composed of, and understand the structural and functional aspects of those proteins. Even so, it is challenging to map the true biology of a protein since the profiling depth was relatively low.
With new technologies that combine the detection and analysis process using MS and library-based approaches, greater profiling depth can be achieved. It also circumvents the need for manual data analysis as these instruments use the run-and-done method to analyze the generated data simultaneously. In turn, it allows scientists to gain a more comprehensive insight into the constitution of their samples in a shorter period of time and with greater accuracy. Future deployment of this method for protein analysis could have significant implications in the fields of medicine, biotechnology or proteomics at large.
- Batiston WP, Carrilho, E. The importance and challenges for analytical chemistry in proteomics analysis. Braz J Anal Chem. 2021;8(31):51-73. doi: 10.30744/brjac.2179-3425.RV-64-2020
- Snapkov I, Chernigovskaya M, Sinitcyn P, Lê Quý K, Nyman TA, Greiff V. Progress and challenges in mass spectrometry-based analysis of antibody repertoires. Trends Biotechnol. 2022;40(4):463-481. doi: 10.1016/j.tibtech.2021.08.006
- Demichev V, Messner CB, Vernardis SI, Lilley KS, Ralser M. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat Methods. 2020;17(1):41-44.. doi: 10.1038/s41592-019-0638-x
- Pauwels J, Gevaert K. Mass spectrometry-based clinical proteomics – a revival. Expert Rev Proteomics. 2021;18(6):411-414. doi: 10.1080/14789450.2021.1950536
- Campbell M. 5 key challenges in proteomics, as told by the experts. Technology Networks. https://www.technologynetworks.com/proteomics/lists/5-key-challenges-in-proteomics-as-told-by-the-experts-321774. Published July 16, 2019. Accessed November 3, 2022.
- Xu T, Park SK, Venable JD, et al. ProLuCID: An improved SEQUEST-like algorithm with enhanced sensitivity and specificity. Journal of Proteomics. 2015; 129(3);16-24. doi: 10.1016/j.jprot.2015.07.001
- Ogata K, Chang CH, Ishihama Y. Effect of phosphorylation on the collision cross sections of peptide ions in ion mobility spectrometry. Mass Spectrom (Tokyo). 2021;10(1):A0093-A0093. doi: 10.5702/massspectrometry.A0093
- Demichev V, Szyrwiel L, Yu F, et al. dia-PASEF data analysis using FragPipe and DIA-NN for deep proteomics of low sample amounts. Nat Commun. 2022;13(1):3944. doi: 10.1038/s41467-022-31492-0
- Meier F, Brunner AD, Frank M, et al. diaPASEF: Parallel accumulation–serial fragmentation combined with data-independent acquisition. Nat Methods. 2020; 17(12) 1229-1236. doi: 10.1038/s41592-020-00998-0