Cinderella and Mass Spectrometry: Insightful Multi-Dimensional Data Analysis
Blog May 02, 2018 | by David Chiang, Chairman and Co-Founder of Sage-N-Research
Credit: Sage-N Research
Proteomics is a powerful technology for analyzing low-abundance proteins for disease research. But it’s stalled by imprecise and often irreproducible data analysis. Few researchers can identify these proteins with confidence. Those who can will achieve breakthroughs.
Here we explain our powerfully simple idea: Use multidimensional separation — already applied chemically in chromatography — to numerically filter correct peptide IDs from a search engine’s guesses, particularly with data-independent acquisition (DIA) data.
Tandem mass spectrometers (MS/MS), like particle colliders and space telescopes, produce big datasets with dynamic range that spans orders of magnitude. MS/MS biomolecule analysis is really closer to physics than traditional chemistry. While physics uses powerful servers to mine deep data for needle-in-a-haystack discoveries, proteomics is trapped in the shallows by simple PC programs that calculate subjective probabilistic scores. Expecting physics-level precision without physics-level IT is simply wishful thinking.
Prevalent data analysis uses a binomial probability (i.e colored balls from a bag) to model fragment ion signals that are not independent and identically distributed (probability’s “IID” requirement), which injects random uncertainty into physical mass/charge (m/z) data. Different software uses different probabilities (6% vs. 10%) for a matched fragment; all models qualitatively agree on easy ‘yes’ and ‘no’ answers but differ in-between where it matters most. Many labs treat analysis software as a black box and choose a loose one — like a player seeking the loosest slot machine — that reports the most IDs. Popular PC programs can be coaxed to identify 15% more IDs than is realistic— an impossible high-water mark for any rigorous software. They contribute to irreproducibility.
Here we illustrate how to produce precise and reproducible results by comparing 2D vs. 1D data analysis of matched fragment ions — the foundation of MS/MS DIA molecular identification — starting with first principles.
Peptides and proteins are physical objects with true identities that can’t be discerned with MS/MS alone. We discovered this remarkable simple abstraction: A high-sensitivity search engine guesses many peptide ID hypotheses from a mass spectrum. A high-specificity multidimensional filter uses physical parameters to accept a small number of hypotheses as high-likelihood peptide IDs. For example, intuition suggests a peptide with >20 fragment ion matches at <0.01 average m/z error is likely correct; a scatter plot proves and extends this intuition. Note the search engine’s inherent subjectivity is irrelevant as long as it is sensitive enough to include the correct peptide among its guesses.
Mass Spec Identification: A Cinderella Story
To appreciate mass spec’s informational asymmetry, consider its parallel to the Cinderella story. If the shoe doesn’t fit, it’s surely not her. But if it fits, we don’t know whether it’s her or a random girl.
So MS/MS identification is akin to identifying Cinderella in a sizable city using one shoe (precursor mass) plus a full wardrobe (many fragment m/z’s). The concept is nothing more than this: a girl is likely our quarry if she is an outlier in terms of both the number and the tightness of garments that fit.
A MS/MS peptide ID hypothesis is likely correct to the extent it is an outlier in both the number and closeness of matching m/z’s, period.
Fundamentally, confidence can never reach 100 percent due to possible random matches, but it increases asymptotically with each closely matched fragment m/z.
Longer peptides (with more matchable fragments) allow higher confidence identification. Longer peptides are also part of fewer proteins; a long-enough one is unique to its protein. Finally, with many matched fragments, a precise precursor mass becomes less critical — very important for DIA analysis.
A natural strategy emerges to analyze any low-abundance protein: Try to capture at least one protein-unique peptide using DIA, which would be designated the surrogate for its “one-hit wonder” protein for both identification and relative quantitation. This eliminates statistical imprecision from inferring a protein from multiple peptides. Besides, it may be next to impossible to capture more than one peptide from very low abundance proteins.
A Sensitive Search Engine is Not Enough
MS/MS by nature does not identify a molecule per se, but rather reports fragments to be compared to a hypothesis. We can view peptide identification as a crossword puzzle (peptide) with numerical clues (fragment m/z’s). Most people solve a crossword by gross-guessing words and then seeing if any one fits exceptionally well.
The same abstraction applies. A high-sensitivity search engine gross-guesses many peptide hypotheses — the more the better — using a subjective criteria (search score). A high-specificity filter accepts at most one as the correct peptide ID for its spectrum. For informational integrity, the filtering criteria should both use physical parameters and be different from the search score.
We can see why, for simple benchmarks with clean data, almost any search engine would identify almost all the peptides. But for noisy spectra, it requires a compute-intensive, cross-correlation search engine to include the true peptide among its guesses. Unfortunately, current workflows use imprecise filters that unwittingly suppress low-abundance peptides. That’s why they are rarely identified even in workflows using a sensitive search engine.
Data-Driven Means All Data and No Models
To illustrate physical multidimensional data mining, we used one DIA data file from an infected sample run on a Thermo Scientific Q Exactive HF (courtesy of Dr. Nicole Kruh-Garcia, Colorado State University). The 3GB file was searched overnight (mass-tolerant, target-decoy, no modifications) on a SORCERER™ iDA, keeping the top 100 results for each search. Four million peptide ID hypotheses for 29K unique spectra were produced.
Peptide identification means accepting perhaps a few thousand peptide IDs among 4M hypotheses. How? We look for visual outliers in a 2D scatter plot.
Figure 1: The “Peakcount vs. Average Fragment Delta-Mass” for part of the 4M hypotheses. Both target (green) and decoy (black) hypotheses are shown. (Jitter was added to Peakcount to spread out integer values for visualization.)
In figure 1, we can clearly see regions of high-confidence IDs (mostly green) and mostly random hypotheses (mixed green/black) separated by a transition band corresponding to “yes’, ‘no’, and ‘maybe’.
The basis of SorcererScore™ is model-free numerical filtering of search engine results. Its first-generation (Chiang 2016) for data-dependent acquisition data uses four dimensions. The second generation will be optimized for DIA data using the same principles.
Figure 2 shows these two parameters as 1D distributions, separately for targets and decoys, for the top-score hypotheses. It’s easy to see that one-dimensional scores lose both information and precision important for low-abundance peptides.
In conclusion, we illustrate SorcererScore’s simple abstraction for deep proteomics: A sensitive search engine guesses ID hypotheses; a high-specificity filter accepts at most one ID. A long peptide ID identifies the protein. Akin to medical x-ray interpretation, SorcererScore uses visual cues — not probability models — for semi-interactive analysis. Precise data-driven analysis means no complex statistical modeling. The same principle can be applied to identify other biomolecules whose fragments can be readily predicted.
Reference: Chiang D (2016) How to Identify Low-Abundance Modified Peptides with Proteomics Mass Spectrometry. MOJ Proteomics Bioinform 4(5): 00133. DOI: 10.15406/mojpb.2016.04.00133