Machine Learning Helps Achieve Greater Depth in Proteome Analysis
Over recent years, we have seen an increase in the application of artificial intelligence (AI)-based methods – such as machine learning – across a variety of biological disciplines.
Proteomics is a field of research that offers unparalleled insights into cellular biology, with potential applications spanning modern medicine, food science, agriculture and systems biology more generally. Over the last decade, the proteomics research field has advanced rapidly.
We can now study more proteins than ever before using an increasingly smaller sample size, at a higher speed and with increased sensitivity. Such sophistication is attributed to innovations in analytical technologies, such as mass spectrometry (MS). But how can scientists go deeper still in their proteome analysis?
Earlier this year, Technology Networks spoke with Rohan Thakur, executive vice president of Life Science Mass Spectrometry at Bruker Daltonics, on how Bruker is helping researchers to “raise the bar” and achieve new heights in proteomics.
Since that conversation, Bruker launched its novel CCS-enabled TIMScore™ algorithm, which can be utilized on the timsTOF Pro 2, timsTOF HT, timsTOF SCP and timsTOF fleX systems, in addition to its TIMS DIA-NN 4D-Proteomics™ Software.
To understand how machine learning methods and new software capabilities are helping proteomics researchers gain greater depth in their analyses, Technology Networks recently interviewed Tharan Srikumar, product manager in bioinformatics at Bruker Daltonics. In this interview, Srikumar explains how the novel TIMScore algorithm works to overcome challenges in analyzing tryptic and phosphorylated peptides, discusses the capabilities of TIMS DIA-NN 4D-Proteomics software and improving the efficiency of proteomics workflows.
Molly Campbell (MC): Can you explain for our readers how the TIMScore algorithm was developed with Bruker’s customers in mind?
Tharan Srikumar (TS): Over the last few years, we've proven that the timsTOF technology works well on the hardware side of things, and we've been able to exploit the hardware with several different acquisition methods. An example – for shotgun proteomics approaches, the standard data dependent acquisition (DDA) parallel accumulation serial fragmentation (PASEF®) method was developed. For data independent acquisition (DIA), we developed dia-PASEF®, and more recently, we've added parallel reaction monitoring (PRM) prm-PASEF® as an acquisition method for targeted applications. What we didn't have was a complimentary software solution that fully leverages TIMS technology for performing data analysis. We saw this as an opportunity, as we weren't exploiting all the information that was being provided to us from the instrument.
This line of thought led to our first attempt at making more use of the collisional cross section (CCS) information that's present in the data. This gave birth to TIMScore, which conceptually, is very simple. The idea being if you knew the true value – the true CCS of a given peptide at a given charge state – then you can compare that to what you're measuring. Then you should be able to create a relative score or judge how good that measurement is compared to your reference value or your true value. Unfortunately, we don't have that information for all of the potential peptides that could be measured, and so, we figured that the next best thing was to build a prediction model that can give us the expected value, or the true measurement value. This is what makes the basis for TIMScore – we've built a machine learning prediction model for the CCS values. It includes all the tryptic peptides that we could feed into the model as well as other post translation modifications (PTMs), like phosphorylation. Phosphorylation was particularly critical to train our model on based on the biological implications of this PTM. Thus, the model can predict unphosphorylated as well as phosphorylated peptide CCS’s with very high accuracy and reproducibility. We use those predicted CCS’s to evaluate how good the measured value is and based on that, we can better address ambiguity in the identifications.
For a spectrum that is identified very, very clearly by its fragmentation pattern – and nothing else – TIMScore doesn't add much value. For the more ambiguous identifications, where there is vagueness in the identification, either because the fragmentation pattern is not clear enough, or the peptide mass error between the measured peptide and the potential identifications is larger, we can use the relationship of how well the predicted CCS matches the measured CCS. We can use this to either say, “this is a decoy peptide” and if there's a false positive, we shouldn't account for it, or to say, “no, this is not a false positive, this is a true match, and we want to use that identification in the dataset”. Furthermore, TIMScore allows for enhanced discriminate analysis reducing the ambiguity in peptide identification by adding another key dimension into the discriminate analysis. Essentially, the TIMScore dimension allows the standard 1% false discovery threshold determination to be based on a 2-dimensional plane, allowing access to some true identifications that may be lower scoring but still valid peptides.
Ash Board (AB): What difficulties have been associated with analyzing tryptic and phosphorylated peptides previously, and how does the software overcome such challenges? In addition, why do we need to capture PTM data in the context of studying the proteome and how can the 4D-Proteomics approach help to generate this information?
TS: Let’s think of an analogy. Say the sample that we are looking at is a room, and we're standing very near to the door of the room. If we had a peephole into the room, we get a very limited view of the room itself, but we may gain some insights into what's inside the room – what furniture is in there and who is in the room, for example.
Compare this to being able to open the door into the room. You have a much wider view of the room and can create a better description of what's actually present and happening. Then of course, if you can step in, you get a fully immersive experience of being in that room.
If we started with a standard shotgun proteomics approach on other platforms, perhaps, or even in our older platforms, you would be staring in through a peephole. There is a limited view. You can see maybe 1000-3000 proteins and several 10000 peptides. It offered you a description, or an idea, of what was in the room – or the sample.
With TIMScore and PASEF, we’re letting you have either a bigger peephole, or the ability to open the door fully and step inside, creating a much broader view of what is there. Post translational modifications (PTMs) play such a critical role in biology. Understanding the role of PTMs – or identifying what PTMs are present – in what quantity and where within a cell, is crucial for the understanding of biology. There is now a deeper, or wider view, which should translate to our customers gaining much deeper understanding of what’s in samples they are studying.
MC: What capabilities does the TIMS DIA-NN software have compared to previous software systems?
TS: TIMS DIA-NN is our first software to analyze dia-PASEF data. It is based on the open source, DIA-NN software from the labs of Professor Markus Ralser and Dr. Vadim Demichev. We have forked that project and put a larger emphasis on the CCS measurement itself. We've also integrated it into the PaSER platform, so you have a workflow that's automatically triggered at the end of your acquisition. From a user perspective, you set up your experiment, your measurement on the timsTOF acquisition PC that includes setting up your processing method. At the end of the acquisition the TIMS DIA-NN is triggered, and you have the results waiting for you a few minutes later.
You no longer have to acquire all of your data, then copy all of the files to your processing computer, start the analysis and then come back a few hours later to review the data quality, or to see if the column clogged or something along those lines. You now have one workflow, which you set up and walk away. If you need to check on the data, you can come a few minutes after the acquisition and have a result file waiting for you. When you want to compare the data across your entire project, or whatever tens or hundreds of samples that you're interested in, you can group them all for analysis to fill in any missing data using a concept called “match between runs”. The efficiency of this concept is also increased by the use of CCS. With that, you have a full project view of all the proteins and peptides that were identified and quantified, in this case, across the whole project.
AB: For you, which Bruker customer case study really demonstrates the impact the novel software can have in proteomics analysis?
TS: We started off with a retrospective analysis, reaching out to Professor Yasushi Ishihama’s lab at the Kyoto University where they published a paper that was exploring phosphorylation and how it is affected by CCS values. We worked together with the team’s existing data set to see if there were any gains to be made with TIMScore. We came to the realization that yes, there were huge gains! I think we were seeing anywhere in the range of 30–40%, depending on whether you were looking at the peptide or the protein level.
We could not only identify more phosphorylated peptides in this specific case, but we were also able to identify more phosphorylation sites at the same confidence level. That is, not only could we identify the peptide sequence in the protein that was modified by phosphorylation, but we could also identify the exact amino acid at which this phosphorylation event was occurring at. This meant that we were not seeing ambiguous identifications; we could localize it to a very specific residue, and that means a much better understanding of the signaling biology.
We built PaSER as a platform, and as part of PaSER we've now integrated TIMScore and TIMS DIA-NN. One of the more common scenarios that we're seeing available – or that is being utilized – is to undertake a small pilot study to build a spectral library, perhaps from a fractionated data set or from pooled samples. Then, using TIMScore, build an in-depth peptide spectral library as possible, before doing a much larger cohort of samples to be studied using TIMS DIA-NN.
Then you do this study in DIA mode across 100–1000s of samples. We've seen pilot projects with several thousand samples, and even larger projects planning to use over 10,000s of samples. What the integrated workflow and PaSER lets you do is keep an eye on the entire project as it's progressing, but also get feedback on blocks of data as you're moving forward. In general, we're seeing our customers migrating towards DIA and we're facilitating that with the PaSER platform.
MC: Can you discuss the impact that novel software systems are having on the proteomics field? To what extent are they helping to overcome the data bottleneck?
TS: Our approach has been very different than what has typically been tried. One of the largest bottlenecks in proteomics was that you can generate hundreds of samples per day – but of course, you need to process that data.
One of the simpler routes that was taken was to move to “the Cloud”, where you can scale your requirements computationally. But, again, you are still waiting to acquire your entire project before moving that into a Cloud environment to process it, and then you're still waiting quite some time or you're spending a good chunk of money to process that rapidly in the Cloud.
One of the initial questions we had with PaSER was, why do we wait? We have all this time while we're acquiring the data, that we could be using to process it. This is one of the key differentiators between the PaSER platform and some of the other software solutions.
That doesn't answer your question about AI and machine learning. This we see as a very large growing front, not only for Bruker, but the entire field in general. I think we will start to see predictions being applied more broadly, not just for CCS’s. It’s already being applied for predicting MS/MS spectrum, for retention time and many other aspects. I think we're going to start to see all of those being integrated together to create even more robust models and more complete models that can describe all of those aspects or many of those aspects together. Then you reach a stage where you can very confidently predict and identify the characteristics of a peptide well before you've made the measurements. You can use that knowledge to modify how you're going to acquire the data so it might better suit your experimental design.
The other aspect would also be that we're changing the bottleneck from acquiring the data to analyzing the data and we're going to create a new bottleneck at post analysis. I think this will be an interesting place to keep your eyes on.
MC: Are you able to discuss any future plans in terms of further enhancing software capabilities?
TS: I think there are a few areas that we haven't covered. For instance, de novo sequencing is not a capability that's currently a part of our portfolio, we're really looking forward to integrating that option in. We've offered solutions for different workflows but want to continue developing that. As well as CCS prediction and all of the other predictions that we want to do, that's an area where we have a lot of focus, particularly with regards to PTMs. We currently support phosphorylation, but we want to grow that to cover ideally, all of the PTMs, whether the model has seen it or not, and to be able to predict those accurately so we can make use of that in TIMScore and other applications. There are also the broader aspects of quality control, statistical analysis and data visualization that we plan to make an impact in.
Tharan Srikumar, Product Manager of Bioinformatics at Bruker Daltonics, was speaking to Technology Networks.