Scale and Speed in Proteomics Data Processing
Scale and Speed in Proteomics Data Processing
Proteomics enables quantitative analysis of thousands of proteins in biological samples, providing deep insight into healthy physiological function and disease states. Whilst proteomics is not yet a mainstream analytical approach adopted in medicine, it's poised to facilitate personalized and precision medicine in the future, particularly through biomarker discovery. However, a key challenge in this space is the sheer size of data produced. Proteomics research, particularly mass spectrometry (MS)-based proteomics analysis, generates enormous data sets which require subsequent analysis.
Targeted proteomics start-up ProteiQ Biosciences recently announced the launch of InfineQ, its cloud-based solution software for near real-time processing of proteomics data. Technology Networks spoke with Arnoud Groen, CSO and co-founder of ProteiQ, to learn more about the solution and how it provides the scale and speed required for biomarker discovery.
Molly Campbell (MC): Can you explain why the analysis of MS-DIA proteomics data can be particularly complex, and provide background information on ProteiQ Biosciences' development?
Arnoud Groen (AG): To answer your first question, there are essentially two forms of proteomics research: 1) an antibody-based proteomics approach and 2) a mass spectrometry (MS)-based proteomics approach. MS-based proteomics is more complex than antibody-based methods because of the increased number of parameters one needs to control, however, it provides more confident identifications and more information on the target (protein), such as post-translational modifications, isoforms etc.
Within MS-proteomics, data independent acquisition (DIA) is particularly complex compared to the other two main MS data acquisition methods, data dependent acquisition (DDA) and selected reaction monitoring (SRM).
The most fundamental reason for this is the loss of direct connection between the intact peptide and its fragments which are used for identification. To put it metaphorically, the puzzle to connect the data to actual protein information has become many times more complex, and the emergence of DIA proteomics has pushed the field even more towards computer science.
ProteiQ started as a biomarker discovery company focusing on sports and well-being. Our first main goal was to create a protein panel to objectively diagnose non-functional overreaching (NFOR) in athletes. Although this work is still ongoing and ProteiQ remains to have a high interest in this area, it is no longer our primary business model.
This is because we realized that the technology we have already been using for our biomarker discovery, DIA, was something that we could more broadly apply to many other challenges in life science research. DIA methods and MS-based proteomics has great and very versatile potential, which is currently underutilized, especially in biomarker discovery work due to its seeming complexity.
It's a classical chicken and the egg problem. We believe it is one of the most important directions to which proteomics-based biomarker discovery needs to develop in order to contribute towards medical diagnostics. Hence, we developed our own cloud-based processing pipeline for DIA MS-based proteomics and are working now both with MS laboratories and research groups / pharmaceutical companies on its applications to a range of medical challenges.
MC: InfineQ is built on top of DIA-NN. Please can you tell us about this and how InfineQ is different to DIA-NN?
AG: To use a metaphor, DIA-NN is our internal engine. Just like a car, the engine is one of the most important features. And just like in car design and manufacturing, InfineQ adds additional features on top of the engine that improve the cars performance. For example, InfineQ adds scalability of the cloud, native cross-run alignment around the samples and improves post-processing. These are all features that are very important for high-throughput, large cohort studies. With some features we are a bit ahead, like identification of post-translational modifications, which is available in InfineQ, but not yet in DIA-NN.
The development of DIA-NN and InfineQ are also not isolated from each other. It is a continuous collaboration by which single run-based improvements are extended to the scale of large cohort studies.
MC: How does InfineQ achieve near real-time processing of proteomics data? Why is this important?
AG: A straightforward answer to this question is that researchers don’t want to wait for days or weeks to process their data. Acquiring MS-based proteomics data already takes a lot of time, depending on the size of the study of course, but it can largely be run in parallel e.g. by buying more MS instruments. However, considering signal processing, it's not this straightforward. If algorithms are not optimized for high-throughput, simply having more computers will not help.
InfineQ solves this problem in three ways. Firstly, because of the cloud solution, the data processing can be parallelized, leading to much shorter data processing times and removing the limit on the size of the cohort. We use a serverless k8s approach to split each run into multiple pieces which are processed in parallel. The code is also optimized for efficiency & speed of processing – which is already part of DIA-NN. Secondly, multiple users who work in one lab or institute can all do their work on the cloud in parallel: there is no need to wait for each other’s work to be complete because of single licenses or lack of computer power. Thirdly, because the scalability bottleneck is removed, additional algorithms can be added to improve the quality of outcome without any observable negative time impact for users.
MC: A major bottleneck to the delivery of proteomics analysis in the clinic is the data load and subsequent time required for analysis. In your opinion, can InfineQ help to alleviate this bottleneck?
AG: Reduction of the processing time after sample acquisition is the main focus of InfineQ. Having said this, the actual application to the data obtained in clinical (treatment) setting is still a long way to go. At this moment the main objective of ProteiQ is to bridge the gap between research and application in medicine, by creating a robust discovery pipeline which can also be run in a regulated environment, for example clinical Phase II studies. However, additional steps and parties are required to translate the results into certified diagnostics tests.
MC: Are you able to provide any examples of researchers in the clinical proteomics field that are adopting InfineQ what their work is analyzing?
AG: InfineQ is currently in a public beta version and we are testing it with a couple of groups.
One of the first applications of the technology is addressing the health complications related to silicone breast implants in women with breast augmentation by the start-up company Bioflagz. They are developing a direct-to-consumer protein biomarker test to better predict these health complications.
MC: Can you talk about InfineQ in terms of ease-of-use?
AG: That was one of the key objectives for us. You can click on the “watch demo” on infineq.com to see the software in action: it literally takes just three steps. For the users, all internal workings such as calibrations are done automatically. The final result is a list of quantified peptides and proteins with which the user can start working immediately.
In addition, we are working on an additional option for the users who are comfortable with coding. The new application programming interface will allow programmatic access to InfineQ and the possibility to interrogate the results directly on the cloud. There will also be other features coming in the future which will allow easier analysis of the data directly from InfineQ’s interface.
Arnoud Groen, CSO and co-founder of ProteiQ, was speaking with Molly Campbell, Science Writer, Technology Networks.