Key Developments in Proteomic Pipelines
By analyzing proteins, proteomics goes right to the trenches where the action of the cell is, biologically speaking. Given the real-time nature of proteomic data, which can provide a snapshot of a particular cell, tissue, or system at a point in time,1 high-throughput proteomics has proven very popular in biological, biomedical and clinical research. The rapid advances in proteomic technologies have empowered scientists to explore proteins and their modifications in complex samples on an unprecedented scale (higher numbers of samples and replicates at increased resolution and coverage).1
However, these advances also pose significant challenges to researchers in terms of storing, managing, and reproducibly analyzing this deluge of proteomic data. High-throughput genomic and transcriptomic analyses have grown to rely on sophisticated analysis pipeline frameworks, which comprise multiple software tools strung together in a specific sequence to form automated analysis workflows for specific tasks.1,2 This was not the case for the field of proteomics, at least until very recently, as most analysis has typically been performed on local workstations, or using “black box” online tools.
Here, we describe some of the recent key advancements in proteomic pipeline development and deployment that are aiming to address some of the core challenges at the forefront of large-scale proteomics.
1. Cloud computing, software engineering and the democratization of bioinformatics
Although cloud computing has revolutionized business and several other sectors, academia has yet to tap into the numerous advantages that it offers.3 Academic institutions still mostly use in-house systems such as high-performance computing (HPC) clusters that require large, up-front capital expenditure. Cloud computing, on the other hand, requires few up-front costs, and users are billed on a recurring basis only for what they use of the virtualized infrastructure.
In proteomics, cloud computing has the potential to accelerate research by providing laboratories with access to large-scale computing resources for developing and implementing their proteomic pipelines — regardless of their location or IT expertise.3 Cloud resources can be used to tackle the computationally intense tasks inherent in proteomic data analysis, so as to greatly decrease analysis times and computational burden.
Besides the cost and time savings, the many added benefits of cloud computing include:
• Decreased development and maintenance workload
• Increased reproducibility
• Improved version control
• Improved fault tolerance
• Reduced latency
• Easier sharing of data and software
• Enhanced security
• The potential for agile development
• The potential for serverless computing
• The ability to scale up under increased loads
Access to software engineering would further democratize access for generalist bioinformaticians and biologists; however, this avenue has not yet been fully explored.2 To facilitate collaboration between software engineers and bioinformaticians, platforms such as Docker have been suggested to offer an ideal crossover technology with which to create containers (discussed further below) in which bioinformatic pipelines can be developed, tested and implemented by bioinformaticians and biologists. The Dockerfile, a simple text document, can be easily amended, updated and shared between software engineers and scientists as each pipeline evolves. Platforms such as Docker and Kubernetes can also ease the migration of software and pipelines from local to cloud deployment.3
2. Software containers
Computational proteomics has historically been dominated by desktop software and online tools, which has hampered high-throughput analysis in HPC clusters and cloud environments.1,4 Furthermore, many of these tools are proprietary closed-source solutions which use proprietary data formats and can only run on specific operating systems or vendor hardware. This poses a considerable challenge to reproducible and scalable proteomics research. During the last decade, open-source solutions have slowly begun to appear. However, this has typically come at the cost of increased technical complexity, requiring computational skills that scientists generally do not have. This is further complicated by the fact that tools in different computational environments (such as local workstations, HPC clusters and the cloud) often require different installation procedures, have different software dependencies and may use different file formats.
Software containers offer a solution toward simplifying the distribution and rapid deployment of bioinformatic software and combining tools into powerful analysis pipelines.1 Containers do this by providing a method for isolating the desired software and their dependencies into units that can be stably deployed in various computing environments. Containers can break up pipeline analysis tasks into isolated units that can be scaled up by increasing the number of containers running simultaneously. Once a container for a specific tool has been built, it can be easily distributed by depositing it in an online container registry. There, the container can directly execute the enclosed software without any additional installation, and the same container can be executed on different operating systems.
In recent years, the use of software containers in bioinformatics has increased rapidly. In October 2017, the Institute Français de Bioinformatique (IFB), European Bioinformatics Institute (EBI) and ELIXIR Tools Platform organized a “Hackathon” in Paris to consolidate a container platform named BioContainers. BioContainers is based on the popular frameworks Conda, Docker, and Singularity inside the Github community platform to which anyone can contribute.4,5 Platforms such as BioContainers and, to some extent, Bioconda (released in 2015), offer thousands of tools in a format that enables users to execute their pipelines in different computing environments without the complexities of installation and software dependencies.1 Users can easily replace independent components with those created using different technologies or programming languages. Furthermore, BioConda and BioContainers provide software version management, which facilitates reproducible data analysis over time.
3. Workflow systems
Although software containers simplify the installation and deployment of bioinformatic tools, scientists are still left with the complex task of combining these tools to create proteomic analysis pipelines that can run on different architectures.1 To address this issue, various workflow systems have been developed. A workflow system is a software that allows sequential and parallel steps of tool execution to be set up in such a way that it can be executed in different environments (e.g., local machine, containers, HPC clusters and clouds). Over the past decade, several open-source workflow environments have begun to emerge, with the two most popular workflow environments being Galaxy and Nextflow. It is hoped that the combination of software containers and workflow systems will make proteomic pipelines more reproducible, scalable and accessible, even to scientists without expertise in complex IT infrastructure and command-line environments.
4. “Bring Your Own Data” (BYOD)
Ideally, bioinformatic pipelines should be easily used by all scientists, not solely bioinformaticians and software engineers. To meet the demand of biologists who wish to have autonomy in their bioinformatic analyses, institutes such as the Dutch Techcentre for Life Sciences and the IFB are offering intensive training courses to equip scientists with the knowledge and skills to develop custom integrated proteomic pipelines (termed the BYOD principle). The first session of the IFB training course took place in February 2019 and all teaching material is freely available. Such courses are hopefully a sign of further impetus in this direction.
5. The ongoing ELIXIR project to benchmark proteomic pipelines
The ever-increasing popularity of proteomic approaches and technological advances in the field has led to a surge in the number of proteomic data analysis tools and pipelines. This, understandably, can be overwhelming to researchers new to the field, and has led to the variable quality of pipeline outputs and a lack of harmonization within the proteomics field.6-8 ELIXIR is an intergovernmental organization made up of 23 European countries with a Hub based in Cambridge, UK, at the Wellcome Genome Campus. This team of life scientists and computer scientists has been assembled to help coordinate life science resources such as databases, software tools and training materials, and to help researchers agree on best practices. They are currently running an Implementation Study that aims to benchmark proteomic pipelines and identify those that conform to the high standards required to ensure reproducible findings. Similarly, for almost two decades now, the Proteomics Standards Initiative of the Human Proteome Organization has been developing and promoting software tools and community standards for data representation in proteomics to facilitate data comparison, exchange, and verification, including continuous updates to the Minimum Information About a Proteomics Experiment (MIAPE) guidelines as proteomic technologies have evolved.9
6. NIST reference materials for proteomic pipeline comparisons and harmonization between laboratories
The US National Institute of Standards and Technology develops reference materials for various types of physical and chemical measurements made by government, academia, and industry. They are currently expanding their offering of a mass spectral library and peptide mass spectral library by developing standard reference materials of human tissues for proteomics experiments.10 This will enable benchmarking and harmonization between laboratories and proteomic techniques, as well as head-to-head comparisons of proteomic pipelines.
1. Perez-Riverol Y, Moreno P. Scalable data analysis in proteomics and metabolomics using BioContainers and workflows engines. Proteomics. 2020;20:1900147. doi: 10.1002/pmic.201900147
2. Lawlor B, Sleator RD. The democratization of bioinformatics: A software engineering perspective. GigaScience. 2020;9(6):giaa063. doi: 10.1093/gigascience/giaa063
3. Cole BS, Moore JH. Eleven quick tips for architecting biomedical informatics workflows with cloud computing. PLoS Comput Biol. 2018;14(3):e1005994. doi: 10.1371/journal.pcbi.1005994
4. Gruening B, Sallou O, Moreno P, et al. Recommendations for the packaging and containerizing of bioinformatics software [version 2; peer review: 2 approved, 1 approved with reservations]. F1000Research. 2019;7(ELIXIR):742. doi: 10.12688/f1000research.15140.2
5. da Veiga Leprevost F, Grüning BA, Alves Aflitos S, et al. BioContainers: an open-source and community-driven framework for software standardization. Bioinformatics. 2017;33(16):2580-2582. doi: 10.1093/bioinformactics/btx192
6. Thomas SN, Zhang H. Targeted proteomic assays for the verification of global proteomics insights. Expert Rev Proteomics. 2016;13(10):897-899. doi: 10.1080/14789450.2016.1229601
7. Prasad B, Achour B, Artursson P, et al. Toward a consensus on applying quantitative liquid chromatography-tandem mass spectrometry proteomics in translational pharmacology research: A white paper. Clin Pharmacol Ther. 2019;106(3):525-543. doi: 10.1002/cpt.1537
8. Tsiamis V, Ienasescu H, Gabrielaitis D, Palmblad M, Schwämmle V, Ison J. One thousand and one software for proteomics: Tales of the toolmakers of science. J Proteome Res. 2019;18(10):3580-3585. doi: 10.1021/acs.jproteome.9b00219
9. Deutsch EW, Orchard S, Binz PA, et al. Proteomics Standards Initiative: Fifteen years of progress and future work. J Proteome Res. 2017;16(12):4288-4298. doi: 10.1021/acs.jproteome.7b00370
10. Davis WC, Kilpatrick LE, Ellisor DL, Neely BA. Characterization of a human liver reference material fit for proteomics applications. Sci Data. 2019;6:324. doi: 10.1038/s41597-019-0336-7