Mapping the Human Proteome: A Journey
Complete the form below to unlock access to ALL audio articles.
In 2020, we celebrate 90% of the human proteome being mapped by the Human Proteome Project (HPP). In this article, we reflect on the history of this endeavor and the future of the proteomics field.
Proteins: The true actors of life
One aspect of human existence we know to be universal is that we each have a molecular self. Every human on this planet comprises organs, tissues, cells and molecular machinery that shapes both our identity and our human experience. Whilst our lives may be beautifully unique, we all live through them via this molecular self. It binds us – biologically – to those that came before us, those that will follow us and also to other organisms with which we share our planet.
Throughout the history of medicine, we have sought to characterize and analyze the molecular self, in a bid to understand the underpinnings of human physiology and, in turn, pathophysiology. This knowledge holds many applications that transcend biology and circle around the question of what it means to be a human. In the context of human disease, it aids our ability to prevent it, or even cure it.
Almost 17 years after the completion of The Human Genome Project (HGP), the insights garnered by DNA sequencing studies have filtered into modern medicine in several ways. Next-generation sequencing (NGS) of tumors has facilitated the discovery of biomarkers unique to specific cancers. Pharmacogenomics has taught us that certain drug compounds will elicit different therapeutic effects in different people. Arguably more niche fields of research have benefitted too; transgenerational epigenetics, for example, is teaching us how biological memories could be passed on to future generations.
The HGP has undisputedly progressed our knowledge of human biology in ways that scientists of the past could only dream of.
But still, gaps remain.
DNA is only part of the story; there are many steps required to transcribe and translate the DNA code into proteins the functional workhorses of the cell. As Professor Chris Overall, Canada research chair in proteinase proteomics and systems biology at the University of British Columbia, Centre for Blood Research, described, "The HGP was a massive achievement – it really was brilliant. But it's only the blueprint for us."
The complexity of human biology is attributed to protein variation
The true actors of life are proteins, Dr Lydie Lane, co-director of the CALIPHO group at the University of Geneva and SIB Swiss Institute of Bioinformatics, says. "They act as little machines that are responsible for all our biological functions (nutrition, reproduction, respiration etc.)." To fully comprehend the complexity of human physiology – and in turn, pathophysiology – the insights provided by genomics alone are not enough; we also need proteomics.
The term "proteome" was first used in 1994 by Marc Wilkins, and refers to the complete set of proteins that are expressed in a cell, tissue, organism or system, at a given time.
Proteomics, the large-scale study of proteomes, sits within an area of science known as "omics", which also comprises genomics, transcriptomics and metabolomics.
Historically, proteomics has been somewhat overshadowed by the field of genomics. Stephen Curry, a professor of structural biology at Imperial College London, wrote about an apparent "lack of interest in our "inner molecules": "[As soon as] someone mentions gene products, the myriad of protein molecules encoded by genes, suddenly there’s a switch off." Critics of Curry's writing expressed that it wasn't a lack of interest, but rather a lack of understanding; studying the proteome is far more complex than the genome.
Why? The proteome is exceptionally diverse. "A surprise revealed by the success of the HGP was the lower-than-anticipated number of genes identified: ~20,300, rather than the ~100,000 estimated," wrote Lloyd Smith and Neil Kelleher.1 This finding of the HGP ultimately led to the acknowledgement that the complexity of human biology is attributed to protein variation, resulting from processes such as alternative splicing of RNA transcripts and the formation of post-translational modifications.
It cast a spotlight on proteomics.
HUPO and the Human Proteome Project
On February 9, 2001, an international scientific organization formed with the collective aim of promoting proteomics through international cooperation and collaboration. It was named the human proteome organization, or HUPO.
Professor Emanuel Petricoin, co-leader of the Applied Proteomics and Molecular Medicine team at George Mason University, was one of the co-founders of HUPO. In an interview with Technology Networks he described how, "In proteomics you have so many different technologies and methodologies […]. All of these specialties and subspecialties have different cohorts of scientists that in themselves are in their own little subgroups."
The aim of HUPO, in Petricoin's words, was to represent the efforts of these cohorts across the globe, and develop what he referred to as "campfire" projects that researchers could "congregate" around and participate in together to advance the field.
The Human Proteome Project (HPP) is an example of such a venture; however, as an endeavor that involves scientists, clinicians and industry members from many countries across the world, it's a little larger than a campfire circle. The project – first launched on September 23, 2010 – is a collaborative endeavor designed to map the entire human proteome using current and novel analytical technologies.
The HPP is divided into two subcategories, the Chromosome-Centric HPP (C-HPP) and the Biology/Disease-Driven HPP (B/D)-HPP. Lane, the co-chair of the C-HPP, explains why, "One of the very first tasks of the HPP was to get convincing experimental evidence for the existence of each of the ~20,000 proteins predicted by the analysis of the human genome. Since the genes encoding these ~20,000 proteins are distributed on the 24 human chromosomes, it was natural to divide the work chromosome per chromosome." Because the HPP is an international project, it was logical for each participating country to be allocated one of the 24 chromosomes. They are then responsible for monitoring the validation status of all the predicted proteins on this chromosome.
"However, once the proteins are produced, the function they perform in the body is independent from the chromosome that originally encoded them. In order to study their role or their involvement in disease it is important to study proteins in their global context," Lane added. "The B/D HPP project tackles this challenge by focusing on broad biological or medical questions and applying more systemic approaches."
The C-HPP and B/D HPP are complementary projects. The first ensures that all proteins are covered, and the second puts the all the pieces together.
Milestone: 90% of the human proteome is mapped
Lane's research group at the SIB Swiss Institute of Bioinformatics curates neXtProt, the official knowledgebase of the human proteome. "neXtProt integrates and standardizes all the data generated by the teams participating in the project and produces annual metrics to monitor its progress. This is done in close collaboration with the other key resources involved in HPP, such as the Human Protein Atlas and PeptideAtlas."
There are five levels of supporting data for protein existence (PE) as part of the HPP's data system:
|Experimental evidence for the existence of at least one proteoform. This is based on studies using MS, Edman sequencing, X-ray, nuclear magnetic resonance structure of the purified protein, protein-protein interaction data or antibody data.|
|Evidence limited to the corresponding transcript.|
|Indicates the existence of orthologs in a closely related species.|
|Entries are based on gene models without evidence at the protein/ transcript/ homology level.|
|Coding evidence is doubtful; entry typically corresponds to an in silico translation of a non-coding element. |
A table outlining the five levels of supporting data for protein existence.
On the HPP's 10th anniversary this year, the project celebrated a major milestone: mapping 90.4% of the human proteome at the PE1 level, as reported in Nature Communications.2 In 2011, just 70% of the human proteome had been mapped to this level.
How is the data collected in the last decade of the HPP being utilized, particularly in a clinical context? This, according to Petricoin – whose research focus lies in oncology – is the "elephant in the room": "Patients stood at the [HGP] announcement and said, 'so what?'. 'How does this information and list of genes help me today, or tomorrow, with my cancer?' The same can rightfully be said here – how does a simple list of proteins help a cancer patient today?"
In the publication, A high-stringency blueprint of the human proteome, Adhikari and colleagues highlight how the HPP data has assisted research groups across the globe in the study of different diseases thus far. The National Cancer Institute’s Office of Cancer Clinical Proteomics Research, for example, is working to improve the prevention, early detection, diagnosis and treatment of cancer via programs such as the Clinical Proteomic Tumor Analysis Consortium. Cardiovascular disease (CVD) proteomics has advanced from simply identifying proteins, to mapping proteoforms that enable subclassification of CVD.3 Most recently, proteomics-based analysis of the SARS-CoV-2 virus has identified potential therapeutic targets for treating COVID-19 and highlighted the potential utility of existing drugs.4
"Like a list of parts for a 747 aircraft, it [mapping the human proteome] doesn't tell you anything about how the aircraft is put together, how it operates, and most importantly how to fly it. This [parts list] is what the field has achieved, and it is fantastic – but just a start – we need to take the parts list and construct the instruction manual, and that is going to take a huge amount of continued effort," – Petricoin.But mapping the proteome is just the beginning. As Lane says, "it's a solid grounding" – but there is still work to be done. "Validating the existence of proteins is good but understanding what they do is better! We hope that the knowledge and tools we integrate in neXtProt will continue to facilitate and speed up the generation of functional hypotheses for all the understudied proteins," she noted.
"Deciphering this new 'proteome code' is the challenge that lies ahead for the proteomics and HPP communities and for addressing the broken hyperbole springing from the euphoria of the publication of the human genome papers 20 years ago, when the media and pundits predicted the curing of some, if not all, human diseases within a few years," Overall added.
"Mind the gap"
You might think that celebrating the completion of 90% of a task is unusual, particularly in the context of scientific research. Why rejoice now? Why not wait until we have mapped 100% of the human proteome?
An appreciation for the sheer complexity of the proteome, the growing capabilities of analytical technologies and the pressures faced by an arguably under-funded research field – particularly when compared to genomics – is pertinent to understand why 90% is a huge triumph, and a cause for celebration. And so, the HPP chooses to "mind the gap", respecting this achievement whilst acknowledging the intention to complete the coverage with "high fidelity" in due course.
"The 10% missing protein gap in completing the overall coverage of the human proteome will hold further keys to understanding human embryonic and childhood development, cell differentiation, and less frequent yet essential responses to disease and environmental and dietary challenges that were essential for hominid survival and evolution from ∼ 2.8 million years ago to today’s modern human," – Overall.
Credit: Suad Kamardeen on Unsplash.
The "missing" proteins of the human proteome may be hiding out of sight in rare cells or tissues, expressed at specific fetal or childhood developmental stages or perhaps in quantities that are too low to be detected by current mass spectrometry approaches. Others may be hiding in plain sight, but their amino acid sequence and chemical composition may render such proteins not amenable to current mass spectrometry instrumentation and methodology. When asked whether it is possible that some proteins may never be mapped, Petricoin says that it would be dogmatic to say never: "Science is always improving and getting better. Right now, it is a product of these 10% being extremely low abundance, having extremely short half-lives and mass spectrometry still not being analytically sensitive enough to 'see' these markers," he explained.
Challenges in proteomics
Despite many impressive advances over the past few decades, proteomics still faces several challenges which influence its utility, particularly in a clinical context. Most notably, data handling is a key issue. "We'll have a massive set of cores in the computer, which are very fast, but the data analysis from just one day of runs can sometimes take days, which is incredible. The computing power has caught up, but we're still struggling to put the data together," said Overall.
The ability to analyze and integrate large omics data sets will be critical for the implementation of proteomic data in the clinical space. This is a core focus for Lane's group. She says, "We will continue to improve the interoperability of neXtProt with resources focusing on clinical and pharmacological data in order to better answer the needs of the medical community."
Funding is also a pertinent issue for the field. The instrumentation adopted in high-throughput proteomics – primarily mass spectrometry – is both expensive and complex, requiring large budgets to fund the equipment and to train specialists to use it.
"One of the biggest challenges is that there is no dedicated funding for the HPP project, in contrast to the former HGP. Nearly all the teams (including ours) participate in the global HPP effort on a voluntary basis, which undoubtedly slows down the whole project," said Lane.
Her thoughts are echoed by Overall, who emphasizes that, whilst he feels privileged to be involved in what he deems a "worthwhile endeavor" that drives him and his lab members, their work would progress much faster and more accurately if more funding was available for essential infrastructure.
"For the next high-fidelity compendium of the full human proteome and to develop a broader understanding of life, human conscience and disease, proteomics needs more data, more patients, more scientists – biochemists, geneticists, engineers, mathematicians, and bioinformaticians, and more doctors to understand life, individuality, personality and disease," – Overall.
Mapping 90% of the human proteome marks a scientific era that embraces a holistic approach to understanding human biology and disease; the timing of which is almost beautifully ironic as society faces the greatest health crisis of its time: COVID-19.
"The post SARS-CoV-2 pandemic world will be different. It is likely that new paradigms to accelerate precision medicine will emerge. These will undoubtedly involve global collaboration (even between competing entities) using multi-disciplinary approaches that enable the fast-tracking of novel diagnostic tests and precision therapeutics. Almost certainly these outcomes will require knowledge involving the human proteome – celebrated here in the inaugural HPP High-Stringency Blueprint," Adhikari and colleagues concluded.
Professor Chris Overall, Professor Emanuel Petricoin and Dr Lydie Lane were speaking to Molly Campbell, Science Writer, Technology Networks.
1. Smith LM, Kelleher NL; Consortium for Top Down Proteomics. Proteoform: a single term describing protein complexity. Nat Methods. 2013;10(3):186-187. doi:10.1038/nmeth.2369.
2. Adhikari S, Nice EC, Deutsch EW, et al. A high-stringency blueprint of the human proteome. Nature Communications. 2020;11(1):5301. doi:10.1038/s41467-020-19045-9.
3. Cai Wenxuan, Zhang Jianhua, de Lange Willem J., et al. An Unbiased Proteomics Method to Assess the Maturation of Human Pluripotent Stem Cell–Derived Cardiomyocytes. Circulation Research. 2019;125(11):936-953. doi:10.1161/CIRCRESAHA.119.315305.
4. Gordon DE, Jang GM, Bouhaddou M, et al. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature. 2020;583(7816):459-468. doi:10.1038/s41586-020-2286-9.