Capturing Genetic Diversity in Proteomics
The field of proteomics lacks the ability to capture human diversity. A new tool, ProHap, promises to help.
Complete the form below to unlock access to ALL audio articles.
Proteins are complex biomolecules that orchestrate or contribute to nearly every cellular process, making them a key focus of biomedical research. Many human diseases occur as a result of abnormal proteins, and over 95% of existing drug targets in the human body are proteins.
Our ability to identify, characterize and analyze proteins continues to play a key role in our progress toward personalized medicine. What the field of proteomics lacks, however, is an ability to capture human diversity; most mass spectrometry (MS)-based proteomics research studies compare a cohort’s proteomes to reference proteomes.
A new bioinformatic tool – ProHap – is designed to address this bottleneck. Published in Nature Methods late last year, ProHap is a Python-based tool that creates custom protein sequence databases from large panels of reference human haplotypes.
Jakub Vašíček, PhD candidate at the University of Bergen, is the first author of the paper. He joined Technology Networks to discuss the need for genetic diversity in proteomics, how ProHap can help achieve it and the continued development of the tool.
Why is it important to account for genetic diversity in proteomics? Why has this proven challenging?
When genome sequencing first became available, the field established the “reference” human genome – an arbitrary reference sequence used to map the differences that make us who we are. The field of precision medicine aims to take these differences into account to better personalize treatments – but not all populations are equally represented in medical studies which can create biases. In genomics, for example, clinical usage of established associations between genetic variants and medical conditions (polygenic risk scores) may increase health disparities for underserved communities.
When studying large human cohorts in proteomics, the data are aligned onto the reference proteome, the product of the reference genome. All natural and common differences between humans are lost. Furthermore, it is easy to mistake two sequences that are similar. If we only search for reference sequences, some parts of proteins that are varying would be falsely matched, while others will remain invisible.
Accounting for the differences between humans opens the possibility of avoiding such errors and accounting for genetic diversity in medical research better.
Can you discuss the background work that led to the development of ProHap?
This work is part of a broader project funded by the Norwegian Research Council to enable the interpretation of proteomic data in the context of genetic variation. Proteomic data contains the products of genetic variation – we were just blind to it so far. To be able to chart variation in human samples, we adopted the methods used in genetic epidemiology and wrote ProHap to produce a map of so-called protein haplotypes, that would fairly represent the participants of reference genetic panels.
Can you discuss the research that you conducted to showcase ProHap’s utility to the scientific community? What were your key findings?
First, we have used ProHap with the genotypes of the 1000 Genomes Project to generate 6 different databases of protein sequences. In the 1,000 Genomes Project, participants were pooled into 5 groups: African, American, European, East Asian and South Asian. The first 5 databases therefore represent these groups, while the 6th contains protein sequences commonly expected among all the individuals from the 1000 Genomes panel.
The 1000 Genomes are not a perfect sample of human diversity, but they do offer a glimpse into the differences between the reference proteome and individuals in the different populations. We have seen that the highest share of the proteome can be affected by genetic variation in the African “superpopulation”, while all individuals would benefit from having their genotypes accounted for in proteomic studies.
We have also used ProHap to create a personalized proteome for a donor whose genome sequence is publicly available online. Stem cells derived from this donor are commercially available and, when analyzing them, we found that we can detect many changes in the protein sequences. In some cases where the donor carries two different versions of the same gene, we can detect the two different proteins that can be encoded.
Firstly, the proteomics community can readily use the sequence databases that we have published alongside the tool. These databases will be useful for the development of more refined proteomic workflows.
Moreover, as the field of genomics moves to produce various new panels of genotypes, with this data rightfully staying in the ownership of the respective communities, ProHap can be used on secure servers to produce new protein databases, maintaining the confidentiality of the data.
Finally, we are investigating the usage of databases created by ProHap in studies of the immune system, where refining the protein analyses to the personal level is vital (e.g., when studying graft–host interaction).