Technological developments in next-generation sequencing (NGS) approaches and mass spectrometry (MS)-based methods have advanced the landscape of molecular biology research. Scientists are now able to identify and characterize various constituents of a cell, tissue or organism and analyze them in their totality – their "ome" – be it the entire expression of genes (genome), proteins (proteome) or metabolites (metabolome), at any given time.
Piecing together this information helps us to understand the molecular journey from genotype to phenotype and where it can go wrong, in the case of disease phenotypes. As an increasing quantity of omics information becomes accessible to researchers, it has become clear that working with the data sets in segregation can limit their utility. Connectivity between the omics "worlds" is fundamental, but it has been challenging – until now.
In January, the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI) announced that it had launched the Genome Integrations with Function and Sequence, or GIFTS, platform. This novel platform enables scientists that are using Ensembl and UniProt to access all of the up-to-date genomic and protein data for human and mouse genomes. Technology Networks spoke with Beth Flint, Ensembl applications project leader at EMBL-EBI, Maria Martin, team leader in protein function development at EMBL-EBI and Daniel Zerbino, team leader in genome analysis at EMBL-EBI, to learn more about GIFTS and how it will be used to help the research community.
Molly Campbell (MC): Please can you talk to us about the rationale behind the Genome Integrations with Function and Sequence (GIFTS)?
Daniel Zerbino (DZ): GIFTS aims to provide a clear and unambiguous bridge between two flagship data resources at the EMBL’s European Bioinformatics Institute (www.ebi.ac.uk), namely Ensembl and UniProt. Together, they offer a wealth of information on protein synthesis: Ensembl describes the upstream sequences of nucleotides that encode protein-coding genes, whose transcripts are transcribed into downstream protein isoforms, documented in UniProt. Each resource already points to the other, thus connecting genes to proteins and vice-versa, but because of differences in release cycle calendars these links were not 100% consistent. GIFTS now details our shared understanding of which gene maps to which protein.
MC: Who is behind the development of GIFTS?
Beth Flint (BF): GIFTS has been developed at EMBL-EBI as a collaboration between Ensembl and UniProt. The project brought together the expertise from these two groups and allowed us to build a tool that will allow people to easily explore the relationships between the data these groups produce. Collaborative projects of this nature take full advantage of the breadth of knowledge and diverse range of skills at EMBL-EBI. The GIFTS project was possible due to the input of curators, annotators, database and API experts, user interface developers and pipeline automation specialists.
MC: Why is it important to connect the genome and proteome worlds?
Maria Martin (MM): The genome is the storehouse of the genetic material needed for an organism to function. Proteins are the primary effectors of the instructions encoded in our genomes and they and their products ultimately shape our cells, tissues, organs and bodies in response to our environment. Proteins provide an essential link between genome sequence and the eventual phenotype. The functional analysis of genomic and other large-scale biomedical datasets requires integrated information about many distinct types of biological entity, including individual genes, transcripts and proteins.
MC: Why has this been difficult previously?
MM: Ensembl focuses on the annotation of transcripts in reference genomes using available cDNA, EST and RNA-seq data, while UniProt focuses on annotating protein sequences using experimental evidence from the literature, homologs in other species and proteomics experiments. The study of genomes and proteomes requires a very specialized scientific knowledge which needs to be combined to effectively map them.
MC: What do you hope the outcome of launching GIFTS will be?
BF: The mappings that GIFTS pipelines produce will help the Ensembl and Uniprot teams update the data they present via their main sites. Using the GIFTS data will present a uniform view of mappings between the two domains and ensure that consistent information is presented. This provides an enormous benefit to those who use these mappings. Behind the public facing interface of GIFTS are tools used by the annotators and curators in the Ensembl and Uniprot groups. These tools enable them to review and improve mappings. As this process continues, it is hoped that over time canonical UniProt isoforms will be selected for all human genes and that these will match the MANE transcripts from Ensembl.
MC: Are there any intentions to launch a similar platform for other "omics" data?
DZ: EMBL-EBI resources strive to be interoperable with each other, and this new bridge is merely strengthening a very tight network of data resources. For example, other EMBL-EBI resources, such as the Reactome pathway database or the Gene Expression Atlas already link directly and unambiguously to UniProt proteins or Ensembl genes. A sister project, Structure Integration with Function, Taxonomy and Sequence (SIFTS) now connects UniProt protein sequences to their 3D structures, stored in PDBe. All these interconnections are enabling researchers to make sense of all the resources at EMBL-EBI, as exemplified by our unified search utility.
Daniel Zerbino, Beth Flint and Maria Martin were speaking to Molly Campbell, Science Writer for Technology Networks.