Two decades ago, the full genome sequence of humankind was released. It was funded by international government and philanthropic sources at a cost of billions of dollars.
Fast forward to 2008 and, driven by the need for better genome understanding and the precipitous drop in sequencing costs, the Genome 10K Community of Scientists (G10K) was established to promote and ensure the genome analysis of 10,000 species of vertebrates. The G10K-sponsored Vertebrate Genomes Project embraced dramatic improvements in sequencing bio-technologies in the last few years to expand production of high-quality reference genome assemblies for all ~70,000 living vertebrates in the coming years.
Today, the G10K sponsored Vertebrate Genomes Project (VGP) announces their flagship study and associated publications focused on genome assembly quality and standardization for the field of genomics. This study includes 16 diploid high-quality, near error-free, and near complete vertebrate reference genome assemblies for species across all taxa with backbones (i.e., mammals, amphibians, birds, reptiles, and fishes) from five years of piloting the first phase of the VGP project.
In a special issue of Nature, along with simultaneous companion papers published in other scientific journals, the VGP details numerous technological improvements based on these 16 genome assemblies. In this new study, the VGP demonstrates the feasibility of setting and achieving high-quality reference genome quality metrics using state-of-the-art automated approach of combining long-read and long-range chromosome scaffolding approaches with novel algorithms that put the pieces of the genome assembly puzzle together. To date, the current VGP pipelines have led to the submission of 129 diploid assemblies representing the most complete and accurate versions of those species to date, and is on the path to generating thousands of genome assemblies, demonstrating feasibility in not only quality standardization but also scale.
Some of the animals that were part of this study included, but were not limited to:
- Mammals: Pale spear-nosed bat, Egyptian fruit bat; Canada lynx; vaquita; platypus;
- Birds: Zebra finch; kakapo; Anna's hummingbird;
- Reptile: Goode's thornscrub tortoise;
- Fish: Zig-zag eel; climbing perch; blunt-snouted clingfish.
"When we first started the G10K idea, we gathered a small handful of diverse field zoologists together with genome-centric computer scientists, pledging to work together to develop genome sequence data for thousands of the world's vertebrates," said Stephen O'Brien, Ph.D., a professor and research scientist at Nova Southeastern University's (NSU) Halmos College of Arts and Sciences. "We wanted to offer a gift for the next generation of genome scientists. Today the dream of genome empowerment of so many living species took a giant leap forward."
O'Brien is the co-founder of the Genome 10K Consortium, the Chief Scientific Officer at the Theodosius Dobzhansky Center for Genome Bioinformatics, St. Petersburg State University, Russia and is a member of the National Academy of Sciences.
The G10K-VGP's approach combines assembly pipelines with manual curation to fix misassemblies, major gaps, and other errors, which informs the iterative development of better algorithms. For example, the VGP helped reveal high levels of false gene duplications, losses or gains, due mostly to algorithms not properly separating maternal and paternal chromosomes. One solution includes a trio binning approach of using DNA from the parents to separate out the paternal and maternal sequences in the offspring. For cases where parental data is unavailable, another solution developed by the VGP and collaborators is an algorithm called FALCON-Phase that reduces the computational complexity of phasing maternal and paternal DNA sequences at chromosome scale.
"When I was asked to take on leadership of the G10K-VGP in 2015, I emphasized the need to work with technology partners and genome assembly experts on approaches that produce the highest quality data possible, as it was taking months per gene for my students and postdocs to correct gene structure and sequences for their experiments, which was causing errors in our biological studies", said Erich Jarvis, lead of the VGP sequencing hub at The Rockefeller University, Chair of the G10K and a Howard Hughes Medical Institute Investigator. "For me this was not only a practical mission, but a moral imperative."
Kerstin Howe, lead of the curation team at the Wellcome Sanger Institute in the UK, said: "Our new approach to produce structurally validated, chromosome-level genome assemblies at scale will be the foundation of ground-breaking insights in comparative and evolutionary genomics."
"It truly was a challenge to design a pipeline applicable to highly diverged genomes - our largest genome, 5GB in size, broke almost every tool commonly used in assembly processes," said Arang Rhie, from the National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, who is the first author of the flagship paper. "The extreme level of heterozygosity or repeat contents posed a big challenge. This is just the beginning; we are continuously improving our pipeline in response to new technology improvements."
Adam Phillippy, chair of the VGP genome assembly and informatics working group of more than 100 members and head of the Genome Informatics Section of the National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland, USA, added: "Completing the first vertebrate reference genome, human, took over 10 years and $3 billion dollars. Thanks to continued research and investment in DNA sequencing technology over the past 20 years, we can now repeat this amazing feat multiple times per day for just a few thousand dollars per genome."
Specific to conservation and in collaboration with the Māori in New Zealand and officials in Mexico, genomic analyses of the kākāp?, a flightless parrot, and the vaquita, a small porpoise and the most endangered marine mammal, respectively, suggest evolutionary and demographic histories of purging harmful mutations in the wild. The implication of these long-term small population sizes at genetic equilibrium gives hope for these species' survival.
Richard Durbin, a Professor at the University of Cambridge and lead of the VGP sequencing hub at the Wellcome Sanger Institute in the UK, said: "These studies mark the start of a new era of genome sequencing that will accelerate over the next decade to enable genomic applications across the whole tree of life, changing our scientific interactions with the living world."
The G10K-VGP consortia involves hundreds of international scientists working together from more than 50 institutions in 12 different countries since the VGP was initiated in 2016 and is exemplary in its scientific cooperation, extensive infrastructure, and collaborative leadership. Additionally, as the first large-scale eukaryotic genomes project to produce reference genome assemblies meeting a specific minimum quality standard, the VGP has thus become a working model for other large consortia, including the Bat 1K, Global Invertebrate Genome Aliance-GIGA, Pan Human Genome Project, Earth BioGenome Project, Darwin Tree of Life, and European Reference Genome Atlas, among others.
"The VGP project is at the vanguard of the creation of a genomic catalog in analogy with Linnaeus' classification of life, said Gene Myers, lead of the VGP sequencing hub at the Max Planck Institute in Dresden, Germany. "I and my colleagues in Dresden are excited to be contributing superb genome reconstructions with the funding of the Max-Planck Society of Germany."
As a next step, the VGP will continue to work collaboratively across the globe and with other consortia to complete Phase 1 of the project, approximately one representative species per 260 vertebrate orders separated by a minimum of 50 million years from a common ancestor with other species in Phase 1. The VGP intends to create comparative genomic resources with these 260 species, including reference-free whole genome alignments, that will provide a means to understand the detailed evolutionary history of these species and create consistent gene annotations. Genome data are primarily generated at three sequencing hubs that have invested in the mission of the VGP including The Rockefeller University's Vertebrate Genome Lab, New York, USA; Wellcome Sanger Institute, UK; and Max Planck Institute, Germany.
Phase 2 will focus on representative species from each vertebrate family and is currently in the progress of sample identification and fundraising. The VGP has an open-door policy and welcomes others to join its efforts, ranging from fundraising and sample collection to generating genome assemblies or including their own genome assemblies that meet the VGP metrics as part of our overall mission.
The VGP collaborated with and tested many protocols from genome sequencing companies, some of whose scientists are also co-authors of the flagship study, including from Pacific Biosciences, Oxford Nanopore Technologies, Illumina, Arima Genomics, Phase Genomics, and Dovetail Genomics. The VGP also collaborated with DNAnexus and Amazon to generate a publicly available VGP assembly pipeline and host the genomic data in the Genome Ark database. The genomes, annotations and alignments are also available in international public genome browsing and analyses databases, including the National Center for Biotechnology Information Genome Data Viewer, Ensembl genome browser, and UC Santa Cruz Genomics Institute Genome Browser. All data are open source and publicly available under the G10K data use policies.
Reference: The Vertebrate Genomes Project. Springer Nature. https://www.nature.com/immersive/d42859-021-00001-6/index.html Accessed April 30, 2021.
This article has been republished from the following materials. Note: material may have been edited for length and content. For further information, please contact the cited source.