The First Complete Human Genome Sequence Is Published
Complete the form below to unlock access to ALL audio articles.
March 31 represents a pivotal moment for the scientific community, as the first “gapless” sequence of the human genome is published.
History of the human genome
In the 1990s, researchers embarked on a mission that would reshape the landscape of scientific research forever: The Human Genome Project (HGP). The goals of the project – coordinated by the US Department of Energy and the National Institutes of Health – included:
- Identifying all of the genes in human DNA
- Determining the sequences of the 3 billion chemical base pairs that make up human DNA
- Storing the information
- Improving data analysis tools
- Transferring related technologies to the private sector
- Addressing the societal, ethical and legal issues that might be borne as a consequence of the project
Originally anticipated to take 15 years (1990–2005), the HGP was accelerated by advances in next-generation sequencing technologies, leading to its early completion in 2003.
The molecular biology insights and novel sequencing technologies generated both directly and as an indirect result of the HGP have changed scientific research, medicine and society more generally. In personalized medicine, physicians can now prescribe tailored, targeted treatment plans based on the unique DNA composition of cancer patients’ tumors. In agriculture, farmers can access genomic information for crops and animals in a timely manner. This helps to improve selective breeding programs that previously relied on visibly observing phenotypic changes across generations. A diverse range of large-scale genome-based “mega projects” are ongoing across different corners of the world, such as the All of Us Research Program and the Earth BioGenome Project – to name a few examples. All these advances – and many more – exist because of the success of the HGP.
Geneticist Professor Richard Gibbs provides arguably the most apt summary of the project’s global impact in The Human Genome Project changed everything: “It is simply inconceivable today that we would not have the genome at our fingertips,” he writes.
An incomplete picture
When the HGP was declared complete in 2003, it wasn’t “technically” finished; rather, it was finalized to the best of our ability at that time.
“The HGP mapped about 92% of the human genome sequence. The remaining sequences were complex in nature and required advances in technology that were not available at the time,” explains Evan Eichler, professor of genome sciences at the University of Washington School of Medicine, and investigator at Howard Hughes Medical Institute.
The remaining eight percent contains highly repetitive sequences of DNA that were “unreadable” in the early 2000s due to technological, cell line and computational limitations. Consider the staggering size of the human genome: an estimated 3 billion base pairs. That’s a lot of information to process at one time. Consequently, NGS methods require DNA to be “cut up” into chunks. These chunks are amplified (copied), reassembled and matched up into the correct order using computational methods to create larger sequence. If the sequence contains many repetitive elements, the matching up process becomes incredibly difficult. It’s likened to piecing together a jigsaw puzzle where some of the pieces are identical. How do you know exactly which piece goes where in the larger picture of the puzzle?
The burdensome technological restraints, coupled with an incomplete understanding of what the unknown genome could be responsible for, led to the partly completed sequence in 2003. “Because this eight percent of the genome was not rich in genes, many scientists were not interested in the additional effort required to finish it,” Eichler adds.
Consequently, the “missing” eight percent of the genome was nicknamed the “dark” genome, or “junk” DNA by some. But Eichler and many of his peers didn’t see junk; they saw potential treasure.
The last 20 years of Eichler’s research have been dedicated to this line of enquiry and resolving the sequences. During that time, both he and other DNA researchers across the world have demonstrated that within this “unknown” territory lie important regulatory elements, among other genome gems.
Repetitive elements like to move around the genome too, and thus have been termed “jumping genes” by some. The functional impact that this movement can have needs to be further understood, as it may contribute to human disease and evolution. “The repetitive regions of our genome are the most dynamic, and as a result they mutate very quickly over short periods of time. I hypothesized that these regions are genomic hotspots that contribute disproportionately to human disease and evolution,” Eichler says.
Credit: National Human Genome Research Institute.
No more unknowns
Since its conception as part of the HGP, the standard reference human genome – known as Genome Reference Consortium build 38, or GRCh38 – has been continuously updated, closing some of the “gaps” in the genome – and our knowledge. But it hadn’t been fully completed, until now.
Eichler is part of a large collaboration – the Telomere-to-Telomere (T2T) Consortium – that has successfully sequenced the entire human genome, including the “missing” eight percent. The new reference genome, which is called T2T-CHM13, can be accessed via the University of California Santa Cruz (UCSC) Genome Browser and is discussed through a series of papers published in the journal Science.
The T2T Consortium is led by Professor Karen Miga, associate director at the University of California Santa Cruz (UCSC) Genomics Institute and Dr. Adam Phillippy, head of the genome informatics section and a senior investigator in the computational and statistical genomics branch at the National Human Genome Research Institute.
How was this pivotal moment in genomics made possible?
The successful completion of T2T-CHM13 was made possible due to several contributing factors, the T2T team explain.
In the 2000s, scientists were contemplating how they could overcome a particular hurdle when sequencing the full genome. Our genomes carry two sets of chromosomes, one from our mother and one from our father. When the DNA sequence is “chopped” into smaller pieces and re-assembled, the sequences we inherit from our mother or our father can get jumbled up, which makes it difficult to identify variation across genomes. Eichler explains: “Large scale differences between your parental chromosomes – especially in the repeats – make it difficult to resolve because sometimes you switch between the two, creating gaps.”
Eichler had an idea. What if the researchers focused on just one of the genomes, instead of navigating both the maternal and paternal genomes at the same time? In 2004, he turned to Professor Urvashi Surti, reproductive geneticist and laboratory director at the University of Pittsburgh School of Medicine. Surti was working with a particular cell line that, interestingly, carried two copies of the paternal DNA, and none of the maternal DNA, known as the hydatidiform mole.
“I was one of the three leaders of the project along with Karen Miga and Adam Phillippy. I originally put forward the idea with Urvashi Surti back in 2004 that sequencing the hydatidiform mole (paternal material only) would greatly simplify completion of the human genome,” Eichler says.
What is the hydatidiform mole?The hydatidiform mole occurs most often when an oocyte lacking an active nucleus is fertilized by a sperm followed by duplication of the paternal chromosomes.
"By focusing just on one any difference we found, we knew represented a different region, so this single genome helped us from making mistakes during the assembly. In a diploid genome you would have difficulty distinguishing allelic variants originating from the parents versus variants that corresponded to repetitive regions," Eichler explains. "By eliminating one parent, we knew any difference we found that was real must correspond to a different (repeat region) [...] When Urvashi agreed to work establishing genomic resources with me back in 2004, it was an exciting time because I knew the resources would allow us to tackle ANY region of the genome [...] In other words, every repetitive regions could in principle be resolved."
Eichler attributes the ability to assemble the full genome to this cell line and advances in gene sequencing technologies, such as long-read sequencing.
Long-read sequencing, sometimes referred to as “third-generation sequencing”, differs from NGS methods that “cut” the DNA up into smaller chunks. Instead, long-read sequencing technologies can sequence single DNA molecules in real-time, often without amplification, which enables the reading of long DNA strands often between 10,000–100,000 base pairs in length. For this work, the research team utilized two different types of long-read sequencing, one that is capable of reading up to 1 million base pairs in a single read, with modest accuracy, and another that can sequence 20,000 base pairs with almost perfect accuracy.
“I was an early adopter of long-read sequencing and showed its potential to more accurately characterize large repeats,” says Eichler. By 2017, Miga and Phillippy had been utilizing long-read sequencing to sequence large stretches of DNA. It occurred to the research team that now, with the cell line and the novel sequencing capabilities in hand, was the time to confront the “missing” eight percent of the genome. And so, the T2T consortium – as the name suggests – sequenced each chromosome, telomere to telomere.
No more “mind the gap”
Once the full genome was available, the T2T researchers each took a closer look at its components to see what novel discoveries could be made. Eichler summarizes the “key gaps” that are filled by T2T-CHM13, compared to GRCh38:
- The first sequence of the ribosomal DNA (rDNA) from acrocentric, the centromere satellite and duplicate genes is now available
- We now have a complete genome to improve discovery of variation and more complex variation as we remap the data to this complex genome
- A blueprint for how to sequence and assemble other genomes completely in the future now exists due to the project
Eichler’s laboratory focused largely on the assembly and characterization of the duplicated regions and the new genes identified in the previously “missing” region, he explains: “Most of the new genes were duplicated families, and the data that was generated was used to characterize the genes.”
Accessing the full genome also helped the researchers identify complex regions of variation. “One person might have 10 copies of a particular gene, while others might have only 1 or 2. This variation can spell trouble during fertilization, when chromosomes from mom and dad line up and swap pieces. The mismatched genes may lead to an” earthquake” of gene alterations,” Eichler says. These newly identified regions included in GRCh38 will be crucial for further understanding disease susceptibility and the rapid evolution of humankind, he emphasizes: “We are solving cases of genetic diseases that were previously missed because we are discovering more complex forms of variation.”
Over at UCSC, Miga and colleagues’ work concentrated on satellite DNA.
What is satellite DNA?
Long stretches of DNA that contain many repeats of short units. Satellites are located at very specific points within the genome, such as the short arms of certain chromosomes and near to the centromeres.
Centromeres are important for chromosome segregation in cell division, a process that is known to become dysfunctional in many human diseases, like cancer. “We’ve never been able to sequence them at the sequence level,” says Miga in a news release. “For the first time, we can study ‘base-by-base’ the sequences that define the centromere and can start to understand how it works.”
A diverse human genome reference
T2T-CHM13 is now complete, but the work is far from over for the T2T consortium. Eichler explains that the next steps will be to repeat the project for diploid organisms, i.e., where both paternal and maternal genomes are analyzed. “We are close to achieving this,” he hints. Once accomplished, it will be applied to understand the diversity of human genomes across the globe and also applied to patient samples.
T2T has also teamed up with the Human Pangenome Reference Consortium, which aims to develop a novel human pangenome reference created using the complete genome sequences of 350 people. This effort aligns with increasing calls for genomics research to be more diverse.
As DNA analysis continues to inform a growing amount of clinical medicine, if genetic risk assessments that utilize the reference genome do not consider diverse populations, global health disparities could be widened. “With the unprecedented increase in size and scope of genome sequencing studies, there is an urgent need for an improved reference that can capture additional unique sequences that are prevalent in different human populations,” writes Wong et al in Towards a reference genome that captures global genetic diversity.
Let's not call it "junk" anymore
It took twice as long to finalize the missing 8% of the human genome as it did to sequence the first 92%. These efforts have not been in vain, the approach developed by the team provides a blueprint for how patient genomes will be characterized in the future. Eichler says, “T2T genomes will mean more complete variant discovery, and improved understanding and diagnosis of genetic diseases.”This project has confirmed Eichler and the whole T2T consortium’s suspicions that that the once “missing” regions of the genome are far from genetic wasteland, they are essential for life. “Centromere satellites are necessary for segregation of chromosomes during cell division, rDNA is essential for cells to produce proteins in cells. The segmental duplication genes distinguish us from chimpanzees and encode some of the genes that are critical for building a bigger brain. In essence, the sequence is critical for life and making us human,” says Eichler, before concluding: “Let’s not call it ‘junk’ anymore.”
Professor Evan Eichler was speaking to Molly Campbell, Senior Science Writer for Technology Networks.