We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.

Advertisement

Non-Coding DNA: The Unexplored Genome

Section of ACGT code.
Credit: iStock
Listen with
Speechify
0:00
Register for free to listen to this article
Thank you. Listen to this article using the player above.

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 6 minutes

Next-generation sequencing (NGS) technologies, used to rapidly read genetic information, have consistently opened new frontiers in biology. The first NGS device – 454 Life Sciences’ Genomic Sequencer 20 – was released in 2006. With their speed and efficiency, these machines blew away Sanger sequencing – used to complete the Human Genome Project. Now, 18 years later, NGS has come of age. The latest advances in the area have helped researchers explore areas of the genome that had previously been dismissed as uninformative.

The dark corners of the genome

Our complex genetic information can be sliced in many ways, but one common divisor is between coding and non-coding DNA. As their name suggests, only coding regions are later transcribed into RNA and then translated into proteins. It was once thought that non-coding regions were effectively “junk” DNA.1 Using very different technologies, two new studies have added further evidence to an already large pile showing that non-coding DNA is instead a rich, unexplored genomic frontier.


Early NGS technologies relied on short-read sequencing approaches. This method chops nucleic acid chains into sections of up to 600 base pairs. Over a decade of development has provided short-read sequencing with an array of data pipelines and complementary technologies that have cemented its position as the most widely used NGS approach.


But short reads have their limitations. After a genome has been fragmented, attached to adapters that the NGS system recognizes and amplified for polymerase chain reaction (PCR) analysis, the resulting data deluge resembles a complicated biological jigsaw puzzle. This genomic information has been assembled and aligned into its correct place. Various techniques are used to achieve this alignment, but shorter reads can run into trouble when facing stretches of repetitive sequences.2


The latest generation of NGS has adopted the mantra that “bigger is better” by sequencing lengthier chunks of DNA than ever before, providing a new way of looking at the genome’s dark regions.

Looking for landmarks

Analyzing a repetitive sequence is like finding your bearings out on a remote steppe. With short-read sequencing, you don’t have the perspective to spot landmarks on the horizon that will orient you. A long-read sequencer is like pulling a telescope out of your back pocket – now far-away features come into view, allowing what seemed like endless scrub to be placed in context.



Advertisement

"If a repetitive region is longer than the read length, you cannot resolve it,” sums up Dr. Brandon Pickett, a postdoctoral researcher at the National Human Genome Research Institute (NHGRI). This was the challenge facing Pickett and colleagues in a bold study that produced the first complete chromosomal sequences from non-human primates.3 Published in Nature, the research mapped X and Y chromosomes from six species, including the bonobo and chimpanzee. Repetitive sequences comprised as much as 66% of the X chromosomes and 82% of the Y chromosomes analyzed. Short-read approaches were not an option here – some of the repetitive sequences, says Pickett, “stretched tens of thousands of nucleotides.”


Instead, the team utilized a combination of two techniques – Pickett calls this “the secret sauce” – that use 100,000 bp and 20,000 bp reads, respectively. The first technique proved invaluable for spanning lengthy repeats, while the second was used to identify inexact repeats. These gene regions are also found elsewhere in the genome, but with slight differences that allow a “best match” read to be aligned. 


The analysis revealed that these genomic flatlands contained far more surprises and diversity than expected – especially in the repeat-heavy Y chromosomes. “The Y chromosomes in the great apes have accumulated a lot of deletions and changes to repetitive element composition,” says Pickett. These changes have also occurred rapidly. “Some of these species diverged from the human lineage only seven mya (million years ago), which is tiny in the evolutionary time scale,” he adds.


What’s the effect of this unexpected variation? “We don’t have a full understanding of the role that most non-coding DNA plays in a genome,” says Pickett. But more sequencing is likely to produce an answer, says Pickett’s co-author, Dr. Kateryna Makova, the Francis R. and Helen M. Pentz professor of biology in the Eberly College of Science at Pennsylvania State University. “Right now, we have complete genomes of just one male from each species, but it would be not just interesting, but very important, to also generate complete genomes of other individuals,” she adds.

The secrets of splicing

However, not all non-coding genomic regions consist of repetitive sequences. Some produce RNA molecules that assist in the transcription and translation of coding regions. Despite this, we are only now starting to understand how these genomic regions might impact human health. A recent study published in Nature Medicine found that mutations in non-coding DNA might underlie many cases of previously unexplained neurodevelopmental disorders.4 The mutations were found in a non-coding gene called RNU4-2­, which produces a structure called the spliceosome. This genomic machine stitches together other sections of nucleic acid that will ultimately become transcribed and translated into proteins.



Advertisement

Dr. Ernest Turro, a principal investigator at the Icahn School of Medicine at Mount Sinai and the senior author of the study, said the key to these findings was short-read whole-genome sequencing. Previous large-scale genomic analyses of neurodevelopmental disorders had instead used whole-exome sequencing. “Consequently, while those studies were extremely successful in discovering disease-causing variants in coding genes, they were blind to genetic variation in non-coding genes such as RNU4-2,” he added, explaining that such approaches only target around 2% of the genome and omit many non-coding genes.


Turro explains there are two different types of spliceosomes – major and minor. The major spliceosome directs most splicing, but Turro’s study is the first to tie a non-coding mutation in this machine to a rare disease.


Many of these genes are highly expressed in the growing brain, he adds. “It may be that mis-splicing of genes in neuronal cells during development is responsible for the intellectual disability in patients,” explains Turro.


As with Pickett and Makova’s work, the next step that Turro sees is large-scale sequencing. Patients around the world could receive a genetic diagnosis based on his team’s work, but analysis of thousands of genomes will enhance statistical power and help understand genomic disease better.

A new frontier

The Human Genome Project – which fully mapped every base of our DNA – was only fully completed in 2022, 19 years after the draft genome was revealed in 2003.5 This delay was caused by repetitive, non-coding gaps that required long-read sequencing to fill in. This milestone might seem like a final step to understanding our genomic information, but the glut of new research in the field shows that the complete sequence was only a landing point for a new world of discovery. Non-coding areas of the genome still hold many secrets, and the technology needed to explore them is improving all the time. What is clear is that the pace of discovery has sped from the crawl of the Human Genome Project to a breakneck sprint. “The thrill when we obtained the complete sequences of these chromosomes was an amazing feeling,” says Makova. “I thought that maybe it would not happen in my lifetime.”


References:


1.      Fagundes NJR, Bisso-Machado R, Figueiredo PICC, Varal M, Zani ALS. What we talk about when we talk about “junk DNA.” GBE. 2022;14(5):evac055. doi: 10.1093/gbe/evac055


2.      Wang P, Meng F, Moore BM, Shiu SH. Impact of short-read sequencing on the misassembly of a plant genome. BMC Genomics. 2021;22(1):99. doi: 10.1186/s12864-021-07397-5


3.      Makova KD, Pickett BD, Harris RS, et al. The complete sequence and comparative analysis of ape sex chromosomes. Nature. 2024:1-11. doi: 10.1038/s41586-024-07473-2


4.      Greene D, Thys C, Berry IR, et al. Mutations in the U4 snRNA gene RNU4-2 cause one of the most prevalent monogenic neurodevelopmental disorders. Nat Med. 2024:1-5. doi: 10.1038/s41591-024-03085-5


5.      Rhie A, Nurk S, Cechova M, et al. The complete sequence of a human Y chromosome. Nature. 2023;621(7978):344-354. doi: 10.1038/s41586-023-06457-y

 

About the interviewees:

Kateryna Makova is the Francis R. and Helen M. Pentz professor of biology in the Eberly College of Science at Pennsylvania State University.


Ernest Turro is a biostatistician and a principal investigator at Icahn School of Medicine at Mount Sinai.


Brandon Pickett is a postdoctoral fellow in the Genome Informatics Section at the National Human Genome Research Institute.