We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.

Advertisement

Whole-Genome Sequencing Enables Rare Variant Discovery at Scale

Diverse group of people representing population-scale diversity for whole-genome sequencing studies.
Credit: iStock.
Read time: 3 minutes

The recent Nature publication of the UK Biobank whole-genome sequencing (WGS) dataset represents a landmark in population genomics. Comprising nearly 500,000 genomes and ~1.5 billion variants, this dataset substantially exceeds the resolution of previous large-scale resources, enabling novel insights into the genetic architecture of health and disease.


Illumina’s DRAGEN technology was at the center of this feat. To explore the technical underpinnings of the dataset’s creation and its broader implications, Technology Networks spoke with Rami Mehio, head of Software and Informatics at Illumina. Mehio discussed the innovations that enabled high-precision variant detection in complex genomic regions, the role of Illumina Connected Analytics in large-scale data processing and how platforms are evolving to support multiomics integration and AI-driven discovery.

Molly Coddington (MC):

The UK Biobank WGS dataset published in Nature is one of the largest of its kind. What makes this dataset uniquely valuable compared to previous large-scale genomic resources?


Rami Mehio (RM):

The UK Biobank WGS dataset stands out as one of the most powerful resources for advancing medical research. Its scale and diversity enable unparalleled opportunities for disease risk prediction, diagnostics and drug target discovery.


Unlike smaller datasets, UK Biobank WGS allows researchers to explore the genetic underpinnings of a wide spectrum of common diseases and uncover complex relationships between genetics, biomarkers and environmental factors. Additionally, the dataset was processed using state-of-the-art variant calling methods, providing high accuracy and consistency – critical for maximizing discovery power and enabling robust comparisons across other biobanks.



MC:

Illumina’s DRAGEN technology identified ~1.5 billion variants – significantly more than previous methods. Could you explain how DRAGEN achieves this increased sensitivity and specificity?


RM:

The version of DRAGEN used to analyze the UK Biobank data – which is  closely related to the one that won the PrecisionFDA Truth Challenge V2 in 2020 – represented a major leap forward in variant detection sensitivity.


A key innovation was the introduction of pangenome mapping. This approach allows DRAGEN to accurately align reads in difficult-to-map and highly polymorphic regions of the genome, while reducing ancestry-related reference bias. As a result, DRAGEN identifies tens of thousands more variants per sample compared to traditional pipelines. When scaled to the entire UK Biobank cohort, this translates to approximately 1.5 billion variants.


It’s worth noting that while this was groundbreaking five years ago, we’ve since made enormous improvements to the product in terms of accuracy, precision and many other critical metrics. This is likely a great motivator to re-analyze the data set with DRAGEN 4.4 to uncover even more insights from this resource. Additional details on the pangenome methodology can be found here. 



MC:

High-throughput sequencing analysis often involves trade-offs between accuracy, speed and cost. How does DRAGEN balance these factors, particularly at this unprecedented scale?


RM:

DRAGEN is engineered to deliver high accuracy without compromising speed or cost efficiency – even at population scale.


At the sample level, it combines pangenome-based mapping and alignment with machine learning-driven variant calling, accelerated by A field-programmable gate array hardware. This architecture achieves state-of-the-art sensitivity and specificity while providing an order-of-magnitude improvement in compute efficiency, cost and speed. For joint calling, DRAGEN aggregates variants and genotypes in a way that preserves per-sample accuracy, regardless of cohort size, with near-linear scalability as sample numbers grow. Post-aggregation, machine learning models score variant quality based on sample-level distributions, ensuring high genotyping rates and consistency across trios and monozygotic twins.

All of this combines to deliver industry-leading precision alongside the throughput and cost-effectiveness required for projects like UK Biobank.


MC:

One of the promises of WGS is detecting rare and structural variants. What new opportunities does this dataset open up for rare disease research and population-level variant interpretation?


RM:

As the cohort expands from thousands to nearly 500,000 genomes, the number of common variants plateaus. But rare variants, including ultra-rare alleles and singletons, continue to rise.


In the UK Biobank dataset, 47% of variants are singletons, 82% occur at frequencies below 1 in 100,000 and 57% are novel to the dbSNP database. Combined with UK Biobank’s rich phenotypic data, this unprecedented collection enables discovery of gene–phenotype associations that are otherwise missed by common-variant genome-wide association study.


Furthermore, population-scale sequencing provides highly accurate allele frequency estimates, a critical factor in distinguishing benign from pathogenic variant, an essential step in clinical genomics. These insights underscore the transformative role of large-scale sequencing in advancing rare disease research and improving variant interpretation.



MC:

With such a large dataset, data handling, storage and accessibility become critical. What informatics approaches or infrastructure have been put in place to ensure researchers worldwide can use the resource effectively?


RM:

While the UK Biobank has built a secure, trusted research environment for general research access, we were thrilled to have the opportunity to partner with them to leverage Illumina Connected Analytics (ICA) for its large-scale data processing.


ICA is a modern, cloud-based genomics platform designed for population-scale sequencing and multi-biobank meta-analysis. It provides secure data access and a highly scalable compute environment that supports high concurrency and low-latency job scheduling. The platform includes both DRAGEN secondary analysis workflows and population genomics pipelines, which accelerates the transformation of raw genomic data into meaningful insights. For example, the UK Biobank 500K joint-calling project ran on ICA, completing 934,000 analyses over 8.6 million CPU hours in just 86 days.

More recently, ICA has enabled joint analyses across multiple biobanks involving over 1.2 million genomes, demonstrating its scalability and global impact.


MC:

Looking ahead, how might Illumina’s informatics platforms evolve to handle not just bigger datasets, but also multi-omics integration (e.g., transcriptomics, proteomics and epigenomics) alongside genomics?


RM:

Beyond scaling for larger datasets, Illumina’s informatics platforms are expanding to support multiomics analysis through advanced pipelines for quality control, clustering and association testing. This includes the availability of industry standard tools like CellBender and Seurat for QC, clustering, and differential expression analysis. It also supports differential methylation analysis and epigenome-wide association studies, enabling researchers to explore complex regulatory mechanisms.


Looking forward, large multiomics datasets will play a critical role in training foundational models in biology. Illumina is investing in AI-driven capabilities to support this evolution, including hosting and developing foundational models as they become available. These advancements position the platform to accelerate discovery across genomics, transcriptomics, epigenomics and beyond.