DNA methylation affects how and when genes are expressed. Hence, abnormal methylation can be linked to the development and progression of various diseases.
However, analysis of methylation data can be challenging. Many traditional tools have complex workflows and lack scalability, meaning that standard analyses require scale computing infrastructure.
This poster highlights a bioinformatics workflow for the analysis of genomic and methylome data that provides access to unprecedented insights into current and future biological states.
Download this poster to discover a software package that:
- Offers an extensive range of visualizations and analyses of methylation profiles.
- Provides a user-friendly experience with a simple and intuitive interface.
- Can analyze multiomic data.
Unlocking scalable and efficient multiomic analysis of 5- and 6-
base genomes
Analysing methylation data is challenging, many existing analysis tools are difficult to work with and do not scale well as the number of samples increases. This lack of scalability means that standard analyses, such as identifying differentially methylated
regions (DMRs), or summarising methylation fractions over genomic regions, require substantial time and memory – typically necessitating large scale compute infrastructure (e.g. compute clusters, cloud).
duet multiomics solution evoC is a new sequencing technology, that simultaneously derives all four genetic bases without ambiguity in C or T calls, alongside distinguishing 5-methylcytosine and 5 hydroxymethylcytosine (6-base data) in a single read
from a single DNA molecule [1] (Figure 1). The technology consists of pre-sequencing library prep and post-sequencing analysis pipeline, providing single-base resolution of genetics and epigenetics at high accuracy.
This expansion of biological signal that can be generated in a single sequencing experiment further increases the scale and complexity of downstream analysis, necessitating the development of more efficient analysis software. To address this challenge, we present modality, a fast and scalable array-based python package for the analysis of 5 and 6-base genomes
(genetics, 5-mC and 5-hmC).
1. Introduction
Nicholas Harding, Michael Wilson, Jean Teyssandier, David Currie, Casper Lumby, William Stark, Mark S. Hill, Páidí Creed
DNA
sample Pre-sequencing
lab protocol NGS
sequencing Bioinformatic pipeline
Insight
+ + + =
1 - Strand synthesis - creates a single molecule with a direct copy of the original information
tethered together with a hairpin. The copy strand
is without cytosine modifications initially, but
importantly, utilises a high fidelity methyltransferase to specifically copy 5mC from
the original to the copy strand.
2 - Paired-end read sequencing - generates sequence information after protection of cytosine modifications followed by deamination of all
remaining cytosines to uracils, read as thymine in SBS.
3 - Read resolution - aligns original and copy strands to correctly call all 4 canonical bases in addition to 5mC and 5hmC.
4 - Aligned (4 base) reads with 5mC & 5hmC are
tagged (6-Base information)
Figure 1 | duet multiomics solution evoC is a 6-base calling technology that reads all four canonical bases plus 5mC and 5hmC.
2. duet multiomics solution evoC
3. modality core concepts
Parallel processing and memory efficiency modality leverages the parallel processing capabilities of dask for efficient computation.
Multi-dimensional Data
Interaction modality enables seemless
interaction with datasets across multiple dimensions using xarray.
zarr storage backend modality interacts with zarr,
leveraging chunked and compressed sets of n- dimensional arrays, allowing analyses to efficiently scale to many samples (>100) even with very limited RAM.
Figure 3 | Performance benchmarking of modality One common operation that provides useful insight into the relationship between samples, in terms of their methylation profiles, is to compute a pairwise Pearson correlation matrix (see Figure 4). This is a computationally expensive operation so we used this to benchmark against an existing tool - methylkit. We used a machine with 16Gb of RAM
to match what we might expect to be available on a typical laptop and computed the matrices for varying numbers of samples (>=1<=110) using both modality and methylkit. We were not able to generate the matrices using methylkit for datasets >10 samples, this would cause an out-of- memory error and crash. In contrast, we were able to generate these matrices for 110 samples in modality, using <1Gb RAM and running in ~6 minutes.
Figure 2 | a view of the modality ContigDataset The core data structure used by modality is the ContigDataset. This contains arrays that represent methylation counts as well as accompanying arrays which encode the coordinates. This object also provides a set of easyto-use and efficient methods for working with them. Each of the arrays are chunked Dask arrays allowing extremely efficient computation. 4. modality in action
etc...
modality is built around three core python packages: zarr, xarray, and dask. These are powerful modern data science packages which collectively allow modality to deal with
larger-than-memory data arrays in an expressive and paralelised fashion. This foundation means that analyses that would previously require long run times and extensive compute
infrastructure can now run quickly on one's laptop - speeding up iterative data analysis.
6. References
odality
As well as focussing on performance, modality was also built with the idea of providing a very user-friendly experience with a simple and intuitive API.
For example, calculating pairwise Pearson correlations between samples,
for a given variable, and plotting the matrix can be done with the following
lines of code:
dataset.plot_pearson_matrix( numerator="num_modc", denominator="num_total_c", min_coverage=10,
)
See Figure 4 for result.
Figure 4 | Pearson matrix plot for Genome in a bottle data
The above plot is the output of the code block
(left). It was produced using a duet +modC dataset of genome in a bottle (GIAB) samples and highlights the similarity between samples in
terms of the fraction of modC calls relative to
total C calls. This dataset is distributed with the modality package.
Figure 5 | Extensive array of visualisations and analyses for 5- and 6-base genomes modality offers an extensive range of visualisations and analyses to help understand genome-wide methylation profiles. A, 6-base ternary plot showing density of data over bins of C:mC:hmC methylation fractions in mouse Ese14 cells. B, Methylation profile plots over genomic regions. C, Tile plot from differentially methylation regions
(DMR) analysis, showing a genomic region with consistently different methylation fractions between two sets of samples.
A
B C
5. Conclusion[1] Simultaneous sequencing of genetic and epigenetic bases in DNA, Füllgrabe and Gosal et al., Nature Biotechnology (2023) (duet multiomics solution technology paper)
To address the difficulties of analysing methylation data, we present modality, an efficient and scalable analysis package for 5- and 6-base genomes.
The package is built on a core set of performant data science
libraries and roots the user into a powerful ecosystem for data analysis in Python.
modality has an intuitive API providing powerful analysis and visualisation methods. Moving forward, the underlying data structure used by modality
is very extensible, allowing other data modalities to be
incorporated and analysed alongside the methylation data modalities shown here.
Efficient and rapid tooling for analysis
Rich set of data visualisations