Second “Code of Life” Cracked by AI
Advances in genetics and artificial intelligence have helped scientists to crack the second “code of life” – gene regulation.
Complete the form below to unlock access to ALL audio articles.
Understanding how genes are regulated
Our DNA code provides the “blueprint” for life, enabling our cellular machinery to produce proteins that carry out essential molecular functions. While each cell possesses the same DNA code, the regulation of specific genes within that cell contribute to its unique function. Genes need to be “switched on” or “switched off”, a process that is coordinated by a number of factors including so-called “enhancers”.
Enhancers are regulatory DNA sequences that interact with a gene's promoter region (where transcription begins) to influence gene expression. Sometimes these enhancers are in close proximity to the promoter region, but in other instances, they are far away.
Enhancers essentially form the “second code of life”, as they orchestrate how our genes are regulated, but studying them has not been easy. Despite being discovered in the 1980s, our understanding of how enhancers operate has only recently started to bloom thanks to advances in genomics and artificial intelligence (AI).
The laboratory of Dr. Alexander Stark, a senior group leader at the Research Institute of Molecular Pathology (IMP), Vienna, has utilized these tools in his mission to crack the second code of life. In a paper published in Nature, his team report their success in achieving three key goals: predicting the activity of enhancers from their DNA sequence, predicting the consequences of mutations in enhancers and designing synthetic enhancers from scratch.
Researchers can now read, write and understand the second code of life
Stark and colleagues developed a powerful deep learning and transfer learning model, which they trained using data from a commonly used laboratory species – the Drosophila melanogaster (fruit fly).
The model was trained on genome-wide DNA sequences and corresponding DNA accessibility data. This process was used to fine-tune a second model that could directly link DNA sequences to the activities of specific enhancers. The models could also predict enhancer activity across the central nervous system, a sub-section of the brain, the epidermis, gut and muscle tissue in the fruit fly. Stark and team then tested 40 synthetic enhancers that had been designed computationally in living fruit flies, discovering that the enhancers were active and capable of driving gene expression in specific tissues.
Technology Networks had the pleasure of interviewing Stark about this new work, which he says marks the peak of his career since starting his lab in 2008.
Molly Campbell (MC): The existence of enhancers has been recognized since the early 1980s. Can you talk about any key research works that have been published since then that have helped to lay the foundations for your new study?
Alexander Stark (AS): Key work that enabled this study has been on three fronts:
- The extensive characterization of enhancer function across tissues living organisms, a study we performed about 10 years ago.
- The realization that enhancers have characteristic chromatin signatures, as well as the ability to measure these signatures across entire genomes by methods based on next-generation sequencing. For the particular property of DNA accessibility, the development of ATAC-seq and single-cell ATAC-seq (Buenrostro et al., Nature Methods 2013; Nature 2015) has been essential.
- The amazing progress in the field of AI and deep learning, which includes the application of deep learning to genomics (e.g. DNA accessibility and transcription factor binding datasets) and to the prediction of enhancer activity in a single defined cell type.
The latter has been an important proof-of-concept for this work.
MC: Can you talk about the key factors that have hampered the prediction and de novo design of enhancers with tissue-specific activities?
AS: Two things needed to come together to make this achievement possible now: sufficiently large training data and sufficiently powerful computational approaches.
Both have only become available in the past years with genome-wide next-generation-sequencing-based methods to determine chromatin properties of enhancers and extensive functional enhancer testing; and with convolutional neural networks (CNNs) that can be trained on this data to model enhancer properties and activities directly from the DNA sequence.
MC: Can you explain how you used deep learning and transfer learning to design tissue-specific enhancers in the fruit fly embryo?
AS: CNNs are very powerful AI tools that can learn complex tasks from raw data, including the de novo discovery of patterns relevant for the learning task. However, to get to this power, we require very large datasets for training such “deep learning” models. Transfer learning is a stepping stone approach: when sufficiently large training datasets are not available for the target task, but do exist for a related task, transfer learning allows us to re-use knowledge between tasks.
You first train a model for the related task (in our case, DNA accessibility as measured by scATAC-seq) and then adjust or fine-tune the model for the target task. In our case, the second transfer-learning step adjusted the model from predicting accessibility to predicting activity.
A classic example of transfer learning is the following: imagine you want to train a model to recognize cats in pictures, but you have only relatively few cat pictures available. You could first train a model on dog pictures (of which you have plenty) and then adjust the model in a second step to recognize cats.
MC: Why would we want to be able to design synthetic enhancers – what are the implications of being able to do this?
AS: The design of enhancers is a demonstration that the AI models we built to predict enhancers are sufficiently powerful that they can generate synthetic enhancers from scratch. This also validates the models’ rules and advanced our understanding in the sense of synthetic biology – as Richard Feynman once said, “what I cannot create, I do not understand”. Such validated models will be game changers for the interpretation of regulatory mutations in the non-coding space of our genomes, the “dark matter” of our DNA.
What’s more, designing enhancers can ultimately allow the precise expression of genes for therapy or diagnostics, for example by enhancers designed to only activate in certain cell types or certain cell states, for example when cells turn cancerous.
MC: Can you discuss the key challenges that you encountered in your study?
AS: One challenge has been the availability of training data, especially the number of functionally characterized enhancers for different tissues, which limited the choice of tissues to rather broad tissue categories. Apart from that, an additional challenge has been the fact that enhancers for some tissues – such as the gut – seem to have more complex rules than, for example, muscle.
This meant that models for muscle were ultimately more successful than expected, while the models for the gut were not, presumably because sequence motifs for GATA transcription factors that are important for gut enhancers are also widely used in other tissues.
MC: Your research demonstrates the feasibility of targeted design of synthetic enhancers for selected tissues. What are your next steps?
AS: With this first demonstration that the gene-regulatory code can be modeled, we want to go in several important directions: we would like to model enhancers for tissue subtypes and individual cell types, including cell-type transitions during development.
We will also model and build synthetic enhancers for cell types and states important during vertebrate regeneration and diseases such as cancer. I am convinced that the diagnostic and therapeutic opportunities that arise from being able to predict the consequences of mutations in non-coding regulatory sequence or being able to direct gene expression precisely to certain cells and not others will be revolutionary.
Dr. Alexander Stark was speaking to Molly Campbell, Senior Science Writer for Technology Networks.