We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.


Venturing Into the Dark Unknown of Metabolomics With Deep Learning

Venturing Into the Dark Unknown of Metabolomics With Deep Learning content piece image
Credit: Greg Rakozy on Unsplash.
Listen with
Register for free to listen to this article
Thank you. Listen to this article using the player above.

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 7 minutes

The art of pursuit has brought science to one of its greatest challenges yet

A paradigm often preached in life is to be content with what you have right now, in this moment. To refine the hard-wired instinct to pursue. Science is an outlier to that basic paradigm. It knocks it right off the table. The art and beauty of science is the endless quest; the quest for new discoveries, new knowledge and evidence that either propels what we already know one step further or makes us firmly stop in our tracks, take a step back and reassess what we think we know.

This continual pursuit has brought scientists to, arguably, one of the greatest challenges yet in modern discovery: the field of metabolomics.

What is metabolomics?

Metabolomics is the latest addition to the umbrella "omics" field within systems biology, which also encompasses genomics, transcriptomics and metabolomics. Significant breakthroughs have been made in these research spaces. Take the Human Genome Project, for example, celebrated as one of the greatest feats of exploration in history. Or the recent advances in technologies such as mass spectrometry that have dramatically accelerated our understanding of protein biology.

In the past few decades, we have made tremendous progress in our knowledge and our understanding of the complex systems that underly human life and the organisms around us. Alas –there is still more to uncover.

A new era of biochemical discovery

Whilst genomics and proteomics serve information on what might happen based on a set of chemical instructions, metabolomics is a whole different ball game. It strives to measure the complete set of metabolites within a biological sample (typically defined as intermediates and products of cellular metabolism <1 kDa in size). This is the metabolome.1 These small molecules are the final products of the aforementioned omics processes; therefore, changes and interactions within these processes are directly reflected in the metabolome. Instead of what might happen, the metabolome can tell us what is happening. It's a field poised with potential, ready to radically reform our understanding of biology, extending to applications such as modern medicine and pharmacology, environmental sciences and synthetic biology. 

Clary Clish, Senior Director of the Metabolomics Platform at the Broad Institute of MIT and Harvard, describes metabolomics as "an objective lens to view the complex nature of how physiology is linked to external events and conditions, as well as measure its response to perturbations such as those associated with disease".2 Note the word complex.  

Molecule Man, An den Treptowers, Berlin.  Credit: Daniel Lonn on Unsplash.

Some of the biggest challenges remaining in the field of metabolomics are attributed to fundamental limits in experimental methodology.1 Thus far, we've uncovered some metabolomic pathways and processes. The Human Metabolome Database contains 114,185 metabolite entries, including both water- and lipid-soluble metabolites, and metabolites that would be regarded as either abundant (> 1 uM) or relatively rare (< 1 nM).3

A sophisticated suite of analytical tools has been developed in this space. The identification of a metabolite from a complex sample, where there can be well over 10,000 unique compounds, classically requires obtaining data on its mass-to-charge value, chromatographic retention time, isotopic pattern and fragmentation data. For pure isolated compounds at sufficiently high concentration, nuclear magnetic resonance and micro-electron diffraction can be used to directly elucidate the molecular structure.

However, scientists believe that we've merely scratched the surface. At present, it's unclear how many unknown metabolites exist, but its estimated to be a very large number.1 The question that stands, therefore, is how can we possibly identify novel molecules which we technically don't know anything about?

This is the new quest. Or as Zamboni, Saghatelian and Patti describe, we might say that we have entered a fourth era of elucidating biochemical pathways: the "metabolomics era".4

Getting down with deep learning

Here to assist in this quest is deep learning, which, if you're a novice to the field, can sound highly intimidating. Deep learning, in simplistic terms, is a subset of machine learning, whereby artificial neural network algorithms learn from large amounts of data. These algorithms are designed using the human brain as a template, and so a helpful analogy might be to think about how we humans learn a task. We repeat it, and each time we modify the way we perform a task until it is optimized. The same applies for deep learning.

"It's complex from a certain perspective, but from another it is actually relatively simple," Sean Colby, a research scientist at Pacific Northwest National Laboratory (PNNL), tells me. Colby is part of an interdisciplinary team that are applying deep learning strategies to delve into the unknown, deep dark matter of the molecular world.

In deep learning, there are a number of layers of interconnected units that make small decisions, but in aggregate, combined together, and feeding through subsequent layers of the network. The interconnected nodes go vertically – hence the apt name "deep learning".

"We ultimately come up with a framework that can learn very complex relationships, which is the advantage of a deep learning model, as opposed to other, explicit forms of modelling. We essentially set up a blank, template architecture, expose it to data, i.e., things that we do know, and the model assembles itself. Once we have this scaffold in place that is designed to learn what we want it to learn, we simply show it the data, cut it loose, and then it will pull together everything it needs to come up with a solution for us."

Included in the dark matter of the molecular world, is of course, the metabolome.

"In the metabolome, we have an estimated 1060 potential configurations of molecules less than 500 Da in mass, and many happen to share very similar properties. If we can very accurately measure the mass of one molecule, it does not necessarily mean we know what it is; there could be hundreds or thousands of molecules that have that exact same mass," Colby says.

Despite major advances in instrument resolution and mass, scientists still cannot perform unambiguous identifications of metabolites because of this overlap.

And so, DarkChem, a research project funded by PNNL's Deep Learning for Scientific Discovery Agile Investment, was born. A team of scientists led by Dr Ryan Renslow, including Colby, are working on harnessing deep learning capabilities to facilitate the identification of such unambiguous metabolites.

“Right now, we’re just skimming what is potentially knowable and saying goodbye to very interesting data because we can’t identify the vast majority of metabolites that our technology detects. Deep learning is providing a new way to solve the puzzle,” says Dr Tom Metz, Integrative Omics Biomedical Scientist at PNNL.

Leveraging the curse of dimensionality

DarkChem is able learn a continuous numerical, or latent, representation of molecular structure and characterize it. It focuses on properties that can be obtained via experimental instruments, and once trained, can be used to predict chemical properties directly from structure, and generate novel candidate structures that possess chemical properties similar to an input of choice.

As a first step, the team trained DarkChem to be able to derive and predict collision cross-section (CCS), a chemical property of molecules measured using ion mobility spectrometry.5

CCS is roughly the area around a particle in which another particle may interact or collide with it. This area can change subject to the size and makeup of the two particles involved. In metabolomics, mathematically calculating CCS allows scientists to derive information on various chemical features of the metabolite, aiding its identification.

"This allows us to leverage, what we call in data science, the "curse of dimensionality", which is that problems get harder as you add dimensions because the space becomes more vast," says Colby. "But, in the case of metabolomics and molecule identification, adding dimensions and this “curse of dimensionality” works in reverse – we get greater separation between individual metabolites."

Why is it necessary to teach a deep learning network to predict CCS? Well, sometimes chemical properties such as CCS are difficult to measure experimentally, whether that be due to the fact the compounds are not available, or they're difficult to synthesize.

In this situation, PNNL scientists have traditionally adopted a quantum chemistry-based framework known as ISiCLE, the In Silico Chemical Library Engine, to predict such chemical properties. Unfortunately, this system was tied with its own set of limitations, including time-consuming and laborious calculations. And so, the researchers applied deep learning using DarkChem, and found that they were able to produce results with the same level of accuracy in a fraction of the time.

“We trained DarkChem in three steps to maximize our training data. First, we exposed the network to ~53 million molecules from PubChem – no CCS yet – to broadly learn chemical structure. Next, we trained on ~700 thousand molecules with CCS computed with ISiCLE. The final step involved ~700 molecules with experimental CCS. This allowed the network to learn as much as possible at each step, enabling success with progressively smaller data sets without overfitting," Colby says.

Deciphering the structure of unknown molecules

DarkChem is clever in that it can rock back and forth between solving a molecule's CCS and other chemical properties, and generating new chemical structures based on the properties that the user is looking for – hence, delving into the deep unknown of dark molecular matter.

Renslow's team have used the network to suggest novel chemical structures that have the potential to influence the NMDA receptor, a glutamate receptor implicated in various aspects of brain function, and a target for certain therapeutics.

I asked Colby if these novel chemical structures could translate to new therapeutics: "100%. That was largely the focus of applying DarkChem here. Current drugs that target the NMDA receptor, for example ketamine, often have negative associated side effects. So, the idea would be to come up with a ketamine-like compound or analogue that has the same therapeutic benefits, but without the negative aspects."

This ability to calculate chemical properties for unknown molecules has a myriad of potential applications and supports the quest towards the fourth era of elucidating biochemical pathways; the "metabolomics era". The team continues to scope out molecular features that they can teach DarkChem to analyze, and advance deeper into the dark depths of unknown molecular matter.

Sean Colby was speaking to Molly Campbell, Science Writer, Technology Networks.


1.       Riekeberg, E., & Powers, R. (2017). New frontiers in metabolomics: from measurement to insight. F1000Research6, 1148. https://doi.org/10.12688/f1000research.11495.1

2.       Clish C. B. (2015). Metabolomics: an emerging but powerful tool for precision medicine. Cold Spring Harbor molecular case studies1(1), a000588. https://doi.org/10.1101/mcs.a000588

3.       Wishart DS, Tzur D, Knox C, et al. (2007). HMDB: the Human Metabolome Database. Nucleic Acids Res. 2007 Jan;35, D521-6. 17202168.

4.       Zamboni N, Saghatelian A, Patti GJ. (2015). Defining the metabolome: size, flux, and regulation. Mol Cell., 58(4):699‐706. doi:10.1016/j.molcel.2015.04.021. 

5. Colby, 
Nuñez, Hodas, Corley and Renslow. Deep Learning to Generate in Silico Chemical Property Libraries and Candidate Molecules for Small Molecule Identification in Complex Samples. Analytical Chemistry. 2020 92 (2), 1720-1729. DOI: 10.1021/acs.analchem.9b02348.