Automating Drug Discovery With Machine Learning
Automating Drug Discovery With Machine Learning
The traditional path of drug development is lengthy, expensive, and suffers from high failure rates – scientists test millions of molecules, however, only a handful progress to preclinical or clinical testing.
Embracing innovation, particularly automated technologies, is essential to reduce the complexity associated with drug discovery and circumvent the high cost and time spent bringing a medicine to market.
The rise of automation in drug discovery
Incorporating automation can make the hunt for drugs cheaper, more effective and less time-consuming. The past few decades have witnessed a dramatic growth in the use of novel approaches and technologies in drug discovery.
The availability of massive data sets and advanced algorithms has driven more interest and major improvements in the use of artificial intelligence (AI) in the field. AI can provide substantial improvements at many stages of drug development, reducing the time from target identification to clinical trials.
Machine learning (ML), a subset of AI, is a rapidly evolving field and is increasingly being harnessed by many pharmaceutical companies. Integrating ML approaches into the drug development process can help to automate repetitive data processing and analysis tasks.
Machine learning – Making data-driven decisions
ML solutions are based on big data modeling and analysis. The data can originate from diverse sources (e.g., data repositories, in-house experiments and publications) and can vary in format making aggregating, storing and preparing the data for analysis challenging albeit necessary.
ML trains a system to make inferences and decisions autonomously without any external support. The decisions are made when the system learns and improves from past experience –it learns from the data it had been provided and deciphers the associated patterns contained within it. Then, through pattern recognition and analysis, the system delivers the “outcome”, which may be a prediction or a classification.
ML tasks fall broadly into three categories: supervised learning, unsupervised learning, and sequential learning. Data in ML can be two types – labeled and unlabeled.
Supervised learning relies on a labeled dataset that acts as a trainer, teaching the model or the machine. Once trained, the model can begin making predictions and decisions as new data is received. Deep learning and support vector machines, commonly used in biological settings, fall under supervised learning. Deep learning uses artificial neural networks (ANNs) to identify highly complex patterns in large datasets.
Unsupervised learning identifies the relationships or patterns in unlabeled data. The model learns independently through observation and creates clusters of the observed patterns and relationships in the dataset.
Sequential learning allows an agent, which is a goal-oriented entity, to learn in an interactive environment using feedback from its own actions and experiences. Sequential learning relies on trial and error to make a sequence of decisions.
A surge in machine learning approaches for drug discovery
ML approaches can be applied at several steps during early drug discovery to:
- Predict target structure
- Identify and optimize “hits”
- Explore the biological activity of new ligands
- Design models that predict the pharmacokinetic and toxicological properties of the drug candidates
The subsequent sections will highlight examples of how ML can be used for drug repurposing and to discover novel antibiotics. The application of ML strategies to enhance image-based profiling and accelerate drug discovery will also be discussed.
Automation That Connects Your Drug Discovery Workflow
Drug discovery is often thought of as a complex jigsaw puzzle where connecting workflows and data are essential pieces. With the laboratory of the future in mind, a flexible and fully integrated solution can help you to seamlessly connect workflows and data to effectively automate your science. Download this guide to discover how to increase walkaway time, while improving reproducibility and productivity, without compromising data quality.Download Guide
Predicting drug-induced changes in gene expression using deep learning
DeepCE, a novel deep-learning computer model developed by researchers at the Ohio State University, helps to predict correlations between gene expression and drug response. Using the model, the team has identified ten drug repurposing candidates for COVID-19. Two drugs (cyclosporine and anidulafungin) have received regulatory approval; the remaining eight are currently investigational and are being tested in different indications.
DeepCE relies on two primary sources of publicly available data: L1000 and DrugBank.
- L1000 is a National Institutes of Health-funded data repository, providing “drug signatures" for drug discovery projects. Drug signature is defined as the gene expression changes within a cell when the cell is exposed to a drug. The L1000 dataset currently contains over one million gene expression profiles of chemically (small drug molecules) perturbed human cell lines. The cell lines represent organ tissues, like kidneys and lungs.
- DrugBank contains information on the chemical structures and properties of approximately 11,000 approved and investigational drugs.
By comparing the L1000 data with the drug compounds contained within DrugBank, the researchers could predict the effect of a drug on different cell lines and different genes. However, the team at Ohio State University faced a key challenge. The drug signatures within L1000 are not complete and cover just a tiny fraction of potential compounds. For genes not represented in L1000, the team used a deep learning approach. The DeepCE model was trained by running the entire L1000 dataset through an algorithm against specific chemical compounds and their dosages.
"We developed a deep learning model, DeepCE, using a graph neural net (converts each compound's chemical structure to a set of vectors, each representing an atom's local substructure), a multi-head attention net (captures drug–gene interactions and gene–gene interactions) and several feedforward nets to translate chemical, gene and drug information into a drug-induced gene expression profile. Thus, we were able to compare the predicted gene expression profiles for all 11,179 drugs in DrugBank with the gene expression profiles of COVID-19 patients and selected compounds with the most negative correlations," explained Ping Zhang.
As per Zhang, "This method enables timely drug repurposing for unknown diseases such as COVID-19, which is useful in the current coronavirus pandemic and in the event of future public health emergencies. Based on the theory of bridging drugs and diseases, once we have disease signatures from patients infected with SARS-CoV-2 variants, we can quickly re-rank our predictions for more accurate recommendations for new patient cohorts."
Expanding the antibiotic armamentarium with deep learning approaches
The rapid emergence of antibiotic-resistant bacteria is a matter of global concern. As a result, there is an urgent need to discover new antibiotics. Experts predict, if no action is taken, that drug-resistant diseases could be responsible for 10 million deaths each year by 2050.
To address this challenge, a team of researchers at the Massachusetts Institute of Technology (MIT) trained a deep neural network capable of predicting molecules with antibacterial activity. By performing predictions on multiple chemical libraries, the researchers discovered a novel antibiotic, which they named halicin.
Halicin’s structure differs from conventional antibiotics and displays bactericidal activity against a broad phylogenetic spectrum of pathogens, including Mycobacterium tuberculosis and carbapenem-resistant Enterobacteriaceae.
Jonathan Stokes, a Banting Fellow at the Broad Institute of MIT and Harvard, is the lead author of the study, which was recently published in Cell. Stokes elaborated on how they identified halicin, "We trained a deep learning model on a collection of ~ 2,500 molecules for those that inhibited the growth of E. coli in vitro. This model learned the relationship between chemical structure and antibacterial activity in a manner that allowed us to show the model sets of chemicals it had never seen before and it could then make predictions about whether these new molecules were possessed antibacterial activity against E. coli or not.”
Once trained, the model was tested on the Broad Institute's Drug Repurposing Hub, a library of ~ 6,000 compounds. From the library, the model selected one molecule, Halicin, which was predicted to have strong antibacterial activity. Halicin, a drug originally investigated as anti-diabetes, was tested on dozens of bacterial strains and was found to work against many drug-resistant bacteria including Clostridium difficile, Acinetobacter baumannii, and Mycobacterium tuberculosis. Halicin was also found to have low predicted toxicity in humans.
ML models can explore, in silico, large chemical spaces that can be tedious and expensive to investigate using conventional approaches. As per Stokes, "ML as a drug discovery tool is likely going to play an important role in how we find new antibiotics. As a predictive tool, properly trained models will allow us to explore vast chemical spaces in silico, which are sufficiently large that we would not be able to empirically screen this number of compounds in the laboratory. Currently, we can screen perhaps a few million molecules in the lab at a large scale, compared to in silico predictions that reach into the billions of compounds".
ML approaches can be exploited at every stage of the drug development pipeline. "Beyond chemical prediction to discover new antibiotics at the preclinical stage, I believe machine learning approaches can show utility at every stage of the drug development pipeline – the important question is whether we as scientists can get acceptable training datasets in order to train models that can make reasonable predictions at more advanced stages of drug development," said Stokes.
Image-based profiling for drug discovery
Image-based profiling is a strategy by which the information present in biological images is processed, analyzed and extracted as image-based features, which are then aggregated into profiles. These image profiles can be mined to capture relevant patterns and reveal unanticipated biological activity, such as the unexplored mechanism of the disease – this vital information can then be applied in the drug discovery process.
Image-based profiling can be used to identify disease-specific phenotypes and explore the mechanism of a disease. It can also be used to predict a drug's activity, such as the mechanism of action and toxicity profile.
Anne Carpenter, senior director of the Imaging Platform at Broad Institute of MIT and Harvard, and her team of biologists and computer scientists are pioneers in developing image analysis and data exploration solutions.
"Image-based profiling is powerful because looking at patterns in images can accelerate nearly every step of the drug discovery pipeline, from building diverse yet compact chemical libraries to primary screening assays, to target deconvolution for phenotypic screens, to the identification of biomarkers and diagnostics. It has even recently been shown to eliminate the need for primary screening by virtual prediction of some biological activities from existing images," explained Carpenter.
Due to the increasing volumes of images, researchers have started leveraging ML strategies like deep learning to improve the extraction of relevant signals from image-based profiles and accelerate drug discovery.
"Most of the proof of principle experiments in the field have used classical image processing and machine learning techniques, so I think we are about to see a rapid acceleration in the field by applying deep learning methods for feature extraction and prediction," said Carpenter.
Exciting possibilities, but understanding methodologies is key
ML can assist scientists and accelerate the drug discovery pathway. ML, when combined with expert knowledge, can reduce the rate of attrition and enhance the process of drug discovery. Zhang elaborated on its potential, "Hundreds of millions of potential drugs await discovery. The next great antiviral (or antidepressant or anti-inflammatory) may already be in a lab somewhere, previously overlooked (it takes lots of time to be proposed for some indications). Deep learning can tell us which compounds are worth testing."
AI has presented exciting possibilities for new discoveries in diverse fields; however, adoption of this technology is still low. Harnessing AI to its full potential will require training, trust and coordination between key stakeholders.