We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.


Big Data Analysis Approaches for Drug Discovery

Listen with
Register for free to listen to this article
Thank you. Listen to this article using the player above.

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 7 minutes

Big data has long been a buzzword in drug discovery, but as analysis methods become more sophisticated, its potential is beginning to be realized. We look at some of the latest advances in big data analysis for drug discovery.

Image-based cell profiling

Dr Anne Carpenter’s lab at the Broad Institute is dedicated to making sure a biologist can get the most out of their images to solve whatever problems or disease areas they're working on.

“Part of our work is geared towards replacing some of the tedious work that biologists do,” says Carpenter, “whereas the big data mining is focused on doing more than a biologist can do, even if they had an infinite amount of time.”

The lab is famous for its open-access tools CellProfiler and CellProfiler Analyst, used by many of the top pharma companies, which allow users to measure, mine and interactively explore morphological data from images of complex physiological samples in high throughput.

“For example, if you want to find drugs that can keep tuberculosis under control,” explains Carpenter, “you might infect cells in a dish and then test a hundred thousand drugs against those cells. You can then use CellProfiler to analyze images of those cells and see which ones are infected or not.”

Similarly, it can find novel morphological differences between diseased and healthy cells. “If I take images of cells from patients with bipolar disorder and cells from matched controls, I can ask ‘is there any difference between those groups?’ and let's say we found a unique mitochondrial phenotype, this tells us the processing of energy in the cell has something to do with bipolar disorder.” Not only do you get a sense of what the mechanism might be, you now have a morphological phenotype that's associated with the disease for screening hundreds of thousands of drugs.

This type of approach is already showing clinical promise. Recursion Pharmaceuticals, founded on technology developed at the Broad Institute, already has two clinical candidates in cerebral cavernous malformation and neurofibromatosis type 2 and is partnering with Takeda to discover drugs for rare diseases using artificial intelligence.

Moreover, CellProfiler is having an immediate impact on patients through personalized medicine. “Researchers in Vienna are using our software in a clinical trial,1 where they take tumor cells from patients, grow them in a dish and then test hundreds of different human therapies on each individual patient sample,” says Carpenter. “Then they use our software to measure whether the cells respond to those drugs. It’s pretty cool to see CellProfiler having a direct impact.”

Machine learning vs deep learning

Machine learning relies on being shown many examples of data and being given the ‘correct’ answer for that data. The algorithm learns how to predict the right answer for a new set of data based on previous experience.

In classical machine learning there's an intermediate step that involves the human user deciding what kinds of features might be helpful in the analysis. But this is limiting, because we may not be able to capture everything that is possibly there.

Deep learning is a unique branch of machine learning that takes the raw form of the data and looks for patterns with no prior ‘knowledge’. 

In image analysis, for example, a classical machine learning approach would identify where the cell is and would measure a lot of features about the cell, such as identifying the nucleus area, or measuring how much green or blue staining there is. From that, the algorithm can tell the user which cells have a given disease or not.

In deep learning, the algorithm doesn’t know if the nuclear area or the staining has anything to do with this disease.  It could try to identify a long list of features but might miss something. So instead, deep learning uses the pixels of the images and look for patterns among them.

Repurposing image-based assays to find new drug targets and mechanisms

At Janssen R&D, Dr Hugo Ceulemans, Scientific Director of Computational Sciences, leads a team of data scientists who are applying machine learning and artificial intelligence in support of small molecule drug discovery. His group recently published data showing that it is feasible to repurpose existing image-based assays to explore new targets and chemical space.2

“Conventional wisdom says that if you build a bespoke image assay with a certain mechanism or target in mind, the images generated in that exercise will only inform those targets or mechanisms it was designed for,” explains Ceulemans. “But if you think of it, the cells that are used in that image assay host thousands of targets in addition to the one that you are looking at, and all those targets are exposed to the chemistry during a screen. Many of those targets and mechanisms will translate to morphological changes. So, if you interfere with those targets or mechanisms even if you didn’t intend to, it will trigger changes you can see with a microscope. That’s precisely what we try to mine.”

In the latest study, the team took a set of compounds and looked not at the chemistry, but instead at a microscopy image of an assay that was designed for a single mechanism. Then they cross compared this with their other validated assays – where a specific question about drug activity on a whole series of targets and mechanisms had previously been answered.

They found that a single set of images informed the outcome in hundreds of validated assays that the original image screen was not designed for at all.

“What this means, is that if you were now to embark on a new drug discovery project that requires evaluating drugs in an expensive physiological assay such as stem cells, you can start with a smaller set of datapoints in that assay, and then ask, ‘can already existing extensive image datasets fill the gaps?’. With the images we have for hundreds of thousands, and in some settings millions, of compounds, we can document a much larger chemical space.”

The more diverse the chemistry, the greater the starting points for drug discovery, and the greater the chance a drug will make it all the way through the pipeline, explains Ceulemans.  Although machine learning like this will never replace physiological assays, it can significantly reduce the number of experiments you would need in those complicated models.

Data scientists are always looking for other types of data that either exist or could be generated, potentially in partnership, says Ceulemans. He sees an emerging role for big data analysis –it could not only be used to help select existing compounds to test but could also be used to propose novel compounds to first make and then test.

“The cost of those data points is higher because chemical synthesis is not free. So artificial intelligence can make an even bigger impact here. While this is harder, the latest methods are becoming more powerful. Beyond that, we see a new world where we would not only be helping drug discovery to select existing compounds and propose new ones, but also providing guidance on how to synthesize them.”

Getting big data ready for machine learning

Dr Brian Marsden’s team at the University of Oxford is working on a different aspect of big data – how it is captured, managed and presented for people to apply machine learning techniques. His team is part of the Structural Genomics Consortium (SGC), world leaders in solving the structures of human proteins. The sorts of data they produce is not classically big data, such as images or large sets of -omics data, but it is highly complex data not obviously amenable to data mining.

Working with an organization called Diamond Light Source in Oxfordshire, they conduct fragment-based screening against human protein targets to identify potential binders that can be developed into chemical probes or drug leads. These screens generate hundreds of data sets in a short period of time.

“The data sets are very complex because they show whether a compound has bound, and importantly where and how they bind to the protein of interest,” explains Marsden. “There could be hundreds, even thousands, of these pieces of information which together give us a fingerprint of where on a protein we might find druggable spaces.”

Conventionally, this data would need to be analyzed by a computational chemist who would look at every single structure individually. “They would need a really good memory to spot the patterns. And when you've got a hundred of these things you've got to be very good to even have a remote chance of spotting patterns.” As this is what machine learning is good at, Marsden’s goal is to take the structure of the protein with the molecule bound and convert that into a representation like an image.

“One of the things we are working on at the moment is whether we can use machine learning, and particularly deep learning, to identify which fragments bind best and therefore which ones we should sink chemistry resource into, to turn them into chemical probes or maybe even lead molecules as we go along.”

“Deep learning algorithms really work best for image analysis, looking for segmentation or pattern matching. It makes sense for us to try and represent our problem as something that a computer might see as an image even though it wouldn't look like an image to us. Then we could plug it straight into existing machine learning algorithms rather than reinventing the wheel and coming up with a specific solution to our problem.”

The future for big data in drug discovery

Big data analysis is providing much more hope than hype in drug R&D these days, but there are still challenges to be solved.

“There are already lots of people working on applying these approaches more in the clinical remit by mining clinical data,” comments Ceulemans. “But so far the data volumes available have been more limiting than in the discovery field.” Until now, even if molecular information or sequence information was collected in a trial it would be for only a few hundred patients at most. But with the lowering of costs for genomic sequencing there are several initiatives connecting profiling of patient material with clinical information on a large scale.

“I see a lot of homologies between the work I described for discovery where we try to match the most promising with a relevant assay,” says Ceulemans. “In this case it would be matching patients with optimal treatments or matching patients with trials. Previously the data volumes have been challenging but I think we are near the point with those data sets where data mining analysis will start getting traction.”

Another persistent challenge that stands in the way of using machine learning is the quality and standardization of data, says Marsden. “Five years ago, we were talking about machine learning as a new way of dealing with all the data we’ve got sitting in archives. We thought we just needed to throw the data into a machine learning tool and it would solve all our problems.” However, the data has proved too noisy or not normalized in a way that machine learning algorithms naturally use.

“I think people are waking up to the fact that challenges still exist around how to create data sets which are clean, coherent and comparable. Machine learning is still a great way forward but people are having to think about the way that they try and capture complex data.”


1. Snijder, B et al. Image-based ex-vivo drug screening for patients with aggressive haematological malignancies: interim results from a single-arm, open-label, pilot study. Lancet Haematol. 2017; 12, 595-606
2. Simm, J et al. Repurposing high-throughput image assays enables biological activity prediction for drug discovery. Cell Chem Biol. 2018; 25: 611-618