We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.


The Power of Sparse Data AI in the Pharmaceutical Industry

The Power of Sparse Data AI in the Pharmaceutical Industry content piece image
Listen with
Register for free to listen to this article
Thank you. Listen to this article using the player above.

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 4 minutes

AI is rapidly being adopted in the pharmaceutical industry, particularly for improving predictive models in drug discovery and early preclinical development. Fueled by the large amounts of data generated in R&D, projects fronted by these innovations in big data AI are leading to exciting new therapies. Nonetheless, much of the data generated in pharmaceutical development is limited, resulting in restricted amounts of usable data. For AI to be fully exploited, new methodologies must be developed.

Sparse data AI – the application of AI to restricted amounts of data – is opening up novel avenues for enhanced drug development. It provides a way of modeling complex biochemistry backed up with a known explanation of the underlying mechanism behind each inference. By moving away from a big data approach, AI within the pharmaceutical industry can advance with greater transparency and lead to more life-changing medicines reaching the market.

Progressing from the use of big data to sparse data

One of the main aims of computational chemistry is to predict the outcomes of complex organic reactions. This is because empirical approaches are time consuming and expensive. In recent years, models have been developed that design efficient reaction sequences for specific target molecules and have been demonstrated to perform at the same level of a chemistry professor1. Despite these advances, one remaining challenge is to understand whether these models can make accurate predictions outside the realms of the training data, which is still limited compared with the vast number of possible molecules.

Big data AI involves masses of data being used to create flexible and generic input/output predictive models with minimum domain knowledge. These are systems induced from data with no a priori knowledge. Deep learning for big data has previously been applied in computational chemistry for the pharmaceutical industry, including the development of models that predict the physico-chemical properties of drug molecules. 

Over the past decade, there has been a dramatic increase in the amount of available compound activity data due to the development of new experimental techniques, such as high-throughput screening and parallel synthesis2. AI is being used to efficiently mine this large-scale chemistry for drug discovery. Most applications of AI use big data; however, it is not always easy to access large datasets in the pharmaceutical industry since many companies prefer to keep their data in-house. The application of sparse data AI therefore provides a significant opportunity for the utilization of pharmaceutical data sets where information is limited.

The use of Bayesian optimization for sparse data

Big data AI can be defined as a black box approach – using algorithms with an unknown explanation. This approach is limited because of the inability to understand the underlying mechanism behind the model, which would be essential to achieve approval for use in the pharmaceutical industry. Sparse data AI, in contrast, is a white box approach and more suitable for understanding causal inferences. The greater transparency of sparse data AI is important; for AI to be fully implemented in the industry, it needs to be trusted and understandable3.

Furthermore, in contrast to big data AI, sparse data AI directly augments experimental results with detailed expert knowledge for the probabilistic prediction of desired quantities, such as how a given molecule will behave under a specific condition. The extraction of expert knowledge mimics the same understanding of the phenomenon being modeled, which allows for predictions to be generated. The augmentation is targeted towards specific, rather than generic, models with a transparent and understandable prediction mechanism. 

A fundamental component of sparse data AI is the use of Bayesian optimization, which utilizes probabilities to develop a sequential model-based approach to problem solving. Bayesian optimization provides the means to efficiently solve the exploration vs. exploitation dilemma – finding a way to benefit from both searches within the local data space and those that venture out into the unknown4. It can be used to search for an optimal procedure or model with a number of unknowns, which typically would not have a simple analytical expression. Bayesian optimization strikes a balance between exploration and exploitation, using probabilistic surrogate models of the unknown to quickly find and define the certainty of solutions. It enables learning that is closer to the human-level, where only one or two examples are required for greater generalizations5.

Sparse data AI and enhanced nanoparticle production

Sparse data AI is being used to enhance drug development, particularly for improving solubility and bioavailability characteristics – a critical goal for improving the rate of attrition in the pharmaceutical pipeline. Nanonization can help drug compounds reach their full therapeutic potential by reducing particle size and thereby improving dissolution rates. Sparse data AI is being applied to particle engineering technology to help define the physical characteristics of drug candidates from limited data and understand how these parameters influence solubility and bioavailability. The work will enable the prediction of nanonization success for new drug candidates and will create a more efficient particle engineering process for drug development.

This project involves building a digital version of the technology, which enables scientists to perform in silico experiments in large quantities. Sparse data AI will ground the digital version with available in vivo and in vitro experimental data, to guide an optimal search of additional experiments to learn model parameter settings. The technology can significantly increase the likelihood of identifying successful compounds that can quickly progress to clinical trials. It can also be used to re-profile molecules and determine whether they can be engineered for different therapeutic applications. While many companies are applying AI to the intelligent design of drugs, the most exciting future work in the sparse data area will explore how AI and particle engineering can be used to find solutions to the most difficult drug development and delivery challenges in the industry. This includes the production of drugs that are capable of penetrating the blood-brain barrier and the deep lung. 

Looking to the future

AI has the potential to positively transform the pharmaceutical industry through its ability to analyze masses of information and augment the capabilities of human experts. To fulfill this promise, it is important that we choose the right solutions for complex problems. The application of sparse data AI is creating a new technological era, one based on transparent and understandable models. Once trust in the technology is earned, the use of AI for drug discovery and development will greatly expand, enabling patients to swiftly benefit from new and enhanced medicines.


1. Jin, W. et al. (2017) Predicting organic reaction outcomes with Weisfeiler-Lehman network, 31st Conference on Neural Information Processing Systems (NIPS 2017). Available online: https://arxiv.org/abs/1709.04555

2. Chen, H. et al. (2018) The rise of deep learning in drug discovery, Drug Discovery Today, 23(6), pp. 1241-1250.

3. Silicon Republic: What are the benefits of white-box models in machine learning? Available online: https://www.siliconrepublic.com/enterprise/white-box-machine-learning

4. Shahriari, B. et al. (2016) Taking the human out of the loop: a review of Bayesian optimization, Proceedings of the IEEE, 104(1), pp. 148-175. https://ieeexplore.ieee.org/abstract/document/7352306

5. Lake, B.M. et al. (2015) Human-level concept learning through probabilistic program induction, Science, 350(6266), pp. 1332-1338.

About the author

Jukka Corander is Head of AI at Nanoform, an innovative nanoparticle medicine-enabling company. He is currently Professor of Biostatistics at the University of Oslo, Norway and Professor of Statistics at the University of Helsinki, Finland. Jukka’s research interests include the use of state-of-the-art machine learning techniques to create simulation-based models from sparse data. His recent work with the Wellcome Sanger Institute in Cambridge, UK involved the application of statistical machine learning and Bayesian inference algorithms to biological data. Jukka is a world-leading expert in his field, having published over 230 research papers in major international journals.