We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.


Strengthening a Commitment to Diversity and Inclusion in R&D With the Help of Novel Machine Learning Approaches

Female doctor and patient seated, smiling and looking at a tablet.
Credit: iStock
Listen with
Register for free to listen to this article
Thank you. Listen to this article using the player above.

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 6 minutes

Historically, clinical trials have relied heavily on white male participants, but today we know that such practices create knowledge gaps in how we understand the natural progression and treatment of acute and chronic conditions across different patient populations. That’s why enrolling sufficient patients from all gender, racial and ethnic groups in numbers that reflect current population proportions is essential for ensuring a treatment’s efficacy within all groups.


Despite many efforts over decades to address disparities, the underrepresentation of gender, racial and ethnic minorities in research and development persists. For example, African Americans make up 13.4% of the US population, but only 5% of trial participants. Hispanics represent 18.1% of the US population, but less than 1% of trial participants. These enrollment disparities have real consequences in terms of improved health outcomes for traditionally underrepresented communities, and industry stakeholders are recognizing the need to achieve meaningful change.

For example, in 2022, the US Food and Drug Administration issued draft guidance for trial sponsors to develop and submit a “Race and Ethnicity Diversity Plan” for their trial programs prior to finalizing study designs. To ensure that drug developers take tangible steps for progress with early planning and proactive thought, plans should include specific enrollment goals and an operational strategy to reach underrepresented and underserved racial and ethnic patient populations.


As they explore transformative approaches to improve diversity and inclusion in clinical trials, industry stakeholders are leveraging advanced technologies and the wealth of available data sources to better understand patient needs, burdens and more so they may increase interest and participation in trials.

This has allowed us to see how machine learning (ML) programs can effectively rank a list of trial sites according to which may yield higher patient enrollment based on trial protocols, including eligibility criteria, previous trial performance, claims data and patient demographics. However, by constantly fine-tuning ML capabilities, we see how a deep reinforcement learning framework designed with intent can effectively learn to better prioritize inclusion while optimizing trial site selection.


In this article, we discuss how a deep learning framework can specifically address real-world challenges to enhance site selection while seeking to improve trial diversity.


Site selection’s real-world challenges


To address enrollment disparity, clinical trial sponsors are using data-driven methodologies to calculate trial protocol burdens according to race and ethnicity, participating in community events (e.g., health fairs) to create local awareness, providing culturally relevant communications to patients for stronger engagement and much more. With early planning, multi-pronged approaches have been shown to be impactful.


To maximize the potential for sites to have adequate representative samples of participants with diverse backgrounds, sponsors have to dig deeper into the nuances of site identification and evaluation methodologies. But there are two notable barriers to site identification that impact improving equity in clinical trial participation:


Missing data

Site identification often starts with an analysis of claims and specialty data, including patient and public involvement, site visits and clinical research coordinator participation, past enrollment performance and recruitment rates. These data sets will best inform the likelihood of local patient enrollment.

However, the issue to keep in mind is that trial sites with a greater minority population are potentially more likely to lack data due to insufficient data collection and reporting. Though race and ethnicity data reporting from trials is increasing, it is still a work in progress. Existing tools fail to cope with missing data, meaning this issue only exacerbates the underlying unfairness when sites in minority-rich locations are overlooked.


The enrollment-diversity trade-off

Adding diversity as a qualifier for target sites to maximize enrollment can be challenging. We cannot simply impose fairness by setting quotas for each racial or ethnic group because the fewer minority-population participants selected by existing approaches would effectively set enrollment caps. We must balance the trade-off between enrollment needs and fairness, and as such, we need to optimize simultaneously for both objectives.


Where deep reinforcement learning models can help

Given what must be considered to meet the challenges discussed above, clinical trial sponsors need to determine how best to optimize multiple site parameters to ensure better enrollment rates with diverse patient populations.


The use of ML solutions to improve clinical trial design and execution is broadening with time as skilled data scientists gather more practice-based insights and apply them to further fine-tune ML-based models for the task at hand. Currently, ML is helping to validate assumptions about trial feasibility, extract meaningful patterns of patient outcomes to drive trial design, predict trial outcomes and more. 

Going a step further, the ML subfield of deep learning is being used to predict the optimal physicians to run studies and maximize patient recruitment. However, patient diversity is not being considered.


To improve site selection with diversity and enrollment in mind, data scientists have tested a specific deep reinforcement learning model using data points from nearly 4,400 real-world clinical trials from 2016 through 2021. Results show this framework accounts for several key variables that can better address the challenges of missing site-specific data and trading off enrollment for diversity and vice versa.


Modality encoder for missing but needed data

While most ML-based research assumes datasets are complete and well-cleaned, that is not feasible within most real-world applications where data is often incomplete, which skews outcomes. In recognition of the need for a more uniform view among sites regarding missing or insufficient data insights, data scientists have tested this framework to bypass what is not available by taking data from multiple sources and then combining, enriching and enhancing it to provide a more holistic picture of each site. In addition, where data is missing, its content can be accurately inferred once the holistic overview is available.


Other existing ML-based strategies, such as modality dropout and cascaded residual autoencoders, do not directly model missing data. But, in this framework, it is possible to build a more accurate representation of a clinical trial site without complete site data.


Efficiently trading-off: a “reward system”  

To rank findings and site features based on what is ideal for a given trial, this deep reinforcement learning model specifically integrates a reward function that emphasizes metrics for enrollment and fairness in terms of inclusion of diverse participants.

Since trial-site representations no longer have “data holes,” this function puts a value on individual sites’ contributions as they relate to the “reward” given to their features. By using a reward system, where the final reward is being selected as a target site for the trial, this model uses an encoding layer to allow each site’s ranking/score to be impacted by other sites’ features. As seen in Figure 1 below, this emphasizes which sites may be both ideal for overall enrollment and able to reach diverse populations.

A visual representation of a deep reinforcement learning model that considers fair ranking with missing modalities.

Figure 1:
A visualization of a deep reinforcement learning model that considers fair ranking with missing modalities. This framework uses multi-modal site features and the trial representation to generate scores for rank and selection of a subset of prospective trial sites. The pipeline used to do so consists of modality encoders, a missing data handling mechanism, a scoring network and a reinforcement learning-based ranking approach. Credit: Theodorou B, Glass L, Xiao C, Sun J. 2024. CC BY 4.0.

Consistently building cases for use

ML-driven solutions are one part of a more holistic approach to optimizing site selection that prioritizes sufficient representation in trials from diverse patient populations. As such, it is critical that data scientists and other stakeholders constantly finesse techniques to better uncover insights of interest in an unbiased and accurate way.
There must be assurance that the ML approach used is embedded in the correct science and guided by the right group of subject-matter experts for clinical trials, including medical, clinical and data science experts.


Since deep reinforcement learning is based on insights gathered from trial and error of use, these models will continuously evolve to meet the needs at hand. For current use, the novel model discussed above helps trial sponsors bypass the need for complete data and limit or eliminate biases within its inputs to better tackle the longstanding industry challenge of selecting sites that can help improve the diversity of the enrolled patient group while also protecting enrollment rates.


Opportunities for this model and other deep learning tools to help drive smarter decisions in R&D to enhance patient care will come with time and a growing collection of insights to examine.

About the author:

Greg Lever is director of AI solutions delivery at IQVIA. With more than 14 years of life sciences and technology experience, Greg currently helps clients discover innovative ways to bring life-changing therapies to patients faster within IQVIA’s Applied Data Science Center’s consulting sales team. Previously, he led a team of machine learning engineers within IQVIA’s Analytics Center of Excellence. 

Greg has worked with several technology startup companies in London and helped see Genomics England’s 100,000 Genomes Project through project completion. He received his PhD at the University of Cambridge, combining quantum physics and ML to develop new approaches for small-molecule drug discovery, and has worked as a postdoctoral associate at MIT.