We've updated our Privacy Policy to make it clearer how we use your personal data. We use cookies to provide you with a better experience. You can read our Cookie Policy here.

Advertisement

How Multimodal Datasets and Models Are Helping To Advance Cancer Care

A person typing on a laptop, patient notes and a stethoscope alongside, overlaid with graphics representing multimodal patient data.
Credit: iStock.
Listen with
Speechify
0:00
Register for free to listen to this article
Thank you. Listen to this article using the player above.

Want to listen to this article for FREE?

Complete the form below to unlock access to ALL audio articles.

Read time: 3 minutes

In the era of precision oncology, the integration of high-throughput, multimodal datasets presents both a formidable challenge and a transformative opportunity. From genomic and pharmacological profiles to radiological imaging and chemical perturbation data, the convergence of diverse data types offers unprecedented potential to unravel the complex biological underpinnings of cancer progression and therapeutic response. Yet realizing this potential requires computational frameworks capable of extracting clinically actionable insights from vast, heterogeneous and often incomplete datasets.


We spoke to Dr. Benjamin Haibe-Kains, senior scientist at the Princess Margaret Cancer Centre, University Health Network, and professor in the Medical Biophysics Department of the University of Toronto, at the American Association of Cancer Research (AACR) Annual Meeting 2025. He discussed the challenges of working with clinical data, how AI/ML data models are helping and how the use of virtual biopsies could expand access to precision oncology.

Karen Steward, PhD (KS):

What are some of the greatest challenges in collecting, collating and interrogating clinical data in a useful and meaningful way?


Benjamin Haibe-Kains, PhD (BHK):

Beyond the critical hurdle of data governance, namely, obtaining approvals to access large-scale clinical datasets, the major challenges revolve around the heterogeneity and accessibility of clinical data. Extracting structured data from electronic medical records (EMRs), aligning them with standardized ontologies and integrating information from unstructured sources such as clinical notes, lab tests, radiology and pathology reports remain challenging tasks. Unstructured data, in particular, pose significant difficulties. However, recent advances in large language models (LLMs) and agentic AI systems offer promising solutions to automate and scale the curation of these complex datasets. Once curated, these rich clinical data can be further augmented with high-dimensional modalities such as medical imaging and molecular profiles (genomics and transcriptomics). Together, these multimodal datasets form a strong foundation for developing clinical tools to improve early diagnosis, guide treatment decisions and enable more precise monitoring.



KS:

Can you discuss some of the key advantages of using spatial analysis over individual information sources?


BHK:
It is now well-established that tumors are highly heterogeneous, composed of multiple cellular clones interacting dynamically with the tumor microenvironment. A growing area of research focuses on understanding how spatially organized niches of cancer cells, often referred to as ecotypes, influence tumor progression and therapeutic response. While we have seen significant advances from bulk to single-cell sequencing, spatial analysis represents the next critical frontier. By preserving the physical context of cells within tissues, spatial profiling allows us to map interactions between cancer cells, stromal components and immune infiltrates with unprecedented resolution. This added dimension of information is essential for uncovering clinically relevant patterns that are invisible when data are analyzed in isolation, ultimately enhancing our ability to predict outcomes and design targeted interventions.


KS:

How can data models be utilized to help improve data quality and usefulness?


BHK:

In the context of clinical data, AI models can significantly enhance data quality and utility through deep structuring and standardization. By extracting and harmonizing information across diverse sources, such as clinical notes, lab results, tumor profiles and circulating tumor DNA, AI enables the creation of richly contextualized patient datasets. Moreover, the integrative power of AI, particularly in multimodal data analysis, allows for the identification of convergent biological or clinical patterns. This not only improves interpretability but also helps flag inconsistencies, outliers or potential data entry errors. These capabilities open the door to automated quality control systems and scalable data aggregation, ultimately strengthening the foundation for robust clinical research and precision medicine.



KS:

Can you give some examples where data models have been used to predict missing data successfully?


BHK:

There are several compelling examples where AI/ML models have been used to impute missing data effectively in cancer research and clinical practice. Large language models have been applied to electronic health records to fill in missing variables, such as lab test results or medication histories, by learning patterns from similar patients across large cohorts (e.g., Med-BERT). As metadata or clinical annotation of radiological and pathological images may be missing, AI models have been used to infer tumor grade, molecular subtype or biomarker status (e.g., used here to predict the overall survival of patients diagnosed with brain tumors). Deep learning models have been used to infer missing gene expression values in RNA-seq datasets, leveraging co-expression patterns and network-based relationships. For example, variational autoencoders and matrix completion methods can reconstruct transcriptomic profiles with high accuracy (e.g., stAI). Companies are effectively leveraging millions of gene expression profiles to generate purely computationally the transcriptomic profiles of healthy and cancerous cells resulting of user-specified perturbation (e.g., gene knock down).



KS:

For those that may be unfamiliar with the concept, can you explain what virtual biopsies are?


BHK:

Virtual biopsies refer to non-invasive or minimally invasive methods that use imaging and computational analysis to characterize tumors in ways that traditionally required tissue sampling. Instead of extracting a physical tissue sample, virtual biopsies leverage data from radiological scans (e.g., MRI, CT or PET) or liquid biopsies (e.g., circulating tumor DNA) and apply advanced AI or ML algorithms to infer molecular, histological or prognostic features of the tumor.


The goal is to replicate the insights gained from conventional biopsies while avoiding the risks, limitations and sampling bias associated with invasive procedures. Virtual biopsies are especially valuable for capturing tumor heterogeneity, monitoring disease progression over time and guiding personalized treatment decisions without repeated surgeries or biopsies. Paverd et al 2024 provides an excellent review on this.



KS:

What impact might virtual biopsies have on cancer diagnosis and monitoring?


BHK:

Virtual biopsies have the potential to transform cancer diagnosis and monitoring by enabling non-invasive, longitudinal and comprehensive assessments of tumors. By leveraging imaging data and/or circulating biomarkers using AI models to infer molecular and histological features, virtual biopsies would allow clinicians to detect tumors earlier, monitor treatment response in real time and capture spatial and temporal heterogeneity more effectively. As a result, virtual biopsies can support more personalized treatment strategies, reduce the need for invasive procedures and expand access to precision oncology, especially in settings where traditional biopsies are impractical.