Chat-MD? Scientists Put AI Models’ Medical Skills to the Test

Complete the form below to unlock access to ALL audio articles.
In a new study published in Nature, scientists from Google Research have introduced a novel benchmark, known as MultiMedQA, for evaluating the ability of large language models (LLMs) to present accurate answers to medical questions. The study also highlights the development of Med-PaLM, an LLM specifically designed for tackling medical questions. Nonetheless, the team has said that significant hurdles will have to be cleared before LLMs can be reliably consulted for medical advice.
An AI a day…
Accurate medical advice is highly sought after but hard (or at least expensive) to access. This has led to the rise of patient-led online medical advice sites, which can lead to inaccurate diagnoses and a brain-flattening load of hypochondria (that headache probably isn’t brain cancer). The breadth and specificity of information available through LLMs like ChatGPT might make them seem like an attractive alternative, but existing models are prone to generating plausible yet incorrect medical information or harboring biases that can unintentionally accentuate health inequalities.
Want more breaking news?
Subscribe to Technology Networks’ daily newsletter, delivering breaking science news straight to your inbox every day.
Subscribe for FREETo address this, researchers have been working on methods to assess the accuracy of LLMs’ medical knowledge. These benchmarks remain sorely lacking – even if an LLM can pass a multi-choice medical exam, it might flounder in the face of real-world medical queries.
Scientists from Google Research set out to build a better benchmark of AI clinical knowledge. Authors eventually devised a benchmark, called MultiMedQA, that merges six existing datasets that cover the full spectrum of professional medical practice, research and questions from consumers. The team also integrated a new resource HealthSearchQA, a fresh dataset containing 3,173 commonly searched online medical questions.
Tuning up an AI
Following this, the team analyzed the performance of two LLMs against the benchmark. They assessed PaLM, a 540-billion parameter LLM, and a variant, Flan-PaLM. The latter achieved stellar performance on several datasets, surpassing the previous top-performing LLMs on the MedQA dataset. Flan-PaLM proved particularly adept at the questions from this dataset developed from US Medical Licensing Exams, outperforming other top LLMs by over 17%.
But like a student doctor who has spent too much time reading textbooks and not enough speaking to patients, Flan-PaLM’s usefulness nosedived once it left the exam hall.
When asked to give long-form answers to consumer queries about health online, Flan-PaLM’s responses were judged by a panel of clinicians to be aligned with medical consensus only 61.9% of the time. Alarmingly, nearly a third (29.7%) of the LLM’s answers were assessed as potentially producing harmful outcomes.
Addressing these significant shortcomings, the team used a technique called instruction prompt tuning, a method presented as an effective solution for acclimating generalized LLMs to specialized domains.
These tweaks produced an adapted model, Med-PaLM, which showed promise in its initial evaluations. 92.6% of Med-PaLM's long-form answers were judged by a panel of clinicians to align with scientific consensus, a figure roughly comparable to a trial dataset of answers generated by human clinicians. 5.8% of Med-PaLM’s responses were rated as potentially harmful. If that still seems a little high to you, know that the equivalent figure for human physicians was very similar at 6.5%.
Nevertheless, the authors emphasize the need for further evaluations. They underline that significant advances must be made before LLMs can be deemed suitable for clinical use, including work to root out underlying errors in the models’ training. “Additional research will be needed to assess LLMs used in healthcare for homogenization and amplification of biases and security vulnerabilities inherited from base models,” they conclude in their paper.
Reference: Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature. 2023:1-9. doi:10.1038/s41586-023-06291-2
This article is a rework of a press release issued by Springer Nature. Material has been edited for length and content.