Your Favorite AI Chatbot Might Be Exaggerating Scientific Findings
Overgeneralization by AI chatbots misrepresents research findings, potentially with dangerous consequences.

Complete the form below to unlock access to ALL audio articles.
Chatbots driven by artificial intelligence (AI) have the potential to be a powerful tool for supporting scientific literacy, as they can quickly distill complex scientific papers into more easily digestible summaries.
However, the use of AI tools for scientific summarization remains controversial. In a recent Nature poll of more than 5,000 academics worldwide, 33% believed it would always be ethically inappropriate for a researcher to upload articles they wanted to cite into an AI tool, ask it to generate a summary and then use that summary in their research paper. While the majority found this broadly acceptable, 31% would only approve if the AI use was disclosed, with a further 19% wanting the AI use and any related prompts to be disclosed.
This concern over AI-generated summary text is not just limited to ethical debates between researchers. There are also real concerns that AI-generated summaries of scientific research could omit key details and/or overgeneralize research findings, leading to research being inaccurately represented to the general public and potentially resulting in real-world harm.
Now, a new study by researchers Dr. Uwe Peters, assistant professor at Utrecht University, and Dr. Benjamin Chin-Yee, a hematologist at Western University and PhD candidate at the University of Cambridge, has put AI chatbots to the test.
The pair tested how well 10 of the world’s most prominent large language model (LLM) chatbots – including ChatGPT, Claude and DeepSeek – can summarize abstracts and articles from top science journals. After comparing a total of 4,900 the AI-generated summaries with the original scientific texts and with human-written summaries, Peters and Chin-Yee found instances of overgeneralization from the chatbots in up to 73% of cases.
To learn more about this overgeneralization problem, what might be causing it and whether users of AI tools can do anything to combat it, Technology Networks spoke with study author Dr. Uwe Peters.
Why is it so crucial for scientific writing to be exacting and precise?
I think the key problem is that inaccurate summaries of scientific research might lead readers to form false beliefs about the findings, which can cause real-world harm.
For example, if a summary of a clinical trial fails to mention that a drug was tested only on males, a general practitioner reading the summary might incorrectly assume the results apply to all patients, including females. This can be dangerous, as some drugs affect men and women differently.
Take the sleep aid, AmbienTM. Studies have found that, due to metabolic differences between sexes, women who take the same dosage of Ambien as men at night, may retain more of the drug in their systems the following morning – sometimes enough to impair their ability to drive a car.
Which AI models did you look at in your analysis and how did you test them?
We tested ten different AI models: GPT-3.5 Turbo, GPT-4 Turbo, LLaMA 2 70B, Claude 2, ChatGPT-4o, ChatGPT-4.5, LLaMA 3.3 70B Versatile, Claude 3.5 Sonnet, Claude 3.7 Sonnet and DeepSeek.
To evaluate them, we asked each model to summarize scientific abstracts and full-length research articles. We then compared the conclusions in the original texts with those in the corresponding LLM-generated summaries. Specifically, we looked for shifts from quantified claims (e.g., “75% of Dutch students with obsessive-compulsive disorder [OCD] struggle with attention-deficit/hyperactivity disorder [ADHD]”) to unquantified, generic claims (e.g., “students with OCD struggle with ADHD”), or from past tense statements (e.g., “drug X was effective”) to present tense statements (e.g., “drug X is effective”).
When models made these kinds of shifts, they were inaccurate; they were producing overgeneralizations – statements broader than what the original text supported.
We also directly compared LLM-generated summaries with human-written summaries of the same articles. Using regression analyses, we found that the LLM-generated summaries were about five times more likely to contain overly broad conclusions than the human-authored ones.
Your study found that newer versions of AI chatbots tended to perform worse in generalization accuracy than older models, with the exception of ChatGPT-4.5 (which was still in development at the time of the study). What can this tell us about why chatbots are so susceptible to over-generalization?
We don’t yet have a clear answer as to why new models performed worse. But it might be that these models prioritize generating plausible-sounding, confident responses over precise and cautious ones to appear more helpful.
Newer models often undergo extensive fine-tuning based on human feedback before their wider release. During this process, responses that sound more confident (or more broadly, universally relevant) may be rated more favorably by users than more narrowly framed or tentative ones. As a result, to seem more helpful, newer models may develop a stronger tendency to overgeneralize claims in their summaries – making them appear more informative or relevant, even when the evidence does not support such conclusions. In other words, their perceived helpfulness may come at the cost of accuracy through overgeneralization.
Can using a better chatbot prompts help negate this issue? For example, by explicitly asking for accuracy?
Yes, thinking carefully about which prompt to use could help. However, we found that explicitly asking LLMs to avoid inaccuracies actually increased the likelihood of overgeneralized summaries. In some cases, this “accuracy prompt” doubled the risk of generalization.
This counterintuitive result might reflect an “ironic rebound” effect, where directing the model to avoid a behavior inadvertently increases its occurrence, or it might simply trigger the model to fall back on familiar patterns that sound authoritative. So, simply prompting for accuracy might not reliably mitigate the problem and may even exacerbate it. More research on which prompts work and don’t work is needed to provide a confident answer.
What can be done to combat this oversimplification problem?
Here are some suggestions that we also touch on in our paper. First, using low-temperature settings (such as temperature 0) when generating LLM summaries via an application programming interface (API) can help, as lower temperatures reduce randomness and make outputs more faithful to the source material.
Second, users might prefer models like Claude, which our study found to be more accurate and less prone to overgeneralization than other leading models.
Third, we think that science communicators, researchers, students, and educators should critically evaluate AI-generated summaries and compare key aspects of them against the original research – especially when precision and accuracy are crucial – before using, sharing, or publishing them.
Finally, LLM developers could implement and test their models with benchmarking frameworks like the one we introduce in the study, which involves scanning LLM outputs for overgeneralizations (e.g., shifts from quantified to unquantified claims [generics] or present-tense shifts that were not present in the original text), and then assigning a performance score to the model.
More generally, we think that LLMs offer clear benefits in simplifying complex scientific content, because this can make science more accessible to people worldwide. But this simplification should be approached with caution. The problem we uncovered isn’t necessarily about fabricating facts; it’s about stretching true ones beyond their warranted limits. As LLMs become more integrated into education, journalism, and healthcare, it’s important to develop and follow best practices that ensure that these AI tools strike the right balance between accuracy and accessibility of scientific information, especially when lives or policies may be affected by misinterpretation of scientific research.