The Hallucination Muse for Medicine: When LLM Errors Spark Biomedical Discovery

Ryan Mehra, Anshoo Mehra

September 27, 2025

https://doi.org/10.69831/e59eafc04e

This preprint reports new research that has not been peer-reviewed and revised at the time of posting

Copyright © 2025 Mehra, Mehra.
Categories
Biology, Engineering, Health Sciences
Abstract

Large-language-model (LLM) “hallucinations” are usually condemned as reliability faults because they generate confident yet false statements [1]. Emerging research, however, finds that such confabulations mirror divergent thinking and can seed novel hypotheses [2, 3]. This study is conducted by an independent investigators with no physical laboratory but unlimited API access to OpenAI models(4o, 4o-mini, 4.1, 4.1-mini)—tests whether deliberately elicited hallucinations can accelerate medical innovation. We target three translational aims: (i) epistemological creativity for medicine, where speculative errors inspire fresh research questions; (ii) generative biomedical design, exemplified by hallucinated protein and drug candidates later validated in vitro [4]; and (iii) speculative clinical engineering, where imaginative missteps suggest prototypes such as infection resistant catheters [5]. A controlled prompt-engineering experiment compares a truth-constrained baseline to a hallucination-promoting condition across the four OpenAI models. Crucially, all outputs are scored for novelty and prospective clinical utility by an autonomous LLM-based “judge” system, adapted from recent self-evaluation frameworks [6], instead of human experts. The LLM judge reports that hallucination-friendly prompts yield 2–3× more ideas rated simultaneously novel and potentially useful, albeit with increased low-quality noise. These findings illustrate a cost-effective workflow in which consumer-accessible LLMs act both as idea generator and evaluator, expanding the biomedical creative search space while automated convergence techniques preserve epistemic rigor—reframing hallucination from flaw to feature in at-home medical R&D.

Download PDF

Scientific Feedback

Chineme Edger Nwatu | eiRxiv Reviewer | Western Illinois University

Comments to Author

First, I want to congratulate you on a bold and highly creative manuscript. You are exploring an unusual and thought-provoking concept by reframing hallucinations in terms of LLMs from errors into a potential tool for biomedical innovation. Your work demonstrates curiosity, initiative, and a genuine passion for pushing ideas safely and rigorously.  I especially appreciate the care you took in documenting your methodology and providing reproducible code. This is a strong submission that reflects independent thinking and a drive to share novel insights.

At the same time, there are areas where additional clarity, precision, and attention to safeguards could make the manuscript even stronger. My feedback below is intended to help you refine your arguments and ensure readers, particularly those at the pre-college or early-research level, can appreciate both the potential and the limitations of this approach.

Manuscript Summary

This paper investigates whether deliberately encouraging hallucinations from OpenAI models (GPT-4O, GPT-4O-Mini, GPT-4.1, GPT-4.1-Mini) can accelerate biomedical idea generation. The study compares a truth-constrained prompt regime to a hallucination-promoting prompt regime across three task parameters: hypotheses for Alzheimer’s disease, antimicrobial therapies, and infection-control devices. The outputs were automatically scored for novelty and usefulness using an LLM-as-judge framework. Below are the Key findings I deduced:

  • Hallucination-promoting prompts generated 2–3 times more ideas rated as both novel and useful.

  • Noise remained minimal for less than 2%

  • Representative outputs included plausible biomedical innovations (e.g., self-sterilizing catheter) alongside speculative or low-utility concepts, for example, Quantum Microtubule Dysfunction.

The authors conclude that, with careful filtering and oversight, controlled hallucinations can act as a creative muse, especially for low-resource or early-stage research contexts.

Science Comments

  1. The central idea, being hallucinations can generate valuable biomedical ideas, is clear, but consider stating it explicitly in the Introduction rather than embedding it within philosophical discussion. This will help readers focus on the research question quickly.

  2. Reliance on a single LLM-as-judge is a limitation as the evaluation method, as acknowledged in the Discussion. I recommend emphasizing early in the manuscript that automated judgments may diverge from human expert assessment, and highlighting this as an area for future validation.

  3. The baseline prompt serves as a control, but the absence of any human-rated validation leaves uncertainty about whether the useful ideas would actually be actionable in medicine. Consider touching on this more in the Limitations section.

  4. Reporting paired t-tests is appropriate; including exact p-values alongside p < 0.01 would improve transparency.

  5. Please consider adding concrete suggestions for human oversight or other safeguards in future work so readers can clearly understand how to safely implement these methods. You rightly acknowledge the risks of harmful hallucinations.

    Presentation Comments

  1. First, consider streamlining the first three paragraphs as they cover similar ground.  These will help readers reach the core research question faster.

  2. Figures are referenced clearly. Ensure captions are fully descriptive, for example Proportion of high-value ideas under each prompt regime”) to help first-time readers who may want to skim through figures first to get a broad overview.

  3. Well-structured, but some terms (epistemological creativity, convergence stage) may be challenging for pre-college readers. Simplifying phrasing while maintaining rigor will make the manuscript more accessible. Also, consider replacing terms like paradoxically, confabulation, and epistemic guardrails with simpler alternatives to improve readability.

    Figure & Table Comments

  1. Figures 1–4 are Informative, but captions should explain what is plotted and why it matters, not just label the figure.

  2. Adding distributions or error bars for novelty/usefulness scores will allow readers to assess variance and robustness, allowing for data transparency.

  3. Table 1 is useful as is, but you could expand it to include one representative idea per task under both baseline and creative prompts for easier comparison.

    Overall Impression

    This manuscript is innovative and exciting, demonstrating how LLM hallucinations can be reframed as a source of creative inspiration for biomedical research. With some refinements around validation, human oversight, and accessibility, it has strong potential to guide and inspire pre-college and early-stage researchers. I encourage the authors to continue developing this work, as it provides a unique lens for thinking about AI creativity in scientific discovery.

Copyright to the Scientific Reviewer under CC-BY-4.0

A scientist with subject-specific expertise provided this feedback. Constructive feedback plays a key role in the scientific process because it allows researchers to learn from other scientists, be encouraged, and refine their ideas, research, and presentation.