Rethinking Medical LLM Evals: A Patient Lens
- Maher Khelifi
- 2 days ago
- 4 min read

TL;DR
Medical LLM evaluations must expand beyond technical accuracy to include a patient lens.
Without this lens, AI risks automating today’s healthcare frustrations.
Patient-centered evals should measure Integrity, Clarity, Actionability, and Agency.
The goal is patient agency through understanding — not just simplification.
Most medical LLMs are getting very good at supporting clinical documentation and administrative workflows. What’s far less clear is whether they are helping patients in ways that truly matter. And this gap points to a deeper challenge: the industry still struggles to build experiences people can genuinely trust with their lives.
Simply layering advanced models onto legacy systems—without a patient perspective—risks digitizing the same confusion, fragmentation, and frustration patients already face.
Today’s evaluations largely focus on benchmarks like clinical knowledge, reasoning, and summarization. These metrics are valuable, but incomplete. They tell us whether a system is technically competent, not whether it is understandable, trustworthy, or empowering in real patient contexts.
As Dhaka has noted in recent work on AI evaluation, automated metrics often miss the nuance of human experience. Without grounding evaluation in user research, we risk optimizing for what is easy to measure rather than what people actually need.To build AI that genuinely empowers rather than merely processes information, we must close the gap between technical performance and human perception.
This article proposes augmenting medical LLM evaluations with a patient-centered framework focused on understanding, trust, and patient agency.
How Systems Fail Patients
Healthcare information systems have historically been designed around clinical workflows, documentation standards, and billing requirements, which often results in patient communications that are dense, technical, and difficult to navigate.
Research consistently shows that this gap is real and consequential. For example, when researchers looked at trauma discharge summaries, they found that only 24% of patients could fully understand them, largely because the documents were written well above most people’s reading levels (Weiss et al., 2016). This reflects a longer-standing divide between patients and medical institutions. As Foucault (1973) noted in The Birth of the Clinic, professional language and classification systems can shape power dynamics and limit patient agency.
Acknowledging these structural shortcomings is essential if we want to move beyond technically “correct” systems toward experiences that feel coherent, empowering, and genuinely supportive for patients. This requires placing a patient-centered lens at the core of LLM evaluations — treating it not as an optional refinement, but as a critical prerequisite for building systems that people can truly rely on.
Testing the Waters of Patient-AI Interaction
To understand how AI could better support patients, we went directly to the source: the patients themselves. Through a series of in-depth interviews, we explored how people want to use AI to make sense of their After-Visit Summaries — the often 10+ page documents they receive after clinical encounters to guide next steps in their care.
Participants interacted with three prototype experiences representing a spectrum from low to high structure:
Open Chat (“The Empty Box”): A flexible, unstructured interface that allows for free-form questions and exploration.
Guided Chat: A semi-structured experience that used prompts and suggested questions to help patients navigate their health information.
Smart Dashboard: A highly organized, data-driven interface designed for rapid scanning and prioritization of key information
We invited participants to engage with, reflect on, and compare each prototype, allowing us to examine how different levels of structure shaped comprehension, confidence, and perceived usefulness.
LLM Evals with a Patient Lens
To bridge the gap between technical performance and human perception, we analyzed patient feedback to understand what makes AI-generated information truly useful and accessible. From this analysis, four core quality pillars emerged.
We recommend expanding LLM evaluations to include these dimensions alongside traditional technical benchmarks: Data Integrity and Grounding, Sense-Making Clarity, Actionable Guidance, and Agency and Autonomy.

Agency Through Cognitive Autonomy, Not Just Simplification
The future of personal health AI depends on its ability to support cognitive autonomy. In today’s fragmented healthcare system, LLMs have real potential to help patients navigate overwhelming amounts of medical information and make sense of complex care journeys.
But without a patient-centered lens, these systems risk doing the opposite — reducing agency rather than strengthening it. When AI is designed only to compress or summarize, it can quietly shift control away from patients. By adopting evaluation frameworks that prioritize data integrity, sense-making, and autonomy, we can ensure that AI does more than “shorten” reports — it helps people truly understand and take ownership of them.
The goal is not simplification, but the right cognitive balance: systems that offer a clear big-picture view for orientation and control, alongside the granular detail needed for confidence and informed decision-making.
This article was written with the assistance of AI, and the accompanying image was created using Nano Banana.
References
Foucault, M. (1973). The birth of the clinic: An archaeology of medical perception (A. M. Sheridan, Trans.). Vintage Books. (Original work published 1963)
Weiss, B. D., Brega, A. G., LeBlanc, W. G., Mabachi, N. M., Barnard, J., Albright, K., … Argenbright, K. (2016). Readability of patient discharge instructions with and without the use of electronically generated discharge summaries. Journal of Trauma and Acute Care Surgery, 81(5), 889–895. https://doi.org/10.1097/TA.0000000000001212
UXR @ Microsoft | Dhaka| How UX research shapes AI evals. Medium. Link
Comments