The Thinking Machine at the Slit Lamp
A clinician's evidence-based guide to prompting AI for differential diagnosis, workup planning, and treatment selection in ophthalmology
Two Ophthalmologists, One Red Eye, Two Answers
A junior resident types into an AI chatbot: "Red eye, photophobia, blurred vision β what could it be?"
Down the hall, an experienced attending opens the same tool and writes something very different.
The resident's AI responds with a generic list: conjunctivitis, dry eye, subconjunctival hemorrhage, allergic reaction, corneal abrasion. Not only is the correct diagnosis absent from the top of the list β the most likely conditions listed are the ones least consistent with the findings. A patient with anterior uveitis might walk out with antibiotic drops and a missed diagnosis.
The attending's AI responds with a ranked differential led by acute anterior uveitis (non-granulomatous, given the fine KPs), noting the elevated IOP as likely secondary to trabecular meshwork inflammation or steroid response if previously treated. It flags the posterior synechia as evidence of significant or recurrent inflammation and recommends a targeted workup including HLA-B27, ACE, chest X-ray, RPR/FTA-ABS, and consideration of sarcoidosis and ankylosing spondylitis given the patient's age and presentation.
The difference wasn't the AI. It was the prompt. The attending had configured the AI with an ophthalmic clinical persona, loaded every slit-lamp finding in a structured format, and asked for a ranked differential with supporting and refuting evidence. The resident typed a sentence. This difference β between a vague symptom query and a structured clinical prompt β is now supported by a substantial body of evidence, and it's what this guide is about.
Teaching the Machine to Think Like an Ophthalmologist
Before you show the AI a single slit-lamp finding, you need to answer a question most clinicians never think to ask: Who should this AI think it is?
In ophthalmology, you instinctively calibrate your consultations. You don't describe a fundus the same way to a retina specialist as you do to a primary care physician. You adjust the vocabulary, the assumed knowledge, the level of detail. An LLM needs the same calibration β but it can't infer it. You have to tell it.
The literature suggests three essential elements: (1) its clinical role β "You are a fellowship-trained ophthalmologist with expertise in uveitis and ocular inflammation. You reason using a systematic anatomical approach"; (2) its behavioral boundaries β "Always consider sight-threatening diagnoses first. Cite evidence when available. Flag diagnostic uncertainty explicitly"; and (3) the output structure β "Provide a ranked differential with supporting and refuting evidence, followed by a stepwise workup." Callens formalized this as the RTF (Role, Task, Format) framework, and research shows it materially changes the quality of clinical output.
The Power of Persona
This isn't theoretical. Google DeepMind's AMIE system β the most rigorously evaluated clinical LLM to date β was built on exactly this principle. Optimized with clinical reasoning personas and structured diagnostic frameworks, AMIE outperformed unassisted clinicians in a randomized evaluation. Not because it was a fundamentally better model. Because it was better configured.
Now imagine applying this to your ophthalmic practice. The same AI that gives you a bland "conjunctivitis" when asked about a red eye can generate a nuanced differential considering scleritis, keratitis, uveitis, and acute angle-closure glaucoma β when it knows it's supposed to think like an ophthalmologist.
How You Present the Case Matters
Just as a disorganized case presentation at grand rounds frustrates the attending and leads to muddled thinking, a disorganized prompt produces muddled AI output. In ophthalmology, this is especially critical because our findings are highly structured β laterality, anatomical location, grading scales, intraocular pressure, visual acuity β and omitting any element degrades the output dramatically.
Beautifully. The ophthalmic exam is already inherently structured β anatomically layered, quantified (IOP, VA, cell/flare grading), and lateralized. This maps directly to an effective prompt structure: demographics β chief complaint β timeline β pertinent history β exam by anatomical layer β quantified measurements β specific clinical question. Ayoub et al. (2026) validated this approach directly β prompts that mimic the clinician's own case presentation workflow produced human-comparable diagnostic rationales on real clinical data.
Before entering any clinical data into a consumer AI tool (ChatGPT, Claude, Gemini), you must de-identify all patient information. These tools are not HIPAA-compliant in their standard configurations. Remove names, DOBs, MRNs, and any identifiers. Never upload fundus photos or OCTs that contain patient metadata. Know your institution's policies.
The Three Pillars of Ophthalmic Reasoning
Clinical decision-making in ophthalmology rests on three pillars: What does this eye have? (differential diagnosis), How do we confirm it? (diagnostic workup), and How do we treat it? (treatment). Each demands a different prompting approach. Let's work through all three with our uveitis patient.
Pillar 1: Building the Differential
This is one of the most nuanced findings in the literature. Chain-of-thought (CoT) prompting β asking the model to reason explicitly β generally improves diagnostic accuracy. Dai et al. found that Causal Abduction CoT (generate hypotheses, then seek confirming and disconfirming evidence) worked best. In our case, you'd want the AI to reason: "Fine KPs suggest non-granulomatous inflammation β supports HLA-B27-associated uveitis, but must rule out herpetic β posterior synechia suggests significant inflammation β elevated IOP could be trabecular inflammation or steroid response..."
However, a 2025 NEJM AI study found that CoT paradoxically reduced performance on tasks requiring intuitive pattern recognition. For ophthalmology, this means: use CoT for complex diagnostic puzzles (atypical uveitis, optic neuropathy workup), but for quick pattern-matching (classic dendritic ulcer, papilledema on fundoscopy), a direct approach may work better.
The most powerful approach isn't any single technique β it's the combination. Zhou et al.'s evaluation on the NEJM Image Challenge found that combining chain-of-thought with few-shot exemplars (showing the AI an example of good diagnostic reasoning before asking your question) corrected over half of initial errors.
This is where ophthalmology-specific prompting becomes essential. You'd follow up: "The anatomical diagnosis is anterior uveitis. Now reason through the etiological differential. Consider: the patient's age and sex, the fine (non-granulomatous) KPs, the unilateral presentation, and the posterior synechia. Rank the etiological diagnoses by likelihood and specify what systemic workup would confirm or exclude each." This two-step approach β anatomical diagnosis first, then etiological reasoning β mirrors how uveitis specialists actually think, and structured two-step prompts have been shown to outperform single-step approaches.
LLMs anchor just like humans. HLA-B27 uveitis is the most common cause of acute anterior non-granulomatous uveitis in young adults β but anchoring on it might cause the AI to underweight herpetic uveitis (which can also cause elevated IOP and requires antiviral treatment), Posner-Schlossman syndrome (glaucomatocyclitic crisis), or early sarcoidosis presenting with a non-granulomatous pattern. Practical prompting strategies: ask the AI to "list the three diagnoses most dangerous to miss in this presentation," to "identify findings that could argue against HLA-B27 uveitis," and to "consider infectious etiologies that would change management." Poulain et al. found that CoT prompting reduced anchoring bias β and Zahraei et al.'s BiasMD framework recommends explicitly prompting the model to consider how demographics might shift the differential.
Pillar 2: Planning the Workup
This is where generic prompts fail catastrophically. A prompt that doesn't mention the pregnancy might recommend a chest X-ray. One that doesn't specify "first episode vs. recurrent" might give you an inappropriately aggressive or conservative workup. The evidence is clear: clinical constraints must be part of the prompt.
This is a perfect illustration of why clinical context in prompts is non-negotiable. Pregnancy changes everything: chest X-ray requires shielding or deferral, fluorescein angiography is relatively contraindicated, many immunosuppressants are teratogenic, and even topical steroids and IOP-lowering agents require careful selection. If you omit the pregnancy from your prompt, the AI will generate a plan that could harm the patient. The correct prompt includes: "Patient is 10 weeks pregnant. Recommend a workup that avoids ionizing radiation and a treatment plan using only pregnancy-safe medications. Flag any recommendations that require obstetric consultation."
Retrieval-augmented generation (RAG) β grounding AI responses in specific guidelines like AAO Preferred Practice Patterns β reduced hallucinations in clinical recommendations. But a prospective crossover study found that well-structured prompts with detailed patient context sometimes performed just as well as RAG. The prompt itself is a powerful tool β especially when you load it with the clinical constraints that matter.
You feed them back: "The following results are now available: HLA-B27 positive, serum ACE normal, RPR non-reactive, chest X-ray deferred due to pregnancy. Update the etiological differential and recommend next steps, including whether additional workup is needed and when." This iterative approach mirrors the real clinical reasoning cycle β data accumulates, the differential narrows, and management evolves. Callens emphasized this as the strongest workflow: presentation β generation β result incorporation β refinement, rather than single-shot queries. The AI becomes a thinking partner you return to, just as you'd update a uveitis consultant with new lab values.
Pillar 3: Selecting Treatment
Treatment prompting demands the most precise specification of any clinical task. Every ocular finding, every systemic condition, every allergy, every pregnancy consideration needs to be in the prompt.
Several things to check here: Prednisolone acetate β generally considered safe topically in pregnancy, but you'll need a rapid taper plan to avoid steroid-response IOP elevation on top of the inflammatory IOP elevation. Cyclopentolate β reasonable for breaking the synechia, though some ophthalmologists prefer atropine for more sustained effect in significant uveitis. Timolol β this is the critical check. Beta-blockers cross the placenta and can cause fetal bradycardia. In a pregnant patient, brimonidine (with caution near term) or a topical carbonic anhydrase inhibitor may be safer β but dorzolamide is a sulfonamide, and she has a sulfa allergy. You could prompt: "Given sulfa allergy and pregnancy at 10 weeks, identify which IOP-lowering drops are safe and which are contraindicated. Explain the reasoning for each." This is exactly the kind of nuanced multi-constraint problem where the AI is useful β and where verification is non-negotiable.
The Machine Can Be Wrong
Everything you've read so far might make AI seem like a trustworthy colleague. It can be β but it can also be a confident liar. And in ophthalmology, where a missed diagnosis of acute angle-closure or endophthalmitis can mean permanent vision loss, that combination is especially dangerous.
Every evidence source reviewed for this guide converges on one rule: AI-generated clinical content is a first draft. Always. No exceptions. It requires your clinical judgment before it becomes a plan. The AI will never see what you see through the slit lamp.
When AI Hallucinates Behind the Slit Lamp
Hallucination β the generation of plausible but factually incorrect content β isn't a rare edge case. It's a persistent feature of current LLM architecture. In ophthalmology, a hallucinated drug interaction, a fabricated diagnostic criterion, or an invented AAO guideline recommendation can directly harm a patient's vision.
You don't β unless you check. And that's the point. LLMs generate text that looks like a guideline recommendation because they've learned the pattern of what guidelines look like. But they don't retrieve from a database of real guidelines. Chelli et al. documented high rates of fabricated references in AI outputs. The specific recommendation about biologics might be a reasonable clinical approach, but the attribution to a specific AAO guideline could be entirely hallucinated. Never trust an AI-attributed guideline without verifying it against the primary source. This is especially important in ophthalmology, where practice patterns evolve rapidly and subspecialty guidelines vary.
Abdulnour et al.'s NEJM review proposed the DEFT-AI framework (Diagnosis, Evidence, Feedback, Teaching) β treating the AI's output like a resident's presentation that requires attending oversight. In ophthalmology, this means: the AI generates the differential, but you confirm it against what you see through the slit lamp. The AI suggests a workup, but you decide what's appropriate for this specific patient. The AI proposes a treatment, but you write the prescription.
Fighting Back: What Actually Works
The good news: hallucination mitigation is advancing rapidly.
But you don't need a knowledge graph to fight hallucinations in your daily practice. Practical strategies include:
Three powerful self-critique prompts: (1) "What diagnoses might I be missing that would change management?" β this forces consideration of herpetic uveitis (needs antivirals), masquerade syndromes (intraocular lymphoma), and drug-induced uveitis. (2) "What are the strongest arguments against this treatment plan for a pregnant patient with sulfa allergy?" β this surfaces drug safety concerns you might have missed. (3) "Identify any inconsistencies between the recommended treatment and current AAO Preferred Practice Patterns or uveitis society guidelines." β this leverages the AI's training data against its own output. Lucas et al.'s ensemble reasoning approach (generating multiple reasoning paths and checking consensus) outperformed standard approaches by 2β5%.
Bias at the Slit Lamp
LLMs can encode and amplify demographic biases. In ophthalmology, this could mean underweighting sarcoidosis in certain populations, missing BehΓ§et's disease in patients outside the "classic" demographic, or defaulting to the most common diagnosis without considering the patient's specific epidemiological risk factors.
The clinical findings are identical, but the epidemiological context shifts dramatically. BehΓ§et's disease β which can present with anterior uveitis and is a sight-threatening diagnosis β is far more prevalent in young men along the Silk Road region. If you don't include demographics in your prompt, the AI may not weight this appropriately. But here's the bias paradox: Poulain et al. found that LLMs sometimes over-weight demographics and under-weight clinical findings. The ideal prompt includes demographics as context but explicitly asks the AI to "weight clinical findings above demographic assumptions" and to "consider diagnoses that may be less common but clinically consistent."
Choosing Your AI Partner
Not all LLMs perform equally for ophthalmic reasoning. Model selection interacts with prompt design in clinically meaningful ways. Wang et al.'s work in the Lancet Digital Health found that reasoning-native models (like o1 and DeepSeek) may need less explicit chain-of-thought prompting β their internal reasoning handles some of that work. But this comes with a tradeoff: their reasoning is less transparent, which matters when you need to understand why the AI ranked one diagnosis above another.
Consider: reasoning depth (complex uveitis workup vs. quick refraction check), knowledge recency (does it know the latest anti-VEGF agents and DRCR.net protocols?), context window (can it handle a full surgical history and years of OCT data?), multimodal capability (can it interpret fundus photos or OCT images β and how reliably?), privacy policies (where do your patients' data go?), institutional compliance (is it approved in your hospital system?), and cost. Gaebe et al. showed that even open-source models reached 79% diagnostic accuracy with advanced prompting β so the most expensive tool isn't always necessary for every task.
The Five-Step Discipline
Let's return one last time to our patient β the 34-year-old pregnant woman with anterior uveitis. The evidence from 48 peer-reviewed studies points to a systematic approach that every ophthalmologist can adopt today:
Step 1 β Define the persona. "You are a fellowship-trained uveitis specialist." Tell the AI its specialty, reasoning style, evidence standards, and behavioral boundaries.
Step 2 β Structure the case. Load every slit-lamp finding, every measurement, every constraint. Present it the way you'd present to a consultant: VA, IOP, exam by anatomical layer, pertinent history, and a clear question.
Step 3 β Choose your reasoning technique. Chain-of-thought for complex etiological differentials. Few-shot examples for consistency. Direct queries for pattern recognition. Combine them for the strongest results.
Step 4 β Specify the output. Ranked differential with evidence? Stepwise workup respecting pregnancy? Treatment plan with pregnancy-safe alternatives? Tell the AI exactly what you need.
Step 5 β Always verify. The AI is a first draft. You are the ophthalmologist. Every recommendation gets checked against your training, your examination findings, your guidelines, and this specific patient's reality. The AI will never hold the slit lamp.
There's no single right answer β that's the point. For some ophthalmologists, the revelation is that prompt design is a skill that can be learned, not an innate talent. For others, it's the evidence that the same AI that gives you "conjunctivitis" can give you a sophisticated uveitis workup β depending entirely on how you ask. For many, it's the realization that they've been using these tools like a search engine when they could be using them like a reasoning partner who knows the difference between granulomatous and non-granulomatous KPs.
Whatever your answer, the evidence is clear: the ophthalmologists who learn to prompt well will get more from AI than those who don't. And the patients of ophthalmologists who verify will be safer than the patients of those who trust blindly. The thinking machine is powerful β but it still needs the thinking doctor.