The Thinking Machine at the Slit Lamp — A Clinician's Guide to Prompting AI

The Thinking Machine at the Slit Lamp

A clinician's evidence-based guide to prompting AI for differential diagnosis, workup planning, and treatment selection in ophthalmology

Eduardo Mayorga, MD · Workshop Companion · March 2026

Prologue

Two Ophthalmologists, One Red Eye, Two Answers

Clinical Vignette

A 34-year-old woman presents to the eye clinic with a three-day history of unilateral red eye, photophobia, and blurred vision in her left eye. She reports mild pain that worsens when she looks at bright lights. She denies trauma or contact lens use. On slit-lamp exam, you note circumlimbal injection, fine keratic precipitates on the inferior corneal endothelium, 2+ anterior chamber cells and flare, and a posterior synechia at 6 o'clock. IOP is 28 mmHg OS, 14 mmHg OD. Visual acuity is 20/40 OS, 20/20 OD.

A junior resident types into an AI chatbot: "Red eye, photophobia, blurred vision — what could it be?"

Down the hall, an experienced attending opens the same tool and writes something very different.

The resident's AI responds with a generic list: conjunctivitis, dry eye, subconjunctival hemorrhage, allergic reaction, corneal abrasion. Not only is the correct diagnosis absent from the top of the list — the most likely conditions listed are the ones least consistent with the findings. A patient with anterior uveitis might walk out with antibiotic drops and a missed diagnosis.

The attending's AI responds with a ranked differential led by acute anterior uveitis (non-granulomatous, given the fine KPs), noting the elevated IOP as likely secondary to trabecular meshwork inflammation or steroid response if previously treated. It flags the posterior synechia as evidence of significant or recurrent inflammation and recommends a targeted workup including HLA-B27, ACE, chest X-ray, RPR/FTA-ABS, and consideration of sarcoidosis and ankylosing spondylitis given the patient's age and presentation.

👇 Click on the questions below to reveal the answers

🤔 Same patient. Same AI. Same moment. What produced such radically different outputs — and what does this tell you about how AI actually works? ▼

The difference wasn't the AI. It was the prompt. The attending had configured the AI with an ophthalmic clinical persona, loaded every slit-lamp finding in a structured format, and asked for a ranked differential with supporting and refuting evidence. The resident typed a sentence. This difference — between a vague symptom query and a structured clinical prompt — is now supported by a substantial body of evidence, and it's what this guide is about.

The Evidence

761 studies

A 2025 systematic review identified 761 studies evaluating LLMs in clinical medicine — yet most ophthalmologists have never been taught how to prompt them effectively.

Shool S et al. BMC Med Inform Decis Mak. 2025. doi

Module 1

Teaching the Machine to Think Like an Ophthalmologist

Before you show the AI a single slit-lamp finding, you need to answer a question most clinicians never think to ask: Who should this AI think it is?

In ophthalmology, you instinctively calibrate your consultations. You don't describe a fundus the same way to a retina specialist as you do to a primary care physician. You adjust the vocabulary, the assumed knowledge, the level of detail. An LLM needs the same calibration — but it can't infer it. You have to tell it.

🤔 If you were configuring an AI to be your ophthalmic reasoning partner, what would you tell it about itself before showing it a single patient case? ▼

The literature suggests three essential elements: (1) its clinical role — "You are a fellowship-trained ophthalmologist with expertise in uveitis and ocular inflammation. You reason using a systematic anatomical approach"; (2) its behavioral boundaries — "Always consider sight-threatening diagnoses first. Cite evidence when available. Flag diagnostic uncertainty explicitly"; and (3) the output structure — "Provide a ranked differential with supporting and refuting evidence, followed by a stepwise workup." Callens formalized this as the RTF (Role, Task, Format) framework, and research shows it materially changes the quality of clinical output.

The Power of Persona

This isn't theoretical. Google DeepMind's AMIE system — the most rigorously evaluated clinical LLM to date — was built on exactly this principle. Optimized with clinical reasoning personas and structured diagnostic frameworks, AMIE outperformed unassisted clinicians in a randomized evaluation. Not because it was a fundamentally better model. Because it was better configured.

Now imagine applying this to your ophthalmic practice. The same AI that gives you a bland "conjunctivitis" when asked about a red eye can generate a nuanced differential considering scleritis, keratitis, uveitis, and acute angle-closure glaucoma — when it knows it's supposed to think like an ophthalmologist.

Landmark Finding

AMIE > Unassisted Clinicians

Google DeepMind's persona-configured LLM exceeded unassisted clinicians in diagnostic accuracy in a randomized evaluation.

McDuff D et al. Nature. 2025. doi

How You Present the Case Matters

Just as a disorganized case presentation at grand rounds frustrates the attending and leads to muddled thinking, a disorganized prompt produces muddled AI output. In ophthalmology, this is especially critical because our findings are highly structured — laterality, anatomical location, grading scales, intraocular pressure, visual acuity — and omitting any element degrades the output dramatically.

❌ The Vague Prompt

"Red eye, photophobia, blurred vision — what could it be?"

✅ The Structured Prompt

"You are a uveitis specialist. A 34F presents with 3 days of unilateral (OS) red eye, photophobia, and pain. Slit lamp: circumlimbal injection, fine KPs inferiorly, 2+ AC cells/flare, posterior synechia at 6 o'clock. IOP 28 OS / 14 OD. VA 20/40 OS, 20/20 OD. No trauma, no CL use. Provide a ranked differential with supporting and refuting evidence for each."

Accuracy Improvement

64% → 91%

Structured prompt engineering with explicit rationale requirements improved clinical data extraction accuracy by 27 percentage points.

Feng R et al. Circ Arrhythm Electrophysiol. 2024. doi

🤔 Think about the structured eye exam: VA, pupils, IOP, slit lamp (lids → conjunctiva → cornea → AC → iris → lens), and fundus. How does this standard ophthalmic workflow translate into a prompting template? ▼

Beautifully. The ophthalmic exam is already inherently structured — anatomically layered, quantified (IOP, VA, cell/flare grading), and lateralized. This maps directly to an effective prompt structure: demographics → chief complaint → timeline → pertinent history → exam by anatomical layer → quantified measurements → specific clinical question. Ayoub et al. (2026) validated this approach directly — prompts that mimic the clinician's own case presentation workflow produced human-comparable diagnostic rationales on real clinical data.

⚠️ Critical Safety Note

Before entering any clinical data into a consumer AI tool (ChatGPT, Claude, Gemini), you must de-identify all patient information. These tools are not HIPAA-compliant in their standard configurations. Remove names, DOBs, MRNs, and any identifiers. Never upload fundus photos or OCTs that contain patient metadata. Know your institution's policies.

Module 2

The Three Pillars of Ophthalmic Reasoning

Clinical decision-making in ophthalmology rests on three pillars: What does this eye have? (differential diagnosis), How do we confirm it? (diagnostic workup), and How do we treat it? (treatment). Each demands a different prompting approach. Let's work through all three with our uveitis patient.

Pillar 1: Building the Differential

Returning to Our Patient

The 34-year-old woman with acute anterior uveitis, elevated IOP, and posterior synechia. The attending has configured the AI as an ophthalmic reasoning partner. Now comes the diagnostic question. But how you ask it changes everything.

🤔 Should you ask the AI to "think step by step" through its reasoning — or just give you the ranked list? When does chain-of-thought reasoning help in ophthalmology, and when might it actually hurt? ▼

This is one of the most nuanced findings in the literature. Chain-of-thought (CoT) prompting — asking the model to reason explicitly — generally improves diagnostic accuracy. Dai et al. found that Causal Abduction CoT (generate hypotheses, then seek confirming and disconfirming evidence) worked best. In our case, you'd want the AI to reason: "Fine KPs suggest non-granulomatous inflammation → supports HLA-B27-associated uveitis, but must rule out herpetic → posterior synechia suggests significant inflammation → elevated IOP could be trabecular inflammation or steroid response..."

However, a 2025 NEJM AI study found that CoT paradoxically reduced performance on tasks requiring intuitive pattern recognition. For ophthalmology, this means: use CoT for complex diagnostic puzzles (atypical uveitis, optic neuropathy workup), but for quick pattern-matching (classic dendritic ulcer, papilledema on fundoscopy), a direct approach may work better.

The most powerful approach isn't any single technique — it's the combination. Zhou et al.'s evaluation on the NEJM Image Challenge found that combining chain-of-thought with few-shot exemplars (showing the AI an example of good diagnostic reasoning before asking your question) corrected over half of initial errors.

Combined Techniques

94% accuracy

CoT + few-shot exemplars on the NEJM Image Challenge — one of the highest accuracy figures in LLM diagnostic evaluation. Imagine applying this to ophthalmic case vignettes.

Zhou Y et al. Front Med. 2026. doi

🤔 The AI generates a differential with anterior uveitis at the top. But you know that "anterior uveitis" is a description, not a diagnosis. How would you prompt the AI to go deeper — to reason about the underlying etiology? ▼

This is where ophthalmology-specific prompting becomes essential. You'd follow up: "The anatomical diagnosis is anterior uveitis. Now reason through the etiological differential. Consider: the patient's age and sex, the fine (non-granulomatous) KPs, the unilateral presentation, and the posterior synechia. Rank the etiological diagnoses by likelihood and specify what systemic workup would confirm or exclude each." This two-step approach — anatomical diagnosis first, then etiological reasoning — mirrors how uveitis specialists actually think, and structured two-step prompts have been shown to outperform single-step approaches.

🤔 Our AI puts HLA-B27-associated uveitis at the top. How might anchoring bias affect this list — and how would you prompt the AI to fight it? What might it be missing? ▼

LLMs anchor just like humans. HLA-B27 uveitis is the most common cause of acute anterior non-granulomatous uveitis in young adults — but anchoring on it might cause the AI to underweight herpetic uveitis (which can also cause elevated IOP and requires antiviral treatment), Posner-Schlossman syndrome (glaucomatocyclitic crisis), or early sarcoidosis presenting with a non-granulomatous pattern. Practical prompting strategies: ask the AI to "list the three diagnoses most dangerous to miss in this presentation," to "identify findings that could argue against HLA-B27 uveitis," and to "consider infectious etiologies that would change management." Poulain et al. found that CoT prompting reduced anchoring bias — and Zahraei et al.'s BiasMD framework recommends explicitly prompting the model to consider how demographics might shift the differential.

Pillar 2: Planning the Workup

The Case Continues

The AI's top etiological differential is HLA-B27-associated anterior uveitis, with herpetic uveitis and sarcoidosis as important alternatives. Now you need a workup plan. But this is a first episode in a young woman — do you work up every first episode? And this patient mentions she's 10 weeks pregnant. How do you prompt the AI to account for all of this?

This is where generic prompts fail catastrophically. A prompt that doesn't mention the pregnancy might recommend a chest X-ray. One that doesn't specify "first episode vs. recurrent" might give you an inappropriately aggressive or conservative workup. The evidence is clear: clinical constraints must be part of the prompt.

🤔 The patient is 10 weeks pregnant. How does this single fact change the entire workup and treatment plan — and what happens if you forget to include it in your prompt? ▼

This is a perfect illustration of why clinical context in prompts is non-negotiable. Pregnancy changes everything: chest X-ray requires shielding or deferral, fluorescein angiography is relatively contraindicated, many immunosuppressants are teratogenic, and even topical steroids and IOP-lowering agents require careful selection. If you omit the pregnancy from your prompt, the AI will generate a plan that could harm the patient. The correct prompt includes: "Patient is 10 weeks pregnant. Recommend a workup that avoids ionizing radiation and a treatment plan using only pregnancy-safe medications. Flag any recommendations that require obstetric consultation."

💡 Key Insight

Retrieval-augmented generation (RAG) — grounding AI responses in specific guidelines like AAO Preferred Practice Patterns — reduced hallucinations in clinical recommendations. But a prospective crossover study found that well-structured prompts with detailed patient context sometimes performed just as well as RAG. The prompt itself is a powerful tool — especially when you load it with the clinical constraints that matter.

🤔 Lab results come back: HLA-B27 positive, ACE normal, RPR non-reactive. What do you do with these results in terms of prompting — and how does this mirror your clinical reasoning cycle? ▼

You feed them back: "The following results are now available: HLA-B27 positive, serum ACE normal, RPR non-reactive, chest X-ray deferred due to pregnancy. Update the etiological differential and recommend next steps, including whether additional workup is needed and when." This iterative approach mirrors the real clinical reasoning cycle — data accumulates, the differential narrows, and management evolves. Callens emphasized this as the strongest workflow: presentation → generation → result incorporation → refinement, rather than single-shot queries. The AI becomes a thinking partner you return to, just as you'd update a uveitis consultant with new lab values.

Pillar 3: Selecting Treatment

Decision Time

Working diagnosis: HLA-B27-associated acute anterior uveitis, first episode, with secondary IOP elevation. The patient is 10 weeks pregnant. IOP is 28 mmHg with 2+ cells. She has a posterior synechia that needs to be broken. She mentions a sulfa allergy. Her OB-GYN has cleared topical ophthalmic treatment but wants to be consulted before any systemic therapy.

Treatment prompting demands the most precise specification of any clinical task. Every ocular finding, every systemic condition, every allergy, every pregnancy consideration needs to be in the prompt.

Concordance with Experts

95% concordance

Carefully engineered prompts specifying medication lists, comorbidities, and guideline criteria achieved 95% concordance with expert recommendations.

Abu AlShieh M et al. IEEE Conference. 2025.

🤔 The AI recommends prednisolone acetate 1% every hour, cyclopentolate 1% TID for synechia, and timolol 0.5% BID for IOP. Before you write the prescription — what's your verification process? ▼

Several things to check here: Prednisolone acetate — generally considered safe topically in pregnancy, but you'll need a rapid taper plan to avoid steroid-response IOP elevation on top of the inflammatory IOP elevation. Cyclopentolate — reasonable for breaking the synechia, though some ophthalmologists prefer atropine for more sustained effect in significant uveitis. Timolol — this is the critical check. Beta-blockers cross the placenta and can cause fetal bradycardia. In a pregnant patient, brimonidine (with caution near term) or a topical carbonic anhydrase inhibitor may be safer — but dorzolamide is a sulfonamide, and she has a sulfa allergy. You could prompt: "Given sulfa allergy and pregnancy at 10 weeks, identify which IOP-lowering drops are safe and which are contraindicated. Explain the reasoning for each." This is exactly the kind of nuanced multi-constraint problem where the AI is useful — and where verification is non-negotiable.

Safety Finding

45% reduction

Safety-first prompt framing reduced potentially harmful AI-generated treatment recommendations by 45%.

Esmaeilzadeh P. BMC Med Inform Decis Mak. 2025. doi

Module 3

The Machine Can Be Wrong

Everything you've read so far might make AI seem like a trustworthy colleague. It can be — but it can also be a confident liar. And in ophthalmology, where a missed diagnosis of acute angle-closure or endophthalmitis can mean permanent vision loss, that combination is especially dangerous.

⚠️ The Non-Negotiable Principle

Every evidence source reviewed for this guide converges on one rule: AI-generated clinical content is a first draft. Always. No exceptions. It requires your clinical judgment before it becomes a plan. The AI will never see what you see through the slit lamp.

When AI Hallucinates Behind the Slit Lamp

Hallucination — the generation of plausible but factually incorrect content — isn't a rare edge case. It's a persistent feature of current LLM architecture. In ophthalmology, a hallucinated drug interaction, a fabricated diagnostic criterion, or an invented AAO guideline recommendation can directly harm a patient's vision.

Adversarial Testing

83% propagation rate

When fabricated clinical content was embedded in case vignettes, LLMs repeated the errors in up to 83% of cases. Imagine fabricated IOP thresholds or invented surgical indications.

Omar M et al. Commun Med. 2025. doi

🤔 An AI tells you with complete confidence that the AAO recommends starting biologic therapy for HLA-B27 uveitis after the second recurrence within 12 months. This sounds reasonable. How do you know if it's real? ▼

You don't — unless you check. And that's the point. LLMs generate text that looks like a guideline recommendation because they've learned the pattern of what guidelines look like. But they don't retrieve from a database of real guidelines. Chelli et al. documented high rates of fabricated references in AI outputs. The specific recommendation about biologics might be a reasonable clinical approach, but the attribution to a specific AAO guideline could be entirely hallucinated. Never trust an AI-attributed guideline without verifying it against the primary source. This is especially important in ophthalmology, where practice patterns evolve rapidly and subspecialty guidelines vary.

Abdulnour et al.'s NEJM review proposed the DEFT-AI framework (Diagnosis, Evidence, Feedback, Teaching) — treating the AI's output like a resident's presentation that requires attending oversight. In ophthalmology, this means: the AI generates the differential, but you confirm it against what you see through the slit lamp. The AI suggests a workup, but you decide what's appropriate for this specific patient. The AI proposes a treatment, but you write the prescription.

Fighting Back: What Actually Works

The good news: hallucination mitigation is advancing rapidly.

Hallucination Reduction

63% → 1.7%

Ontology-grounded GraphRAG reduced hallucination rates from 63% to 1.7%, achieving 98% accuracy on clinical question-answering.

Ali M et al. J Biomed Inform. 2026. doi

But you don't need a knowledge graph to fight hallucinations in your daily practice. Practical strategies include:

🤔 You've generated a treatment plan for our uveitis patient. What three prompts would you use to make the AI challenge its own recommendations? ▼

Three powerful self-critique prompts: (1) "What diagnoses might I be missing that would change management?" — this forces consideration of herpetic uveitis (needs antivirals), masquerade syndromes (intraocular lymphoma), and drug-induced uveitis. (2) "What are the strongest arguments against this treatment plan for a pregnant patient with sulfa allergy?" — this surfaces drug safety concerns you might have missed. (3) "Identify any inconsistencies between the recommended treatment and current AAO Preferred Practice Patterns or uveitis society guidelines." — this leverages the AI's training data against its own output. Lucas et al.'s ensemble reasoning approach (generating multiple reasoning paths and checking consensus) outperformed standard approaches by 2–5%.

Bias at the Slit Lamp

LLMs can encode and amplify demographic biases. In ophthalmology, this could mean underweighting sarcoidosis in certain populations, missing Behçet's disease in patients outside the "classic" demographic, or defaulting to the most common diagnosis without considering the patient's specific epidemiological risk factors.

🤔 If our patient were a 28-year-old man of Turkish descent instead of a 34-year-old woman, how should your prompt change — and what does this tell you about bias in AI? ▼

The clinical findings are identical, but the epidemiological context shifts dramatically. Behçet's disease — which can present with anterior uveitis and is a sight-threatening diagnosis — is far more prevalent in young men along the Silk Road region. If you don't include demographics in your prompt, the AI may not weight this appropriately. But here's the bias paradox: Poulain et al. found that LLMs sometimes over-weight demographics and under-weight clinical findings. The ideal prompt includes demographics as context but explicitly asks the AI to "weight clinical findings above demographic assumptions" and to "consider diagnoses that may be less common but clinically consistent."

Practical Guidance

Choosing Your AI Partner

Not all LLMs perform equally for ophthalmic reasoning. Model selection interacts with prompt design in clinically meaningful ways. Wang et al.'s work in the Lancet Digital Health found that reasoning-native models (like o1 and DeepSeek) may need less explicit chain-of-thought prompting — their internal reasoning handles some of that work. But this comes with a tradeoff: their reasoning is less transparent, which matters when you need to understand why the AI ranked one diagnosis above another.

🤔 Beyond accuracy, what factors should guide an ophthalmologist's choice of AI tool — and why might you use different tools for different tasks? ▼

Consider: reasoning depth (complex uveitis workup vs. quick refraction check), knowledge recency (does it know the latest anti-VEGF agents and DRCR.net protocols?), context window (can it handle a full surgical history and years of OCT data?), multimodal capability (can it interpret fundus photos or OCT images — and how reliably?), privacy policies (where do your patients' data go?), institutional compliance (is it approved in your hospital system?), and cost. Gaebe et al. showed that even open-source models reached 79% diagnostic accuracy with advanced prompting — so the most expensive tool isn't always necessary for every task.

Epilogue

The Five-Step Discipline

Let's return one last time to our patient — the 34-year-old pregnant woman with anterior uveitis. The evidence from 48 peer-reviewed studies points to a systematic approach that every ophthalmologist can adopt today:

Step 1 — Define the persona. "You are a fellowship-trained uveitis specialist." Tell the AI its specialty, reasoning style, evidence standards, and behavioral boundaries.

Step 2 — Structure the case. Load every slit-lamp finding, every measurement, every constraint. Present it the way you'd present to a consultant: VA, IOP, exam by anatomical layer, pertinent history, and a clear question.

Step 3 — Choose your reasoning technique. Chain-of-thought for complex etiological differentials. Few-shot examples for consistency. Direct queries for pattern recognition. Combine them for the strongest results.

Step 4 — Specify the output. Ranked differential with evidence? Stepwise workup respecting pregnancy? Treatment plan with pregnancy-safe alternatives? Tell the AI exactly what you need.

Step 5 — Always verify. The AI is a first draft. You are the ophthalmologist. Every recommendation gets checked against your training, your examination findings, your guidelines, and this specific patient's reality. The AI will never hold the slit lamp.

🤔 Final reflection: After working through this case, has your mental model of how to use AI in ophthalmic practice changed? What's the single most important thing you'll do differently the next time you open an AI tool in clinic? ▼

There's no single right answer — that's the point. For some ophthalmologists, the revelation is that prompt design is a skill that can be learned, not an innate talent. For others, it's the evidence that the same AI that gives you "conjunctivitis" can give you a sophisticated uveitis workup — depending entirely on how you ask. For many, it's the realization that they've been using these tools like a search engine when they could be using them like a reasoning partner who knows the difference between granulomatous and non-granulomatous KPs.

Whatever your answer, the evidence is clear: the ophthalmologists who learn to prompt well will get more from AI than those who don't. And the patients of ophthalmologists who verify will be safer than the patients of those who trust blindly. The thinking machine is powerful — but it still needs the thinking doctor.