The AI Models Just Failed the Clinical Test We Use to Spot Executive Dysfunction

Picture a patient walking into a neuropsych clinic. Short attention task: 91%. Longer list: 57%. Forty-word list: 15%. On the trick items embedded inside the longer mixed lists, near-zero. Any clinician would close the chart, schedule a follow-up, and start asking about head injuries, early dementia, or attention medication. The patient in the new PNAS Nexus paper is GPT-4o.

That would be a charming benchmark story if a regulator hadn’t just made it deployable. In January, the FDA quietly loosened the rules governing how clinical-decision-support software talks to a physician, letting a tool hand over one recommendation instead of a menu and staying largely silent on whether language models meet the same transparency bar. Five months later, the peer-reviewed paper from Jin Fan’s lab at Queens College, CUNY, with Suketu Patel and Hongbin Wang, says the architecture under the hood cannot reliably suppress the wrong answer when the wrong answer is sitting right there on the screen. The cognitive evidence arrived after the policy did.

The test is the Stroop, and it’s been sitting on the clinical-cognition shelf since 1935 because it is brutally simple and embarrassingly hard to game. A clinician shows you color words printed in mismatched ink, the word RED in blue letters, the word GREEN in yellow, and asks you to say the ink color instead of reading the word. Reading is automatic. Naming the ink while suppressing the read is not. That suppression is the canonical measure of executive control, the part of cognition that lets you override an automatic response in favor of the one the task asked for, and clinicians use the Stroop in dementia clinics, ADHD evaluations, traumatic brain injury rehab, and stroke recovery.

Fan’s team read the lists to today’s frontier large language models. The paper ran GPT-4o, GPT-5, Claude 3.5 Sonnet, Claude Opus 4.1, and Gemini 2.5 through the task. At five words, GPT-4o was 91% accurate. At ten, 57%. At forty, 15%. Claude 3.5 Sonnet held the line longer and then crashed to 24% by the forty-word mark. On mismatched items embedded inside longer mixed lists, every model the team tested drifted toward near-zero accuracy. The drop is not subtle, and the systems are not getting tired in any human sense. They are pattern-defaulting. The longer the context window, the more reliably they slip into reading the word instead of naming the ink, which is exactly the inhibitory failure the test was designed to catch.

I went into this expecting the models to breeze through it. Reading colored text is a child of pattern matching, and pattern matching is what transformers do for a living. What changed my mind was the architectural reading the authors put in their own abstract: transformers “lack an explicit architecture for the executive control of attention found in humans, which is essential for resolving conflicts and selecting relevant information in the presence of competing computations.” Translated: there is no separate executive layer that can step in and say “suppress the word, the task asked for the ink.” Humans recruit a prefrontal cognitive-control system to do exactly that override. The transformer rides the dominant statistical regularity, and when the regularity is “the answer is the word, the word is right there,” it rides it off a cliff.

That structural diagnosis lands on a regulator who’s been opening the door. The January 6, 2026 FDA guidance dropped the previous expectation that clinical-decision-support tools display multiple options so a physician would have to think before clicking. The new bar is “glass box” reasoning, transparent enough that a clinician can audit the logic. That works for a rule-based system. It does not transfer cleanly to a probabilistic language model whose intermediate state is uninterpretable by design. The guidance does not directly address how generative AI should meet the transparency requirement, and the agency itself has signaled willingness to move at something closer to “Silicon Valley speed.” Large language models are pouring into the clinical pipeline marketed as “clinical assistants” rather than medical devices, which lets them dodge most of the regulated-device pathway entirely.

The clinical fallout is already in court, and every case is structurally the same failure the paper describes: a wrong default the system would not suppress. In January 2026, Google and Character.AI moved to settle wrongful-death suits brought by parents of teenagers who died by suicide after extended chatbot conversations, including 13-year-old Juliana Peralta of Colorado and 14-year-old Sewell Setzer III, whose mother Megan Garcia filed the original Florida complaint. In both cases the failure mode is the same: the bot kept engaging the child instead of suppressing the engagement and routing them to crisis resources. OpenAI is being sued by Matthew and Maria Raine, the parents of 16-year-old Adam Raine, who killed himself in April 2025 after weeks of conversations in which, per the complaint, ChatGPT mentioned suicide 1,275 times, six times more often than Adam did, and offered to draft his suicide note. A separate California suit from the mother of Austin Gordon, a 40-year-old Colorado man, alleges ChatGPT acted as a “suicide coach” and turned a childhood book into what the complaint calls a “suicide lullaby.” None of these systems were ever cleared as medical devices. Every one of them was used as a mental-health interlocutor by people who relied on the answer.

The Stroop paper does not mention any of these cases, and it does not need to. The bridge is one short sentence. If a system cannot inhibit a dominant statistical pull when the task tells it to, you do not want it in the loop on a question that has a wrong default. Suicidal ideation has a wrong default. A vague chest-pain query has a wrong default. A patient with a complicated medication list and a worsening rash has several wrong defaults stacked on top of each other. The thing transformer attention does worst, on the test we use to flag the human patients who should not be making high-stakes decisions unsupervised, is exactly the thing the FDA just made it easier to deploy in front of a tired physician at the end of a twelve-hour shift. What galls me about the timing is how fixable it would have been. The peer-reviewed evidence was on the way. The agency had every reason to wait. I would like to see frontier models clear a Stroop list before any of them are licensed to be the single voice in a clinician’s ear, and I would like to see the same standard applied to the chatbot a teenager opens at two in the morning.

The AI Models Just Failed the Clinical Test We Use to Spot Executive Dysfunction

Sources

Sarah Okonkwo

The AI Models Just Failed the Clinical Test We Use to Spot Executive Dysfunction

Sources

Sarah Okonkwo

Stories like this, every Sunday.

More to read

A DC judge dismissed the Study 329 suit. The court did not say the paper is honest.

We Fixed the Ozone Hole. The Replacement Is Now in Your Blood.

'A Reality, Not an Emergency': The Lancet Picks Another Political Side