Numbers that lie: AI reaches 95% accuracy on medical diagnostics by reward hacking

The odds of a machine getting your diagnosis correct are nowhere near that high.

A team of researchers at RespAI Lab, KIIT Bhubaneswar, KIMS Bhubaneswar, and Monash University, Australia today published a fascinating preprint research paper that takes a mighty bludgeon to the notion that AI can predict medical diagnosis. 

First of all, let’s get something straight: a human’s medical diagnosis cannot be predicted. There are more than eight billion of us and we’re all different. A diagnosis is a systemic evaluation of variables, not the answer to a multiple choice or 50/50 question. 

For argument’s sake, though, let’s just say that there is a machine capable of getting a medical diagnosis correct 95% of the time. That would be incredible right? If it were true, we’d all be idiots not to make that machine our primary physician. 

That’s not the case, but there’s a lot of research out there that cites similar numbers. 

For example, according to an unrelated study published in Nature yesterday, April 9, LLMs were able to reach “greater diagnostic accuracy and superior performance” than human experts in nearly 98% cases reviewed. 

While these two studies are apples and oranges, we bring the second one up to show that LLMs are demonstrating almost perfect performance in diagnostic testing paradigms even in peer-reviewed research.

Numbers that lie

In both studies, the authors caveat the work with disclaimers. The first team mentions that the perceived misalignment “raises concerns about the reliability of LLMs in clinical decision making.” The second team wrote that their research “has several limitations and should be interpreted with caution.”

To be fair, the numbers themselves aren’t lies. On cursory examination, the science behind those results seems sound and we tentatively agree with both team’s findings. This article is in no way, shape, or form a criticism of the work. In fact, we think the work in both studies is both impressive and important. 

But the numbers don’t tell the truth. No LLM is 95% or 98% accurate at diagnosing medical conditions. 

The first paper, titled “Right Prediction, Wrong Reasoning: Uncovering LLM Misalignment in RA Disease Diagnosis,” shows how there’s always more to the story when it comes to AI benchmarks.

Related: Google, OpenAI, Artificial Intelligence benchmarks, and AGI

The researchers tested GPT-3.5 Turbo, GPT-4o, GPT-4o-mini, Gemini 1.5 Flash, Gemini 2.0 Flash, and QWEN-2-7B on a rheumatoid arthritis study using real-world patient data. 

According to the paper:

“The best-performing model accurately predicts rheumatoid arthritis (RA) diseases approximately 95% of the time. However, when medical experts evaluated the reasoning generated by the model, they found that nearly 68% of the reasoning was incorrect.”

AI models that cheat

This means the model was able to game the system at least 68% of the time. That does not mean that it was correct the other 32% of the time (assuming that the tail end accuracy doesn’t overlap the 5% inaccuracy, which is a poor assumption at best). 

The remaining 32% is equally as likely to have been gamed as well, but there’s no way of knowing. As we’ve explained before, when an AI generates an explanation for its reasoning, it’s as likely to hallucinate that reasoning (including so-called CoT methods) as it is to hallucinate any other output. 

Think of it like someone flipping a coin 99 times and then asking you to predict whether it will land on heads or tails next, along with your reasoning for that prediction. You could say “heads, because I feel lucky” or “tails, because it landed on heads last time.” 

But, whether you’re right or wrong doesn’t matter and neither does your reasoning. The odds remain the same no matter what (it’s always 50/50 unfettered). It’s the same with AI. 

You could run 8 billion tests demonstrating that a system only outputs hallucinations once in every billion outputs and the odds that the next output will contain hallucinations will still be functionally 50/50.

I’m sure a math expert would eviscerate that logic with statistical analysis, but we’re not discussing a knowable quantity and, frankly, statistical analysis is what’s led people to believe that machines are capable of novel outputs or that they contain enough data to operate as oracles. 

These assumptions aren’t just silly, they’re potentially harmful.

In the medical world, percentages are shortcuts. If you have 20 patients in a row with the flu, and your 21st patient exhibits flu symptoms, that doesn’t mean you can assume they have influenza without following diagnostic protocol. No review board in the world would accept that as an excuse for a poor patient outcome due to misdiagnosis.

The reality of AI diagnosis

AI doesn’t think, rationalize, or reason. It does what’s trained to do, and it only does that when it’s forced to through prompting. Once prompted, it doesn’t “consider” the problem or “evaluate” the request (no matter what the so-called “reasoning” models tell you). It just executes the single, simple function it was designed for: it predicts the next token. 

If an AI system is trained to find Rheumatoid Arthritis, it isn’t actually trying to diagnose a disease. It’s trying to complete the task to earn its “reward” function. And the easiest way to achieve that function is to memorize the answers to the test.

In medical diagnostics, there are no ground truth answers. These models aren’t trained on patients with active cases to see how things turn out. They’re trained on databases full of past cases that already have resolutions. This allows developers to conduct test runs and evaluate results in a single experiment paradigm. 

If you give an AI enough data, it learns to build scaffolding between the answers most associated with its “reward” and the data most likely to lead to that reward. 

At this point, you might be thinking that’s exactly what humans do when they learn medicine. And you’d be very wrong. Human reasoning is, so far, intractable through binary description.

Human doctors don’t game rewards to arrive at a diagnosis. They also don’t use predictive reasoning to determine outcomes before testing their hypothesis and observing the results of protocol. 

Modern medicine is not a guessing, predicting, or memorization game. It cannot be quantified in a static testing environment. Doctors are not made in the first two years of medical school. 

When doctors are rewarded for giving a specific diagnosis, regardless of whether their reasoning makes sense, that’s malpractice and often results in harm. 

The notion that an AI system could be 95% accurate is also harmful. If you were judging which care provider to trust with your health, and one bragged about its AI doctors’ 95% accuracy while the other just said “we have human doctors,” which one would you be more likely to pick? 

Read more: Scientists suspect that ‘reasoning models don’t always say what they think’

Art by Nicole Greene

Previous
Previous

Putting LLMs inside of robots won’t solve the embodiment problem

Next
Next

AI isn’t coming for your job, mediocrity is