
“Hallucination rate” sounds like a clean engineering metric. The kind you’d put on a slide, right next to uptime and latency, with a comforting arrow pointing down. In practice, it’s closer to an argument that never ends—because people use the phrase to mean three different things at once: how often a model invents facts, how often it’s wrong when it answers, and how often it refuses to answer at all.
That’s why this piece exists. Not to sell you a single number—because there isn’t one—but to translate the numbers that do exist into something usable. Some evaluations measure how faithfully a model summarizes a document you gave it. Others measure how often it fabricates an answer when it should have said, “I don’t know.” And the most revealing results don’t just expose models; they expose incentives. If a system is rewarded for always producing something, “bluffing” stops being a failure mode and becomes an operating strategy.
So think of this as a field guide to the bluff rate: what it measures, what it doesn’t, and why the same model can look almost reliable in one benchmark and wildly unreliable in another—without changing a single parameter.
On a task where the model must summarize a provided short document (so there is a clear reference), Vectara’s hallucination leaderboard reports the best listed model at 1.8% hallucination rate (last updated December 18, 2025). This number is specifically “how often the summary introduces information not supported by the source text,” not “how often the model lies in open chat.”
HalluLens reports GPT-4o at 45.15% hallucination rate when not refusing on the PreciseWikiQA task, while also noting GPT-4o rarely refuses (4.13%) in that setup—illustrating the core trade-off: fewer refusals can correlate with more fabricated answers under uncertainty.
OpenAI’s 2025 analysis includes a SimpleQA example where o4-mini shows 1% abstention and 75% error rate, while gpt-5-thinking-mini shows 52% abstention and 26% error rate—a concrete demonstration that optimizing for “never say ‘I don’t know’” can drive dramatically higher wrong-answer rates.
SimpleQA is intentionally built to be easy to grade and hard to “game”: it contains 4,326 short, fact-seeking questions, uses human trainers to create and independently verify answers, and grades each response as correct / incorrect / not attempted. That design makes it useful for measuring “does the model know what it knows?” rather than long-form fluency.
These figures are all legitimate—but they measure different things. “Hallucination rate” is not one global constant; it is a function of task design, grounding, and whether the system is allowed (and incentivized) to abstain.
If you made it this far, you’ve earned the one sentence that actually matters: the bluff rate isn’t a mysterious flaw hiding in the model—it’s a predictable output of how we train, evaluate, and productize these systems.
When an assistant is punished for refusing, it learns to guess. When it’s rewarded for fluency, it learns to smooth uncertainty into confidence. And when humans are the consumers of that confidence—especially in legal, medical, financial, or policy contexts—the cost of a “helpful” wrong answer isn’t just embarrassment. It becomes rework, liability, reputational damage, and a slow erosion of trust that no UI redesign can fix.
The practical takeaway isn’t “never use chatbots.” It’s to stop demanding the one behavior that makes them most dangerous: always answering. If you want fewer hallucinations, you don’t only need better models. You need better incentives—benchmarks that reward calibrated uncertainty, products that allow abstention without shame, and workflows where claims are grounded, verifiable, and treated like claims, not vibes.
The bluff rate isn’t destiny. But pretending it’s a single number you can average away is.