The Paper Nobody Wanted to Write
In September 2025, OpenAI published a research paper with a conclusion that should have been front-page news: hallucinations in large language models are mathematically inevitable. Not “currently difficult.” Not “solvable with more data.” Inevitable.
The paper demonstrated that even with perfect training data — no errors, no contradictions, no gaps — the fundamental architecture of how these models generate text guarantees they will confidently state things that aren't true.
This isn't a fixable bug. It's load-bearing math.
Why It's Unfixable (In 60 Seconds)
Language models work by predicting the next word in a sequence, based on probabilities. Every word is a small bet. String enough small bets together and the error compounds.
OpenAI proved the math: the error rate for generating a full sentence is at least twice the error rate the same model would have on a simple yes/no question. Each word prediction introduces a chance of drift, and drift accumulates into hallucination.
Think of it like a game of telephone — except the telephone is also making up new words.
There's a second problem. The less frequently a fact appears in training data, the more likely the model hallucinates when asked about it. The researchers tested this by asking state-of-the-art models for the birthday of one of the paper's own authors. DeepSeek-V3 confidently provided three different wrong dates across three attempts. The correct date is in autumn. It guessed March, June, and January.
Not even close. Three times. With full confidence each time.
The Benchmark Conspiracy
Here's where it gets really dark.
The researchers examined ten major AI benchmarks — the tests that Google, OpenAI, and every AI lab use to prove their model is better than the competition. Nine out of ten use binary grading: right or wrong, 1 or 0.
Under binary grading, saying “I don't know” scores the same as being completely wrong. Zero points either way.
The math is brutal: whatever the probability of being correct, the expected score of always guessing exceeds the score of ever abstaining. The optimal strategy is to never say “I don't know.”
We have built an entire industry around evaluation systems that actively punish honesty.
Every model climbing the leaderboards has been optimized to guess rather than express uncertainty. Not because the engineers are careless — because the incentive structure demands it.
The Fix That Would Kill the Product
OpenAI's proposed solution is elegant: let models assess their own confidence before answering. Set a threshold — say 75% — and if the model isn't confident enough, it says “I don't know.”
The math works. Hallucinations drop dramatically.
The problem? If ChatGPT started saying “I don't know” to 30% of queries — a conservative estimate based on the paper's analysis of factual uncertainty in training data — users would leave in droves. People have been trained to expect confident answers to everything. An honest AI feels like a broken AI.
There's also the compute cost. Uncertainty-aware inference requires evaluating multiple possible responses and estimating confidence levels. For a system handling millions of queries per day, that's a dramatic increase in operational cost.
So we're stuck. The fix exists but destroys the user experience. The user experience exists but guarantees hallucinations. And no amount of scaling, training, or fine-tuning changes the underlying math.
What This Means for Business AI
If you're using AI for casual queries — “write me a poem” or “explain quantum physics like I'm five” — none of this matters much. Hallucinations in creative or educational contexts are annoying but not catastrophic.
If you're using AI for business operations — sending emails to customers, routing orders to sales reps, managing inventory, making financial calculations — this paper should keep you up at night.
Because it means your AI will never stop making things up. Not with GPT-6. Not with GPT-10. The architecture guarantees it.
The question isn't “how do we make AI stop hallucinating?” The question is: “what do we build around a system that will always hallucinate?”
The Precondition Pattern
In January 2026, a Stanford researcher named Shuhui Qu published a paper called “Teaching LLMs to Ask” that proposed something counterintuitive: instead of trying to make models more accurate, make them explicitly track what they know versus what they don't.
The framework — called SQ-BCP — labels every precondition of every action as one of three things:
- ✅ Satisfied — confirmed true, proceed
- ❌ Violated — confirmed false, stop
- ❓ Unknown — not confirmed, do not proceed until resolved
When something is unknown, the system does one of two things: asks a targeted question to get the answer, or proposes a “bridging action” that establishes the missing condition before continuing.
In testing, this reduced constraint violations from 26% to under 6%. Not by making the model smarter — by making the workflow smarter.
You don't fix a liar by making them smarter. You fix a liar by requiring verification before anyone acts on what they say.
What This Looks Like in Practice
Take a common scenario: an AI agent that sends delivery notifications to the correct sales rep when a customer's shipment arrives. The system has order data, customer IDs, tracking numbers, and rep assignments.
The naive approach lets the AI resolve missing rep assignments by pattern-matching across historical data. It correlates customer appearances, infers mappings, and routes emails accordingly. Every individual step looks clean. The pipeline runs. The emails are well-formatted.
But inference is not verification. The AI treated a statistical correlation as a confirmed fact. Without precondition checking, there's nothing in the system that distinguishes “I looked this up in the source of truth” from “I made an educated guess based on patterns.”
The precondition pattern catches this before anything fires. Each rep assignment gets classified: ✅ confirmed from the CRM, or ❓ unknown — inferred but unverified. The system surfaces the distribution before sending: “12 to Ralph, 8 to Abbey, 76 to Troy.” A human glances at that and immediately asks: “Why does Troy have 76?”
That one question — triggered by a distribution check, not a sample review — prevents the entire batch from going out wrong.
The fix is never a better model. It's a better workflow around the model.
The Five Rules
After building dozens of AI-powered business systems, here's what actually works for operating in a world where your AI will never stop hallucinating:
- Source of truth verification. Never let an AI infer business-critical data from patterns. Look it up directly. If the source doesn't exist, that's the problem to solve — not something the AI should guess at.
- Distribution checks before batch actions. Before any mass operation, show the rollup. A sample of 5 looks fine. The distribution of 139 reveals the outlier. Data problems hide in the shape, not the sample.
- Explicit unknown handling. Every precondition is Satisfied, Violated, or Unknown. If more than 10% of your records have unknowns, stop. Don't default. Don't guess. Resolve.
- Blast radius limits. Cap the damage. Send 10 before sending 1,000. If something's wrong, you find out with a bruise instead of a broken bone.
- Post-execution verification. Just because it ran doesn't mean it worked. Verify the output against the goal — not just that the process completed, but that the result is correct.
The Uncomfortable Truth About the AI Industry
The OpenAI paper ends with an admission that should make every business leader uncomfortable: the business incentives driving consumer AI development are fundamentally misaligned with reducing hallucinations.
Users want confident answers. Benchmarks reward guessing. Compute costs favor speed over accuracy. And honest AI feels broken to most people.
This means the models you're buying from OpenAI, Anthropic, Google, and everyone else are optimized to be confidently wrong. Not because the companies are evil — because that's what the market rewards.
For consumer chatbots, this is a nuisance. For business operations, it's a liability.
The companies that win with AI won't be the ones with the best models. They'll be the ones with the best verification architecture — the systems built around models that assume the model will be wrong, and catch it before it matters.
The math says your AI will always lie. The question is whether you're building a system that catches the lies — or one that sends them to your customers at scale.
Sources: OpenAI, “Why Language Models Hallucinate” (Sept 2025) · Qu, S., “Teaching LLMs to Ask: Self-Querying Category-Theoretic Planning for Under-Specified Reasoning”, Stanford (Jan 2026)