The Design Assumption
The strongest systems are built on the ugliest assumption: every agent will eventually produce a wrong, malformed, or adversarial output.
Not βmight.β Will. The question is not whether your financial agent will hallucinate a number. The question is whether your architecture catches it before it reaches a spreadsheet, an email, or a decision.
This is not pessimism. It is the same design philosophy that makes airplane autopilots, nuclear reactors, and banking systems reliable. They do not trust any single component. They verify. They cross-check. They fail safe.
βResilience is structural, not prompting-dependent. You cannot prompt your way out of an architecture that trusts its own outputs.β
The Five Layers of Trust
Layer 1: Structured Outputs β Make Failures Loud
The worst failure mode in AI is the silent one. The agent returns plausible-looking text that contains a wrong number, a hallucinated name, or a fabricated citation. Nobody catches it because it looks right.
The fix is structural: require typed, schema-validated outputs.
When an agent returns {"revenue": null} instead of a number, that is a loud failure β your system catches it, logs it, and escalates. When an agent returns βrevenue was about $420K,β that is a silent failure β it flows downstream, looking correct, until someone notices the real number was $380K.
Layer 2: Assumption Echoing β Verify Before Acting
Before any agent takes an irreversible action, it must echo back its key assumptions in a structured format:
- βI am about to send an email to 7 sales reps with last week's performance data.β
- βI am using revenue data from ns_orders where status = 'Billed' and date range = 2026-03-01 to 2026-03-07.β
- βI am excluding House Account and eBay/Shopify so_origin orders.β
The orchestrator (or human) validates the assumptions before the action executes. This is not a prompt technique β it is a mandatory checkpoint in the execution pipeline.
Layer 3: Critic/Verifier Pattern β Two Models, One Truth
The actor agent produces output. A separate verifier agent β ideally running a different model or different prompt β checks the output against source data. They must agree before the output is accepted.
This is expensive β you are running inference twice. Use it selectively: financial reports, outbound communications, anything where a wrong output has real-world consequences. For internal logging? Not worth it. For a P&L that the CEO reads? Absolutely worth it.
Layer 4: Blast Radius Containment β Constrain the Damage
Before deploying any agent, answer one question: what is the worst thing this agent can do if it goes completely rogue?
Then constrain its permissions to that boundary:
The rule is simple: an agent should never have the permissions to cause more damage than a human is willing to clean up. If the cleanup cost exceeds the automation value, add a checkpoint.
Layer 5: Human-in-the-Loop β The Ultimate Circuit Breaker
For high-stakes decisions, no amount of automated verification replaces a human looking at the output and saying βyes, send it.β
This is not a weakness of the system. It is the system working as designed. The hub-and-spoke model exists precisely so that human consensus β the one node in the network you can trust β remains the final authority on anything that matters.
The art is knowing where to place the checkpoint. Too many and you have a human babysitting every API call. Too few and a hallucinated email reaches a client. The blast radius framework above draws the line: read-only is free, internal writes are logged, external actions are gated.
βThe goal is not zero human involvement. The goal is human involvement only at the moments where human judgment actually matters.β
The Anti-Patterns
Things I see teams do that will eventually destroy them:
- βOur prompts are really good.β β Prompts are suggestions, not contracts. An LLM can follow a prompt perfectly 99 times and hallucinate on the 100th. Trust architecture is structural, not linguistic.
- βWe tested it and it works.β β You tested twelve scenarios out of 1.5 million possible states. The failure you did not test for is the one that will hit production.
- βThe agent can self-correct.β β An agent that produces wrong output and then evaluates its own output is checking its homework with the same brain that got the answer wrong. Use a separate verifier.
- βWe log everything.β β Logging without alerting is a write-only database. If nobody reads the logs until after the incident, the logs did not help.
Bottom Line
Trust is not a feeling. It is an architecture. You build it in layers: structured outputs that fail loudly, assumption echoing before action, critic verification for high-stakes outputs, blast radius containment for permissions, and human-in-the-loop for anything irreversible.
The system does not trust its agents. It verifies them. Every time. At every layer. And when verification fails, it fails safe β loudly, visibly, and recoverably.
That is not a limitation. That is the whole point.
