Designing agent retries that don't cascade

Mar 12, 2026Longform4 min read

Most agent demos never fail. Production agents fail constantly — and what the system does on failure is what separates "trustworthy" from "merely impressive."

The default instinct, inherited from web development, is to retry: a step times out, so run it again. With a stateless HTTP GET that's usually safe. With an agent it often isn't, because the step you're retrying may have already changed the world before it reported failure. It sent the email. It posted the ledger entry. It told a downstream system the case was approved. Retry that blindly and you don't recover from the failure — you multiply it.

I call this the cascade: one ambiguous failure, retried at several layers, turning a single missed step into a pile of duplicated side effects nobody asked for. In a demo you never see it. In a regulated environment it's the difference between an audit you pass and one you don't.

Three things make agentic retries harder than the retries most engineers have built before.

Nondeterminism. The same prompt, tools, and inputs can produce a different plan on the second attempt. A retry isn't "do that again" — it's "decide again," and the new decision may not be compatible with the half-finished work the first attempt left behind.

Side effects in the middle. Traditional retries wrap a single idempotent call. An agent step is usually several actions — read, reason, act, write — and the failure can land after the irreversible part.

Blast radius. Agents call other agents. A retry at the top can fan out into retries underneath it, each spawning its own.

So the rule I work to is simple: never retry a step you can't prove is safe to repeat. Everything below follows from that.

Make the irreversible parts idempotent. Every action with a side effect carries an idempotency key derived from the task, not the attempt. The second time the agent tries to post that ledger entry, the system recognises the key and returns the original result instead of doing it twice. Unglamorous plumbing, and the single highest-leverage thing you can build.

Classify failures before reacting. A timeout, a rate limit, and a validation error are not the same event. Retryable failures — transient, no committed side effect — get bounded retries with exponential backoff and jitter. Terminal failures — the model refused, the input was malformed, a policy was violated — get escalated, not retried. Most cascades I've seen come from treating a terminal failure as if it were transient.

Bound everything. A maximum attempt count per step, a maximum spend per task, a wall-clock deadline. An agent that can retry without limit isn't resilient; it's a way to turn one stuck task into an unbounded bill.

Checkpoint between steps. Persist enough state that a retry resumes from the last safe point instead of re-running committed work. This is what lets you retry the part that failed without redoing the parts that succeeded.

Escalate with context. When retries are exhausted, hand off to a human deliberately. "The agent failed" is not an escalation. "Step 5 of 7 is in an unknown state — a human needs to confirm whether the payment posted" is.

Make all of it observable. Every attempt, its inputs, its classification, and its outcome belong on a trace you can read after the fact. In production the question is never "did it work" — it's "what exactly happened, in what order, and can I prove it." If you can't answer that, you don't have a retry policy. You have hope.

None of this is exotic. It's the discipline I was trained to apply to control systems: assume the component will fail, decide in advance what failure means, and make the safe path the default. Agentic AI doesn't change that contract — it raises the stakes, because the components now make decisions of their own.

A retry policy has a clear definition of done: every step is either provably safe to repeat or explicitly non-retryable; every retry is bounded; every exhausted retry escalates with context; and every attempt is on the record. Get that right and your agents fail the way good engineering fails — quietly, recoverably, and without taking the rest of the system down with them.