Exactly-Once Is a Lie: Designing Replay-Safe Agent Workflows
Exactly-once execution is a comforting promise. In distributed agent systems, it’s also a lie.
The promise everyone eventually makes
Every orchestration system eventually promises some version of “exactly once”: the task will run once, the output will apply once, and the system will never duplicate effects.
This promise sounds responsible. It’s also incompatible with reality.
Where exactly-once breaks
Consider the smallest possible failure window:
apply_side_effect()
record_success()
If the process crashes between those two lines, the system can no longer know whether:
- the effect never happened,
- the effect happened fully, or
- the effect happened partially.
The only safe response is to retry — which risks repeating the effect.
Exactly-once execution fails not because systems are sloppy, but because crashes are unavoidable.
Agent workflows make this worse
Agent systems add additional sources of ambiguity:
- stochastic outputs
- long-running subprocesses
- tool invocations with side effects
- timeouts that don’t correlate with completion
When agent outputs are applied immediately, retries turn from recovery into corruption.
At-least-once is the honest baseline
At-least-once reality:
Agent
|
v
[ Output ] -- may replay -->
|
v
[ Handler ]
|
v
[ Side Effects ]
Only safe if side effects are idempotent.
Distributed systems settle for at-least-once delivery because it survives failure.
Agent workflows should do the same — but only if effects are designed to tolerate replay.
At-least-once delivery with exactly-once effect is the real goal.
Where agent systems usually go wrong
In practice, most agent workflows violate replay safety in subtle ways:
- applying diffs twice
- appending content on retries
- recreating tasks without deduplication
- re-running fixes that already landed
These bugs don’t throw errors. They accumulate damage.
A concrete failure: the duplicate diff
Here’s a real failure pattern I’ve seen repeatedly in agent workflows.
An agent is asked to apply a small refactor as a patch. The patch is generated correctly. The system applies it immediately.
Then something mundane happens: a timeout, a worker crash, or a lost acknowledgment. The system retries the task.
// first run
apply_patch(diff)
// crash before "success" is recorded
// retry
apply_patch(diff)
The second application doesn’t necessarily fail. It may partially apply, apply cleanly in a shifted context, or subtly corrupt the file.
No exception is thrown.
No alarm is raised.
The damage accumulates silently.
This is not an agent problem. It’s a replay-safety problem.
Lessons from Farcaster: design for replay
While building Farcaster, I stopped trying to prevent retries and started assuming they would happen.
1) Outputs are immutable records
// agent execution
output = run_agent(job)
// no side effects yet
store_output(job_id, output)
Outputs are stored durably before interpretation. If the system crashes, the record remains.
2) Interpretation is separate from execution
// processing loop
for output in unprocessed_outputs():
interpret(output)
mark_processed(output)
This allows interpretation to be retried independently of execution.
3) Idempotency beats cleverness
Handlers are designed so replaying the same output is harmless: deduplication keys, existence checks, and explicit state transitions.
Why this scales
As agents become cheaper, retries become more common — not less.
Systems that assume “this probably won’t replay” fail quietly at scale. Systems that assume replay can survive.
Replay safety is not an optimization. It’s a prerequisite.