Exactly-Once Is a Lie: Designing Replay-Safe Agent Workflows

· distributed systems · idempotency · agent workflows

Exactly-once execution is a comforting promise. In distributed agent systems, it’s also a lie.

The promise everyone eventually makes

Every orchestration system eventually promises some version of “exactly once”: the task will run once, the output will apply once, and the system will never duplicate effects.

This promise sounds responsible. It’s also incompatible with reality.

Where exactly-once breaks

Consider the smallest possible failure window:

apply_side_effect()
record_success()

If the process crashes between those two lines, the system can no longer know whether:

The only safe response is to retry — which risks repeating the effect.

Exactly-once execution fails not because systems are sloppy, but because crashes are unavoidable.

Agent workflows make this worse

Agent systems add additional sources of ambiguity:

When agent outputs are applied immediately, retries turn from recovery into corruption.

At-least-once is the honest baseline


At-least-once reality:

Agent
  |
  v
[ Output ]  -- may replay -->
  |
  v
[ Handler ]
  |
  v
[ Side Effects ]

Only safe if side effects are idempotent.

Distributed systems settle for at-least-once delivery because it survives failure.

Agent workflows should do the same — but only if effects are designed to tolerate replay.

At-least-once delivery with exactly-once effect is the real goal.

Where agent systems usually go wrong

In practice, most agent workflows violate replay safety in subtle ways:

These bugs don’t throw errors. They accumulate damage.

A concrete failure: the duplicate diff

Here’s a real failure pattern I’ve seen repeatedly in agent workflows.

An agent is asked to apply a small refactor as a patch. The patch is generated correctly. The system applies it immediately.

Then something mundane happens: a timeout, a worker crash, or a lost acknowledgment. The system retries the task.

// first run
apply_patch(diff)

// crash before "success" is recorded

// retry
apply_patch(diff)

The second application doesn’t necessarily fail. It may partially apply, apply cleanly in a shifted context, or subtly corrupt the file.

No exception is thrown.
No alarm is raised.
The damage accumulates silently.

This is not an agent problem. It’s a replay-safety problem.

Lessons from Farcaster: design for replay

While building Farcaster, I stopped trying to prevent retries and started assuming they would happen.

1) Outputs are immutable records

// agent execution
output = run_agent(job)

// no side effects yet
store_output(job_id, output)

Outputs are stored durably before interpretation. If the system crashes, the record remains.

2) Interpretation is separate from execution

// processing loop
for output in unprocessed_outputs():
  interpret(output)
  mark_processed(output)

This allows interpretation to be retried independently of execution.

3) Idempotency beats cleverness

Handlers are designed so replaying the same output is harmless: deduplication keys, existence checks, and explicit state transitions.

Why this scales

As agents become cheaper, retries become more common — not less.

Systems that assume “this probably won’t replay” fail quietly at scale. Systems that assume replay can survive.

Replay safety is not an optimization. It’s a prerequisite.