Exactly-Once Is a Lie: Designing Replay-Safe Agent Workflows
Exactly-once execution is a comforting promise. In distributed agent systems, it’s also a lie.
The promise everyone eventually makes
Every orchestration system eventually promises some version of “exactly once”: the task will run once, the output will apply once, and the system will never duplicate effects.
This promise sounds responsible. It’s also incompatible with reality.
Where exactly-once breaks
Consider the smallest possible failure window:
apply_side_effect()
record_success()
If the process crashes between those two lines, the system can no longer know whether the effect happened fully, partially, or not at all. The only safe response is to retry — which risks repeating the effect.
Exactly-once execution fails not because systems are sloppy, but because crashes are unavoidable.
Monitoring vs. Jailing: The Chaos of Autonomy
In building Farcaster, I encountered a philosophical choice: do I force agents to return structured JSON so the system can control the "apply" step, or do I let them use their own tools?
I chose the latter. Agents are encouraged to use their internal tools (sed, git, python) directly on the target. This makes them faster and more capable, but it turns the "exactly-once" lie into a full-blown hazard.
When an agent applies effects directly, the system only sees a start time and an exit code. If that process crashes halfway through a multi-file sed refactor, we are left in a "charred pile" of partial state.
The orchestrator's job is not to be a jail; it's to be a monitoring and tracking system that survives the chaos.
Designing for Replay
Since we assume retries will happen—and that those retries might be "chaotic"—we need a design that prioritizes replay-safety:
1) Ownership is Rented (Leases + Heartbeats)
A worker doesn't own a job; it leases it. If it stops heartbeating, the lease expires. This prevents a "dead" worker from holding onto a half-finished file edit forever.
2) Durable Context as a Safety Net
This is where tools like librarian come in. Before an agent starts a task, we can use the catalog to snapshot file hashes. If the task fails, the next agent (or the human) has a durable record of exactly what changed—even if the agent didn't return a clean JSON diff.
3) Idempotency is a Tool-Chain Property
If you give an agent a tool like librarian apply, you should ensure that tool itself is idempotent. Checking pre-hashes and using symbol-based resolution makes re-running the same "proposal" harmless.
The Honest Baseline
Distributed systems settle for at-least-once delivery because it survives failure. Agent workflows should do the same — but only if effects are designed to tolerate replay.
At-least-once delivery with exactly-once effect is the real goal.
Replay safety is not an optimization. It’s a prerequisite.