Exactly-Once Is a Lie: Designing Replay-Safe Agent Workflows

· distributed systems · idempotency · agent workflows

Exactly-once execution is a comforting promise. In distributed agent systems, it’s also a lie.


The promise everyone eventually makes

Every orchestration system eventually promises some version of “exactly once”: the task will run once, the output will apply once, and the system will never duplicate effects.

This promise sounds responsible. It’s also incompatible with reality.

Where exactly-once breaks

Consider the smallest possible failure window:

apply_side_effect()
record_success()

If the process crashes between those two lines, the system can no longer know whether the effect happened fully, partially, or not at all. The only safe response is to retry — which risks repeating the effect.

Exactly-once execution fails not because systems are sloppy, but because crashes are unavoidable.

Monitoring vs. Jailing: The Chaos of Autonomy

In building Farcaster, I encountered a philosophical choice: do I force agents to return structured JSON so the system can control the "apply" step, or do I let them use their own tools?

I chose the latter. Agents are encouraged to use their internal tools (sed, git, python) directly on the target. This makes them faster and more capable, but it turns the "exactly-once" lie into a full-blown hazard.

When an agent applies effects directly, the system only sees a start time and an exit code. If that process crashes halfway through a multi-file sed refactor, we are left in a "charred pile" of partial state.

The orchestrator's job is not to be a jail; it's to be a monitoring and tracking system that survives the chaos.

Designing for Replay

Since we assume retries will happen—and that those retries might be "chaotic"—we need a design that prioritizes replay-safety:

1) Ownership is Rented (Leases + Heartbeats)

A worker doesn't own a job; it leases it. If it stops heartbeating, the lease expires. This prevents a "dead" worker from holding onto a half-finished file edit forever.

2) Durable Context as a Safety Net

This is where tools like librarian come in. Before an agent starts a task, we can use the catalog to snapshot file hashes. If the task fails, the next agent (or the human) has a durable record of exactly what changed—even if the agent didn't return a clean JSON diff.

3) Idempotency is a Tool-Chain Property

If you give an agent a tool like librarian apply, you should ensure that tool itself is idempotent. Checking pre-hashes and using symbol-based resolution makes re-running the same "proposal" harmless.

The Honest Baseline

Distributed systems settle for at-least-once delivery because it survives failure. Agent workflows should do the same — but only if effects are designed to tolerate replay.

At-least-once delivery with exactly-once effect is the real goal.

Replay safety is not an optimization. It’s a prerequisite.