Stop Doing Infrastructure in HTTP Handlers

An HTTP handler is a mayfly. It is born, it runs for a moment, and then it dies—sometimes gracefully, sometimes mid-sentence. In the long run, we’re all just timeouts with better marketing.

Meanwhile, the work we ask it to do (provisioning, imports, cluster mutations, cloud calls) is… not a mayfly. It’s a three-act play with retries, partial failure, cleanup, and at least one person asking if we can "just refresh".

Rule of thumb: if the work has meaningful failure modes, don’t pretend request/response is a good container for it.

This post is a clean-room demo of a pattern I keep coming back to when I want systems to be boring in the right way:

  • Intent → accept a request, validate it, and write down what should exist
  • CRD → persist intent as durable state in Kubernetes
  • Operator → reconcile that intent into real resources
  • Status → make progress observable (not vibes-based)

You can call this "operator-first" if you want. I think it’s mostly just admitting physics exists.

The problem: “just do it in the handler”

The synchronous approach starts simple: do the work inside POST /thing, return 200 when done.

Then reality arrives:

  • timeouts (client, load balancer, proxy, server)
  • retries you don’t control (and usually can’t distinguish from the original request)
  • partial failures (some resources created, others didn’t)
  • no coherent status model (poll endpoints, scrape logs, guess)
The universe does not care that your handler timed out. It will happily deliver partial failure anyway.

The pattern: Intent → CRD → Operator → Status

Kubernetes is, at its core, a reconciliation engine: desired state in, convergence out. So if your API is doing orchestration, you can stop fighting the grain and just… stop cosplaying as a scheduler.

The API becomes a thin intent writer. The operator becomes the workhorse. Status becomes an API surface you can actually trust.

Client
  |
  | POST /work  (intent)
  v
API server
  |
  | creates WorkRequest CR
  v
Kubernetes API (etcd)
  |
  | watch WorkRequest
  v
Controller/Operator
  |
  | creates Job
  v
Job runs
  |
  | controller mirrors Job state
  v
WorkRequest.status

The demo

I built a small clean-room repo that implements this pattern using:

  • a CRD called WorkRequest
  • a controller that creates one Kubernetes Job per WorkRequest
  • a tiny HTTP API that returns 202 Accepted and lets the operator do the work

Repo: github.com/bnjam/intent-crd-demo

Example CR

apiVersion: demo.demo.bnj.am/v1alpha1
kind: WorkRequest
metadata:
  generateName: work-
spec:
  message: "hello world"
  durationSeconds: 3
  shouldFail: false

Status

status:
  phase: Succeeded
  jobName: work-abc12-worker
  conditions:
  - type: JobCreated
    status: "True"
    reason: Created
  - type: Completed
    status: "True"
    reason: Succeeded

Run it (operator + API in-cluster via k3d)

This setup runs both the operator and the API inside the cluster. You port-forward the API to localhost.

make demo-deploy-all IMG=intent-crd-demo:dev API_IMG=intent-crd-demo-api:dev
make api-port-forward

Then create work and watch it converge:

./scripts/demo.sh
SHOULD_FAIL=true ./scripts/demo.sh

Reset demo objects (clean screenshots):

make demo-reset

What you get for free

  • Durability: intent survives restarts and transient failure
  • Idempotency: reconciliation can be written to be replay-safe
  • Observability: status is watchable (kubectl get ... -w) and queryable
  • Separation of concerns: API is policy + validation; operator is lifecycle + orchestration

If you’ve ever built a “background worker” system and then slowly reinvented leases, state machines, retries, and status APIs: congratulations, you discovered why controllers exist. One must imagine Sisyphus happy—because at least the retry loop is idempotent.

Tradeoffs

This pattern is not free. You are building a distributed system with a state machine and a controller loop. Which sounds intimidating until you remember you were already doing that—just badly, inside HTTP handlers.

The cost is that you now need to care about:

  • CRD versioning
  • status design (what does “done” mean?)
  • garbage collection / finalizers
  • concurrency limits and backpressure
The absurdity doesn’t go away. You just put it in a place that can retry.

Closing thought

Most of platform engineering is taking chaos, naming it, and making it observable. The rest is writing enough guardrails that the next person can be productive without being a hero.

Returning 202 and reconciling later is not a downgrade. It’s admitting what the work actually is. The meaning doesn’t arrive automatically; you manufacture it by making failure states explicit and progress observable.