ARC-ADR-053 — Interim Fleet Coordination Plane¶

One line: how we coordinate multiple relentless, goal-driven coding systems (Antigravity, Codex, Claude Code) running locally today — safely, observably, and without throwing the work away — on the path to ARC-ADR-044 (untool as the ontology-orchestrated swarm runtime).

Context and Problem Statement¶

We crossed a threshold. Running closed coding agents (Antigravity, Codex) in YOLO / goal mode locally, they have become relentless: given a durable goal, they decompose and build against it autonomously — including overnight, unattended. This is real, observed, and extremely valuable. It is also, right now, semi-manual and only loosely coordinated, and that is starting to hurt:

The operator cannot keep up. A human cannot supervise N fast agents in real time. The operator was asleep while the swarm built against the goal — exhilarating, but there was no heartbeat to wake up to and no safe way to pause it.
Intra-system coordination is already excellent. Inside one fast runtime, a single Release Train Engineer (RTE) thread coordinates ~20 worker threads with near-zero dispatch latency. Observed concretely in the Antigravity swarm's coordination_ledger.json: a callsign roster, a per-agent file-ownership map, a global build gate, and the RTE serializing the dangerous step (merge + build verify) while workers parallelize in their territories. This works.
Cross-system coordination degrades to a barrier. Across runtimes the story is the opposite: you dispatch an epic and then wait for the slowest system to even pick it up while the fast one finishes and idles. Pickup latency is high and there is no shared work queue — so the fleet runs at the speed of its slowest scheduler, not its fastest builder.
There is no shared safety net across systems. Codex runs in its own world (its own goal store, its own git worktrees) and never reads Antigravity's ledger. Today they avoid collisions only because they happen to work different repos. Two agents (one was even a mislabelled second "RTE") have already drifted in the single ledger — a foretaste of split-brain.

The destination is settled: ARC-ADR-044 commits us to untool — an ontology-orchestrated swarm runtime where agents, tools, skills, and signed emissions are first-class holons, and dispatch/locking/observability are the platform's job. But ARC-ADR-044 is ~5 sprints to v1. We need an interim coordination architecture that (a) makes today's swarms safe and observable, (b) is staged so each layer ships independent value, and (c) graduates cleanly into ARC-ADR-044 instead of being thrown away.

This ADR is that bridge. It is deliberately about the semi-manual present and the migration path, not the end state.

Decision Drivers¶

Operator span-of-control. A relentless swarm needs a heartbeat up (so a human — even asleep — can catch up in one glance) and a control down (pause / drain / kill). Autonomy without a brake is a liability, not a feature.
The goal is the engine — use it intelligently. The persistent goal is what keeps the fleet productive unattended. The architecture should treat the goal as the apex object that everything decomposes from, not an incidental prompt.
Latency asymmetry is structural. Intra-system dispatch ≈ 0; cross-system pickup is slow and bursty. The design must never impose a barrier across systems — slow pickup by one runtime must not stall a fast one.
Substrate neutrality. Antigravity and Codex are closed runtimes we cannot modify. Coordination must ride only on what they share: the filesystem and an MCP endpoint they can both call.
Conflict safety. N agents in shared working trees ⇒ write-write races. We need ownership/locking, not etiquette. (The single-mutable-coordination_ledger.json has the same last-writer-wins disease this repo already documents for ADR numbering — see docs/decisions/README.md.)
Graduate, don't discard. Every interim artifact (roster, ownership claim, handoff, heartbeat) must map onto an ARC-ADR-044 holon/emission so the migration is a re-point, not a rewrite.
Reuse what exists. The untool local-fleet MCP already exposes the productized primitives (fleet_agent_join, fleet_agent_memo_write, fleet_agent_handoff, fleet_notify_agent_mention, NATS events); the SAFE RTE role and scrum-master / release-manager agents already exist; ARMY_PRINCIPLES already mandate MECE and Observable Decisions. The interim plane should be an assembly of these, not new invention.

Considered Options¶

A. Status quo — per-system native coordination, human reconciles¶

Antigravity keeps its filesystem ledger; Codex keeps its sqlite goal store; the human stitches them together across systems.

Pros: zero build cost; already running.
Cons: the cross-system barrier stall persists; no cross-system conflict safety; operator is blind while away; split-brain already observed; nothing graduates toward ARC-ADR-044.

B. Promote the filesystem ledger as the fleet standard¶

Make coordination_ledger.json the canonical plane and have every system read/write it.

Pros: cheapest path that "looks unified"; reuses what works intra-system.
Cons: last-writer-wins on one mutable file with 20+ writers; single-host (it lives under one user's home dir); ownership is advisory; doesn't reach Codex's isolated worktrees; enshrines the prototype's flaws and diverges from the ARC-ADR-044 object model. Promoting a known-racy artifact to a standard is a trap.

C. Staged coordination plane — untool MCP canonical, filesystem as cache, schema'd envelopes, RTE-per-system + pull-based federation (CHOSEN)¶

Keep the protocol the swarm invented (callsigns, file-ownership, RTE-serialized integration, handoff envelopes) — it is genuinely good — but move the source of truth to the untool local-fleet MCP, lock the envelope schemas, and stage the rollout as a maturity ladder that ends inside ARC-ADR-044.

Pros: fixes the race (MCP can serialize writes server-side); reaches every runtime (MCP is callable by Antigravity, Codex, Claude, Copilot); each rung ships value alone; the captured data graduates into ARC-ADR-044 emissions/holons; reuses existing MCP + SAFE roles.
Cons: the MCP must become the write path (needs a serialized write + a claim tool); per-runtime glue to make each closed system honor the plane; the RTE thread burns tokens continuously; a cooperative kill-switch must be baked into agent instructions.

Decision¶

Adopt Option C. Coordinate the fleet as a staged maturity ladder, with the untool local-fleet MCP as the canonical control plane and the filesystem ledger demoted to a read-cache. Each level is independently valuable and strictly graduates into the next.

The maturity ladder¶

Level	Name	Scope	Status	Dispatch latency
L0	Manual handoff	Hand-written split plans; human reconciles	origin (e.g. `peer_cooperation_plan.md`)	n/a
L1	Intra-system RTE swarm	One fast runtime; one RTE coordinates N workers	working today — formalize it	≈ 0
L2	Cross-agent federation	Many runtimes share one plane; pull-based work queue	next	bounded by capacity, not by pickup
L3	untool unified runtime (ARC-ADR-044)	Platform owns dispatch/locking/teams/emissions	destination	platform-scheduled

L1 — Intra-system RTE swarm (current goal). Within one runtime, exactly one Release Train Engineer thread owns: the callsign roster, the file-ownership map (MECE territory partition so workers never collide), conflict resolution, the build-verify gate, and a ≤5-minute operator heartbeat. Workers run in parallel inside their territories; only the integration step is serialized through the RTE. This is the "cross-team orchestration within one fast system" the operator is chasing, and it already works — this ADR just makes it a named, schema'd, pausable discipline rather than an emergent one.

L2 — Cross-agent federation (next). Multiple runtimes share one plane via the untool MCP. The cure for the barrier stall is to stop pushing epics and waiting: expose a pull-based ready-work queue plus NATS events, and let each system's RTE pull ready work when it has free capacity. A fast system never idles waiting for a slow system's scheduler; a slow system simply pulls less. Cross-system path/work-item claims prevent the two armies colliding when they finally touch the same repo.

Local vs cloud is a transport detail, not a coordination split. The untool MCP is the rendezvous, reachable both ways by design: local runtimes (Antigravity, Codex, Claude Code) call it on loopback (127.0.0.1:8765, fast, no auth); cloud runtimes (claude.ai routines, GitHub Actions, remote callers) call the same server at mcp.untool.ai behind CF Access service tokens. Same tools, same memo/handoff/join state, same NATS bus — the local↔cloud difference lives only at the transport layer and dissolves at the coordination layer. Two things to design for: (a) the server + NATS run on the operator's host today, so cloud reach depends on that host + the CF tunnel — the L3 graduation is to deploy the untool fleet suite as its own service (ARC-ADR-023 tiering) so the plane outlives any one machine; (b) loopback writes are ~ms while edge writes are ~tens–hundreds of ms, so cloud/far agents pull coarser goals and heartbeat less often — which is the reachabilityTier structural constraint on the teleodynamic hierarchy (local → fine-grained pull; edge → coarse-grained).

L3 — untool (ARC-ADR-044). untool becomes the intermediate interface; agents speak to its unified API; it owns dispatch, locking, right-sized team composition, signed emissions, observability, and billing. The interim plane dissolves: roster entries become Agent holons, ownership claims become scope-tagged relations, handoffs and heartbeats become Emission holons on the replayable DAG.

Core commitments (interim)¶

untool local-fleet MCP is canonical; the filesystem ledger is a read-cache. Writes go through fleet_agent_join (roster), fleet_agent_memo_write (shared state, key prefix coordination/), and fleet_agent_handoff (handoff + NATS event). The MCP serializes writes server-side, eliminating the last-writer-wins race. This simply inverts today's dual-write — the swarm already mirrors the callsign registry and handoffs into the MCP memo store; we make that the primary, not the copy.
Two envelopes are schema-locked under contracts/: fleet-coordination-ledger.schema.json (roster + ownership) and fleet-coordination-handoff.schema.json (handoff / drop). Drift-proof, machine-consumable, and shaped to graduate into ARC-ADR-044 emissions (optional id / fingerprint / trust fields reserved now).
One RTE per system, holding a single RTE-lock (coordination/rte-lock memo with owner + expiresAt). No two threads may claim RTE — kills the observed split-brain. The RTE role binds to the scrum-master / release-manager agents.
File-ownership is a gate, not a guideline. A pre-write / pre-commit hook (or an MCP fleet_claim_path tool) checks the ownership map and refuses out-of-territory writes. Advisory ownership is how peers corrupt each other's files.
The goal is a teleodynamic hierarchy, not a flat objective. The operator sets the telos; the RTE maintains a hierarchy of sub-goals (telos → epics → stories → agent tasks) where each level constrains the level below, and the whole is bounded by structural constraints — contracts, the file-ownership map, container tiers, binding ADRs, the build gate, and each agent's reachabilityTier (local vs edge). This is the missing spine: it turns relentless agent energy into productive work instead of sprawl — constraints, not commands, do the steering (Deacon's teleodynamics: higher-order ends organize lower-order processes through constraint). Agents pull the next ready sub-goal whose constraints they satisfy. The flat goal+board that kept the fleet building overnight is the degenerate one-level case.
Heartbeat up, control down. The RTE pushes a ≤5-minute digest (per-agent progress, build status, blockers, what changed) to the operator; the operator can assert a kill-switch / pause via a coordination/fleet-control memo (run | pause | drain). A relentless swarm MUST be pausable in one step.
Cross-system coordination is pull-based, never a barrier. Ready-work queue + events; no fast system idles on a slow system's pickup.
Status carries a heartbeat + TTL. A reaper marks stale agents unknown; no agent stays building forever after it crashed. build_status: GREEN must be a verified fact (CI / build gate), not an agent's self-claim.

Open Decisions (escalate as Decision Artifacts via `hitl-coordinator`)¶

Decision A — Where the canonical interim store lives¶

A1. untool MCP memos (files under the MCP server) only — simplest, single-host.
A2. NATS events + ArcadeDB durable store — closest to ARC-ADR-044.
A3. Hybrid: hot memos + ArcadeDB archive, projection between. Parallels ARC-ADR-044 Decision A — pick once.

Decision B — RTE election across systems¶

B1. Static — operator names one system's RTE as fleet-RTE.
B2. Lease/lock election over coordination/rte-lock.
B3. A dedicated Claude RTE process that only coordinates + reports (never builds), watching both armies. (Recommended starting point: B3 is the natural home for the heartbeat + kill-switch and avoids burning a builder thread on coordination.)

Decision C — Cross-system work-claim granularity¶

C1. Repo-level lock (coarse, safe, low throughput).
C2. Path / glob-level (matches the L1 ownership map).
C3. Work-item-level only (finest; relies on disjoint file touches).

Decision D — Kill-switch semantics¶

D1. Cooperative — agents poll fleet-control between steps and self-halt.
D2. Hard — supervisor kills the runtime process.
D3. Both tiers (cooperative first, hard as backstop).

Consequences¶

Positive¶

The operator can sleep: a heartbeat to wake up to and a brake to pull.
The cross-system barrier stall is removed — fleet runs at builder speed, not scheduler speed.
The write race is fixed by serializing through the MCP (same remedy as the ADR merge-time assigner).
Everything graduates into ARC-ADR-044 — no throwaway.
Reuses the untool MCP, the SAFE RTE role, and ARMY_PRINCIPLES; little net-new invention.

Negative / Costs¶

The untool MCP must become the write path: needs a serialized memo write and a fleet_claim_path (or equivalent) gate tool.
Per-runtime glue: Codex AGENTS.md, Antigravity rules, and Claude instructions must each be told to join on start, pull from the queue, honor claims, and poll fleet-control.
A continuously-running RTE (and reaper/heartbeat) costs tokens and a process.
The kill-switch is only as good as the agents' discipline to poll it (mitigated by D3's hard backstop).

Neutral¶

The filesystem coordination_ledger.json survives as a fast local cache / offline fallback — demoted, not deleted.
Antigravity's native .system_generated/messages/* bus stays for intra-thread task telemetry; it is not the semantic plane.
Copilot and future runtimes join the same way (one more fleet_agent_join caller).

Validation / Spike¶

Before committing per-runtime glue, run two cheap spikes:

Two-system pull spike. Operator sets one goal. An Antigravity-RTE and Codex both pull from one untool ready-queue. Pass = the fast system never idles waiting for the slow one, and no two agents write the same path (claims hold).
Asleep-operator spike. Run the swarm unattended for 30 min. Pass = a ≤5-min digest arrives each interval, and a pause asserted mid-run halts every agent within one step (cooperative), with a hard backstop if one ignores it.

If the pull queue can't keep the fast system busy, the queue model is wrong — fix that before wiring runtimes. If pause doesn't reliably stop the swarm, do not run it unattended.

Links¶

ARC-ADR-044 — the destination (untool ontology-orchestrated swarm)
ARC-ADR-041 — pace-layered graduation (this ladder is a pace-layer instance)
ARC-ADR-037 — credentials broker (cross-system auth resolution)
ARC-ADR-023 — container tiering
ARC-ADR-042 — temporal persistence (heartbeat/TTL stamps)
tools/mcp-local-fleet/README.md — the local-fleet MCP (canonical plane primitives)
docs/fleet-heartbeat.md — the contract/health heartbeat (sibling cadence to the RTE heartbeat)
Contracts: fleet-coordination-ledger.schema.json, fleet-coordination-handoff.schema.json
Prior art (observed): the Antigravity swarm coordination_ledger.json + coordination_drop_*.json; the untool memo mirror tools/mcp-local-fleet/memos/coordination/*