Skip to content

Coordination & the Holographic Virtual Filesystem

TL;DR

The untool.ai fleet runs many agents (Claude Code, Codex, Copilot, Grok, Antigravity, claude.ai routines) across many repos (hub + spokes) on many execution surfaces (local, Azure Container Apps, ACI, GitHub Actions, KEDA self-hosted runners, microVMs). Classic CI/CD assumes one repo, one runner, one human-paced reviewer — none of which hold here. So we built a thin coordination plane: a holonic board (ADR-035, ADR-058), a planned holographic VFS (ADR-054, design only), a file-first static registry (ADR-071), a NATS JetStream event bus (ADR-022, ADR-049), a self-healing harness (ADR-050), a local loopback runner (ADR-048), Postman-based contract distribution (ADR-034), and a CF-Access-gated MCP control plane.

This page is a research-flavoured tour of why those pieces exist and how they fit together. Several pieces are shipped and several are designed but not yet built — we call that out explicitly per section.

How to read this page

The architecture below is layered. Each layer solves a specific failure mode that bit us in practice; we did not start with a grand design and then build it. We started with a single repo and a single human, added a second repo, hit a coordination wall, added a board; added an agent runtime, hit a rate-limit wall, added an event bus; added a sandbox, hit a filesystem wall, designed HVFS. The ADR numbers are not chronological because the understanding is not chronological — we sometimes wrote the ADR after the implementation stabilised, sometimes before. Treat the section ordering as a logical dependency walk, not a build order.

A useful mental model: there are four kinds of state in the fleet — code (git), contracts (Postman + git), coordination (the board + NATS), and runtime (containers + microVMs). Each piece below owns exactly one of those.


1. The coordination problem

A normal CI/CD setup assumes one repo, one default branch, one queue of human-reviewed PRs, and a small number of long-lived runners. The untool.ai fleet violates every one of those assumptions:

Axis Classic CI/CD untool.ai fleet
Repos 1 Hub + N spokes (frontend-core, backend-core, middle-core, commons-core, agentarmy-forge, …)
Agent runtimes None — humans drive Claude Code (local), Codex, Copilot coding agent, Grok Build, Antigravity, claude.ai routines
Execution surfaces GitHub Actions only Local laptop, Azure Container Apps (ACA), Azure Container Instances (ACI), GitHub-hosted runners, KEDA scale-to-zero self-hosted runners, microVMs
Reviewers Human(s) Human + @copilot + @codex + @gemini-code-assist + @claude review loop
Pacing Hours–days Minutes; multiple PRs in flight per repo simultaneously
Trust model Centralised Sandboxed: every spoke is a microVM with no direct path back to the hub filesystem

Three structural problems fall out of that table:

  1. No shared filesystem. A sandboxed spoke cannot ln -s or submodule the hub. The only delivery channel for hub-owned content (ADRs, contracts, agent definitions) is explicit copy-in sync. That breaks the usual "one source of truth on disk" assumption.
  2. No central scheduler. Six different agent runtimes pick up work on six different cadences. They cannot all poll the GitHub API at full speed without hitting the secondary rate limit (we have, repeatedly — see gh merge rate-limit gotcha). We need a durable, queue-able coordination surface that survives any single agent being offline.
  3. No single audit log. Cloud agents (claude.ai routines, GitHub Actions) have no local shell; local agents (Claude Code) do. Both need to drive the same set of capabilities — fleet_up, fleet_logs, fleet_agent_inbox. The interface has to work without a local Docker socket while staying auditable.

The rest of this page is the architecture we landed on. The unifying idea: make the coordination plane a first-class product surface, not a side effect of GitHub.

Why classic CI/CD doesn't fit

The "classic CI/CD" assumption that fails hardest is synchronicity. A normal pipeline assumes the producer of an event (a push, a merge) and the consumer of that event (a test job, a deploy) are connected by a synchronous chain of webhooks and Action steps. If anyone in the chain is offline, the chain breaks; if anyone is too slow, the chain backs up against an API rate limit.

The fleet's reality is asynchronous, partial, and opportunistic. A claude.ai routine wakes every 5 minutes; a GitHub Actions runner picks up jobs in seconds; a local Claude Code session may be offline for hours and then catch up in a burst. The coordination plane has to absorb all three rhythms without dropping a single event and without serialising them through a single bottleneck. That is the design constraint that drove almost every choice below.

The second assumption that fails is filesystem locality. Classic CI/CD jobs all share the workspace of the repo they're running for. The fleet's sandboxed agents share nothing — each microVM starts cold, with only its own configured slice of state. So everything that has to be visible across agents has to be visible over the network, in a form an agent can ingest without prior context. That is why we lean so hard on a single board (URL-addressable) and a single contract registry (URL-addressable) and a single MCP edge (URL-addressable). URLs are the only inter-agent currency that works in every runtime.


2. Holonic Unified Board (HUB) — ADR-035

Koestler's holon as the design metaphor

Arthur Koestler coined "holon" in The Ghost in the Machine (1967) for an entity that is simultaneously a whole and a part — a Janus-faced unit in a hierarchy ("holarchy"). A cell is a holon: a complete autonomous system, and a component of a tissue. A SAFe Feature is a holon: a complete deliverable, and a child of an Epic.

The fleet board models work the same way. Every item is:

  • A whole — it has its own acceptance criteria, owner, status, PR, mocks, contracts.
  • A part — it sits inside a Feature, inside an Epic, inside a Program Increment, inside a Release Train.

That recursion is not cosmetic. It is the reason a single agent can pick up an item and act on it without loading the whole graph: the item is locally complete. And it is the reason a single Decision Artifact at the Feature level can block a dozen Stories below it without explicit fan-out: the holarchy carries blocking semantics for free.

See ADR-035 — Holonic Unified Board Architecture.

Today: GitHub Projects v2 as the materialization

We materialize the holonic board on GitHub Projects v2 (project number 1 in the hub):

  • Type field carries SAFe semantics (Epic / Feature / Story / Enabler / Bug / Spike / Decision).
  • Parent issue makes holarchy explicit; sub-issues inherit PI/Iteration by convention.
  • Status is the lifecycle state: Todo → Ready → In Progress → In Review → Done, plus Awaiting Decision for HITL holds.
  • Cross-repo items show up because we add spoke issues to the hub project, not just their own boards.

The gh project CLI plus the fleet_project_board / fleet_project_fields MCP tools (see ADR-053) drive the macro/RT tier from any agent runtime.

Tomorrow: sharded holonic board (ADR-058)

GitHub Projects v2 caps out around the low tens of thousands of items and serialises mutations through a single GraphQL endpoint. At fleet scale (target: hundreds of holons per Program Increment across ten spokes), that ceiling becomes the bottleneck.

ADR-058 — Sharded Holonic Board Coordination describes the scale-out path: shard by Release Train or by spoke, keep the holarchy edges in a denormalised graph (ArcadeDB), and reconcile to GitHub Projects v2 asynchronously. The shard boundary respects the holon — a parent and its immediate children always live on the same shard, so local agents never need cross-shard reads to act.

Status: designed; current scale does not yet require sharding. We will switch over when we cross ~5k open items.

Why "Holonic Unified Board" and not just "the board"

Two reasons the name matters in practice:

  1. Holonic is the discipline that prevents the board from becoming a giant flat list. Every item must declare its parent, and every item must be self-contained enough that an agent can act on it without loading the parent's full context. That discipline is what makes the board usable by agents instead of just visible to humans. When we have failed to enforce it (early in the project) we ended up with orphan stories that nobody — human or agent — knew how to triage.
  2. Unified is the architectural commitment that there is exactly one board across all repos. Hub issues and spoke issues both land on the same Projects v2 surface; the Type/PI/Status fields are the same; an agent that knows how to read the hub board can read the whole fleet. That is the only way an agent like @board-manager can answer "what is everyone working on right now" without N round-trips per repo.

3. Holographic Virtual Filesystem (HVFS) — ADR-054

What HVFS is

A git replacement for fleet content. Not a wrapper around git, not a layer on top — a parallel filesystem with git-shaped verbs.

The stack:

  • Storage: content-addressable object store (S3-compatible).
  • Versioning: lakeFS for Git-like data ops over the object store — branches, commits, merges, atomic cross-collection transactions.
  • CLI: ut vfs mirrors the verbs developers already know — ut vfs clone, ut vfs branch, ut vfs commit, ut vfs push, ut vfs merge.
  • Access: every spoke agent gets a credential-scoped view; the hub holds the master.

Why "holographic"

In holography every fragment of the recording medium contains a (lower-resolution) image of the whole. HVFS aims for the same property at the fleet level: every shard carries addressable references to the whole, so any agent — local or sandboxed — can resolve any path without a central round-trip, even if it has only its own slice cached.

Concretely:

  • Every commit in HVFS is content-addressable; the address is the same in every shard that has seen it.
  • A spoke holds a thin slice (just its own contracts + ADRs it cares about), but the reference graph reaches the whole fleet.
  • Resolution is lazy: fetch on first read, cache, invalidate via the event bus.

That gives us the property the sandboxed spokes need most — they can reason about any hub artifact without holding it — while keeping the copy-in sync surface small.

Status: design only

HVFS is not built yet. The design lives on PR #405 (merged into main via #473 in June 2026 as the architectural ADR set, but the runtime is still vapor) and is captured in ADR-054 — Holographic Virtual Filesystem and ADR-056 — Integrated VFS-Board Coordination.

Today the equivalent capability is approximated by:

  • Git for source.
  • The Hub Contract Mirror (ADR-047) for explicit copy-in distribution of contracts.
  • Postman (ADR-034) for live contract surfaces.
  • The file-first static registry (ADR-071) for fleet membership.

HVFS will eventually subsume the first three.

Why a git replacement and not a git layer

A natural question: why not just wrap git? Git already has content addressing, branches, merges, and a vast tooling ecosystem. The honest answer is that git's model is wrong for large binary contracts, generated artifacts, and per-spoke partial views.

  • Git refuses to ship partial trees. A spoke that wants three contracts has to clone (a shallow version of) the entire hub.
  • Git's object model is content-addressable but its ref model is not — a branch is mutable, which breaks the holographic property (the same name resolves to different content in different shards).
  • Git LFS exists, but every team that has used it at scale has stories. The integration seams are sharp.
  • The diff/merge model is line-oriented; OpenAPI spec evolution is structural and wants semantic merges.

lakeFS sidesteps all of this by sitting on top of an object store and exposing Git-like verbs over arbitrary content. We get the verbs the team already knows without inheriting the constraints we don't want.

The integration with the board (ADR-056)

ADR-056 — Integrated VFS-Board Coordination describes how the board and HVFS plug into each other: every board item carries a VFS ref (a commit hash) for its acceptance state, every VFS commit carries a board-item link in its metadata, and a merge to the main HVFS branch is what flips the linked board item to Done. That tight coupling is the property that lets a sandboxed agent close the loop without ever touching GitHub directly — it commits to its HVFS branch and the board updates as a downstream projection.


4. File-first fleet static registry — ADR-071

The fleet needs a membership and addressing table: who is in the fleet, what is each member's callsign, where do they live, what capabilities do they expose. The naive answer is "a database." We deliberately chose a file-first static registry instead.

The trade-off:

Property DB-of-record File-first static
Source of truth A running service A versioned file (fleet/registry.yaml)
Read path Network call Local file read
Write path RPC / API PR + merge
Audit trail Service logs git log
Offline-friendly No Yes
Idempotent join Hard (race) Trivial (declare entry)
Stale view Possible Bounded by git pull cadence

For an agent fleet, offline-friendly and idempotent join dominate. A sandboxed microVM that loses network on cold-start should still be able to look up its own callsign and the hub's URL. A new agent runtime joining the fleet should not need to coordinate with a registry service that may be down.

The registry encodes:

  • Callsign stability — each member's logical name (hub, frontend-core, backend-core, …) is the primary key. Hostnames, URLs, container tags all change; callsigns do not.
  • Capability declarations — what tools/surfaces each member exposes (fleet_logs, fleet_inbox, …).
  • Idempotent join — adding a member is a single PR that adds a row; replays are no-ops.

See ADR-071 — File-first Fleet Static Registry.

Why "callsign" instead of "hostname"

We deliberately use callsigns as the primary identity for fleet members rather than hostnames, URLs, or container IDs. The reason is the same reason aviation uses callsigns: the identity has to survive the physical address changing.

A spoke might run as frontend-core-staging.westus2.azurecontainer.io today and frontend-core.azurewebsites.net tomorrow. The container ID rotates every redeploy. The git SHA changes on every push. Only the callsign — frontend-core — is stable across the entire lifecycle. Every NATS subject, every audit log entry, every inbox message, every memo is keyed on the callsign. The mapping from callsign to current address is exactly one lookup, into the registry.

That indirection is what gives us the property classic CI/CD lacks: an agent can address another agent by name, in code, in a contract, in an issue comment, and that address will resolve correctly forever even if the underlying infrastructure is rebuilt from scratch. URLs are too brittle; container IDs are too short-lived; callsigns are exactly the right granularity.

Idempotent join, in detail

Idempotency matters because the registry is read by every agent on every cold start. If joining required a coordinated write to a service, every cold start would be a coordination point. With the file-first model, the join is just a row in a YAML file. An agent that re-joins is a no-op merge. An agent that never quite joined cleanly can be force-joined by anyone with PR rights. An agent whose entry got corrupted can be rolled back via git revert.

We have used all three of those affordances. None of them would be available with a service-based registry, and none of them required any code we wrote ourselves — they are just git working as intended.


5. Event bus bridges — ADR-022 + ADR-049

Why NATS JetStream

The fleet needs a durable, ordered, ack-able event log that:

  • Survives single-consumer outages (an agent dies; the message stays).
  • Supports pull-based consumers (an agent picks up when it is ready, not when the bus pushes).
  • Has a tiny operational footprint (we run it ourselves on a single container).

NATS JetStream checks every box. It gives us durable streams, durable pull consumers with ack/redeliver, subject hierarchies, and exactly-once semantics through the message ID dedup window.

See ADR-022 — Event Bus Bridges.

Why webhook → NATS → projector, not direct fanout

GitHub webhooks land on a single HMAC-verified ingress endpoint, get published onto a NATS subject, and are then projected by N consumers. The alternative — letting each consumer subscribe to GitHub directly — would mean:

  • N webhook URLs (none of which can be a microVM behind NAT).
  • N HMAC secrets to rotate.
  • No durability: a missed delivery is gone.
  • No replay: a new consumer cannot catch up on history.

The two images carry this:

  • webhook-projector — verifies HMAC, publishes to NATS, idempotent on GitHub's delivery ID.
  • event-bridge — projects NATS events into agent-specific surfaces (Decision Artifacts, board updates, inbox messages).

Harness-isolated agent event bridge

ADR-049 — Harness-Isolated Agent Event Bridge layers on the isolation story: a sandboxed agent must not be able to forge events as another agent. Each guest microVM gets a scoped NATS credential; the bridge enforces per-callsign subject prefixes; cross-callsign writes go through the host with audit.

That gives us the property classic message buses miss: an event's identity is verifiable end-to-end without trusting the producer's runtime.

Subject hierarchy

NATS subjects use dots as separators and support wildcards. We use the hierarchy directly as the access boundary:

  • fleet.agent.<callsign>.inbox — write-only for the addressing agent, read-only for the target.
  • fleet.agent.<callsign>.handoff — write-only for any agent, read-only for the target.
  • fleet.board.<repo>.<event> — projected from GitHub webhooks.
  • fleet.heartbeat.<callsign> — write-only for the heartbeat owner.
  • fleet.audit.> — append-only, read by the observability stack only.

The per-callsign credential's allowed subject list directly mirrors this hierarchy, so authorisation is a string match and the JetStream message itself becomes the audit record.

Durable pull consumers, in detail

Pull consumers (as opposed to push) matter for the fleet because agents have wildly different cadences. A claude.ai routine running every 5 minutes wants to drain its inbox on each tick, not have messages pushed at a rate it can't process. A local Claude Code session that just woke up wants to fetch the entire backlog at once. Durable pull consumers handle both transparently: the ack pointer lives server-side, redelivery is automatic on failed ack, and the consumer controls the batch size on every fetch.


6. Self-healing bidirectional harness — ADR-050

What "self-healing" means here

Not "kubectl restarts pods that fall over." We mean heartbeat-driven reconcile:

  • Every agent and every container emits a heartbeat to a known NATS subject.
  • A reconciler watches for absences against the declared registry (ADR-071).
  • On absence, it walks a documented recovery procedure: re-pull image, re-issue credential, restart, escalate to HITL if the recovery fails.

The crucial point: the reconciler never improvises. It walks documented procedures. Anything unknown escalates. That keeps the blast radius bounded — a misbehaving heartbeat does not produce a cascade of "fixes."

Bidirectional flow (host ⇄ guest microVM via vsock)

The "bidirectional" in the ADR title is about the host-↔-guest channel. A microVM agent needs to:

  • Receive work (board polls, event bridge messages).
  • Emit results (PR open, comment, decision request).

Both directions cross the host/guest boundary. We use vsock (virtio socket) for that boundary because it is the cleanest in-kernel guest↔host channel that doesn't expose anything else — no shared filesystem, no network namespace leak.

Per PR #506, each vsock conduit carries:

  • A rate limiter (token-bucket per guest callsign).
  • Phoenix tracing spans for every message, so we can correlate a board action back to the originating agent runtime.

See ADR-050 — Self-Healing Bidirectional Agent Harness.


7. Local loopback runner & hyperautomation — ADR-048

What it is

A local Docker-in-Docker runner pool that mirrors GitHub Actions semantics — same step model, same env-var conventions, same ${{ secrets.X }} resolution — but executes against the host's loopback network. Two reasons:

  1. Off the metered cap. GitHub Actions minutes are billable and finite. Local runners are free and limited only by the box.
  2. Fast loopback access. The runner can hit 127.0.0.1:8765 (local-fleet MCP), localhost:3000 (frontend), and localhost:8100 (middle) without any tunnel overhead, which is exactly what hyperautomation flows need.

See ADR-048 — Local Loopback Runner & Hyperautomation.

Integration with ACA self-hosted KEDA runners

The local loopback runner is the dev path. The production-grade path is the ACA self-hosted runner pool — KEDA scale-to-zero, ACR-pulled image, runs-on [self-hosted, linux]. Same image, same step model; the difference is the environment (cloud, real ACR, real Azure AD) versus the local box.

That lets a flow that worked locally deploy unchanged: change the runs-on tag, push, done.

Why DinD, despite the warnings

"Don't run Docker in Docker" is well-trodden advice. We do it anyway, for one specific reason: the runner is itself an ephemeral disposable. The whole point of the local runner is that it boots, runs one job, and dies. The classic DinD failure modes (stale state, layer cache pollution, security escape) are bounded because there is no long-lived runner state to pollute. The host's Docker socket is not mounted into the runner; the runner runs its own dockerd. Build cache loss is fine — the next runner pulls from ACR.

This is the same calculus GitHub itself uses for its hosted runners; we are simply applying it locally.

Host-proxied LLM gateway (PR #500)

A subtle but important integration: the local runner can talk to the host-proxied LLM gateway (PR #500). Instead of every runner job re-authenticating to OpenRouter / Anthropic / Cerebras, the host runs a single gateway that holds the credentials and meters usage. The runner just speaks OpenAI-shaped HTTP to host.docker.internal. That keeps secrets off the runner image and gives us a single per-call audit point.


8. Cross-repo access & contract distribution — ADR-034

Postman as the contract registry + mock surface

Every contract — OpenAPI, AsyncAPI, GraphQL — is published to Postman, in the AgentArmy workspace (64b63429-ed44-4078-861a-c8867742eaf4). The PMAK is stored in Azure Key Vault as POSTMANTOKEN. The Postman MCP server is available to every agent.

Why Postman and not just git-tracked specs:

  • Live mocks. Postman mocks return on the public internet the moment a spec is published. A consumer can integrate against the mock URL before the producer exists.
  • Spec versioning. Postman tracks every published version with a stable URL.
  • Workspace ACLs. A single shared workspace is the access boundary; we don't have to thread per-repo secrets.

Contract-first AND mock-first

This is the standing rule, not a guideline: every new contract gets a live mock the moment it is defined. Both halves matter.

  • Contract-first keeps the producer and consumer agreeing on shape before either is built.
  • Mock-first means the consumer can build in parallel against the mock URL, before the producer container exists.

The mock URL is the integration handoff. The producer comes online later. The consumer never has to wait.

Caveat we have been bitten by: Postman mocks always return 200 regardless of auth headers. That is fine for shape and parallel unblocking, but JWT/authz tests must verify against the real producer, not the mock, or you get false greens. See ADR-034 for the full discussion.


9. Cloud agent control plane — mcp.untool.ai

Cloud agents — claude.ai routines, GitHub Actions runners, remote API callers — have no local Docker socket and no local filesystem. They can't run docker ps. They can't tail a file. They need a network surface that gives them the same observe-and-drive capability a local Claude Code session has.

That surface is the local-fleet MCP server, bound to 127.0.0.1:8765 on the host, fronted by Cloudflare Tunnel at mcp.untool.ai, gated by Cloudflare Access service tokens.

Auth model

Two paths into the same server:

  • Loopback (127.0.0.1): legacy bearer token accepted. Used by local-only agents.
  • Public (mcp.untool.ai): CF Access service tokens required — CF-Access-Client-Id + CF-Access-Client-Secret headers. CF Access verifies the token at the edge; the origin also verifies the resulting JWT (defense in depth). Every CF Access principal is logged by Client ID in the CF audit log; we keep a reverse map of "ID → friendly name" in Key Vault as cf-access-svc-<name>-id.

The four NATS coordination tools on the web

Of the eight MCP tools the server exposes, four are agent-coordination primitives that work over NATS:

  • fleet_agent_inbox — read messages addressed to a callsign.
  • fleet_agent_handoff — send a message to another callsign (work handoff, decision request, …).
  • fleet_agent_memo_read / fleet_agent_memo_write — durable per-callsign scratchpad.

These are the tools that let a claude.ai cloud routine hand off a piece of work to a local Claude Code session without either of them knowing the other's network location. The routine writes to NATS via the MCP edge; the local session pulls on its next heartbeat. Verified working over mcp.untool.ai via the claude-routine-1 service token as of June 2026.

The remaining four are the operations tools: fleet_ps / fleet_inspect / fleet_logs (read-only, auto-approve) and fleet_up / fleet_down / fleet_restart / fleet_build / fleet_deploy (write, per-call approval).

Why we don't expose middle-core / backend-core directly

The CF tunnel exposes frontend-core only (:3000) and the MCP server (:8765). Middle-core and backend-core stay on localhost behind the Next.js BFF at /api/copilotkit, which owns the session-cookie → JWT injection (per ARC-ADR-002). Tunneling middle/back directly would bypass JWT injection and orphan rate-limiting, auth tiers, and billing — that slot belongs to Azure APIM when external API monetisation arrives, not to a raw tunnel.

See docs/fleet-coordination.md and docs/fleet-mcp-distribution.md.

Audit and observability

Every MCP call — local or public — is appended to tools/logs/mcp-audit.log.YYYY-MM-DD with the principal, the tool name, the redacted arguments, and the outcome. For CF Access calls the principal is the Client ID (the friendly name is looked up via the Key Vault reverse map). For loopback bearer calls the principal is local-bearer. The audit log is the system of record for "who did what" and is sufficient on its own to reconstruct a session.

The MCP server itself only exposes the eight tools listed above; there is no shell-exec tool. That is a deliberate constraint. Adding a generic shell would collapse the access model — every other tool's allowlist would become moot. The eight tools are enough to cover every observe-and-drive use we have hit; anything else gets added as a new named tool with its own allowlist.

Tool naming gotcha

The tool names must match [a-zA-Z0-9_-]. Claude Code's MCP client silently drops dotted names — claude mcp list would show 0 tools when we tried fleet.agent.inbox style names. We standardised on snake_case after fleet_agent_inbox (etc.) and the issue went away. It cost a half day; we are documenting it here so the next implementer skips it.


The end-to-end flow

flowchart LR
    Dev[Developer / Agent] -->|create issue| Board[(Holonic Board<br/>GH Projects v2)]
    Board -->|label: copilot-task<br/>or agent-army-task| Dispatcher{Routing<br/>Workflow}

    Dispatcher -->|copilot-task| Copilot[Copilot army<br/>fast, mechanical]
    Dispatcher -->|agent-army-task| Claude[Claude army<br/>deep, strategic]

    Copilot --> PR[Pull Request]
    Claude --> PR

    PR -->|@copilot @codex<br/>@gemini @claude| ReviewLoop[Review loop<br/>review-loop label]
    ReviewLoop -->|fixes pushed| PR
    ReviewLoop -->|Closes #N| Merge[Auto-status → Done]

    Merge --> Heartbeat[fleet-heartbeat.mjs<br/>daily routine]
    Heartbeat -->|inventory<br/>+ detect gaps| Board

    subgraph EventBus [NATS JetStream]
        Webhook[webhook-projector<br/>HMAC verified]
        Projector[event-bridge]
    end

    PR -.->|gh webhook| Webhook
    Merge -.->|gh webhook| Webhook
    Webhook --> Projector
    Projector -.->|Decision Artifacts<br/>inbox messages| Board

    subgraph CloudPlane [mcp.untool.ai - CF Access]
        Inbox[fleet_agent_inbox]
        Handoff[fleet_agent_handoff]
    end

    Claude -.->|loopback bearer| Inbox
    Copilot -.->|service token| Handoff
    Handoff --> EventBus

    classDef built fill:#dcfce7,stroke:#16a34a,color:#052e16
    classDef designed fill:#fef3c7,stroke:#d97706,color:#3f1f00
    class Board,Dispatcher,Copilot,Claude,PR,ReviewLoop,Merge,Heartbeat,Webhook,Projector,Inbox,Handoff built

Every node in green is shipped today. The amber would be the HVFS layer once it exists; for now the board is the system of record and copy-in sync is the distribution channel.


10. Standards & prior art

We did not invent most of this. Where we lean on prior art:

Component Standard / prior art Why
Event bus NATS JetStream durable consumers (Synadia, 2020+) Pull-based, durable, ack/redeliver semantics; tiny operational footprint.
Versioned data lakeFS Git-like data ops over object stores Branch/commit/merge semantics for content-addressable storage; the HVFS substrate.
Concurrent state CRDTs (Shapiro et al., 2011) Considered for cross-shard board reconciliation; not yet used. The holarchy's parent-on-same-shard invariant lets us avoid CRDT complexity for now. Revisit if we need cross-shard mutation.
Contract evolution Postel's law ("be conservative in what you send, liberal in what you accept") Producers must additively evolve specs; consumers must tolerate unknown fields. Enforced via Postman spec diffs in CI.
Synchronous contracts OpenAPI 3.1 Postman is the registry; consumers generate clients from the spec; raw fetch is an anti-pattern.
Asynchronous contracts AsyncAPI 2.6 NATS subjects + JetStream message shapes; same registry, same diff discipline.
Edge auth OAuth 2.1 + Cloudflare Access service tokens Industry-standard token flow at the edge; we add origin-side JWT verification (defense in depth). The Managed OAuth path on the CF Self-Hosted app gives DCR + PKCE for interactive clients (claude.ai web UI); the service-token path is for non-interactive workloads.
Board model SAFe Epics / Features / Stories / Enablers / Spikes Maps cleanly to the holonic hierarchy; gives us a vocabulary the human side of the fleet already speaks.
Holon concept Koestler 1967, The Ghost in the Machine The whole-and-part duality is the structural insight that makes a sharded board feasible.

What we are not doing (yet)

  • No CRDTs. Mutation lives at one site per shard; reconciliation is asynchronous and idempotent. CRDTs are only worth their complexity once you have multi-master writes per holon, which we don't.
  • No central scheduler. The board is the scheduler. Agents poll on their own cadence.
  • No service mesh. Cloudflare Tunnel + CF Access cover the only public ingress; everything else is loopback or vsock.
  • No bespoke RPC. Everything is MCP, gh CLI, or NATS. Three protocols, ridable by every agent runtime.

Build vs design — honest scorecard

Component ADR Status
Holonic Unified Board (Projects v2 surface) 035 Built — daily-driver
Sharded board 058 Designed; activates above ~5k items
Holographic VFS 054, 056 Design only — PR #405 / landed as ADR set in #473, no runtime
Hub Contract Mirror (interim) 047 Built — explicit copy-in sync
File-first static registry 071 Built
NATS event bus + webhook bridge 022 Builtwebhook-projector + event-bridge running
Harness-isolated agent bridge 049 Built — scoped NATS creds
Self-healing bidirectional harness 050 Built — ratelimiter + Phoenix tracing via PR #506
Local loopback runner 048 Built — local DinD pool
ACA self-hosted KEDA runners (ops) Builtrg-github-runner-ci
Postman contract registry + mocks 034 Built — AgentArmy workspace
Fleet coordination plane (mcp.untool.ai) 053 Built — CF Access service tokens live

When the HVFS lands it will replace the Hub Contract Mirror and absorb the copy-in sync helper into a single ut vfs model. Until then, the file-first registry plus explicit sync is the pragmatic equivalent.

Open questions and active threads

A few threads we are still working through, in case a future reader wants to dig in:

  • Cross-shard board mutation. Once the sharded board is live (ADR-058), what happens when a holon needs to move across shards? The current sketch is "close on source, open on target, link," which loses identity but is simple. A CRDT-backed identity layer would preserve it at the cost of significant complexity.
  • HVFS write conflict resolution. lakeFS supports merges but the conflict model for OpenAPI specs is structural, not line-based. We will probably need a custom merge driver per content type.
  • microVM cold-start latency. Sandboxed agents pay a cold-start cost for every job; pooling pre-warmed microVMs would help but introduces state-residue questions.
  • Heartbeat economy. Every agent emits a heartbeat; the reconciler watches every callsign. At fleet scale this could become a hot subject. Per-callsign rate limits and tiered heartbeats (fast for runtime, slow for design-time) are likely.
  • Decision Artifact closure. When a HITL decision lands and unblocks fifty linked items, the projector has to fan that out to fifty board updates. We have not yet seen the spike load that proves the projector handles it gracefully.

None of these are blocking for current scale, but they are the things we expect to hit first as the fleet grows.

A note on simplicity

The fleet looks elaborate when summarised on a single page. In practice, day-to-day work touches three surfaces: the board (read), gh (mutate), and Postman (contracts). The rest of this architecture exists so those three surfaces keep working as the number of agents grows. The complexity is in service of the simplicity, not the other way round.

Glossary recap

A short reference for the terms used throughout this page:

  • Holon — an entity that is simultaneously a whole and a part (Koestler 1967). A board item.
  • Holarchy — a hierarchy of holons. The board's parent-issue tree.
  • HUB — the Holonic Unified Board. Today: GitHub Projects v2; tomorrow: sharded.
  • HVFS — the Holographic Virtual Filesystem. Design only. lakeFS-backed, ut vfs CLI.
  • Callsign — a stable logical name for a fleet member. The primary key of the registry.
  • Spoke — a layer repo (frontend-core, backend-core, …) created from the hub template.
  • Hub — the template repository that orchestrates the spokes.
  • MCP edgemcp.untool.ai, the CF Access-gated public face of the local-fleet MCP server.
  • Postman mock — a live HTTP endpoint that returns the example responses defined in a spec.
  • Heartbeat — a NATS message published periodically by an agent or container to declare liveness.
  • Decision Artifact — a board item that captures a HITL judgment call. Blocks the items below it until closed.
  • Review loop — the autonomous fix-comments-from-AI-reviewers cycle, opt-in via the review-loop label.
  • Fleet heartbeattools/fleet-heartbeat.mjs, the inventory-and-dispatch script that runs daily.

The lexicon is stable. The implementations behind each term will evolve.


See also

LLM & Agent Stack · Generative Pipeline · Standards Index · Intellectual Foundations (Bibliography)