LLM & Agent Stack¶

This page is the canonical map of how untool.ai builds agents — which model providers we call, what protocols carry agent traffic, where memory lives, how tools are authorized, and how swarms decompose. It is meant as an on-ramp for new contributors and a single source of truth for the architectural choices that already live as Architecture Decision Records.

Every claim below is anchored in an ADR. When the ADR and this page disagree, the ADR wins — fix the page.

The agent stack at a glance¶

Layer	Choice	Standard / protocol	Alternative considered
LLM provider	Cerebras (`gpt-oss-120b`, `zai-glm-4.7`)	OpenAI-compatible REST	Anthropic direct, OpenAI direct, Groq, Together
Model access	`llm-gateway` image (server-side)	OpenAI-compatible + SSE	Direct provider SDK calls from app code
Browser → LLM	BFF route only (`/api/copilotkit`) with JWT injection	HTTP + session cookie → JWT (ADR-002)	Public API key in browser (rejected — ADR-003)
Tool protocol	Model Context Protocol (MCP)	modelcontextprotocol.io	Bespoke JSON-RPC, OpenAPI-as-tools
Peer messaging	A2A protocol	Google A2A + community extensions	MCP-only (rejected — peers ≠ tools), AMQP, raw NATS
Streaming	Server-Sent Events (SSE)	HTTP/1.1 + HTTP/2 multiplexed	WebSocket, gRPC streaming
Episodic memory	NATS JetStream + ArcadeDB	Append-only event log + property graph	Redis streams, Postgres LISTEN/NOTIFY
Semantic memory	`codebase-memory-mcp` (DeusData)	Per-repo SQLite knowledge graph	Vector store only, Neo4j
Local-dev memory	MemPalace (CLI 3.3.x)	Session-start/stop hooks	None — manual context restate
Embeddings	Cohere `embed-v-4-0` via Azure Foundry	OpenAI-compatible embed API	OpenAI `text-embedding-3-large`, `bge-large`
Embeddings fallback	`local-embedder` image (OpenVINO on Arc iGPU)	OpenAI-compatible embed API	CPU-only `sentence-transformers`
Tool authorization	Capability gating (ADR-052)	Capability registry in platform self-model	Role-based ACLs, OPA-only
Orchestration	Ontology-driven swarm (ADR-044)	Holonic decomposition (Koestler 1967)	Single supervisor, master-worker queue
Observability	OpenTelemetry GenAI semantic conventions	OTLP → Grafana / Tempo / Loki	Per-vendor APM, console logs
Prompt caching	Anthropic prompt cache (5-min TTL)	`cache_control: ephemeral` on stable blocks	None, custom KV cache

The rest of this page walks each row and the trade-offs behind it.

LLM provider — Cerebras (ADR-004) and the gateway pattern (ADR-021)¶

Why Cerebras¶

ADR-004 selects Cerebras as the primary inference provider for the fleet. The decision rests on three properties:

Latency. Cerebras' wafer-scale inference returns first-token in the tens of milliseconds and sustains hundreds–thousands of tokens/sec on open-weights models. For interactive agent loops — where a tool call may wait on a planner LLM, then a critic LLM, then a writer LLM — end-to-end latency is dominated by the chain length, not by any one prompt. Sub-second hops are what make multi-agent chains feel synchronous rather than batch.
Model availability. Cerebras hosts gpt-oss-120b (general reasoning, tool-use, function-calling) and zai-glm-4.7 (long-context, multilingual, strong coding). Both are openly licensed, which keeps our enterprise compliance story clean.
OpenAI-compatible API. The wire format is unchanged from the OpenAI reference. Switching providers is a config-file change, not a code migration.

The alternatives — direct Anthropic, direct OpenAI, Groq, Together — are all viable as secondary providers; the gateway is built so adding one is a registry edit and a credential, not a refactor.

Why a gateway (not direct calls)¶

ADR-021 introduces llm-gateway — a Function-tier container that sits between every agent process and the model providers. Calling Cerebras (or any provider) directly from application code is prohibited by this ADR. The gateway is the single chokepoint, and it earns its keep five ways:

Concern	Why centralize it
Token redaction	Prompts often contain secrets the caller forgot to scrub (`Bearer ...`, `sk_live_...`, `aks-...`). The gateway runs a deterministic redactor before egress and before it writes the audit log.
Audit	Every prompt + completion is logged with caller identity, model, token counts, cost, and capability set. This is what makes incident response and FinOps possible.
Fair-share & per-tier rate limits	A runaway loop in one agent must not starve the rest of the fleet. The gateway implements token-bucket per (caller tier, model) tuples.
Capability gating	Before forwarding a request the gateway consults the capability registry — if the caller's declared capabilities don't authorize the requested model or the tool set being passed, the call is rejected.
Provider abstraction	When a model is deprecated, retired, or a cheaper provider appears, we change the gateway config. No application code moves.

Why no LLM key in the browser¶

ADR-003 is the oldest LLM-related decision in the fleet and the least negotiable. A browser-resident provider key is, in every meaningful sense, public — View Source, devtools Network tab, a stray window.__NEXT_DATA__ dump. Public keys mean:

Anyone can bill our account to exhaustion in an afternoon.
Anyone can call any model we have access to, including models the authenticated user is not entitled to.
Per-user audit is impossible — every request looks like "the fleet."
Rotation requires a frontend deploy.

The accepted path is the BFF route: the Next.js frontend (frontend-core) exposes /api/copilotkit (and friends) as a server route. The browser authenticates with a session cookie; the BFF route mints a short-lived JWT carrying the user's identity + tier + capabilities and forwards the request to the gateway. This is the JWT injection contract from ADR-002, and it is the only acceptable path from a user-facing client to an LLM.

Direct exposure of middle-core (:8100) or backend-core (:8000) through the dev tunnel is, for the same reason, banned — see the CLAUDE.md tunnel policy. Cloud agents do not get raw LLM keys either; they call the gateway through mcp.untool.ai with a CF Access service token, and the gateway issues the LLM credentials.

Model Context Protocol (MCP)¶

The spec¶

The Model Context Protocol is Anthropic's open standard for connecting LLM applications to external context and capabilities. An MCP server exposes three primitives:

Tools — functions the model can invoke (with JSON-schema'd inputs).
Resources — read-only content the model can consult (files, database rows, page contents).
Prompts — parameterized prompt templates the host UI can offer the user as slash-commands or buttons.

The transport is JSON-RPC 2.0 over stdio (for locally-spawned servers) or Server-Sent Events / HTTP (for remote servers). The host (Claude Desktop, Claude Code, Cursor, our own runtime) discovers tools at connect time and surfaces them to the model.

How we use it¶

MCP is the primary protocol for agent → tool interactions in the fleet. We run MCP servers for:

Code knowledge graphs (codebase-memory-mcp, see Agent memory).
Cloud agent control of the local docker fleet (tools/mcp-local-fleet/, next subsection).
Vendored third-party servers (GitHub, Postman, Figma, Azure, GCP, etc.) — see Fleet MCP distribution for the full inventory and the per-agent allowlist.
First-party fleet primitives: contract publishing, ontology mutation, capability dispatch.

The principle is: if a behavior is a discrete, schema-able operation on a service, it belongs behind an MCP tool, not as bespoke prompting. Once it is an MCP tool, every model and every agent in the fleet can use it.

The local-fleet MCP server¶

tools/mcp-local-fleet/ is the control plane that lets cloud agents (Claude.ai routines, GitHub Actions, remote API callers) observe and drive the local Docker fleet without exposing the Docker socket. It is the canonical example of how we ship an MCP server in production.

The server binds to 127.0.0.1:8765 and is exposed publicly at mcp.untool.ai behind Cloudflare Access service tokens (CF-Access-Client-Id + CF-Access-Client-Secret headers). It serves eight allowlisted tools, split into read-only (auto-approve in clients that respect approval prompts) and write (per-call approval):

Tool	Read/Write	What it does
`fleet_ps`	Read	List containers — name, image, status, ports.
`fleet_inspect`	Read	`docker inspect` for one container (config, mounts, env-redacted).
`fleet_logs`	Read	Tail logs from a named container.
`fleet_up`	Write	Start a service (compose-managed).
`fleet_down`	Write	Stop a service.
`fleet_restart`	Write	Restart a service.
`fleet_build`	Write	Rebuild an image.
`fleet_deploy`	Write	Apply a new image tag and roll.

Key invariants — enforced in code, not relaxable by config:

Loopback bind only; public exposure requires the explicit Cloudflare tunnel route, which is fronted by CF Access.
Allowlisted service names only — there is no "run arbitrary container" tool and no exec_shell tool. The threat model is "an agent gets a stale or compromised credential"; the blast radius must be bounded to "the fleet's known services."
CF Access JWT is verified server-side as defense-in-depth.
Every call is audited to tools/logs/mcp-audit.log.YYYY-MM-DD with redacted args and the edge principal (the service token name).
Tool names match [a-zA-Z0-9_-] — Claude Code's MCP client silently drops dotted names, which we learned the hard way.

MCP vs. A2A¶

MCP is vertical: agent → tool → service. The agent is the principal; the tool is a function call. A2A (next section) is horizontal: agent ↔ agent, peers exchanging messages.

The line is sometimes blurry — an agent can be exposed as an MCP server (its skills become tools, callable by another agent's planner). We do this in some places. But the protocol you reach for first is:

MCP when the relationship is "I want this work done, I don't care who does it, just call the function."
A2A when the relationship is "I am specifically talking to that agent, the conversation has state, and the response shape depends on the peer's identity."

Agent-to-Agent (A2A)¶

The protocol¶

The Agent-to-Agent (A2A) protocol is Google's open standard (with growing community contributions) for direct agent ↔ agent communication. Where MCP standardizes the agent → tool boundary, A2A standardizes the boundary between two autonomous agents that may not share a process, host, or trust domain.

A2A defines:

Agent cards — a /.well-known/agent.json-style manifest exposing the agent's capabilities, supported message types, and auth.
Tasks — stateful, long-running units of work with their own lifecycle (submitted → working → input-required → completed / failed / canceled).
Streaming updates — agents push partial results as a task runs.
Multi-modal messages — text, files, structured data, function calls.

How we use it¶

ADR-028 commits the fleet to the dual-protocol pattern: every agent gateway speaks both MCP (tools) and A2A (peers). The A2A contract lives at backend-core/contracts/agentarmy-a2a.openapi.yaml and is the source of truth for fleet agent-card schemas, task envelopes, and the agent-discovery endpoints.

When to reach for A2A instead of MCP tool calls:

Delegation with conversation. A2A's task model carries the conversation forward; one agent can answer "what did the planner decide?" by replaying the task transcript. An MCP tool call is stateless.
Long-running work with progress. A2A's input-required and streaming-update states are first-class. Modeling the same flow as an MCP tool requires custom polling or a webhook escape hatch.
Heterogeneous fleets. When the peer is not ours — a community agent, a partner's agent, an upstream platform's agent — A2A is the lingua franca. Our MCP tool catalog is internal.
Identity matters. When the who of the peer is part of the semantics (audit, billing, contract-bound) rather than the what of the function, A2A's agent card gives you that identity at protocol level.

Streaming protocol (ADR-007)¶

ADR-007 selects Server-Sent Events (SSE) as the canonical transport for agent text streams from any gateway to any consumer (browser or backend). The frame format is JSON lines — one JSON object per data: event, each carrying either a token delta, a tool-call event, a usage update, or a terminal done event.

Why SSE, not WebSocket¶

HTTP/2 multiplexing. SSE rides ordinary HTTP, so a single TCP connection between the browser and our edge multiplexes dozens of concurrent agent streams. WebSocket is one socket per stream, and every middlebox (CDN, ingress, ACA) treats it as a special case.
Simple proxies. Every reverse proxy in the universe understands HTTP. CloudFront, Cloudflare, Azure Front Door, Nginx, Envoy, ACA ingress — they all stream SSE without configuration. WebSocket upgrades require explicit allowlisting and often break on cheap tiers.
One direction is enough. Agent → consumer is the only direction that needs streaming. Consumer → agent is well-served by a separate POST for the prompt and a cancellation endpoint. WebSocket's bidirectionality is unused dead weight for this shape.
Replay. SSE has a built-in Last-Event-ID semantic; resuming a dropped stream from a sequence number is part of the spec.

Why not gRPC streaming¶

gRPC bidi streaming is technically superior for cross-service backend calls (it is what middle-core ↔ backend-core should probably use for plumbing). But it is not browser-reachable without a translation layer (grpc-web), it does not play well with our SSE-aware edge, and it forces a protobuf compilation step on every contract change. SSE keeps the contract in JSON and reachable from curl.

Frame format¶

event: token
data: {"seq":42,"delta":"Hello","model":"gpt-oss-120b"}

event: tool_call
data: {"seq":43,"id":"call_x","name":"fleet_ps","arguments":{}}

event: usage
data: {"input_tokens":312,"output_tokens":118,"cache_read":0}

event: done
data: {"finish_reason":"stop"}

Cancellation¶

The client sends DELETE /v1/streams/{stream_id} (or aborts the EventSource and lets a server-side idle timer fire). The gateway then:

Stops forwarding upstream tokens.
Sends a final event: canceled frame.
Closes the response stream.
Writes a canceled audit record.

Importantly, cancellation does not retroactively un-call tools that already executed. The agent's tool-call ledger is append-only.

Agent memory (ADR-008)¶

ADR-008 splits agent memory into four layers, each chosen for what it does well:

Layer	Persistence	Shape	Backing store
Working memory	In-process, dies with the turn	The prompt	The LLM's context window
Episodic memory	Durable, append-only	Event log: "agent X did Y at T"	NATS JetStream → ArcadeDB
Semantic memory	Durable, queried	Knowledge graph: entities, relations, embeddings	ArcadeDB (graph + vector)
Code knowledge	Durable, per-repo	Symbol graph: files, defs, refs, calls	`codebase-memory-mcp` (SQLite per repo)

Ephemeral vs persisted¶

Ephemeral:

The current prompt and the streaming response.
Tool-call scratch space inside one turn.
Per-stream cancellation tokens.
Anything written to /tmp inside an agent container.

Persisted:

Every prompt + completion + tool-call sequence (audit, not retrieval — we don't replay these into prompts).
Episodic events: "user asked X, planner produced plan Y, executor completed step Z."
Semantic facts extracted from conversations or documents: entities, their relationships, embeddings of their descriptions.
Codebase structural memory: definitions, references, the call graph.

Episodic vs semantic¶

The distinction is the cognitive-science one (Tulving, 1972), repurposed for agents:

Episodic is what happened, when, in what order. It is the source of truth for "did we already try this?" and "what was the state when we made that decision?". It is queried by time and by run-id.
Semantic is what is true, abstracted from the episode. It is the source of truth for "what is the schema of holon?" and "what capabilities does the api-designer agent declare?". It is queried by entity and by embedding similarity.

Both are durable, but they serve different reads. Conflating them — the classic "just throw everything into one vector store" — makes both reads worse.

`codebase-memory-mcp`¶

The DeusData codebase-memory-mcp server is the fleet's standard for code knowledge. It indexes a repository into a per-repo SQLite knowledge graph (one DB file per project, cached under ~/.cache/codebase-memory-mcp/) and exposes tools for searching code, fetching snippets by symbol, tracing call paths, querying the architecture graph, and managing ADRs.

We use it because:

It is per-repo, so a multi-spoke fleet doesn't fight over one shared index.
It is MCP-native, so any agent in any host (Claude Code, Antigravity, claude.ai routines) gets the same tools.
It complements ArcadeDB rather than duplicating it: ArcadeDB is the fleet topology and ontology store; codebase-memory-mcp is the source-code structural store.

The six fleet repos are indexed and are kept current via detect_changes runs.

MemPalace in the local dev loop¶

MemPalace is a CLI tool that captures session-start / session-stop / pre-compact hooks and turns them into "drawers" of remembered context, keyed by topic. It runs exclusively on the operator's local box; cloud agents do not have it and do not need it.

What MemPalace gives the local Claude Code session:

A surface for the operator to dictate "remember this for next time" without writing into the repo.
Automatic capture at session-stop so the next session-start can restate the open thread.
A separation between durable repo memory (ADRs, contracts, docs) and ambient session memory (what was hot last Tuesday).

It is intentionally not part of the agent's memory contract — the fleet's memory is what is in ArcadeDB, the MCP knowledge graphs, and the git tree.

Capability gating (ADR-052)¶

ADR-052 introduces the rule that makes autonomous agents safe at scale:

Every tool call is authorized against the calling agent's declared capabilities, against a registry that lives in the platform self-model.

The mechanism¶

Each agent in .claude/agents/categories/ declares a set of capabilities in its frontmatter — e.g. capabilities: [code:read, code:write, fleet:read]. The capability names are drawn from a controlled vocabulary in the platform self-model.
Each MCP tool (and each LLM model) is annotated with the capability it requires — e.g. fleet_up requires fleet:write, gpt-oss-120b requires model:reasoning.
When an agent calls a tool, the gateway (LLM gateway for models, agent gateway for MCP/A2A) consults the registry and rejects calls where the intersection of declared and required is empty.

Why this matters¶

The naïve alternative — "the agent runs as the user, the user is authorized, therefore the call is authorized" — fails as soon as the fleet starts dispatching work autonomously. A copilot-task issue that spawns a coding agent should not be able to delete production containers, even if the operator who filed the issue technically can. Capabilities decouple "what the operator could do" from "what this agent was hired to do."

It also gives us:

Least-privilege agents. Every new agent starts with an empty capability set and only gets what it explicitly justifies.
Audit-able dispatches. The capability set is recorded with every call, so post-hoc questions like "did anything with fleet:write touch arcadedb in the last hour?" are queryable.
Safe agent updates. Adding a capability is a reviewable PR against the agent definition; CI checks that the agent's prompts and delegations don't require capabilities it has not declared.

This is the rule that lets us run the heartbeat in --apply --auto mode without panic — a runaway Copilot worker still cannot escape the capability fence.

Swarm orchestration (ADR-044)¶

ADR-044 sets the orchestration model: an ontology-driven swarm where the shape of the work, the agents available to do it, and the routing between them are all derived from the same self-model rather than hand-coded.

Holons all the way down¶

Arthur Koestler, in The Ghost in the Machine (1967), coined holon for an entity that is simultaneously a whole (looking down at its parts) and a part (looking up at its container). The holarchy is the nested structure of holons; Koestler argued it is the universal pattern of organization in living and social systems.

We borrow the pattern directly:

Every agent is a holon. A code-reviewer agent is a whole over its sub-skills (style checks, security pattern detection, contract conformance) and a part within the larger PR review process, which is itself a holon within the delivery loop.
Every issue / task is a holon. A Feature is a whole over its Stories and a part within an Epic. The board structure (Epic → Feature → Story / Enabler / Bug / Spike) is a holarchy by construction.
The fleet is a holon. Each spoke repo is a whole within its layer and a part within the hub's orchestration plane.

This is not a metaphor — the decomposition matters. When the orchestrator faces "build feature X," its first move is to decompose X into sub-holons until each leaf is something a single agent can do in one turn. The decomposition rule is encoded against the platform self-model, so adding a new sub-type of work means extending the ontology, not the orchestrator.

Connection to the holonic unified board¶

ADR-035 operationalizes the holarchy at the board level — every work item is a holon with explicit parent and children, and the board renders them as a nested view rather than a flat list. The orchestrator and the human operator see the same holarchy.

Ontology-driven, not topology-driven¶

The naïve swarm — "spin up N workers, give them a shared queue, let them race" — works for embarrassingly-parallel jobs and breaks immediately for jobs with structure. Ontology-driven means the structure of the work informs the routing:

A task whose ontology classifies it as contract-change routes through api-designer → contract-test-engineer → schema-migration-engineer (the contract cluster), not whoever happens to be idle.
A task classified as infrastructure-change routes through the delivery/ops cluster.
A task classified as judgment-call escalates to hitl-coordinator and pauses for the operator.

The router is small. The ontology does the work.

Embeddings — Cohere `embed-v-4-0`¶

The fleet's canonical embedding model is Cohere embed-v-4-0, served through Azure AI Foundry (deployment fndry-01 in the operator's environment). It returns 1536-dimensional vectors and supports input across 100+ languages.

Why Cohere over the alternatives¶

Alternative	Why we passed
OpenAI `text-embedding-3-large`	Strong, but the data-handling clauses on OpenAI's enterprise tier are harder to align with our partner contracts than Cohere via Azure.
OpenAI `text-embedding-3-small`	Adequate for low-stakes use but loses material recall vs. v4 on long-tail retrieval.
Open-source (`bge-large`, `e5-large`)	Strong on MTEB English; weaker multilingual; self-hosting cost (GPU minutes) exceeds Cohere API spend at fleet scale. We do run an open model in the local-embedder fallback (next subsection), but not as primary.
Voyage / Mistral / etc.	Considered, viable, kept on the secondary list. Not a strong enough lead to displace incumbent.

The deciding criteria were, in order: license clarity for enterprise (Cohere via Azure has clean terms for content we ingest from partner repos), MTEB benchmark position (Cohere embed-v-4-0 is at or near top of the public multilingual leaderboard at the time of selection), and multilingual recall (the platform documentation and operator content are not English-only).

Local-embedder fallback¶

ADR-021's "local fallback" clause allows for a self-hosted embedder when network is down, the operator wants to embed proprietary content without leaving the box, or per-call cost matters. The fallback is the local-embedder Function-tier container — an OpenVINO build of a strong open model, tuned for the operator's hardware (Intel Core Ultra 7 265 with Arc iGPU). It exposes the same OpenAI-compatible embed API shape, so callers swap the base URL and nothing else.

The fallback is not for production semantic memory writes; mixing embeddings from two models in the same vector index destroys recall (different geometries). The local embedder is for indices that are explicitly tagged "local-only."

Token budgeting & caching¶

LLM tokens are the dominant cost of an agent fleet, both literally (the bill) and in latency. ADR-021's gateway is the place we account for and optimize them.

Anthropic prompt caching¶

Claude's API supports prompt caching with a 5-minute TTL: stable prefix blocks marked cache_control: ephemeral are stored on Anthropic's side and not re-billed (cached input tokens cost ~10× less) or re-processed for ~5 minutes. To hit the cache we structure prompts as:

A long, stable system block (style guide, capability statements, tool inventory) — marked cacheable.
A long, stable context block (the ADRs, the agent's pinned knowledge) — marked cacheable.
A short, volatile task block (the actual user request and conversation tail) — uncached.

Empirically the system+context block runs 10–40 K tokens for our heavier agents; cache hits during a 5-minute interactive session amortize that across many turns and shrink per-turn cost by 70–90%.

The fleet's prompt builders are written with the cache in mind: any agent prompt is (stable_blob, volatile_blob) and the stable_blob hashes the same across turns. A common mistake — inlining the current date into the stable_blob — breaks the cache silently; the prompt builder takes the date as an explicit volatile input to prevent it.

Cost telemetry¶

ADR-010 mandates that every LLM call emit telemetry under the OpenTelemetry GenAI semantic conventions: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.id, etc. The gateway is the natural place to emit these — it sees every call, knows the caller, and knows the cost.

Cost flows into the FinOps surface: Grafana dashboards by tier, by agent, by capability. A spike in one agent's hourly spend is what triggers finops-engineer to investigate, and the spike is attributable down to the specific tool sequence that caused it.

Token budgeting per agent¶

Capability gating extends to budgets: an agent's declared capabilities include a per-window token budget (e.g. budget:tokens-per-hour:1_000_000). The gateway enforces this with the same machinery as rate limiting. A busted budget is logged, the call rejected, and hitl-coordinator notified — the same way other capability failures escalate. This is what keeps a single misbehaving worker from burning the daily spend in fifteen minutes.

References¶

Protocols & specs¶

Model Context Protocol — modelcontextprotocol.io — Anthropic-authored open spec for connecting LLM apps to tools, resources, and prompts.
Agent-to-Agent (A2A) protocol — github.com/google/A2A — open standard for direct agent-peer messaging.
OpenAI function calling — platform.openai.com/docs/guides/function-calling — the de facto wire format for tool-use we inherit through the OpenAI-compatible API.
Anthropic tool use — docs.anthropic.com/claude/docs/tool-use — Claude's tool-use semantics; semantically equivalent to OpenAI's but with input_schema and parallel tool calls first-class.
Anthropic prompt caching — docs.anthropic.com/claude/docs/prompt-caching — the 5-minute TTL cache contract our prompt builders target.
Server-Sent Events — html.spec.whatwg.org/multipage/server-sent-events.html — the streaming transport spec.
OpenTelemetry semantic conventions for GenAI — opentelemetry.io/docs/specs/semconv/gen-ai/ — the attribute names every LLM call must emit.

Vendors & models¶

Cerebras inference platform — cerebras.ai — wafer-scale inference; primary LLM provider.
gpt-oss-120b model card — huggingface.co/openai/gpt-oss-120b
zai-glm-4.7 model card — huggingface.co/THUDM — long-context, multilingual; secondary model for non-English / heavy-code tasks.
Cohere embed-v-4 model card — docs.cohere.com/docs/cohere-embed — primary embedding model; 1536-d, multilingual.
Azure AI Foundry — learn.microsoft.com/azure/ai-foundry/ — the deployment surface that serves Cohere embed v4 to the fleet.

Background reading¶

Arthur Koestler, The Ghost in the Machine (1967) — origin of the holon / holarchy vocabulary; required reading for ADR-044's swarm decomposition.
Tulving, E. (1972). "Episodic and semantic memory." — the cognitive-science distinction we lift for ADR-008's memory layering.

ADRs cited on this page¶

ADR-002 — JWT BFF injection
ADR-003 — No LLM key in browser
ADR-004 — LLM provider Cerebras
ADR-007 — Agent streaming protocol
ADR-008 — Agent memory store
ADR-010 — Observability standard
ADR-021 — LLM gateway
ADR-028 — Agent gateway A2A + MCP
ADR-035 — Holonic unified board
ADR-044 — Ontology-orchestrated swarm intelligence
ADR-052 — Agent tool authorization (capability gating)

Fleet MCP distribution — every MCP server in the fleet, its allowlist, and which agents get it.
Realtime agent interface — the client-facing contract for SSE streams, cancellation, and replay.

Formal-methods adjacent literature¶

A curated, living bibliography of peer-reviewed research that independently validates the design axes we anchor our agent stack on. Discovery and citation grounded via the Hugging Face papers service — every entry is a clickable HF paper page with linked arXiv ID, structured metadata, and (where available) linked model/dataset artifacts.

How this list is maintained

These are not random picks. Each entry was surfaced by the search workflow in Using the Hugging Face Papers Service, then hand-filtered for direct relevance to one of our ADRs. To propose an addition, run the searches there and open a PR adding a row with the matching ADR anchor.

Capability gating, tool authorization, and verifiable safety¶

These map onto ADR-052 — Agent tool authorization (capability gating) and our gateway-mediated tool surface (ADR-021/028). They confirm the broader research community is converging on programmable privilege control and formal guarantees for agent actions — exactly the axis we're on.

Progent: Programmable Privilege Control for LLM Agents — Shi, He, Wang, Wu et al. (2025). Argues for explicit, programmable least-privilege boundaries for LLM tool calls instead of trust-the-prompt. Direct shape match for our capability registry in ADR-052. _{arXiv: 2504.11703}
VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation — Miculicich, Parmar, Palangi, Dvijotham et al. (2025). Formal guarantees that agent actions adhere to predefined safety constraints in sensitive domains (healthcare). Anchors why we gate codegen + tool invocation at the gateway, not the prompt. _{arXiv: 2510.05156}
Formally Specifying the High-Level Behavior of LLM-Based Agents — Crouse, Abdelaziz, Basu, Dan et al. (2023, IBM Research). Uses Linear Temporal Logic (LTL) to specify agent behavior as time-indexed contracts. Same shape as our ADR-038 unified process-and-time architecture — time-indexed states + decidable constraints. _{arXiv: 2310.08535}

Ontology-grounded LLMs and knowledge-graph retrieval¶

These map onto ADR-008 — Agent memory store, ADR-019 — Ontology + reasoning layer, and ADR-030 — Data→ontology ingestion pipeline.

MindMap: Knowledge Graph Prompting Sparks Graph of Thoughts in Large Language Models — Wen, Wang, Sun (2023, UIUC). KG-conditioned prompting as a remedy for hallucination + opacity; LLMs trace reasoning paths through a graph. Sister-thesis to our active hypergraph inference prototype. _{arXiv: 2308.09729}
Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text — Mihindukulasooriya, Tiwari, Enguix, Lata (2023). Benchmarks LLM ability to extract KGs constrained by a target ontology — the exact gate our ADR-030 ingestion pipeline enforces with SHACL. _{arXiv: 2308.02357}
Neurosymbolic AI: The 3rd Wave — d'Avila Garcez & Lamb (2020). The foundational programmatic-statement paper for combining well-founded knowledge representation with deep learning. Cite in every conversation about why we sit ontology + reasoner alongside the LLM rather than throwing the LLM at the world bare. _{arXiv: 2012.05876}

MCP and agent benchmarks¶

These map onto ADR-028 — Agent gateway A2A + MCP and our broader MCP investment (the local-fleet MCP server, mcp.untool.ai, codebase-memory-mcp).

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries — Yin, Shen, Xu, Han et al. (2025). 101 multi-step tasks across diverse MCP tools in dynamic environments — directly the runtime regime our fleet operates in. The benchmark we should be running ourselves. _{arXiv: 2508.15760}
MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP — Real-world MCP-mediated agent evaluation; an external scoreboard our agents' tool-use can be measured against. _{arXiv: 2509.09734}

Multi-agent coordination¶

These map onto ADR-035 — Holonic unified board architecture and ADR-044 — Ontology-orchestrated swarm intelligence.

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems — Qi, Ma, Xing, Guo et al. (2026). Surveys the error-propagation risk that motivates our heartbeat + self-healing harness (ADR-050) and our fleet coordination plane (ADR-053). Particularly relevant: their taxonomy of failure-attribution patterns lines up with our HITL escalation criteria. _{arXiv: 2605.14892}

Open invitation

We update this section when new papers materially change the trade-space. The acceptance bar is: (a) peer-reviewed or substantive preprint, (b) maps onto a specific ADR we hold, (c) either confirms our axis or proposes a falsifiable alternative we should consider. PRs welcome.

See also: Coordination & VFS · Generative Pipeline · Standards Index · Using the HF Papers Service · Intellectual Foundations (Bibliography)

LLM & Agent Stack¶

The agent stack at a glance¶

LLM provider — Cerebras (ADR-004) and the gateway pattern (ADR-021)¶

Why Cerebras¶

Why a gateway (not direct calls)¶

Why no LLM key in the browser¶

Model Context Protocol (MCP)¶

The spec¶

How we use it¶

The local-fleet MCP server¶

MCP vs. A2A¶

Agent-to-Agent (A2A)¶

The protocol¶

How we use it¶

Streaming protocol (ADR-007)¶

Why SSE, not WebSocket¶

Why not gRPC streaming¶

Frame format¶

Cancellation¶

Agent memory (ADR-008)¶

Ephemeral vs persisted¶

Episodic vs semantic¶

codebase-memory-mcp¶

MemPalace in the local dev loop¶

Capability gating (ADR-052)¶

The mechanism¶

Why this matters¶

Swarm orchestration (ADR-044)¶

Holons all the way down¶

Connection to the holonic unified board¶

Ontology-driven, not topology-driven¶

Embeddings — Cohere embed-v-4-0¶

Why Cohere over the alternatives¶

Local-embedder fallback¶

Token budgeting & caching¶

Anthropic prompt caching¶

Cost telemetry¶

Token budgeting per agent¶

References¶

Protocols & specs¶

Vendors & models¶

Background reading¶

ADRs cited on this page¶

Related fleet docs¶

Formal-methods adjacent literature¶

Capability gating, tool authorization, and verifiable safety¶

Ontology-grounded LLMs and knowledge-graph retrieval¶

MCP and agent benchmarks¶

Multi-agent coordination¶

`codebase-memory-mcp`¶

Embeddings — Cohere `embed-v-4-0`¶