LLM & Agent Stack¶
This page is the canonical map of how untool.ai builds agents — which model providers we call, what protocols carry agent traffic, where memory lives, how tools are authorized, and how swarms decompose. It is meant as an on-ramp for new contributors and a single source of truth for the architectural choices that already live as Architecture Decision Records.
Every claim below is anchored in an ADR. When the ADR and this page disagree, the ADR wins — fix the page.
The agent stack at a glance¶
| Layer | Choice | Standard / protocol | Alternative considered |
|---|---|---|---|
| LLM provider | Cerebras (gpt-oss-120b, zai-glm-4.7) |
OpenAI-compatible REST | Anthropic direct, OpenAI direct, Groq, Together |
| Model access | llm-gateway image (server-side) |
OpenAI-compatible + SSE | Direct provider SDK calls from app code |
| Browser → LLM | BFF route only (/api/copilotkit) with JWT injection |
HTTP + session cookie → JWT (ADR-002) | Public API key in browser (rejected — ADR-003) |
| Tool protocol | Model Context Protocol (MCP) | modelcontextprotocol.io | Bespoke JSON-RPC, OpenAPI-as-tools |
| Peer messaging | A2A protocol | Google A2A + community extensions | MCP-only (rejected — peers ≠ tools), AMQP, raw NATS |
| Streaming | Server-Sent Events (SSE) | HTTP/1.1 + HTTP/2 multiplexed | WebSocket, gRPC streaming |
| Episodic memory | NATS JetStream + ArcadeDB | Append-only event log + property graph | Redis streams, Postgres LISTEN/NOTIFY |
| Semantic memory | codebase-memory-mcp (DeusData) |
Per-repo SQLite knowledge graph | Vector store only, Neo4j |
| Local-dev memory | MemPalace (CLI 3.3.x) | Session-start/stop hooks | None — manual context restate |
| Embeddings | Cohere embed-v-4-0 via Azure Foundry |
OpenAI-compatible embed API | OpenAI text-embedding-3-large, bge-large |
| Embeddings fallback | local-embedder image (OpenVINO on Arc iGPU) |
OpenAI-compatible embed API | CPU-only sentence-transformers |
| Tool authorization | Capability gating (ADR-052) | Capability registry in platform self-model | Role-based ACLs, OPA-only |
| Orchestration | Ontology-driven swarm (ADR-044) | Holonic decomposition (Koestler 1967) | Single supervisor, master-worker queue |
| Observability | OpenTelemetry GenAI semantic conventions | OTLP → Grafana / Tempo / Loki | Per-vendor APM, console logs |
| Prompt caching | Anthropic prompt cache (5-min TTL) | cache_control: ephemeral on stable blocks |
None, custom KV cache |
The rest of this page walks each row and the trade-offs behind it.
LLM provider — Cerebras (ADR-004) and the gateway pattern (ADR-021)¶
Why Cerebras¶
ADR-004 selects Cerebras as the primary inference provider for the fleet. The decision rests on three properties:
- Latency. Cerebras' wafer-scale inference returns first-token in the tens of milliseconds and sustains hundreds–thousands of tokens/sec on open-weights models. For interactive agent loops — where a tool call may wait on a planner LLM, then a critic LLM, then a writer LLM — end-to-end latency is dominated by the chain length, not by any one prompt. Sub-second hops are what make multi-agent chains feel synchronous rather than batch.
- Model availability. Cerebras hosts
gpt-oss-120b(general reasoning, tool-use, function-calling) andzai-glm-4.7(long-context, multilingual, strong coding). Both are openly licensed, which keeps our enterprise compliance story clean. - OpenAI-compatible API. The wire format is unchanged from the OpenAI reference. Switching providers is a config-file change, not a code migration.
The alternatives — direct Anthropic, direct OpenAI, Groq, Together — are all viable as secondary providers; the gateway is built so adding one is a registry edit and a credential, not a refactor.
Why a gateway (not direct calls)¶
ADR-021 introduces
llm-gateway — a Function-tier container that sits between every agent
process and the model providers. Calling Cerebras (or any provider)
directly from application code is prohibited by this ADR. The
gateway is the single chokepoint, and it earns its keep five ways:
| Concern | Why centralize it |
|---|---|
| Token redaction | Prompts often contain secrets the caller forgot to scrub (Bearer ..., sk_live_..., aks-...). The gateway runs a deterministic redactor before egress and before it writes the audit log. |
| Audit | Every prompt + completion is logged with caller identity, model, token counts, cost, and capability set. This is what makes incident response and FinOps possible. |
| Fair-share & per-tier rate limits | A runaway loop in one agent must not starve the rest of the fleet. The gateway implements token-bucket per (caller tier, model) tuples. |
| Capability gating | Before forwarding a request the gateway consults the capability registry — if the caller's declared capabilities don't authorize the requested model or the tool set being passed, the call is rejected. |
| Provider abstraction | When a model is deprecated, retired, or a cheaper provider appears, we change the gateway config. No application code moves. |
Why no LLM key in the browser¶
ADR-003 is the
oldest LLM-related decision in the fleet and the least negotiable. A
browser-resident provider key is, in every meaningful sense, public —
View Source, devtools Network tab, a stray
window.__NEXT_DATA__ dump. Public keys mean:
- Anyone can bill our account to exhaustion in an afternoon.
- Anyone can call any model we have access to, including models the authenticated user is not entitled to.
- Per-user audit is impossible — every request looks like "the fleet."
- Rotation requires a frontend deploy.
The accepted path is the BFF route: the Next.js frontend (frontend-core)
exposes /api/copilotkit (and friends) as a server route. The browser
authenticates with a session cookie; the BFF route mints a short-lived
JWT carrying the user's identity + tier + capabilities and forwards the
request to the gateway. This is the JWT injection contract from
ADR-002, and it is
the only acceptable path from a user-facing client to an LLM.
Direct exposure of middle-core (:8100) or backend-core (:8000)
through the dev tunnel is, for the same reason, banned — see the CLAUDE.md
tunnel policy. Cloud agents do not get raw LLM keys either; they call the
gateway through mcp.untool.ai with a CF Access service token, and the
gateway issues the LLM credentials.
Model Context Protocol (MCP)¶
The spec¶
The Model Context Protocol is Anthropic's open standard for connecting LLM applications to external context and capabilities. An MCP server exposes three primitives:
- Tools — functions the model can invoke (with JSON-schema'd inputs).
- Resources — read-only content the model can consult (files, database rows, page contents).
- Prompts — parameterized prompt templates the host UI can offer the user as slash-commands or buttons.
The transport is JSON-RPC 2.0 over stdio (for locally-spawned servers) or Server-Sent Events / HTTP (for remote servers). The host (Claude Desktop, Claude Code, Cursor, our own runtime) discovers tools at connect time and surfaces them to the model.
How we use it¶
MCP is the primary protocol for agent → tool interactions in the fleet. We run MCP servers for:
- Code knowledge graphs (
codebase-memory-mcp, see Agent memory). - Cloud agent control of the local docker fleet (
tools/mcp-local-fleet/, next subsection). - Vendored third-party servers (GitHub, Postman, Figma, Azure, GCP, etc.) — see Fleet MCP distribution for the full inventory and the per-agent allowlist.
- First-party fleet primitives: contract publishing, ontology mutation, capability dispatch.
The principle is: if a behavior is a discrete, schema-able operation on a service, it belongs behind an MCP tool, not as bespoke prompting. Once it is an MCP tool, every model and every agent in the fleet can use it.
The local-fleet MCP server¶
tools/mcp-local-fleet/ is the control plane that lets cloud agents
(Claude.ai routines, GitHub Actions, remote API callers) observe and
drive the local Docker fleet without exposing the Docker socket. It is
the canonical example of how we ship an MCP server in production.
The server binds to 127.0.0.1:8765 and is exposed publicly at
mcp.untool.ai behind Cloudflare Access service tokens
(CF-Access-Client-Id + CF-Access-Client-Secret headers). It serves
eight allowlisted tools, split into read-only (auto-approve in clients
that respect approval prompts) and write (per-call approval):
| Tool | Read/Write | What it does |
|---|---|---|
fleet_ps |
Read | List containers — name, image, status, ports. |
fleet_inspect |
Read | docker inspect for one container (config, mounts, env-redacted). |
fleet_logs |
Read | Tail logs from a named container. |
fleet_up |
Write | Start a service (compose-managed). |
fleet_down |
Write | Stop a service. |
fleet_restart |
Write | Restart a service. |
fleet_build |
Write | Rebuild an image. |
fleet_deploy |
Write | Apply a new image tag and roll. |
Key invariants — enforced in code, not relaxable by config:
- Loopback bind only; public exposure requires the explicit Cloudflare tunnel route, which is fronted by CF Access.
- Allowlisted service names only — there is no "run arbitrary
container" tool and no
exec_shelltool. The threat model is "an agent gets a stale or compromised credential"; the blast radius must be bounded to "the fleet's known services." - CF Access JWT is verified server-side as defense-in-depth.
- Every call is audited to
tools/logs/mcp-audit.log.YYYY-MM-DDwith redacted args and the edge principal (the service token name). - Tool names match
[a-zA-Z0-9_-]— Claude Code's MCP client silently drops dotted names, which we learned the hard way.
MCP vs. A2A¶
MCP is vertical: agent → tool → service. The agent is the principal; the tool is a function call. A2A (next section) is horizontal: agent ↔ agent, peers exchanging messages.
The line is sometimes blurry — an agent can be exposed as an MCP server (its skills become tools, callable by another agent's planner). We do this in some places. But the protocol you reach for first is:
- MCP when the relationship is "I want this work done, I don't care who does it, just call the function."
- A2A when the relationship is "I am specifically talking to that agent, the conversation has state, and the response shape depends on the peer's identity."
Agent-to-Agent (A2A)¶
The protocol¶
The Agent-to-Agent (A2A) protocol is Google's open standard (with growing community contributions) for direct agent ↔ agent communication. Where MCP standardizes the agent → tool boundary, A2A standardizes the boundary between two autonomous agents that may not share a process, host, or trust domain.
A2A defines:
- Agent cards — a
/.well-known/agent.json-style manifest exposing the agent's capabilities, supported message types, and auth. - Tasks — stateful, long-running units of work with their own
lifecycle (
submitted→working→input-required→completed/failed/canceled). - Streaming updates — agents push partial results as a task runs.
- Multi-modal messages — text, files, structured data, function calls.
How we use it¶
ADR-028 commits
the fleet to the dual-protocol pattern: every agent gateway speaks
both MCP (tools) and A2A (peers). The A2A contract lives at
backend-core/contracts/agentarmy-a2a.openapi.yaml and is the source
of truth for fleet agent-card schemas, task envelopes, and the
agent-discovery endpoints.
When to reach for A2A instead of MCP tool calls:
- Delegation with conversation. A2A's task model carries the conversation forward; one agent can answer "what did the planner decide?" by replaying the task transcript. An MCP tool call is stateless.
- Long-running work with progress. A2A's
input-requiredand streaming-update states are first-class. Modeling the same flow as an MCP tool requires custom polling or a webhook escape hatch. - Heterogeneous fleets. When the peer is not ours — a community agent, a partner's agent, an upstream platform's agent — A2A is the lingua franca. Our MCP tool catalog is internal.
- Identity matters. When the who of the peer is part of the semantics (audit, billing, contract-bound) rather than the what of the function, A2A's agent card gives you that identity at protocol level.
Streaming protocol (ADR-007)¶
ADR-007
selects Server-Sent Events (SSE) as the canonical transport for
agent text streams from any gateway to any consumer (browser or
backend). The frame format is JSON lines — one JSON object per data:
event, each carrying either a token delta, a tool-call event, a usage
update, or a terminal done event.
Why SSE, not WebSocket¶
- HTTP/2 multiplexing. SSE rides ordinary HTTP, so a single TCP connection between the browser and our edge multiplexes dozens of concurrent agent streams. WebSocket is one socket per stream, and every middlebox (CDN, ingress, ACA) treats it as a special case.
- Simple proxies. Every reverse proxy in the universe understands HTTP. CloudFront, Cloudflare, Azure Front Door, Nginx, Envoy, ACA ingress — they all stream SSE without configuration. WebSocket upgrades require explicit allowlisting and often break on cheap tiers.
- One direction is enough. Agent → consumer is the only direction
that needs streaming. Consumer → agent is well-served by a separate
POSTfor the prompt and a cancellation endpoint. WebSocket's bidirectionality is unused dead weight for this shape. - Replay. SSE has a built-in
Last-Event-IDsemantic; resuming a dropped stream from a sequence number is part of the spec.
Why not gRPC streaming¶
gRPC bidi streaming is technically superior for cross-service backend
calls (it is what middle-core ↔ backend-core should probably use
for plumbing). But it is not browser-reachable without a translation
layer (grpc-web), it does not play well with our SSE-aware edge, and
it forces a protobuf compilation step on every contract change. SSE
keeps the contract in JSON and reachable from curl.
Frame format¶
event: token
data: {"seq":42,"delta":"Hello","model":"gpt-oss-120b"}
event: tool_call
data: {"seq":43,"id":"call_x","name":"fleet_ps","arguments":{}}
event: usage
data: {"input_tokens":312,"output_tokens":118,"cache_read":0}
event: done
data: {"finish_reason":"stop"}
Cancellation¶
The client sends DELETE /v1/streams/{stream_id} (or aborts the
EventSource and lets a server-side idle timer fire). The gateway then:
- Stops forwarding upstream tokens.
- Sends a final
event: canceledframe. - Closes the response stream.
- Writes a
canceledaudit record.
Importantly, cancellation does not retroactively un-call tools that already executed. The agent's tool-call ledger is append-only.
Agent memory (ADR-008)¶
ADR-008 splits agent memory into four layers, each chosen for what it does well:
| Layer | Persistence | Shape | Backing store |
|---|---|---|---|
| Working memory | In-process, dies with the turn | The prompt | The LLM's context window |
| Episodic memory | Durable, append-only | Event log: "agent X did Y at T" | NATS JetStream → ArcadeDB |
| Semantic memory | Durable, queried | Knowledge graph: entities, relations, embeddings | ArcadeDB (graph + vector) |
| Code knowledge | Durable, per-repo | Symbol graph: files, defs, refs, calls | codebase-memory-mcp (SQLite per repo) |
Ephemeral vs persisted¶
Ephemeral:
- The current prompt and the streaming response.
- Tool-call scratch space inside one turn.
- Per-stream cancellation tokens.
- Anything written to
/tmpinside an agent container.
Persisted:
- Every prompt + completion + tool-call sequence (audit, not retrieval — we don't replay these into prompts).
- Episodic events: "user asked X, planner produced plan Y, executor completed step Z."
- Semantic facts extracted from conversations or documents: entities, their relationships, embeddings of their descriptions.
- Codebase structural memory: definitions, references, the call graph.
Episodic vs semantic¶
The distinction is the cognitive-science one (Tulving, 1972), repurposed for agents:
- Episodic is what happened, when, in what order. It is the source of truth for "did we already try this?" and "what was the state when we made that decision?". It is queried by time and by run-id.
- Semantic is what is true, abstracted from the episode. It is the
source of truth for "what is the schema of
holon?" and "what capabilities does theapi-designeragent declare?". It is queried by entity and by embedding similarity.
Both are durable, but they serve different reads. Conflating them — the classic "just throw everything into one vector store" — makes both reads worse.
codebase-memory-mcp¶
The DeusData codebase-memory-mcp
server is the fleet's standard for code knowledge. It indexes a
repository into a per-repo SQLite knowledge graph (one DB file per
project, cached under ~/.cache/codebase-memory-mcp/) and exposes
tools for searching code, fetching snippets by symbol, tracing call
paths, querying the architecture graph, and managing ADRs.
We use it because:
- It is per-repo, so a multi-spoke fleet doesn't fight over one shared index.
- It is MCP-native, so any agent in any host (Claude Code, Antigravity, claude.ai routines) gets the same tools.
- It complements ArcadeDB rather than duplicating it: ArcadeDB is the
fleet topology and ontology store;
codebase-memory-mcpis the source-code structural store.
The six fleet repos are indexed and are kept current via
detect_changes runs.
MemPalace in the local dev loop¶
MemPalace is a CLI tool that captures session-start / session-stop / pre-compact hooks and turns them into "drawers" of remembered context, keyed by topic. It runs exclusively on the operator's local box; cloud agents do not have it and do not need it.
What MemPalace gives the local Claude Code session:
- A surface for the operator to dictate "remember this for next time" without writing into the repo.
- Automatic capture at
session-stopso the nextsession-startcan restate the open thread. - A separation between durable repo memory (ADRs, contracts, docs) and ambient session memory (what was hot last Tuesday).
It is intentionally not part of the agent's memory contract — the fleet's memory is what is in ArcadeDB, the MCP knowledge graphs, and the git tree.
Capability gating (ADR-052)¶
ADR-052 introduces the rule that makes autonomous agents safe at scale:
Every tool call is authorized against the calling agent's declared capabilities, against a registry that lives in the platform self-model.
The mechanism¶
- Each agent in
.claude/agents/categories/declares a set of capabilities in its frontmatter — e.g.capabilities: [code:read, code:write, fleet:read]. The capability names are drawn from a controlled vocabulary in the platform self-model. - Each MCP tool (and each LLM model) is annotated with the capability
it requires — e.g.
fleet_uprequiresfleet:write,gpt-oss-120brequiresmodel:reasoning. - When an agent calls a tool, the gateway (LLM gateway for models, agent gateway for MCP/A2A) consults the registry and rejects calls where the intersection of declared and required is empty.
Why this matters¶
The naïve alternative — "the agent runs as the user, the user is
authorized, therefore the call is authorized" — fails as soon as the
fleet starts dispatching work autonomously. A copilot-task issue
that spawns a coding agent should not be able to delete production
containers, even if the operator who filed the issue technically can.
Capabilities decouple "what the operator could do" from "what this
agent was hired to do."
It also gives us:
- Least-privilege agents. Every new agent starts with an empty capability set and only gets what it explicitly justifies.
- Audit-able dispatches. The capability set is recorded with every
call, so post-hoc questions like "did anything with
fleet:writetouch arcadedb in the last hour?" are queryable. - Safe agent updates. Adding a capability is a reviewable PR against the agent definition; CI checks that the agent's prompts and delegations don't require capabilities it has not declared.
This is the rule that lets us run the heartbeat in
--apply --auto mode without panic — a runaway Copilot worker still
cannot escape the capability fence.
Swarm orchestration (ADR-044)¶
ADR-044 sets the orchestration model: an ontology-driven swarm where the shape of the work, the agents available to do it, and the routing between them are all derived from the same self-model rather than hand-coded.
Holons all the way down¶
Arthur Koestler, in The Ghost in the Machine (1967), coined holon for an entity that is simultaneously a whole (looking down at its parts) and a part (looking up at its container). The holarchy is the nested structure of holons; Koestler argued it is the universal pattern of organization in living and social systems.
We borrow the pattern directly:
- Every agent is a holon. A
code-revieweragent is a whole over its sub-skills (style checks, security pattern detection, contract conformance) and a part within the larger PR review process, which is itself a holon within the delivery loop. - Every issue / task is a holon. A Feature is a whole over its Stories and a part within an Epic. The board structure (Epic → Feature → Story / Enabler / Bug / Spike) is a holarchy by construction.
- The fleet is a holon. Each spoke repo is a whole within its layer and a part within the hub's orchestration plane.
This is not a metaphor — the decomposition matters. When the orchestrator faces "build feature X," its first move is to decompose X into sub-holons until each leaf is something a single agent can do in one turn. The decomposition rule is encoded against the platform self-model, so adding a new sub-type of work means extending the ontology, not the orchestrator.
Connection to the holonic unified board¶
ADR-035 operationalizes the holarchy at the board level — every work item is a holon with explicit parent and children, and the board renders them as a nested view rather than a flat list. The orchestrator and the human operator see the same holarchy.
Ontology-driven, not topology-driven¶
The naïve swarm — "spin up N workers, give them a shared queue, let them race" — works for embarrassingly-parallel jobs and breaks immediately for jobs with structure. Ontology-driven means the structure of the work informs the routing:
- A task whose ontology classifies it as
contract-changeroutes throughapi-designer→contract-test-engineer→schema-migration-engineer(the contract cluster), not whoever happens to be idle. - A task classified as
infrastructure-changeroutes through the delivery/ops cluster. - A task classified as
judgment-callescalates tohitl-coordinatorand pauses for the operator.
The router is small. The ontology does the work.
Embeddings — Cohere embed-v-4-0¶
The fleet's canonical embedding model is Cohere embed-v-4-0,
served through Azure AI Foundry (deployment fndry-01 in the
operator's environment). It returns 1536-dimensional vectors and
supports input across 100+ languages.
Why Cohere over the alternatives¶
| Alternative | Why we passed |
|---|---|
OpenAI text-embedding-3-large |
Strong, but the data-handling clauses on OpenAI's enterprise tier are harder to align with our partner contracts than Cohere via Azure. |
OpenAI text-embedding-3-small |
Adequate for low-stakes use but loses material recall vs. v4 on long-tail retrieval. |
Open-source (bge-large, e5-large) |
Strong on MTEB English; weaker multilingual; self-hosting cost (GPU minutes) exceeds Cohere API spend at fleet scale. We do run an open model in the local-embedder fallback (next subsection), but not as primary. |
| Voyage / Mistral / etc. | Considered, viable, kept on the secondary list. Not a strong enough lead to displace incumbent. |
The deciding criteria were, in order: license clarity for enterprise
(Cohere via Azure has clean terms for content we ingest from partner
repos), MTEB benchmark position (Cohere embed-v-4-0 is at or near
top of the public multilingual leaderboard at the time of selection),
and multilingual recall (the platform documentation and operator
content are not English-only).
Local-embedder fallback¶
ADR-021's "local fallback" clause allows for a self-hosted embedder
when network is down, the operator wants to embed proprietary content
without leaving the box, or per-call cost matters. The fallback is the
local-embedder Function-tier container — an OpenVINO build of a
strong open model, tuned for the operator's hardware (Intel Core Ultra
7 265 with Arc iGPU). It exposes the same OpenAI-compatible embed API
shape, so callers swap the base URL and nothing else.
The fallback is not for production semantic memory writes; mixing embeddings from two models in the same vector index destroys recall (different geometries). The local embedder is for indices that are explicitly tagged "local-only."
Token budgeting & caching¶
LLM tokens are the dominant cost of an agent fleet, both literally (the bill) and in latency. ADR-021's gateway is the place we account for and optimize them.
Anthropic prompt caching¶
Claude's API supports prompt caching with a 5-minute TTL: stable
prefix blocks marked cache_control: ephemeral are stored on
Anthropic's side and not re-billed (cached input tokens cost ~10× less)
or re-processed for ~5 minutes. To hit the cache we structure prompts
as:
- A long, stable system block (style guide, capability statements, tool inventory) — marked cacheable.
- A long, stable context block (the ADRs, the agent's pinned knowledge) — marked cacheable.
- A short, volatile task block (the actual user request and conversation tail) — uncached.
Empirically the system+context block runs 10–40 K tokens for our heavier agents; cache hits during a 5-minute interactive session amortize that across many turns and shrink per-turn cost by 70–90%.
The fleet's prompt builders are written with the cache in mind: any
agent prompt is (stable_blob, volatile_blob) and the stable_blob
hashes the same across turns. A common mistake — inlining the current
date into the stable_blob — breaks the cache silently; the prompt
builder takes the date as an explicit volatile input to prevent it.
Cost telemetry¶
ADR-010
mandates that every LLM call emit telemetry under the
OpenTelemetry GenAI semantic conventions: gen_ai.system,
gen_ai.request.model, gen_ai.usage.input_tokens,
gen_ai.usage.output_tokens, gen_ai.response.id, etc. The gateway
is the natural place to emit these — it sees every call, knows the
caller, and knows the cost.
Cost flows into the FinOps surface: Grafana dashboards by tier, by
agent, by capability. A spike in one agent's hourly spend is what
triggers finops-engineer to investigate, and the spike is
attributable down to the specific tool sequence that caused it.
Token budgeting per agent¶
Capability gating extends to budgets: an agent's declared capabilities
include a per-window token budget (e.g.
budget:tokens-per-hour:1_000_000). The gateway enforces this with
the same machinery as rate limiting. A busted budget is logged, the
call rejected, and hitl-coordinator notified — the same way other
capability failures escalate. This is what keeps a single misbehaving
worker from burning the daily spend in fifteen minutes.
References¶
Protocols & specs¶
- Model Context Protocol — modelcontextprotocol.io — Anthropic-authored open spec for connecting LLM apps to tools, resources, and prompts.
- Agent-to-Agent (A2A) protocol — github.com/google/A2A — open standard for direct agent-peer messaging.
- OpenAI function calling — platform.openai.com/docs/guides/function-calling — the de facto wire format for tool-use we inherit through the OpenAI-compatible API.
- Anthropic tool use —
docs.anthropic.com/claude/docs/tool-use
— Claude's tool-use semantics; semantically equivalent to OpenAI's
but with
input_schemaand parallel tool calls first-class. - Anthropic prompt caching — docs.anthropic.com/claude/docs/prompt-caching — the 5-minute TTL cache contract our prompt builders target.
- Server-Sent Events — html.spec.whatwg.org/multipage/server-sent-events.html — the streaming transport spec.
- OpenTelemetry semantic conventions for GenAI — opentelemetry.io/docs/specs/semconv/gen-ai/ — the attribute names every LLM call must emit.
Vendors & models¶
- Cerebras inference platform — cerebras.ai — wafer-scale inference; primary LLM provider.
gpt-oss-120bmodel card — huggingface.co/openai/gpt-oss-120bzai-glm-4.7model card — huggingface.co/THUDM — long-context, multilingual; secondary model for non-English / heavy-code tasks.- Cohere
embed-v-4model card — docs.cohere.com/docs/cohere-embed — primary embedding model; 1536-d, multilingual. - Azure AI Foundry — learn.microsoft.com/azure/ai-foundry/ — the deployment surface that serves Cohere embed v4 to the fleet.
Background reading¶
- Arthur Koestler, The Ghost in the Machine (1967) — origin of the holon / holarchy vocabulary; required reading for ADR-044's swarm decomposition.
- Tulving, E. (1972). "Episodic and semantic memory." — the cognitive-science distinction we lift for ADR-008's memory layering.
ADRs cited on this page¶
- ADR-002 — JWT BFF injection
- ADR-003 — No LLM key in browser
- ADR-004 — LLM provider Cerebras
- ADR-007 — Agent streaming protocol
- ADR-008 — Agent memory store
- ADR-010 — Observability standard
- ADR-021 — LLM gateway
- ADR-028 — Agent gateway A2A + MCP
- ADR-035 — Holonic unified board
- ADR-044 — Ontology-orchestrated swarm intelligence
- ADR-052 — Agent tool authorization (capability gating)
Related fleet docs¶
- Fleet MCP distribution — every MCP server in the fleet, its allowlist, and which agents get it.
- Realtime agent interface — the client-facing contract for SSE streams, cancellation, and replay.
Formal-methods adjacent literature¶
A curated, living bibliography of peer-reviewed research that independently validates the design axes we anchor our agent stack on. Discovery and citation grounded via the Hugging Face papers service — every entry is a clickable HF paper page with linked arXiv ID, structured metadata, and (where available) linked model/dataset artifacts.
How this list is maintained
These are not random picks. Each entry was surfaced by the search workflow in Using the Hugging Face Papers Service, then hand-filtered for direct relevance to one of our ADRs. To propose an addition, run the searches there and open a PR adding a row with the matching ADR anchor.
Capability gating, tool authorization, and verifiable safety¶
These map onto ADR-052 — Agent tool authorization (capability gating) and our gateway-mediated tool surface (ADR-021/028). They confirm the broader research community is converging on programmable privilege control and formal guarantees for agent actions — exactly the axis we're on.
- Progent: Programmable Privilege Control for LLM Agents — Shi, He, Wang, Wu et al. (2025). Argues for explicit, programmable least-privilege boundaries for LLM tool calls instead of trust-the-prompt. Direct shape match for our capability registry in ADR-052. arXiv: 2504.11703
- VeriGuard: Enhancing LLM Agent Safety via Verified Code Generation — Miculicich, Parmar, Palangi, Dvijotham et al. (2025). Formal guarantees that agent actions adhere to predefined safety constraints in sensitive domains (healthcare). Anchors why we gate codegen + tool invocation at the gateway, not the prompt. arXiv: 2510.05156
- Formally Specifying the High-Level Behavior of LLM-Based Agents — Crouse, Abdelaziz, Basu, Dan et al. (2023, IBM Research). Uses Linear Temporal Logic (LTL) to specify agent behavior as time-indexed contracts. Same shape as our ADR-038 unified process-and-time architecture — time-indexed states + decidable constraints. arXiv: 2310.08535
Ontology-grounded LLMs and knowledge-graph retrieval¶
These map onto ADR-008 — Agent memory store, ADR-019 — Ontology + reasoning layer, and ADR-030 — Data→ontology ingestion pipeline.
- MindMap: Knowledge Graph Prompting Sparks Graph of Thoughts in Large Language Models — Wen, Wang, Sun (2023, UIUC). KG-conditioned prompting as a remedy for hallucination + opacity; LLMs trace reasoning paths through a graph. Sister-thesis to our active hypergraph inference prototype. arXiv: 2308.09729
- Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation from Text — Mihindukulasooriya, Tiwari, Enguix, Lata (2023). Benchmarks LLM ability to extract KGs constrained by a target ontology — the exact gate our ADR-030 ingestion pipeline enforces with SHACL. arXiv: 2308.02357
- Neurosymbolic AI: The 3rd Wave — d'Avila Garcez & Lamb (2020). The foundational programmatic-statement paper for combining well-founded knowledge representation with deep learning. Cite in every conversation about why we sit ontology + reasoner alongside the LLM rather than throwing the LLM at the world bare. arXiv: 2012.05876
MCP and agent benchmarks¶
These map onto ADR-028 — Agent gateway A2A + MCP and our broader MCP investment (the local-fleet MCP server, mcp.untool.ai, codebase-memory-mcp).
- LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries — Yin, Shen, Xu, Han et al. (2025). 101 multi-step tasks across diverse MCP tools in dynamic environments — directly the runtime regime our fleet operates in. The benchmark we should be running ourselves. arXiv: 2508.15760
- MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP — Real-world MCP-mediated agent evaluation; an external scoreboard our agents' tool-use can be measured against. arXiv: 2509.09734
Multi-agent coordination¶
These map onto ADR-035 — Holonic unified board architecture and ADR-044 — Ontology-orchestrated swarm intelligence.
- Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems — Qi, Ma, Xing, Guo et al. (2026). Surveys the error-propagation risk that motivates our heartbeat + self-healing harness (ADR-050) and our fleet coordination plane (ADR-053). Particularly relevant: their taxonomy of failure-attribution patterns lines up with our HITL escalation criteria. arXiv: 2605.14892
Open invitation
We update this section when new papers materially change the trade-space. The acceptance bar is: (a) peer-reviewed or substantive preprint, (b) maps onto a specific ADR we hold, (c) either confirms our axis or proposes a falsifiable alternative we should consider. PRs welcome.
See also: Coordination & VFS · Generative Pipeline · Standards Index · Using the HF Papers Service · Intellectual Foundations (Bibliography)