Skip to content

LLM & Agent Stack

This page is the canonical map of how untool.ai builds agents — which model providers we call, what protocols carry agent traffic, where memory lives, how tools are authorized, and how swarms decompose. It is meant as an on-ramp for new contributors and a single source of truth for the architectural choices that already live as Architecture Decision Records.

Every claim below is anchored in an ADR. When the ADR and this page disagree, the ADR wins — fix the page.


The agent stack at a glance

Layer Choice Standard / protocol Alternative considered
LLM provider Cerebras (gpt-oss-120b, zai-glm-4.7) OpenAI-compatible REST Anthropic direct, OpenAI direct, Groq, Together
Model access llm-gateway image (server-side) OpenAI-compatible + SSE Direct provider SDK calls from app code
Browser → LLM BFF route only (/api/copilotkit) with JWT injection HTTP + session cookie → JWT (ADR-002) Public API key in browser (rejected — ADR-003)
Tool protocol Model Context Protocol (MCP) modelcontextprotocol.io Bespoke JSON-RPC, OpenAPI-as-tools
Peer messaging A2A protocol Google A2A + community extensions MCP-only (rejected — peers ≠ tools), AMQP, raw NATS
Streaming Server-Sent Events (SSE) HTTP/1.1 + HTTP/2 multiplexed WebSocket, gRPC streaming
Episodic memory NATS JetStream + ArcadeDB Append-only event log + property graph Redis streams, Postgres LISTEN/NOTIFY
Semantic memory codebase-memory-mcp (DeusData) Per-repo SQLite knowledge graph Vector store only, Neo4j
Local-dev memory MemPalace (CLI 3.3.x) Session-start/stop hooks None — manual context restate
Embeddings Cohere embed-v-4-0 via Azure Foundry OpenAI-compatible embed API OpenAI text-embedding-3-large, bge-large
Embeddings fallback local-embedder image (OpenVINO on Arc iGPU) OpenAI-compatible embed API CPU-only sentence-transformers
Tool authorization Capability gating (ADR-052) Capability registry in platform self-model Role-based ACLs, OPA-only
Orchestration Ontology-driven swarm (ADR-044) Holonic decomposition (Koestler 1967) Single supervisor, master-worker queue
Observability OpenTelemetry GenAI semantic conventions OTLP → Grafana / Tempo / Loki Per-vendor APM, console logs
Prompt caching Anthropic prompt cache (5-min TTL) cache_control: ephemeral on stable blocks None, custom KV cache

The rest of this page walks each row and the trade-offs behind it.


LLM provider — Cerebras (ADR-004) and the gateway pattern (ADR-021)

Why Cerebras

ADR-004 selects Cerebras as the primary inference provider for the fleet. The decision rests on three properties:

  1. Latency. Cerebras' wafer-scale inference returns first-token in the tens of milliseconds and sustains hundreds–thousands of tokens/sec on open-weights models. For interactive agent loops — where a tool call may wait on a planner LLM, then a critic LLM, then a writer LLM — end-to-end latency is dominated by the chain length, not by any one prompt. Sub-second hops are what make multi-agent chains feel synchronous rather than batch.
  2. Model availability. Cerebras hosts gpt-oss-120b (general reasoning, tool-use, function-calling) and zai-glm-4.7 (long-context, multilingual, strong coding). Both are openly licensed, which keeps our enterprise compliance story clean.
  3. OpenAI-compatible API. The wire format is unchanged from the OpenAI reference. Switching providers is a config-file change, not a code migration.

The alternatives — direct Anthropic, direct OpenAI, Groq, Together — are all viable as secondary providers; the gateway is built so adding one is a registry edit and a credential, not a refactor.

Why a gateway (not direct calls)

ADR-021 introduces llm-gateway — a Function-tier container that sits between every agent process and the model providers. Calling Cerebras (or any provider) directly from application code is prohibited by this ADR. The gateway is the single chokepoint, and it earns its keep five ways:

Concern Why centralize it
Token redaction Prompts often contain secrets the caller forgot to scrub (Bearer ..., sk_live_..., aks-...). The gateway runs a deterministic redactor before egress and before it writes the audit log.
Audit Every prompt + completion is logged with caller identity, model, token counts, cost, and capability set. This is what makes incident response and FinOps possible.
Fair-share & per-tier rate limits A runaway loop in one agent must not starve the rest of the fleet. The gateway implements token-bucket per (caller tier, model) tuples.
Capability gating Before forwarding a request the gateway consults the capability registry — if the caller's declared capabilities don't authorize the requested model or the tool set being passed, the call is rejected.
Provider abstraction When a model is deprecated, retired, or a cheaper provider appears, we change the gateway config. No application code moves.

Why no LLM key in the browser

ADR-003 is the oldest LLM-related decision in the fleet and the least negotiable. A browser-resident provider key is, in every meaningful sense, public — View Source, devtools Network tab, a stray window.__NEXT_DATA__ dump. Public keys mean:

  • Anyone can bill our account to exhaustion in an afternoon.
  • Anyone can call any model we have access to, including models the authenticated user is not entitled to.
  • Per-user audit is impossible — every request looks like "the fleet."
  • Rotation requires a frontend deploy.

The accepted path is the BFF route: the Next.js frontend (frontend-core) exposes /api/copilotkit (and friends) as a server route. The browser authenticates with a session cookie; the BFF route mints a short-lived JWT carrying the user's identity + tier + capabilities and forwards the request to the gateway. This is the JWT injection contract from ADR-002, and it is the only acceptable path from a user-facing client to an LLM.

Direct exposure of middle-core (:8100) or backend-core (:8000) through the dev tunnel is, for the same reason, banned — see the CLAUDE.md tunnel policy. Cloud agents do not get raw LLM keys either; they call the gateway through mcp.untool.ai with a CF Access service token, and the gateway issues the LLM credentials.


Model Context Protocol (MCP)

The spec

The Model Context Protocol is Anthropic's open standard for connecting LLM applications to external context and capabilities. An MCP server exposes three primitives:

  • Tools — functions the model can invoke (with JSON-schema'd inputs).
  • Resources — read-only content the model can consult (files, database rows, page contents).
  • Prompts — parameterized prompt templates the host UI can offer the user as slash-commands or buttons.

The transport is JSON-RPC 2.0 over stdio (for locally-spawned servers) or Server-Sent Events / HTTP (for remote servers). The host (Claude Desktop, Claude Code, Cursor, our own runtime) discovers tools at connect time and surfaces them to the model.

How we use it

MCP is the primary protocol for agent → tool interactions in the fleet. We run MCP servers for:

  • Code knowledge graphs (codebase-memory-mcp, see Agent memory).
  • Cloud agent control of the local docker fleet (tools/mcp-local-fleet/, next subsection).
  • Vendored third-party servers (GitHub, Postman, Figma, Azure, GCP, etc.) — see Fleet MCP distribution for the full inventory and the per-agent allowlist.
  • First-party fleet primitives: contract publishing, ontology mutation, capability dispatch.

The principle is: if a behavior is a discrete, schema-able operation on a service, it belongs behind an MCP tool, not as bespoke prompting. Once it is an MCP tool, every model and every agent in the fleet can use it.

The local-fleet MCP server

tools/mcp-local-fleet/ is the control plane that lets cloud agents (Claude.ai routines, GitHub Actions, remote API callers) observe and drive the local Docker fleet without exposing the Docker socket. It is the canonical example of how we ship an MCP server in production.

The server binds to 127.0.0.1:8765 and is exposed publicly at mcp.untool.ai behind Cloudflare Access service tokens (CF-Access-Client-Id + CF-Access-Client-Secret headers). It serves eight allowlisted tools, split into read-only (auto-approve in clients that respect approval prompts) and write (per-call approval):

Tool Read/Write What it does
fleet_ps Read List containers — name, image, status, ports.
fleet_inspect Read docker inspect for one container (config, mounts, env-redacted).
fleet_logs Read Tail logs from a named container.
fleet_up Write Start a service (compose-managed).
fleet_down Write Stop a service.
fleet_restart Write Restart a service.
fleet_build Write Rebuild an image.
fleet_deploy Write Apply a new image tag and roll.

Key invariants — enforced in code, not relaxable by config:

  • Loopback bind only; public exposure requires the explicit Cloudflare tunnel route, which is fronted by CF Access.
  • Allowlisted service names only — there is no "run arbitrary container" tool and no exec_shell tool. The threat model is "an agent gets a stale or compromised credential"; the blast radius must be bounded to "the fleet's known services."
  • CF Access JWT is verified server-side as defense-in-depth.
  • Every call is audited to tools/logs/mcp-audit.log.YYYY-MM-DD with redacted args and the edge principal (the service token name).
  • Tool names match [a-zA-Z0-9_-] — Claude Code's MCP client silently drops dotted names, which we learned the hard way.

MCP vs. A2A

MCP is vertical: agent → tool → service. The agent is the principal; the tool is a function call. A2A (next section) is horizontal: agent ↔ agent, peers exchanging messages.

The line is sometimes blurry — an agent can be exposed as an MCP server (its skills become tools, callable by another agent's planner). We do this in some places. But the protocol you reach for first is:

  • MCP when the relationship is "I want this work done, I don't care who does it, just call the function."
  • A2A when the relationship is "I am specifically talking to that agent, the conversation has state, and the response shape depends on the peer's identity."

Agent-to-Agent (A2A)

The protocol

The Agent-to-Agent (A2A) protocol is Google's open standard (with growing community contributions) for direct agent ↔ agent communication. Where MCP standardizes the agent → tool boundary, A2A standardizes the boundary between two autonomous agents that may not share a process, host, or trust domain.

A2A defines:

  • Agent cards — a /.well-known/agent.json-style manifest exposing the agent's capabilities, supported message types, and auth.
  • Tasks — stateful, long-running units of work with their own lifecycle (submittedworkinginput-requiredcompleted / failed / canceled).
  • Streaming updates — agents push partial results as a task runs.
  • Multi-modal messages — text, files, structured data, function calls.

How we use it

ADR-028 commits the fleet to the dual-protocol pattern: every agent gateway speaks both MCP (tools) and A2A (peers). The A2A contract lives at backend-core/contracts/agentarmy-a2a.openapi.yaml and is the source of truth for fleet agent-card schemas, task envelopes, and the agent-discovery endpoints.

When to reach for A2A instead of MCP tool calls:

  • Delegation with conversation. A2A's task model carries the conversation forward; one agent can answer "what did the planner decide?" by replaying the task transcript. An MCP tool call is stateless.
  • Long-running work with progress. A2A's input-required and streaming-update states are first-class. Modeling the same flow as an MCP tool requires custom polling or a webhook escape hatch.
  • Heterogeneous fleets. When the peer is not ours — a community agent, a partner's agent, an upstream platform's agent — A2A is the lingua franca. Our MCP tool catalog is internal.
  • Identity matters. When the who of the peer is part of the semantics (audit, billing, contract-bound) rather than the what of the function, A2A's agent card gives you that identity at protocol level.

Streaming protocol (ADR-007)

ADR-007 selects Server-Sent Events (SSE) as the canonical transport for agent text streams from any gateway to any consumer (browser or backend). The frame format is JSON lines — one JSON object per data: event, each carrying either a token delta, a tool-call event, a usage update, or a terminal done event.

Why SSE, not WebSocket

  • HTTP/2 multiplexing. SSE rides ordinary HTTP, so a single TCP connection between the browser and our edge multiplexes dozens of concurrent agent streams. WebSocket is one socket per stream, and every middlebox (CDN, ingress, ACA) treats it as a special case.
  • Simple proxies. Every reverse proxy in the universe understands HTTP. CloudFront, Cloudflare, Azure Front Door, Nginx, Envoy, ACA ingress — they all stream SSE without configuration. WebSocket upgrades require explicit allowlisting and often break on cheap tiers.
  • One direction is enough. Agent → consumer is the only direction that needs streaming. Consumer → agent is well-served by a separate POST for the prompt and a cancellation endpoint. WebSocket's bidirectionality is unused dead weight for this shape.
  • Replay. SSE has a built-in Last-Event-ID semantic; resuming a dropped stream from a sequence number is part of the spec.

Why not gRPC streaming

gRPC bidi streaming is technically superior for cross-service backend calls (it is what middle-corebackend-core should probably use for plumbing). But it is not browser-reachable without a translation layer (grpc-web), it does not play well with our SSE-aware edge, and it forces a protobuf compilation step on every contract change. SSE keeps the contract in JSON and reachable from curl.

Frame format

event: token
data: {"seq":42,"delta":"Hello","model":"gpt-oss-120b"}

event: tool_call
data: {"seq":43,"id":"call_x","name":"fleet_ps","arguments":{}}

event: usage
data: {"input_tokens":312,"output_tokens":118,"cache_read":0}

event: done
data: {"finish_reason":"stop"}

Cancellation

The client sends DELETE /v1/streams/{stream_id} (or aborts the EventSource and lets a server-side idle timer fire). The gateway then:

  1. Stops forwarding upstream tokens.
  2. Sends a final event: canceled frame.
  3. Closes the response stream.
  4. Writes a canceled audit record.

Importantly, cancellation does not retroactively un-call tools that already executed. The agent's tool-call ledger is append-only.


Agent memory (ADR-008)

ADR-008 splits agent memory into four layers, each chosen for what it does well:

Layer Persistence Shape Backing store
Working memory In-process, dies with the turn The prompt The LLM's context window
Episodic memory Durable, append-only Event log: "agent X did Y at T" NATS JetStream → ArcadeDB
Semantic memory Durable, queried Knowledge graph: entities, relations, embeddings ArcadeDB (graph + vector)
Code knowledge Durable, per-repo Symbol graph: files, defs, refs, calls codebase-memory-mcp (SQLite per repo)

Ephemeral vs persisted

Ephemeral:

  • The current prompt and the streaming response.
  • Tool-call scratch space inside one turn.
  • Per-stream cancellation tokens.
  • Anything written to /tmp inside an agent container.

Persisted:

  • Every prompt + completion + tool-call sequence (audit, not retrieval — we don't replay these into prompts).
  • Episodic events: "user asked X, planner produced plan Y, executor completed step Z."
  • Semantic facts extracted from conversations or documents: entities, their relationships, embeddings of their descriptions.
  • Codebase structural memory: definitions, references, the call graph.

Episodic vs semantic

The distinction is the cognitive-science one (Tulving, 1972), repurposed for agents:

  • Episodic is what happened, when, in what order. It is the source of truth for "did we already try this?" and "what was the state when we made that decision?". It is queried by time and by run-id.
  • Semantic is what is true, abstracted from the episode. It is the source of truth for "what is the schema of holon?" and "what capabilities does the api-designer agent declare?". It is queried by entity and by embedding similarity.

Both are durable, but they serve different reads. Conflating them — the classic "just throw everything into one vector store" — makes both reads worse.

codebase-memory-mcp

The DeusData codebase-memory-mcp server is the fleet's standard for code knowledge. It indexes a repository into a per-repo SQLite knowledge graph (one DB file per project, cached under ~/.cache/codebase-memory-mcp/) and exposes tools for searching code, fetching snippets by symbol, tracing call paths, querying the architecture graph, and managing ADRs.

We use it because:

  • It is per-repo, so a multi-spoke fleet doesn't fight over one shared index.
  • It is MCP-native, so any agent in any host (Claude Code, Antigravity, claude.ai routines) gets the same tools.
  • It complements ArcadeDB rather than duplicating it: ArcadeDB is the fleet topology and ontology store; codebase-memory-mcp is the source-code structural store.

The six fleet repos are indexed and are kept current via detect_changes runs.

MemPalace in the local dev loop

MemPalace is a CLI tool that captures session-start / session-stop / pre-compact hooks and turns them into "drawers" of remembered context, keyed by topic. It runs exclusively on the operator's local box; cloud agents do not have it and do not need it.

What MemPalace gives the local Claude Code session:

  • A surface for the operator to dictate "remember this for next time" without writing into the repo.
  • Automatic capture at session-stop so the next session-start can restate the open thread.
  • A separation between durable repo memory (ADRs, contracts, docs) and ambient session memory (what was hot last Tuesday).

It is intentionally not part of the agent's memory contract — the fleet's memory is what is in ArcadeDB, the MCP knowledge graphs, and the git tree.


Capability gating (ADR-052)

ADR-052 introduces the rule that makes autonomous agents safe at scale:

Every tool call is authorized against the calling agent's declared capabilities, against a registry that lives in the platform self-model.

The mechanism

  1. Each agent in .claude/agents/categories/ declares a set of capabilities in its frontmatter — e.g. capabilities: [code:read, code:write, fleet:read]. The capability names are drawn from a controlled vocabulary in the platform self-model.
  2. Each MCP tool (and each LLM model) is annotated with the capability it requires — e.g. fleet_up requires fleet:write, gpt-oss-120b requires model:reasoning.
  3. When an agent calls a tool, the gateway (LLM gateway for models, agent gateway for MCP/A2A) consults the registry and rejects calls where the intersection of declared and required is empty.

Why this matters

The naïve alternative — "the agent runs as the user, the user is authorized, therefore the call is authorized" — fails as soon as the fleet starts dispatching work autonomously. A copilot-task issue that spawns a coding agent should not be able to delete production containers, even if the operator who filed the issue technically can. Capabilities decouple "what the operator could do" from "what this agent was hired to do."

It also gives us:

  • Least-privilege agents. Every new agent starts with an empty capability set and only gets what it explicitly justifies.
  • Audit-able dispatches. The capability set is recorded with every call, so post-hoc questions like "did anything with fleet:write touch arcadedb in the last hour?" are queryable.
  • Safe agent updates. Adding a capability is a reviewable PR against the agent definition; CI checks that the agent's prompts and delegations don't require capabilities it has not declared.

This is the rule that lets us run the heartbeat in --apply --auto mode without panic — a runaway Copilot worker still cannot escape the capability fence.


Swarm orchestration (ADR-044)

ADR-044 sets the orchestration model: an ontology-driven swarm where the shape of the work, the agents available to do it, and the routing between them are all derived from the same self-model rather than hand-coded.

Holons all the way down

Arthur Koestler, in The Ghost in the Machine (1967), coined holon for an entity that is simultaneously a whole (looking down at its parts) and a part (looking up at its container). The holarchy is the nested structure of holons; Koestler argued it is the universal pattern of organization in living and social systems.

We borrow the pattern directly:

  • Every agent is a holon. A code-reviewer agent is a whole over its sub-skills (style checks, security pattern detection, contract conformance) and a part within the larger PR review process, which is itself a holon within the delivery loop.
  • Every issue / task is a holon. A Feature is a whole over its Stories and a part within an Epic. The board structure (Epic → Feature → Story / Enabler / Bug / Spike) is a holarchy by construction.
  • The fleet is a holon. Each spoke repo is a whole within its layer and a part within the hub's orchestration plane.

This is not a metaphor — the decomposition matters. When the orchestrator faces "build feature X," its first move is to decompose X into sub-holons until each leaf is something a single agent can do in one turn. The decomposition rule is encoded against the platform self-model, so adding a new sub-type of work means extending the ontology, not the orchestrator.

Connection to the holonic unified board

ADR-035 operationalizes the holarchy at the board level — every work item is a holon with explicit parent and children, and the board renders them as a nested view rather than a flat list. The orchestrator and the human operator see the same holarchy.

Ontology-driven, not topology-driven

The naïve swarm — "spin up N workers, give them a shared queue, let them race" — works for embarrassingly-parallel jobs and breaks immediately for jobs with structure. Ontology-driven means the structure of the work informs the routing:

  • A task whose ontology classifies it as contract-change routes through api-designercontract-test-engineerschema-migration-engineer (the contract cluster), not whoever happens to be idle.
  • A task classified as infrastructure-change routes through the delivery/ops cluster.
  • A task classified as judgment-call escalates to hitl-coordinator and pauses for the operator.

The router is small. The ontology does the work.


Embeddings — Cohere embed-v-4-0

The fleet's canonical embedding model is Cohere embed-v-4-0, served through Azure AI Foundry (deployment fndry-01 in the operator's environment). It returns 1536-dimensional vectors and supports input across 100+ languages.

Why Cohere over the alternatives

Alternative Why we passed
OpenAI text-embedding-3-large Strong, but the data-handling clauses on OpenAI's enterprise tier are harder to align with our partner contracts than Cohere via Azure.
OpenAI text-embedding-3-small Adequate for low-stakes use but loses material recall vs. v4 on long-tail retrieval.
Open-source (bge-large, e5-large) Strong on MTEB English; weaker multilingual; self-hosting cost (GPU minutes) exceeds Cohere API spend at fleet scale. We do run an open model in the local-embedder fallback (next subsection), but not as primary.
Voyage / Mistral / etc. Considered, viable, kept on the secondary list. Not a strong enough lead to displace incumbent.

The deciding criteria were, in order: license clarity for enterprise (Cohere via Azure has clean terms for content we ingest from partner repos), MTEB benchmark position (Cohere embed-v-4-0 is at or near top of the public multilingual leaderboard at the time of selection), and multilingual recall (the platform documentation and operator content are not English-only).

Local-embedder fallback

ADR-021's "local fallback" clause allows for a self-hosted embedder when network is down, the operator wants to embed proprietary content without leaving the box, or per-call cost matters. The fallback is the local-embedder Function-tier container — an OpenVINO build of a strong open model, tuned for the operator's hardware (Intel Core Ultra 7 265 with Arc iGPU). It exposes the same OpenAI-compatible embed API shape, so callers swap the base URL and nothing else.

The fallback is not for production semantic memory writes; mixing embeddings from two models in the same vector index destroys recall (different geometries). The local embedder is for indices that are explicitly tagged "local-only."


Token budgeting & caching

LLM tokens are the dominant cost of an agent fleet, both literally (the bill) and in latency. ADR-021's gateway is the place we account for and optimize them.

Anthropic prompt caching

Claude's API supports prompt caching with a 5-minute TTL: stable prefix blocks marked cache_control: ephemeral are stored on Anthropic's side and not re-billed (cached input tokens cost ~10× less) or re-processed for ~5 minutes. To hit the cache we structure prompts as:

  1. A long, stable system block (style guide, capability statements, tool inventory) — marked cacheable.
  2. A long, stable context block (the ADRs, the agent's pinned knowledge) — marked cacheable.
  3. A short, volatile task block (the actual user request and conversation tail) — uncached.

Empirically the system+context block runs 10–40 K tokens for our heavier agents; cache hits during a 5-minute interactive session amortize that across many turns and shrink per-turn cost by 70–90%.

The fleet's prompt builders are written with the cache in mind: any agent prompt is (stable_blob, volatile_blob) and the stable_blob hashes the same across turns. A common mistake — inlining the current date into the stable_blob — breaks the cache silently; the prompt builder takes the date as an explicit volatile input to prevent it.

Cost telemetry

ADR-010 mandates that every LLM call emit telemetry under the OpenTelemetry GenAI semantic conventions: gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.id, etc. The gateway is the natural place to emit these — it sees every call, knows the caller, and knows the cost.

Cost flows into the FinOps surface: Grafana dashboards by tier, by agent, by capability. A spike in one agent's hourly spend is what triggers finops-engineer to investigate, and the spike is attributable down to the specific tool sequence that caused it.

Token budgeting per agent

Capability gating extends to budgets: an agent's declared capabilities include a per-window token budget (e.g. budget:tokens-per-hour:1_000_000). The gateway enforces this with the same machinery as rate limiting. A busted budget is logged, the call rejected, and hitl-coordinator notified — the same way other capability failures escalate. This is what keeps a single misbehaving worker from burning the daily spend in fifteen minutes.


References

Protocols & specs

Vendors & models

Background reading

  • Arthur Koestler, The Ghost in the Machine (1967) — origin of the holon / holarchy vocabulary; required reading for ADR-044's swarm decomposition.
  • Tulving, E. (1972). "Episodic and semantic memory." — the cognitive-science distinction we lift for ADR-008's memory layering.

ADRs cited on this page

  • Fleet MCP distribution — every MCP server in the fleet, its allowlist, and which agents get it.
  • Realtime agent interface — the client-facing contract for SSE streams, cancellation, and replay.

Formal-methods adjacent literature

A curated, living bibliography of peer-reviewed research that independently validates the design axes we anchor our agent stack on. Discovery and citation grounded via the Hugging Face papers service — every entry is a clickable HF paper page with linked arXiv ID, structured metadata, and (where available) linked model/dataset artifacts.

How this list is maintained

These are not random picks. Each entry was surfaced by the search workflow in Using the Hugging Face Papers Service, then hand-filtered for direct relevance to one of our ADRs. To propose an addition, run the searches there and open a PR adding a row with the matching ADR anchor.

Capability gating, tool authorization, and verifiable safety

These map onto ADR-052 — Agent tool authorization (capability gating) and our gateway-mediated tool surface (ADR-021/028). They confirm the broader research community is converging on programmable privilege control and formal guarantees for agent actions — exactly the axis we're on.

Ontology-grounded LLMs and knowledge-graph retrieval

These map onto ADR-008 — Agent memory store, ADR-019 — Ontology + reasoning layer, and ADR-030 — Data→ontology ingestion pipeline.

MCP and agent benchmarks

These map onto ADR-028 — Agent gateway A2A + MCP and our broader MCP investment (the local-fleet MCP server, mcp.untool.ai, codebase-memory-mcp).

Multi-agent coordination

These map onto ADR-035 — Holonic unified board architecture and ADR-044 — Ontology-orchestrated swarm intelligence.

Open invitation

We update this section when new papers materially change the trade-space. The acceptance bar is: (a) peer-reviewed or substantive preprint, (b) maps onto a specific ADR we hold, (c) either confirms our axis or proposes a falsifiable alternative we should consider. PRs welcome.


See also: Coordination & VFS · Generative Pipeline · Standards Index · Using the HF Papers Service · Intellectual Foundations (Bibliography)