Skip to content

ARC-ADR-050 — Self-Healing, Bidirectional, and Hypergraph-Attached Agent Harness

One line: Architecture of the runner harness enabling guest-to-host control loops, self-correction execution loops, ontology MCP queries, and operator steering hooks.


Context and Problem Statement

As we move toward an agent-army-first self-building platform, our agents run isolated code operations (compiling, code emitting, running tests) inside sandboxed runner environments (Firecracker microVMs or Docker WSL containers).

To scale this to autonomous operation, we must solve several critical coordination and usability challenges: 1. Brittle Failures: If a compiler fails, a dependency is missing, or a database migration is needed, standard sandboxes simply crash or report failure. The agent has no programmatic framework to diagnose and self-heal the environment. 2. One-Way Directionality: Host-to-guest execution is currently a one-way street (host commands guest to run). We lack a secure, bidirectional communication mechanism for guests to request resources or escalate decisions (Human-in-the-Loop) reactively. 3. Hypergraph Isolation: The agents write code that changes the system, but they cannot dynamically query or write to our core knowledge graph (the hypergraph mapping SystemComponents, relators, and task states in ArcadeDB/Fuseki). 4. Black-Box Execution: Operators (like Nicky) have no easy interface to inspect running VMs, view active trace events, or dynamically inject guidance into live agent processes.


Decision

Adopt the Self-Healing, Bidirectional Agent Harness (ARC-ADR-050) as our standard hyperautomation runner envelope. We will build a unified guest-host proxy interface using virtual sockets (vsock) under Firecracker (falling back to TCP under Docker WSL) to orchestrate self-correction, bidirectional control streams, and hypergraph tool bindings.


Technical Specifications

1. Self-Healing Execution Envelope

Every runner command (e.g. pytest or cargo build) executes inside a wrapped diagnostic transaction: 1. Pre-Flight Snapshot: The harness writes a VFS checkout checkpoint (using the local VFS SQLite staging buffer). 2. Sandbox Run: The command is executed. If it exits with 0, the transaction is committed. 3. Diagnostic Trap: If a non-zero exit code or traceback occurs: - The Guest Execution Agent intercepts stderr. - It runs a local Traceback Parser to classify the error (e.g., ImportError: no module named X, DatabaseMigrationError, SyntaxError). - Self-Healing Strategies: * Missing Dependency: Runs pip install or npm install dynamically inside the guest runtime overlay and retries. * Database Out-of-Sync: Triggers a local database schema migration run and retries. * Code-Level Bug: Invokes a localized self-repair loop (a lightweight agent LLM instance) to patch the offending file in the OverlayFS writable layer and compile again. 4. VFS Rollback: If the error is unrecoverable or a recursion loop is detected, the harness reverts the OverlayFS state to the pre-flight VFS checkpoint, discarding dirty intermediate writes, and escalates to the host.

2. Bidirectional VSOCK / MCP Control Loop

Rather than running standard TCP ports (which introduce security risks), the guest uses Firecracker virtual sockets (vsock) to connect back to the host. * Bidirectional Channels: The guest Execution Agent establishes an MCP (Model Context Protocol) connection over VSOCK CID 2 to the host runner node daemon. * Host-to-Guest RPCs: - runner_pause() / runner_resume() / runner_drain(): Instructs the hypervisor scheduler to throttle or gracefully finish execution. - runner_inject_prompt(text): Forces a high-priority prompt into the active agent queue. * Guest-to-Host RPCs: - escalate_to_operator(reason, schema): Requests Human-in-the-Loop input (rendered as a cockpit selector UI) when a destructive or ambiguous option is reached. - acquire_vfs_lock(path): Requests the host daemon to coordinate file-path locks across other concurrent runner nodes.

3. Hypergraph Attachment via MCP Tools

The host MCP server acts as a bridge to the ArcadeDB/Postgres hypergraph. Guest agents call these tools to query and mutate the platform's self-model: * hypergraph_query(query_string, depth): Resolves dependencies, components, and mission-to-goal hierarchies. Allows the agent to understand: "What other services depend on this API structure?" * hypergraph_mutate(vertex_payload): Registers newly created components or updates active deployment status. * evidence_record(component_id, test_result_payload): Appends verifiable test traces and audit blocks directly to the hypergraph component vertex, satisfying L1-L6 Verification Level constraints.

4. Operator Steering CLI: ut-agent

A local command-line interface ut-agent runs on the host Windows/WSL OS to expose steering capabilities: * ut-agent list: Lists active microVM CIDs, base repos, branches, and active goals. * ut-agent inspect <cid> --follow: Streams real-time console logs, trace spans, and vsock packets from the guest. * ut-agent inject <cid> --prompt "...": Injects guidance directly into a running process. * ut-agent checkpoint <cid> / ut-agent rollback <cid>: Controls sandbox snapshots.


Consequences

  • + Self-Correcting Runtimes: Sandboxes heal local compile/install issues autonomously, dramatically reducing task abortion rates.
  • + Secure Real-Time Communication: VSOCK bypasses network routing overheads while preserving absolute network perimeter isolation.
  • + Hypergraph Grounding: Agents read and record their architectural mutations directly to the knowledge graph, closing the systems engineering loop.
  • − Memory overhead: Writable OverlayFS layers and local self-repair LLM calls require higher memory footprints per runner node.