Skip to content

Realtime Agent Interface

Position: realtime audio and video are interaction surfaces, not the authority layer. The live agent can listen, speak, see, point, interrupt, and guide, but platform truth still comes from the operational hypergraph, contracts, evidence, policies, and HITL gates.

Goal

Design a live agent interface for untool.ai that supports:

  • low-latency voice interaction;
  • optional video and screen context;
  • live captions and transcript;
  • agent/human interruption;
  • tool and runbook gating;
  • demo/twin evidence capture;
  • operational hypergraph traceability.

The interface should work for demos first, then mature into an operational twin control surface.

Current Baseline

The platform already has streaming foundations:

  • AG-UI stream: SSE event protocol for frontend agent interaction.
  • Active hypergraph stream: object-native SSE events for activation, gate, final, and done.
  • OpenAI-compatible adapters: chat and Responses streaming over the same hypergraph authority path.
  • Demo Harness: evidence, approval, simulation, twin bindings, and review surface.

The missing plane is media: browser microphone, speaker, camera, screen share, realtime model transport, recording, transcript, consent, and media evidence.

Architecture

flowchart LR
  Browser["Browser UI<br/>mic, speaker, camera, screen, captions"]
  MediaBroker["Realtime Media Broker<br/>session, consent, ephemeral tokens"]
  MediaPlane["Media Plane<br/>WebRTC / SFU / recording"]
  ModelAdapter["Realtime Model Adapter<br/>speech, vision, proposals"]
  Hypergraph["Operational Hypergraph<br/>truth, policy, sufficiency, trace"]
  Policy["Policy + HITL Gate<br/>CapabilityGrant, approval, rollback posture"]
  Tools["Tool Gateway<br/>MCP / A2A / runbooks"]
  Evidence["Evidence Store<br/>transcript, trace, media refs, HITL"]
  Demo["Demo / Twin Harness<br/>guided review and control"]

  Browser <--> MediaBroker
  MediaBroker <--> MediaPlane
  MediaBroker <--> ModelAdapter
  ModelAdapter <--> Hypergraph
  ModelAdapter -->|tool.proposed| Policy
  Hypergraph --> Policy
  Policy -->|approved only| Tools
  Policy --> Evidence
  Hypergraph --> Evidence
  MediaPlane --> Evidence
  Browser --> Demo
  Demo --> Evidence

1. Browser Client

The browser owns user interaction:

  • microphone, speaker, camera, and screen-share controls;
  • consent and recording indicators;
  • live captions and transcript timeline;
  • push-to-talk and mute;
  • barge-in / interrupt;
  • tool-call and HITL approval panels;
  • current hypergraph trace link;
  • current demo/twin run link.

The browser must never hold a long-lived provider API key. It receives only a short-lived session credential from the backend.

2. Realtime Media Broker

The broker is the server-side control point.

Responsibilities:

  • create realtime sessions;
  • mint ephemeral client credentials;
  • attach user, tenant, repo, demo, work unit, and policy context;
  • authorize media modes: audio, video, screen, recording, transcription;
  • enforce JWT/BFF identity, media-mode scope, TTL, revocation, and consent version;
  • route model events to the hypergraph authority path;
  • convert model tool requests into typed tool.proposed events;
  • route approved tool calls through MCP/A2A only after policy and HITL gates;
  • write session evidence.

This broker belongs behind backend-core auth and policy. It is not a direct frontend-to-provider pass-through.

3. Media Plane

Use WebRTC for live media. WebSockets/SSE are acceptable for control/event streams, but not as the primary audio/video transport.

Recommended path:

Stage Media plane
Prototype Browser WebRTC to provider or local broker with broker-minted, mode-scoped, short-lived session credentials.
Team demos LiveKit or equivalent SFU-backed room for voice, video, screen share, recording, and multi-party presence.
Production Media broker + SFU + provider adapter + recording/transcript pipeline + policy gates.

Avoid building low-level WebRTC infrastructure from scratch until a concrete requirement forces it.

Direct browser-to-provider media is allowed only for media transport. Tools, policy, HITL, evidence, trace correlation, and session creation still route through backend-core controlled services.

4. Realtime Model Adapter

The adapter hides provider details. It should support two execution styles:

Style Use
Native realtime model Low-latency speech-to-speech, interruption, multimodal input, fast demos.
Cascaded pipeline Provider flexibility: streaming STT -> hypergraph/LLM reasoning -> streaming TTS.

The adapter emits normalized events. In v0 these are not a parallel public stream grammar; they are the adapter-local vocabulary that maps onto AG-UI, active hypergraph SSE, provider events, and the durable RealtimeSession evidence contract.

  • session.created
  • media.input.audio
  • media.input.video
  • transcript.delta
  • transcript.final
  • agent.audio.delta
  • agent.text.delta
  • agent.interruption
  • tool.proposed
  • tool.blocked
  • tool.approved
  • hypergraph.trace.started
  • hypergraph.trace.updated
  • hypergraph.trace.final
  • hitl.requested
  • hitl.decided
  • session.ended

Every externally persisted event must carry sessionId, sequence, timestamp, and available correlation IDs: demoId, runId, traceId, traceparent, toolProposalId, and hitlDecisionId.

Event Crosswalk

Realtime control must compose with existing streams instead of replacing them.

Realtime event Maps to Durable evidence
session.created / session.ended AG-UI lifecycle status; media broker audit realtime-agent-session.v0
media.input.* Provider/WebRTC media event; optional AG-UI status media ref, consent state, redaction status
transcript.delta / transcript.final AG-UI text delta or caption state transcript media ref
agent.text.delta / agent.audio.delta AG-UI assistant delta; provider audio output transcript/caption ref, optional ephemeral audio ref
hypergraph.trace.* active hypergraph activation, gate, final, done SSE hypergraphTraceIds and trace refs
tool.proposed AG-UI tool proposal panel typed proposal with risk, args summary, trace ID
tool.blocked / tool.approved backend policy/HITL gate result HITL decision, CapabilityGrant, rollback posture
agent.interruption AG-UI cancellation/turn state event log only unless retained in transcript

Cancellation is fail-closed: ending, pausing, revoking consent, disconnecting, or losing policy context stops media capture, stops provider streaming, rejects pending tool proposals, and writes a terminal session event.

Hypergraph Authority Boundary

Live model output is not automatically platform truth.

For factual, operational, or action-bearing claims:

  1. The realtime agent normalizes intent.
  2. The hypergraph activates the relevant object slice.
  3. Policy filters the active set.
  4. The system judges sufficiency.
  5. The model renders language/audio only from the admitted result.
  6. Tool calls remain blocked until policy and HITL gates pass.

If the graph is insufficient, the agent must say so and ask for missing context, instead of improvising.

Realtime media is advisory input. A realtime session may not create, update, delete, deploy, rotate, assign, or mutate production or twin state unless the proposed action is represented as a typed tool proposal, bound to an active hypergraph trace, authorized by backend policy and a time-boxed CapabilityGrant, and approved through the required HITL gate.

Screen, camera, audio, transcript, and model-visible context are untrusted inputs. They cannot grant tools, change policy, bypass HITL, or override system/developer instructions.

Demo And Twin Integration

Realtime sessions should become another Demo Harness evidence source.

Suggested manifest evidence kinds:

[
  { "kind": "realtime-session", "retention": "pr-artifact", "sensitivity": "internal" },
  { "kind": "transcript", "retention": "pr-artifact", "sensitivity": "internal" },
  { "kind": "audio", "retention": "ephemeral", "sensitivity": "restricted" },
  { "kind": "video", "retention": "ephemeral", "sensitivity": "restricted" },
  { "kind": "hypergraph-trace", "retention": "release-evidence", "sensitivity": "internal" }
]

Demo/twin run outputs should include:

  • session ID;
  • media mode;
  • consent state;
  • transcript reference;
  • model/provider reference;
  • hypergraph trace IDs;
  • tool proposals;
  • HITL decisions;
  • redaction status;
  • retention policy;
  • reviewer notes.

Contract anchors:

  • contracts/realtime-agent-session.schema.json
  • contracts/realtime-agent-session.example.json
  • tools/validate-realtime-agent-session.py

Product Modes

Mode Description First use
Voice review Human talks through a demo with an agent. Demo Harness.
Voice + screen Agent sees shared screen or app state while guiding review. Guided UI demos.
Voice + operational trace Live conversation is grounded to active hypergraph traces. Capability/adoption questions.
Video co-presence Human, agent, and optional peers share a live room. Design/HITL sessions.
Twin control room Live session drives a digital/operational twin with gates. Self-modeled systems engineering.

Safety Requirements

Consent is per mode:

  • microphone input;
  • speaker/audio output;
  • camera input;
  • screen share;
  • transcription/captions;
  • recording;
  • model-visible context;
  • evidence retention.

Consent changes are session events. Revocation immediately stops capture, provider streaming, recording, transcription, and pending tool approvals for the revoked mode. Multi-party rooms require visible participant and recording state for all participants, including late joiners.

Media Retention Matrix

Artifact Default retention Default sensitivity Promotion rule
Raw audio ephemeral restricted Do not promote to PR/release evidence without explicit privacy approval.
Raw video/screen ephemeral restricted Prefer hashes, transcript refs, and reviewer notes over raw media.
Transcript/captions pr-artifact internal Redaction must pass before release evidence.
Hypergraph trace IDs release-evidence internal Store IDs/refs, not raw media.
HITL/tool decisions release-evidence internal Required for action-bearing sessions.

Raw audio, video, and screen media default to ephemeral retention and must not be promoted to PR artifacts or release evidence. Evidence should prefer hashes, transcript references, redaction status, trace IDs, and reviewer decisions over raw media content.

Baseline Controls

Minimum requirements:

  • explicit consent before microphone, camera, screen, transcription, recording, model processing, or evidence retention;
  • visible capture, recording, transcription, and model-visible indicators;
  • user can mute, pause, and end session;
  • no raw provider key in browser;
  • short-lived media credentials with mode scope, user/session binding, and revocation;
  • raw audio/video/screen default to ephemeral retention and restricted sensitivity;
  • transcripts pass redaction before release evidence;
  • tool calls require policy gate and HITL when action-bearing;
  • no production mutation from live conversation alone;
  • every action proposal links to a hypergraph trace or states that none exists.

Special handling:

  • voiceprints, faces, and biometrics are sensitive;
  • voice cannot authenticate a user; identity comes from backend auth/session state, not speaker recognition;
  • face/voiceprint identification, emotion inference, and biometric enrollment are out of scope until separate privacy/security review;
  • employee/customer conversations require stricter retention and privacy review;
  • screen-share may expose secrets and must be redacted or ephemeral by default;
  • screen-share text cannot grant tools or override policy;
  • provider/subprocessor inventory is required before employee, customer, or external production use;
  • realtime agents must support interruption and correction.

First Implementation Slice

The first slice should not attempt full production video rooms.

Build this:

  1. Add realtime.agent.interface as a planned capability.
  2. Add RealtimeSession evidence shape to the Demo Harness plan.
  3. Add a browser panel that can attach a transcript/reference to a demo run.
  4. Add provider-agnostic session metadata:
{
  "schemaVersion": "realtime-agent-session.v0",
  "sessionId": "ras-...",
  "demoId": "demo-harness.twin-driver",
  "modes": ["audio", "transcript"],
  "transport": "webrtc",
  "providerMetadata": {
    "provider": "openai-realtime",
    "adapter": "native-realtime"
  },
  "hypergraphTraceIds": [],
  "retention": "pr-artifact",
  "sensitivity": "internal",
  "consent": {
    "recording": false,
    "transcription": true,
    "screenShare": false
  }
}
  1. Bind the session to the mock demo-harness.twin-driver capsule.
  2. Require HITL approval before the session can promote any recommendation.
Concern Recommendation
Browser media WebRTC.
Multi-party rooms/SFU LiveKit first; keep adapter boundary open.
Provider-native realtime AI OpenAI Realtime as a primary adapter, not a lock-in.
Provider-flexible voice Cascaded STT -> hypergraph/LLM -> TTS path.
UI control stream Existing AG-UI/SSE and active hypergraph SSE.
Tool execution MCP/A2A through backend policy.
Evidence Demo Harness artifacts plus transcript/media refs.

Sources