Realtime Agent Interface¶

Position: realtime audio and video are interaction surfaces, not the authority layer. The live agent can listen, speak, see, point, interrupt, and guide, but platform truth still comes from the operational hypergraph, contracts, evidence, policies, and HITL gates.

Goal¶

Design a live agent interface for untool.ai that supports:

low-latency voice interaction;
optional video and screen context;
live captions and transcript;
agent/human interruption;
tool and runbook gating;
demo/twin evidence capture;
operational hypergraph traceability.

The interface should work for demos first, then mature into an operational twin control surface.

Current Baseline¶

The platform already has streaming foundations:

AG-UI stream: SSE event protocol for frontend agent interaction.
Active hypergraph stream: object-native SSE events for activation, gate, final, and done.
OpenAI-compatible adapters: chat and Responses streaming over the same hypergraph authority path.
Demo Harness: evidence, approval, simulation, twin bindings, and review surface.

The missing plane is media: browser microphone, speaker, camera, screen share, realtime model transport, recording, transcript, consent, and media evidence.

Architecture¶

flowchart LR
  Browser["Browser UI<br/>mic, speaker, camera, screen, captions"]
  MediaBroker["Realtime Media Broker<br/>session, consent, ephemeral tokens"]
  MediaPlane["Media Plane<br/>WebRTC / SFU / recording"]
  ModelAdapter["Realtime Model Adapter<br/>speech, vision, proposals"]
  Hypergraph["Operational Hypergraph<br/>truth, policy, sufficiency, trace"]
  Policy["Policy + HITL Gate<br/>CapabilityGrant, approval, rollback posture"]
  Tools["Tool Gateway<br/>MCP / A2A / runbooks"]
  Evidence["Evidence Store<br/>transcript, trace, media refs, HITL"]
  Demo["Demo / Twin Harness<br/>guided review and control"]

  Browser <--> MediaBroker
  MediaBroker <--> MediaPlane
  MediaBroker <--> ModelAdapter
  ModelAdapter <--> Hypergraph
  ModelAdapter -->|tool.proposed| Policy
  Hypergraph --> Policy
  Policy -->|approved only| Tools
  Policy --> Evidence
  Hypergraph --> Evidence
  MediaPlane --> Evidence
  Browser --> Demo
  Demo --> Evidence

1. Browser Client¶

The browser owns user interaction:

microphone, speaker, camera, and screen-share controls;
consent and recording indicators;
live captions and transcript timeline;
push-to-talk and mute;
barge-in / interrupt;
tool-call and HITL approval panels;
current hypergraph trace link;
current demo/twin run link.

The browser must never hold a long-lived provider API key. It receives only a short-lived session credential from the backend.

2. Realtime Media Broker¶

The broker is the server-side control point.

Responsibilities:

create realtime sessions;
mint ephemeral client credentials;
attach user, tenant, repo, demo, work unit, and policy context;
authorize media modes: audio, video, screen, recording, transcription;
enforce JWT/BFF identity, media-mode scope, TTL, revocation, and consent version;
route model events to the hypergraph authority path;
convert model tool requests into typed tool.proposed events;
route approved tool calls through MCP/A2A only after policy and HITL gates;
write session evidence.

This broker belongs behind backend-core auth and policy. It is not a direct frontend-to-provider pass-through.

3. Media Plane¶

Use WebRTC for live media. WebSockets/SSE are acceptable for control/event streams, but not as the primary audio/video transport.

Recommended path:

Stage	Media plane
Prototype	Browser WebRTC to provider or local broker with broker-minted, mode-scoped, short-lived session credentials.
Team demos	LiveKit or equivalent SFU-backed room for voice, video, screen share, recording, and multi-party presence.
Production	Media broker + SFU + provider adapter + recording/transcript pipeline + policy gates.

Avoid building low-level WebRTC infrastructure from scratch until a concrete requirement forces it.

Direct browser-to-provider media is allowed only for media transport. Tools, policy, HITL, evidence, trace correlation, and session creation still route through backend-core controlled services.

4. Realtime Model Adapter¶

The adapter hides provider details. It should support two execution styles:

Style	Use
Native realtime model	Low-latency speech-to-speech, interruption, multimodal input, fast demos.
Cascaded pipeline	Provider flexibility: streaming STT -> hypergraph/LLM reasoning -> streaming TTS.

The adapter emits normalized events. In v0 these are not a parallel public stream grammar; they are the adapter-local vocabulary that maps onto AG-UI, active hypergraph SSE, provider events, and the durable RealtimeSession evidence contract.

session.created
media.input.audio
media.input.video
transcript.delta
transcript.final
agent.audio.delta
agent.text.delta
agent.interruption
tool.proposed
tool.blocked
tool.approved
hypergraph.trace.started
hypergraph.trace.updated
hypergraph.trace.final
hitl.requested
hitl.decided
session.ended

Every externally persisted event must carry sessionId, sequence, timestamp, and available correlation IDs: demoId, runId, traceId, traceparent, toolProposalId, and hitlDecisionId.

Event Crosswalk¶

Realtime control must compose with existing streams instead of replacing them.

Realtime event	Maps to	Durable evidence
`session.created` / `session.ended`	AG-UI lifecycle status; media broker audit	`realtime-agent-session.v0`
`media.input.*`	Provider/WebRTC media event; optional AG-UI status	media ref, consent state, redaction status
`transcript.delta` / `transcript.final`	AG-UI text delta or caption state	transcript media ref
`agent.text.delta` / `agent.audio.delta`	AG-UI assistant delta; provider audio output	transcript/caption ref, optional ephemeral audio ref
`hypergraph.trace.*`	active hypergraph `activation`, `gate`, `final`, `done` SSE	`hypergraphTraceIds` and trace refs
`tool.proposed`	AG-UI tool proposal panel	typed proposal with risk, args summary, trace ID
`tool.blocked` / `tool.approved`	backend policy/HITL gate result	HITL decision, `CapabilityGrant`, rollback posture
`agent.interruption`	AG-UI cancellation/turn state	event log only unless retained in transcript

Cancellation is fail-closed: ending, pausing, revoking consent, disconnecting, or losing policy context stops media capture, stops provider streaming, rejects pending tool proposals, and writes a terminal session event.

Hypergraph Authority Boundary¶

Live model output is not automatically platform truth.

For factual, operational, or action-bearing claims:

The realtime agent normalizes intent.
The hypergraph activates the relevant object slice.
Policy filters the active set.
The system judges sufficiency.
The model renders language/audio only from the admitted result.
Tool calls remain blocked until policy and HITL gates pass.

If the graph is insufficient, the agent must say so and ask for missing context, instead of improvising.

Realtime media is advisory input. A realtime session may not create, update, delete, deploy, rotate, assign, or mutate production or twin state unless the proposed action is represented as a typed tool proposal, bound to an active hypergraph trace, authorized by backend policy and a time-boxed CapabilityGrant, and approved through the required HITL gate.

Screen, camera, audio, transcript, and model-visible context are untrusted inputs. They cannot grant tools, change policy, bypass HITL, or override system/developer instructions.

Demo And Twin Integration¶

Realtime sessions should become another Demo Harness evidence source.

Suggested manifest evidence kinds:

[
  { "kind": "realtime-session", "retention": "pr-artifact", "sensitivity": "internal" },
  { "kind": "transcript", "retention": "pr-artifact", "sensitivity": "internal" },
  { "kind": "audio", "retention": "ephemeral", "sensitivity": "restricted" },
  { "kind": "video", "retention": "ephemeral", "sensitivity": "restricted" },
  { "kind": "hypergraph-trace", "retention": "release-evidence", "sensitivity": "internal" }
]

Demo/twin run outputs should include:

session ID;
media mode;
consent state;
transcript reference;
model/provider reference;
hypergraph trace IDs;
tool proposals;
HITL decisions;
redaction status;
retention policy;
reviewer notes.

Contract anchors:

contracts/realtime-agent-session.schema.json
contracts/realtime-agent-session.example.json
tools/validate-realtime-agent-session.py

Product Modes¶

Mode	Description	First use
Voice review	Human talks through a demo with an agent.	Demo Harness.
Voice + screen	Agent sees shared screen or app state while guiding review.	Guided UI demos.
Voice + operational trace	Live conversation is grounded to active hypergraph traces.	Capability/adoption questions.
Video co-presence	Human, agent, and optional peers share a live room.	Design/HITL sessions.
Twin control room	Live session drives a digital/operational twin with gates.	Self-modeled systems engineering.

Safety Requirements¶

Consent is per mode:

microphone input;
speaker/audio output;
camera input;
screen share;
transcription/captions;
recording;
model-visible context;
evidence retention.

Consent changes are session events. Revocation immediately stops capture, provider streaming, recording, transcription, and pending tool approvals for the revoked mode. Multi-party rooms require visible participant and recording state for all participants, including late joiners.

Media Retention Matrix¶

Artifact	Default retention	Default sensitivity	Promotion rule
Raw audio	`ephemeral`	`restricted`	Do not promote to PR/release evidence without explicit privacy approval.
Raw video/screen	`ephemeral`	`restricted`	Prefer hashes, transcript refs, and reviewer notes over raw media.
Transcript/captions	`pr-artifact`	`internal`	Redaction must pass before release evidence.
Hypergraph trace IDs	`release-evidence`	`internal`	Store IDs/refs, not raw media.
HITL/tool decisions	`release-evidence`	`internal`	Required for action-bearing sessions.

Raw audio, video, and screen media default to ephemeral retention and must not be promoted to PR artifacts or release evidence. Evidence should prefer hashes, transcript references, redaction status, trace IDs, and reviewer decisions over raw media content.

Baseline Controls¶

Minimum requirements:

explicit consent before microphone, camera, screen, transcription, recording, model processing, or evidence retention;
visible capture, recording, transcription, and model-visible indicators;
user can mute, pause, and end session;
no raw provider key in browser;
short-lived media credentials with mode scope, user/session binding, and revocation;
raw audio/video/screen default to ephemeral retention and restricted sensitivity;
transcripts pass redaction before release evidence;
tool calls require policy gate and HITL when action-bearing;
no production mutation from live conversation alone;
every action proposal links to a hypergraph trace or states that none exists.

Special handling:

voiceprints, faces, and biometrics are sensitive;
voice cannot authenticate a user; identity comes from backend auth/session state, not speaker recognition;
face/voiceprint identification, emotion inference, and biometric enrollment are out of scope until separate privacy/security review;
employee/customer conversations require stricter retention and privacy review;
screen-share may expose secrets and must be redacted or ephemeral by default;
screen-share text cannot grant tools or override policy;
provider/subprocessor inventory is required before employee, customer, or external production use;
realtime agents must support interruption and correction.

First Implementation Slice¶

The first slice should not attempt full production video rooms.

Build this:

Add realtime.agent.interface as a planned capability.
Add RealtimeSession evidence shape to the Demo Harness plan.
Add a browser panel that can attach a transcript/reference to a demo run.
Add provider-agnostic session metadata:

{
  "schemaVersion": "realtime-agent-session.v0",
  "sessionId": "ras-...",
  "demoId": "demo-harness.twin-driver",
  "modes": ["audio", "transcript"],
  "transport": "webrtc",
  "providerMetadata": {
    "provider": "openai-realtime",
    "adapter": "native-realtime"
  },
  "hypergraphTraceIds": [],
  "retention": "pr-artifact",
  "sensitivity": "internal",
  "consent": {
    "recording": false,
    "transcription": true,
    "screenShare": false
  }
}

Bind the session to the mock demo-harness.twin-driver capsule.
Require HITL approval before the session can promote any recommendation.

Concern	Recommendation
Browser media	WebRTC.
Multi-party rooms/SFU	LiveKit first; keep adapter boundary open.
Provider-native realtime AI	OpenAI Realtime as a primary adapter, not a lock-in.
Provider-flexible voice	Cascaded STT -> hypergraph/LLM -> TTS path.
UI control stream	Existing AG-UI/SSE and active hypergraph SSE.
Tool execution	MCP/A2A through backend policy.
Evidence	Demo Harness artifacts plus transcript/media refs.

Realtime Agent Interface¶

Goal¶

Current Baseline¶

Architecture¶

1. Browser Client¶

2. Realtime Media Broker¶

3. Media Plane¶

4. Realtime Model Adapter¶

Event Crosswalk¶

Hypergraph Authority Boundary¶

Demo And Twin Integration¶

Product Modes¶

Safety Requirements¶

Media Retention Matrix¶

Baseline Controls¶

First Implementation Slice¶

Recommended Technology Posture¶

Sources¶

Realtime Agent Interface¶

Goal¶

Current Baseline¶

Architecture¶

1. Browser Client¶

2. Realtime Media Broker¶

3. Media Plane¶

4. Realtime Model Adapter¶

Event Crosswalk¶

Hypergraph Authority Boundary¶

Demo And Twin Integration¶

Product Modes¶

Safety Requirements¶

Consent Lifecycle¶

Media Retention Matrix¶

Baseline Controls¶

First Implementation Slice¶

Recommended Technology Posture¶

Sources¶