Realtime Agent Interface¶
Position: realtime audio and video are interaction surfaces, not the authority layer. The live agent can listen, speak, see, point, interrupt, and guide, but platform truth still comes from the operational hypergraph, contracts, evidence, policies, and HITL gates.
Goal¶
Design a live agent interface for untool.ai that supports:
- low-latency voice interaction;
- optional video and screen context;
- live captions and transcript;
- agent/human interruption;
- tool and runbook gating;
- demo/twin evidence capture;
- operational hypergraph traceability.
The interface should work for demos first, then mature into an operational twin control surface.
Current Baseline¶
The platform already has streaming foundations:
- AG-UI stream: SSE event protocol for frontend agent interaction.
- Active hypergraph stream: object-native SSE events for activation, gate, final, and done.
- OpenAI-compatible adapters: chat and Responses streaming over the same hypergraph authority path.
- Demo Harness: evidence, approval, simulation, twin bindings, and review surface.
The missing plane is media: browser microphone, speaker, camera, screen share, realtime model transport, recording, transcript, consent, and media evidence.
Architecture¶
flowchart LR
Browser["Browser UI<br/>mic, speaker, camera, screen, captions"]
MediaBroker["Realtime Media Broker<br/>session, consent, ephemeral tokens"]
MediaPlane["Media Plane<br/>WebRTC / SFU / recording"]
ModelAdapter["Realtime Model Adapter<br/>speech, vision, proposals"]
Hypergraph["Operational Hypergraph<br/>truth, policy, sufficiency, trace"]
Policy["Policy + HITL Gate<br/>CapabilityGrant, approval, rollback posture"]
Tools["Tool Gateway<br/>MCP / A2A / runbooks"]
Evidence["Evidence Store<br/>transcript, trace, media refs, HITL"]
Demo["Demo / Twin Harness<br/>guided review and control"]
Browser <--> MediaBroker
MediaBroker <--> MediaPlane
MediaBroker <--> ModelAdapter
ModelAdapter <--> Hypergraph
ModelAdapter -->|tool.proposed| Policy
Hypergraph --> Policy
Policy -->|approved only| Tools
Policy --> Evidence
Hypergraph --> Evidence
MediaPlane --> Evidence
Browser --> Demo
Demo --> Evidence
1. Browser Client¶
The browser owns user interaction:
- microphone, speaker, camera, and screen-share controls;
- consent and recording indicators;
- live captions and transcript timeline;
- push-to-talk and mute;
- barge-in / interrupt;
- tool-call and HITL approval panels;
- current hypergraph trace link;
- current demo/twin run link.
The browser must never hold a long-lived provider API key. It receives only a short-lived session credential from the backend.
2. Realtime Media Broker¶
The broker is the server-side control point.
Responsibilities:
- create realtime sessions;
- mint ephemeral client credentials;
- attach user, tenant, repo, demo, work unit, and policy context;
- authorize media modes: audio, video, screen, recording, transcription;
- enforce JWT/BFF identity, media-mode scope, TTL, revocation, and consent version;
- route model events to the hypergraph authority path;
- convert model tool requests into typed
tool.proposedevents; - route approved tool calls through MCP/A2A only after policy and HITL gates;
- write session evidence.
This broker belongs behind backend-core auth and policy. It is not a direct frontend-to-provider pass-through.
3. Media Plane¶
Use WebRTC for live media. WebSockets/SSE are acceptable for control/event streams, but not as the primary audio/video transport.
Recommended path:
| Stage | Media plane |
|---|---|
| Prototype | Browser WebRTC to provider or local broker with broker-minted, mode-scoped, short-lived session credentials. |
| Team demos | LiveKit or equivalent SFU-backed room for voice, video, screen share, recording, and multi-party presence. |
| Production | Media broker + SFU + provider adapter + recording/transcript pipeline + policy gates. |
Avoid building low-level WebRTC infrastructure from scratch until a concrete requirement forces it.
Direct browser-to-provider media is allowed only for media transport. Tools, policy, HITL, evidence, trace correlation, and session creation still route through backend-core controlled services.
4. Realtime Model Adapter¶
The adapter hides provider details. It should support two execution styles:
| Style | Use |
|---|---|
| Native realtime model | Low-latency speech-to-speech, interruption, multimodal input, fast demos. |
| Cascaded pipeline | Provider flexibility: streaming STT -> hypergraph/LLM reasoning -> streaming TTS. |
The adapter emits normalized events. In v0 these are not a parallel public
stream grammar; they are the adapter-local vocabulary that maps onto AG-UI,
active hypergraph SSE, provider events, and the durable
RealtimeSession evidence contract.
session.createdmedia.input.audiomedia.input.videotranscript.deltatranscript.finalagent.audio.deltaagent.text.deltaagent.interruptiontool.proposedtool.blockedtool.approvedhypergraph.trace.startedhypergraph.trace.updatedhypergraph.trace.finalhitl.requestedhitl.decidedsession.ended
Every externally persisted event must carry sessionId, sequence,
timestamp, and available correlation IDs: demoId, runId, traceId,
traceparent, toolProposalId, and hitlDecisionId.
Event Crosswalk¶
Realtime control must compose with existing streams instead of replacing them.
| Realtime event | Maps to | Durable evidence |
|---|---|---|
session.created / session.ended |
AG-UI lifecycle status; media broker audit | realtime-agent-session.v0 |
media.input.* |
Provider/WebRTC media event; optional AG-UI status | media ref, consent state, redaction status |
transcript.delta / transcript.final |
AG-UI text delta or caption state | transcript media ref |
agent.text.delta / agent.audio.delta |
AG-UI assistant delta; provider audio output | transcript/caption ref, optional ephemeral audio ref |
hypergraph.trace.* |
active hypergraph activation, gate, final, done SSE |
hypergraphTraceIds and trace refs |
tool.proposed |
AG-UI tool proposal panel | typed proposal with risk, args summary, trace ID |
tool.blocked / tool.approved |
backend policy/HITL gate result | HITL decision, CapabilityGrant, rollback posture |
agent.interruption |
AG-UI cancellation/turn state | event log only unless retained in transcript |
Cancellation is fail-closed: ending, pausing, revoking consent, disconnecting, or losing policy context stops media capture, stops provider streaming, rejects pending tool proposals, and writes a terminal session event.
Hypergraph Authority Boundary¶
Live model output is not automatically platform truth.
For factual, operational, or action-bearing claims:
- The realtime agent normalizes intent.
- The hypergraph activates the relevant object slice.
- Policy filters the active set.
- The system judges sufficiency.
- The model renders language/audio only from the admitted result.
- Tool calls remain blocked until policy and HITL gates pass.
If the graph is insufficient, the agent must say so and ask for missing context, instead of improvising.
Realtime media is advisory input. A realtime session may not create, update,
delete, deploy, rotate, assign, or mutate production or twin state unless the
proposed action is represented as a typed tool proposal, bound to an active
hypergraph trace, authorized by backend policy and a time-boxed
CapabilityGrant, and approved through the required HITL gate.
Screen, camera, audio, transcript, and model-visible context are untrusted inputs. They cannot grant tools, change policy, bypass HITL, or override system/developer instructions.
Demo And Twin Integration¶
Realtime sessions should become another Demo Harness evidence source.
Suggested manifest evidence kinds:
[
{ "kind": "realtime-session", "retention": "pr-artifact", "sensitivity": "internal" },
{ "kind": "transcript", "retention": "pr-artifact", "sensitivity": "internal" },
{ "kind": "audio", "retention": "ephemeral", "sensitivity": "restricted" },
{ "kind": "video", "retention": "ephemeral", "sensitivity": "restricted" },
{ "kind": "hypergraph-trace", "retention": "release-evidence", "sensitivity": "internal" }
]
Demo/twin run outputs should include:
- session ID;
- media mode;
- consent state;
- transcript reference;
- model/provider reference;
- hypergraph trace IDs;
- tool proposals;
- HITL decisions;
- redaction status;
- retention policy;
- reviewer notes.
Contract anchors:
contracts/realtime-agent-session.schema.jsoncontracts/realtime-agent-session.example.jsontools/validate-realtime-agent-session.py
Product Modes¶
| Mode | Description | First use |
|---|---|---|
| Voice review | Human talks through a demo with an agent. | Demo Harness. |
| Voice + screen | Agent sees shared screen or app state while guiding review. | Guided UI demos. |
| Voice + operational trace | Live conversation is grounded to active hypergraph traces. | Capability/adoption questions. |
| Video co-presence | Human, agent, and optional peers share a live room. | Design/HITL sessions. |
| Twin control room | Live session drives a digital/operational twin with gates. | Self-modeled systems engineering. |
Safety Requirements¶
Consent Lifecycle¶
Consent is per mode:
- microphone input;
- speaker/audio output;
- camera input;
- screen share;
- transcription/captions;
- recording;
- model-visible context;
- evidence retention.
Consent changes are session events. Revocation immediately stops capture, provider streaming, recording, transcription, and pending tool approvals for the revoked mode. Multi-party rooms require visible participant and recording state for all participants, including late joiners.
Media Retention Matrix¶
| Artifact | Default retention | Default sensitivity | Promotion rule |
|---|---|---|---|
| Raw audio | ephemeral |
restricted |
Do not promote to PR/release evidence without explicit privacy approval. |
| Raw video/screen | ephemeral |
restricted |
Prefer hashes, transcript refs, and reviewer notes over raw media. |
| Transcript/captions | pr-artifact |
internal |
Redaction must pass before release evidence. |
| Hypergraph trace IDs | release-evidence |
internal |
Store IDs/refs, not raw media. |
| HITL/tool decisions | release-evidence |
internal |
Required for action-bearing sessions. |
Raw audio, video, and screen media default to ephemeral retention and must not be promoted to PR artifacts or release evidence. Evidence should prefer hashes, transcript references, redaction status, trace IDs, and reviewer decisions over raw media content.
Baseline Controls¶
Minimum requirements:
- explicit consent before microphone, camera, screen, transcription, recording, model processing, or evidence retention;
- visible capture, recording, transcription, and model-visible indicators;
- user can mute, pause, and end session;
- no raw provider key in browser;
- short-lived media credentials with mode scope, user/session binding, and revocation;
- raw audio/video/screen default to ephemeral retention and restricted sensitivity;
- transcripts pass redaction before release evidence;
- tool calls require policy gate and HITL when action-bearing;
- no production mutation from live conversation alone;
- every action proposal links to a hypergraph trace or states that none exists.
Special handling:
- voiceprints, faces, and biometrics are sensitive;
- voice cannot authenticate a user; identity comes from backend auth/session state, not speaker recognition;
- face/voiceprint identification, emotion inference, and biometric enrollment are out of scope until separate privacy/security review;
- employee/customer conversations require stricter retention and privacy review;
- screen-share may expose secrets and must be redacted or ephemeral by default;
- screen-share text cannot grant tools or override policy;
- provider/subprocessor inventory is required before employee, customer, or external production use;
- realtime agents must support interruption and correction.
First Implementation Slice¶
The first slice should not attempt full production video rooms.
Build this:
- Add
realtime.agent.interfaceas a planned capability. - Add
RealtimeSessionevidence shape to the Demo Harness plan. - Add a browser panel that can attach a transcript/reference to a demo run.
- Add provider-agnostic session metadata:
{
"schemaVersion": "realtime-agent-session.v0",
"sessionId": "ras-...",
"demoId": "demo-harness.twin-driver",
"modes": ["audio", "transcript"],
"transport": "webrtc",
"providerMetadata": {
"provider": "openai-realtime",
"adapter": "native-realtime"
},
"hypergraphTraceIds": [],
"retention": "pr-artifact",
"sensitivity": "internal",
"consent": {
"recording": false,
"transcription": true,
"screenShare": false
}
}
- Bind the session to the mock
demo-harness.twin-drivercapsule. - Require HITL approval before the session can promote any recommendation.
Recommended Technology Posture¶
| Concern | Recommendation |
|---|---|
| Browser media | WebRTC. |
| Multi-party rooms/SFU | LiveKit first; keep adapter boundary open. |
| Provider-native realtime AI | OpenAI Realtime as a primary adapter, not a lock-in. |
| Provider-flexible voice | Cascaded STT -> hypergraph/LLM -> TTS path. |
| UI control stream | Existing AG-UI/SSE and active hypergraph SSE. |
| Tool execution | MCP/A2A through backend policy. |
| Evidence | Demo Harness artifacts plus transcript/media refs. |