ARC-ADR-DRAFT — Disambiguator: a high-speed streaming sense-resolution service¶
- ID: ARC-ADR-DRAFT
- Status: Proposed (draft / vision)
- Date: 2026-05-30
- Deciders: platform-architect, knowledge-engineer, ontologist-generalist, async-messaging-engineer
- Related: Acronym Squasher (acronym-squasher.md) is the static, batch, single-sense special case of this service.
Context¶
Natural language — in docs, code, chat, agent traffic — is saturated with ambiguous surface forms. Three families, one underlying problem:
| Family | One ↔ many | Example |
|---|---|---|
| Acronyms | one surface → many expansions | MC = middle-core │ Master of Ceremonies │ Marginal Cost |
| Homonyms / polysemy | one surface → many senses | bridge = the agent-army-docker-bridge repo │ a network bridge │ a verb |
| Synonyms / aliases | many surfaces → one referent | backend-core = BE = "the API layer" |
The right sense depends on context (surrounding tokens, active domain, who's speaking, what was just resolved) and shifts over time within a stream — a later sentence can retroactively change what an earlier token meant. The Acronym Squasher solved the easy corner (one fixed expansion per acronym, whole-doc, offline). The general capability is a streaming Word-Sense Disambiguation (WSD) + entity-linking service: ingest a token stream, emit a continuously-revised array of resolutions.
Decision drivers¶
- Speed — bounded per-token latency; O(n) over the stream; most tokens resolve in microseconds. (Realized: the Rust core
tools/disambiguator-rs/runs ~13–14 M tokens/sec single-thread, ~72 ns/token, zero-dependency.) - Context-sensitivity — same surface resolves differently per stream/domain/conversation state.
- Revisability — resolutions are provisional; later context emits revisions, not just appends.
- Cost-awareness — escalate to embeddings / LLM only for the genuinely ambiguous minority.
- Knowledge-graph-backed — reuse the fleet's ontology stack so acronyms, synonyms, homonyms, and entities share one sense inventory.
- Fleet-native — ride NATS / CloudEvents / event-bridge; no new bus.
Core model¶
stream of tokens ─▶ [surface detection] ─▶ [candidate senses] ─▶ [context scoring]
│
┌─────────────────────────────────────────────┘
▼
per-stream resolution state (the "array that shifts")
surface ─▶ { sense, confidence, provenance, validInterval, version }
│
├─▶ emit Disambiguation event (new resolution)
└─▶ emit Revision event (confidence/sense changed)
The central object is a per-stream resolution map — a live array of
Resolution records that is mutated and versioned as the stream advances. That is
the user's "array of mapped disambiguations which may shift over time."
// Resolution record (emitted as a CloudEvent)
{
"surface": "MC",
"span": { "start": 1043, "end": 1045 },
"sense": "kb://entity/middle-core", // canonical referent in the knowledge graph
"confidence": 0.91,
"alternates": [ { "sense": "kb://concept/marginal-cost", "confidence": 0.06 } ],
"tier": 1, // which resolver tier decided it (0..3)
"provenance": ["domain:fleet-ops", "window:±12", "prior:0.7"],
"validInterval": { "from": "<HLC>", "to": null }, // time-varying; revisable
"version": 2 // bumped on each revision
}
Architecture¶
A tiered pipeline — cheap-and-fast first, expensive-and-smart only when needed.
| Stage | Responsibility | Realized by |
|---|---|---|
| Ingest | accept a stream (HTTP push or subject subscribe), CloudEvents-wrapped | event-bridge (:8080) → NATS JetStream FLEET |
| Lexicon / sense inventory | surface form → candidate senses, with domain tags, priors, validity intervals | knowledge graph (ArcadeDB / Fuseki); SKOS altLabel/prefLabel for synonyms, the glossary for acronyms, a UFO/BFO ontology for entity senses |
| Surface detection | O(n) multi-pattern match over the token window | Aho-Corasick automaton compiled from the lexicon (generalizes acronyms.mjs's regex bank) |
| Context scorer (tiered) | pick the sense | T0 lexicon prior + active-domain filter (µs) → T1 sliding-window co-occurrence + recent-resolution state → T2 embedding similarity (local-embedder, #184) → T3 llm-gateway for the hard residue |
| Stream resolver | maintain the revisable per-stream map; order revisions; emit deltas | windowed, stateful; HLC-ordered so out-of-order/late context revises deterministically |
| Output | emit Disambiguation / Revision events; serve current-state snapshot | CloudEvents on NATS; a queryable GET /streams/{id}/resolutions snapshot |
| Feedback loop | grow the lexicon; tune priors; escalate novelty | low-confidence / novel surfaces → hitl-coordinator Decision; confirmed senses → knowledge-synthesizer updates priors; new entries authored by taxonomist (SKOS) / ontologist-generalist (senses) |
Why tiered is the whole game¶
Empirically most tokens are unambiguous or have a dominant sense — T0 resolves the
vast majority at lexicon-lookup speed. Only the ambiguous minority climbs to
embeddings (T2) or the LLM (T3). This keeps the service streaming-fast and cheap, and
it's the same cost-discipline as the fleet's existing llm-gateway / FinOps posture.
Revisability (the "shifts over time" requirement)¶
Each Resolution carries a validInterval and version. When later context changes
a score past a hysteresis threshold, the resolver emits a Revision event (a delta,
not a re-send) and bumps the version. Consumers can replay the resolution history of a
span. HLC timestamps give a total order across late/out-of-order arrivals without a
central clock — so two workers resolving the same stream converge.
Phased roadmap¶
- P1 — Static lexicon, batch (DONE). Acronym Squasher: glossary as lexicon, whole-doc, one sense. The seed.
- P2 — Streaming T0/T1. Aho-Corasick matcher + lexicon priors + window heuristics over a NATS stream; emit Disambiguation events. No ML yet. → runnable PoC:
tools/disambiguator/(context-sensitivity + revision proven; transport/KG/T2-T3 still stubbed). - P3 — Context + embeddings (T2). Domain/conversation state;
local-embedderfor semantic tie-breaks; synonyms via SKOS. → the ontology-fed lexicon is already prototyped:tools/disambiguator/skos-adapter.mjsturns a Fuseki SKOSCONSTRUCTinto surfaces/senses/cues (graph topology = model). This is the shared-service loop: ontology pipelines feed the lexicon (SKOS → senses) and consume the output (resolved concept IRIs → entity links back into the graph). - P4 — Revisable online (T3 + feedback). HLC-ordered Revision events;
llm-gatewayfor the hard residue; HITL-driven lexicon growth. - P5 — Cross-stream entity resolution. Same-as / identity resolution across streams (the Data-Vault
same-asdiscipline applied to language), soBE/backend-core/"the API layer" collapse to one referent fleet-wide.
Consequences¶
Good: one capability spans acronyms + homonyms + synonyms + entities; comprehension debt is paid down continuously, not just in docs; agents get a shared, context-aware "what does this token mean here" service; reuses the existing bus + KG + embedder + LLM gateway rather than new infra.
Costs / risks: lexicon governance becomes load-bearing (who curates senses? → knowledge-ontology cluster + HITL); the LLM tier needs a hard budget cap or it eats latency/cost; eventual-consistency of revisions means consumers must handle "the answer changed"; evaluation needs real WSD/entity-linking metrics (precision/recall, accuracy@k) and a labelled set, or quality is unknowable.
Open questions¶
- Sense inventory store: extend the glossary format, or model senses natively as SKOS concepts + ontology individuals from day one?
- Per-stream state lifetime / eviction policy for long-lived streams.
- Is the revision hysteresis global, or per-surface-learned?
- Where's the line between this service and the existing ontology "sieve" (Fuseki/SHACL)?