ARC-ADR-DRAFT — Disambiguator: a high-speed streaming sense-resolution service¶

ID: ARC-ADR-DRAFT
Status: Proposed (draft / vision)
Date: 2026-05-30
Deciders: platform-architect, knowledge-engineer, ontologist-generalist, async-messaging-engineer
Related: Acronym Squasher (acronym-squasher.md) is the static, batch, single-sense special case of this service.

Context¶

Natural language — in docs, code, chat, agent traffic — is saturated with ambiguous surface forms. Three families, one underlying problem:

Family	One ↔ many	Example
Acronyms	one surface → many expansions	`MC` = middle-core │ Master of Ceremonies │ Marginal Cost
Homonyms / polysemy	one surface → many senses	`bridge` = the `agent-army-docker-bridge` repo │ a network bridge │ a verb
Synonyms / aliases	many surfaces → one referent	`backend-core` = `BE` = "the API layer"

The right sense depends on context (surrounding tokens, active domain, who's speaking, what was just resolved) and shifts over time within a stream — a later sentence can retroactively change what an earlier token meant. The Acronym Squasher solved the easy corner (one fixed expansion per acronym, whole-doc, offline). The general capability is a streaming Word-Sense Disambiguation (WSD) + entity-linking service: ingest a token stream, emit a continuously-revised array of resolutions.

Decision drivers¶

Speed — bounded per-token latency; O(n) over the stream; most tokens resolve in microseconds. (Realized: the Rust core tools/disambiguator-rs/ runs ~13–14 M tokens/sec single-thread, ~72 ns/token, zero-dependency.)
Context-sensitivity — same surface resolves differently per stream/domain/conversation state.
Revisability — resolutions are provisional; later context emits revisions, not just appends.
Cost-awareness — escalate to embeddings / LLM only for the genuinely ambiguous minority.
Knowledge-graph-backed — reuse the fleet's ontology stack so acronyms, synonyms, homonyms, and entities share one sense inventory.
Fleet-native — ride NATS / CloudEvents / event-bridge; no new bus.

Core model¶

stream of tokens ─▶ [surface detection] ─▶ [candidate senses] ─▶ [context scoring]
                                                                        │
                          ┌─────────────────────────────────────────────┘
                          ▼
            per-stream resolution state  (the "array that shifts")
            surface ─▶ { sense, confidence, provenance, validInterval, version }
                          │
                          ├─▶ emit Disambiguation event   (new resolution)
                          └─▶ emit Revision event         (confidence/sense changed)

The central object is a per-stream resolution map — a live array of Resolution records that is mutated and versioned as the stream advances. That is the user's "array of mapped disambiguations which may shift over time."

// Resolution record (emitted as a CloudEvent)
{
  "surface": "MC",
  "span": { "start": 1043, "end": 1045 },
  "sense": "kb://entity/middle-core",      // canonical referent in the knowledge graph
  "confidence": 0.91,
  "alternates": [ { "sense": "kb://concept/marginal-cost", "confidence": 0.06 } ],
  "tier": 1,                                // which resolver tier decided it (0..3)
  "provenance": ["domain:fleet-ops", "window:±12", "prior:0.7"],
  "validInterval": { "from": "<HLC>", "to": null },  // time-varying; revisable
  "version": 2                              // bumped on each revision
}

Architecture¶

A tiered pipeline — cheap-and-fast first, expensive-and-smart only when needed.

Stage	Responsibility	Realized by
Ingest	accept a stream (HTTP push or subject subscribe), CloudEvents-wrapped	`event-bridge` (`:8080`) → NATS JetStream `FLEET`
Lexicon / sense inventory	surface form → candidate senses, with domain tags, priors, validity intervals	knowledge graph (ArcadeDB / Fuseki); SKOS `altLabel`/`prefLabel` for synonyms, the glossary for acronyms, a UFO/BFO ontology for entity senses
Surface detection	O(n) multi-pattern match over the token window	Aho-Corasick automaton compiled from the lexicon (generalizes `acronyms.mjs`'s regex bank)
Context scorer (tiered)	pick the sense	T0 lexicon prior + active-domain filter (µs) → T1 sliding-window co-occurrence + recent-resolution state → T2 embedding similarity (`local-embedder`, #184) → T3 `llm-gateway` for the hard residue
Stream resolver	maintain the revisable per-stream map; order revisions; emit deltas	windowed, stateful; HLC-ordered so out-of-order/late context revises deterministically
Output	emit Disambiguation / Revision events; serve current-state snapshot	CloudEvents on NATS; a queryable `GET /streams/{id}/resolutions` snapshot
Feedback loop	grow the lexicon; tune priors; escalate novelty	low-confidence / novel surfaces → `hitl-coordinator` Decision; confirmed senses → `knowledge-synthesizer` updates priors; new entries authored by `taxonomist` (SKOS) / `ontologist-generalist` (senses)

Why tiered is the whole game¶

Empirically most tokens are unambiguous or have a dominant sense — T0 resolves the vast majority at lexicon-lookup speed. Only the ambiguous minority climbs to embeddings (T2) or the LLM (T3). This keeps the service streaming-fast and cheap, and it's the same cost-discipline as the fleet's existing llm-gateway / FinOps posture.

Revisability (the "shifts over time" requirement)¶

Each Resolution carries a validInterval and version. When later context changes a score past a hysteresis threshold, the resolver emits a Revision event (a delta, not a re-send) and bumps the version. Consumers can replay the resolution history of a span. HLC timestamps give a total order across late/out-of-order arrivals without a central clock — so two workers resolving the same stream converge.

Phased roadmap¶

P1 — Static lexicon, batch (DONE). Acronym Squasher: glossary as lexicon, whole-doc, one sense. The seed.
P2 — Streaming T0/T1. Aho-Corasick matcher + lexicon priors + window heuristics over a NATS stream; emit Disambiguation events. No ML yet. → runnable PoC: tools/disambiguator/ (context-sensitivity + revision proven; transport/KG/T2-T3 still stubbed).
P3 — Context + embeddings (T2). Domain/conversation state; local-embedder for semantic tie-breaks; synonyms via SKOS. → the ontology-fed lexicon is already prototyped: tools/disambiguator/skos-adapter.mjs turns a Fuseki SKOS CONSTRUCT into surfaces/senses/cues (graph topology = model). This is the shared-service loop: ontology pipelines feed the lexicon (SKOS → senses) and consume the output (resolved concept IRIs → entity links back into the graph).
P4 — Revisable online (T3 + feedback). HLC-ordered Revision events; llm-gateway for the hard residue; HITL-driven lexicon growth.
P5 — Cross-stream entity resolution. Same-as / identity resolution across streams (the Data-Vault same-as discipline applied to language), so BE/backend-core/"the API layer" collapse to one referent fleet-wide.

Consequences¶

Good: one capability spans acronyms + homonyms + synonyms + entities; comprehension debt is paid down continuously, not just in docs; agents get a shared, context-aware "what does this token mean here" service; reuses the existing bus + KG + embedder + LLM gateway rather than new infra.

Costs / risks: lexicon governance becomes load-bearing (who curates senses? → knowledge-ontology cluster + HITL); the LLM tier needs a hard budget cap or it eats latency/cost; eventual-consistency of revisions means consumers must handle "the answer changed"; evaluation needs real WSD/entity-linking metrics (precision/recall, accuracy@k) and a labelled set, or quality is unknowable.

Open questions¶

Sense inventory store: extend the glossary format, or model senses natively as SKOS concepts + ontology individuals from day one?
Per-stream state lifetime / eviction policy for long-lived streams.
Is the revision hysteresis global, or per-surface-learned?
Where's the line between this service and the existing ontology "sieve" (Fuseki/SHACL)?