The Ontology Pipeline¶

Turn unstructured documents into proven, queryable, provenance-tracked knowledge — not plausible LLM extractions. This is the implemented pattern; the decisions behind it are ARC-ADR-030 / 032 / 033 / 019 / 016; the broader vision is the Labs north-star (obsidian/labs/AgentArmyLabs/Ontology-Pipeline.md).

The pattern in one line¶

The LLM proposes; the formal layer disposes. A concept reaches the canonical graph only when it is proven — schema-valid, anti-pattern-free, reasoner-consistent across a dual gUFO+BFO grounding, and SHACL-conformant — and every step is traceable back to its source. Snapped, not plausible.

What a document becomes — four substrates¶

document (.docx/.pdf/.txt)
  ├─ ingest ───────────────▶ VECTOR INDEX     (ArcadeDB LSM_VECTOR, Cohere embed-v-4-0)
  └─ sift (Cerebras) ─┬────▶ HOLOGRAPHIC LPG  (ArcadeDB — every candidate + lifecycle state)
                      ├────▶ MID-LEVEL MAP    (proven concepts aligned to business-mid.ttl)
                      └────▶ CANONICAL RDF     (Fuseki — only the PROVEN, + PROV-O lineage)

Substrate	Store	Holds	Role
Vector index	ArcadeDB `LSM_VECTOR`	doc chunks + 1536-d embeddings	semantic retrieval (RAG)
Holographic LPG	ArcadeDB graph	every candidate + per-level results + `state`	the working/staging graph; quarantine is a `state`, not a separate store
Canonical RDF	Fuseki (TDB2, `/knowledge`)	only proven gUFO+BFO triples + lineage + mappings	the authoritative knowledge graph
Mid-level map	RDF (in the canonical graph)	proven concepts aligned to a shared vocabulary	cross-document integration layer

The vector index and the ontology graph trace to the same source document, so a consumer can pivot concept → source span → chunk.

The discipline — propose → sift → snap | quarantine¶

The sift ladder (ARC-ADR-032):

Level	Gate	Proves
L1	JSON-Schema on the IR fragment	well-formed, valid stereotypes
L2	OntoUML anti-pattern check	role bindings reference declared entities
L3	OWL reasoner closure (owlrl)	gUFO and BFO classifications agree (an Event grounded to a Continuant is inconsistent, not merely odd)
L4	SHACL conformance (pyshacl)	relator under-mediation and other shape rules
(prod)	the Fuseki sieve (`sieve.sh`)	an independent SHACL re-check before promotion

Only an all-green candidate snaps (projects to canonical RDF). Anything that can't be proven within the repair budget lands in quarantine — retained with its full violation report, never auto-promoted.

Mid-level mapping — and the evidence ladder¶

A proven concept is still document-local (one doc's "Rumor" ≠ another's). The mid-level mapper aligns it to a shared vocabulary (business-mid.ttl, ~40 business/epistemic/risk classes) so the fleet integrates across documents.

Primary signal — embedding cosine in the same Cohere space the RAG index uses (a measurable number), gated by gUFO archetype compatibility: a reified relation may only map to a relator class; an occurrent never to an object.
Thresholds → skos:exactMatch / skos:closeMatch / rdfs:subClassOf.
Below the floor → escalate to the gateway model, and its verdict is cited as PROV-O provenance (sift:citedSource "cerebras:zai-glm-4.7"). A decision a measurable signal couldn't make is handed to a model — and the model is named.

Every mapping records its method, cosine, embed model, and any cited source. The "propose / dispose" rule holds here too: the model only picks among cosine-ranked candidates; it cannot invent a class.

Functional core, imperative shell¶

The same project + sift logic exists in two forms (ARC-ADR-033):

Imperative shell — Python (backend-core): all the IO — HTTP, Cerebras gateway, Cohere embeddings, ArcadeDB, Fuseki. The live /api/v1/ontology/* routes.
Functional core — F# (tools/ontology-sift/fsharp): the provable transformations. The category theory pays rent here: the IR is a coproduct (discriminated union), project is a functor (a catamorphism folding the fragment to triples), the ladder is a Result monad (L1 short-circuit) into a Validation applicative (L2–L4 accumulate), and the outcome is the coproduct Snapped | Quarantined. With FS0025-as-error, adding a stereotype case fails the build until every projection handles it — "snapped, not plausible" enforced by the compiler, not by tests. See tools/ontology-sift/fsharp/README.md.

Provenance, end to end¶

PROV-O threads the whole chain, so the canonical commit is auditable:

source span ─ wasDerivedFrom ─▶ concept ─ wasGeneratedBy ─▶ propose activity (proposer + prompt hash)
concept ─ skos:exactMatch/closeMatch/subClassOf ─▶ mid:Class
   └─ wasGeneratedBy ─▶ map activity (method, cosine, embed model, citedSource?)

Run it¶

Want	How
See the whole thing on a real doc	`backend-core/notebooks/ontology_pipeline_e2e.ipynb` → Run All (needs backend-core `:8000` + Fuseki `:3030`)
Call the pipeline	`POST /api/v1/ontology/pipeline` `{ "source_text", "source_doc", "proposer": "gateway" }`
Prove the F# core	`dotnet run --project tools/ontology-sift/fsharp` (parity doctor)
Offline, no LLM	`tools/ontology-sift/doctor.py` (fixture proposer; deterministic)

Where it lives¶

backend-core/app/ontology/ — sift_engine.py, proposer.py, midlevel.py, pipeline.py, arcade_schema.py, api.py; discipline/ (IR schema, gUFO/BFO-lite TTLs, SHACL shapes, midlevel/business-mid.ttl).
tools/ontology-sift/ — the Python reference + offline doctor + spikes/ + fsharp/ (the compiler core).
Contract: backend-core/contracts/backend-core.openapi.json (/api/v1/ontology/*).

Status¶

Implemented and proven live on the "Three makes a tiger" document (2026-05-29): 57 chunks → vector index; one fragment snapped (all four gates green); 11 concepts mapped (3 exact / 4 close / 4 escalated + cited zai-glm-4.7); 183 triples in Fuseki. Shipped in backend-core #130 and hub #336.