Verification Levels — spike plan for L3-DL, L5 (Alloy), L6 (Z3)¶
Companion to the roadmap in obsidian/labs/AgentArmyLabs/Factory-Loop-Test-Infrastructure.md
(north-star Levels 1–6). This note scopes the later levels so we can stand each up
when — and only when — a concrete invariant needs it, rather than speculatively.
Where we are (2026-05-25)¶
| L | Level | Status |
|---|---|---|
| 1 | Syntactic (JSON-Schema) | ~ validator does this ad-hoc |
| 2 | OntoUML/UFO (validator rules) | ~ partial |
| 3 | Semantic (OWL) | ✅ structural — tools/modelgen/owl_check.py (pure rdflib) in CI |
| 4 | Constraint (SHACL) | ✅ shipped (#123) — pyshacl in CI |
| 5 | Structural (Alloy) | ◻ later — this note |
| 6 | Arithmetic/temporal (Z3) | ◻ later — this note |
Each level is a projection of the IR (model/middle-core/model.yaml) plus a checker.
The generator already emits the L3/L4 projections (model-runtime.owl.ttl,
model-runtime.shacl.ttl) next to the RDF ABox fixture (model-runtime.fixture.ttl).
The pattern for every new level is identical: add an emitter to
generate_middle_core.py, add a checker under tools/modelgen/, gate it in
middle-core-model.yml, and let the drift gate lock the new artifact to the model.
L3-DL — full OWL 2 DL reasoner (upgrade to today's structural check)¶
What today's check does NOT do. owl_check.py is a pure-rdflib structural pass
over the OWL RL subset the projection encodes: it catches an individual asserted into two
disjoint classes, and object-property triples that violate rdfs:domain/rdfs:range
(honouring the subclass hierarchy). It does not prove class satisfiability
(is some class necessarily empty given the axioms?) or subsumption (does class A
necessarily imply class B?). Those need a real DL reasoner.
Trigger to build it. The first time the model gains axioms whose consistency is not
obvious by inspection — e.g. an object type that is simultaneously a subclass of two
classes declared disjoint (an unsatisfiable class), property characteristics
(owl:FunctionalProperty, owl:inverseOf, transitive roles), or cardinality
restrictions (owl:maxCardinality) that can contradict each other. Reification
(relators: + roles, see [[Reification-and-Hyperedges]]) is the likely first source:
role-filler constraints + relator disjointness are exactly the shape where a silent
unsatisfiable class can slip in.
How. owlready2 (bundles the HermiT and Pellet reasoners; needs a JVM — Java 24 is
already on the dev box and actions/setup-java in CI). Load model-runtime.owl.ttl,
call sync_reasoner(), and fail if Nothing gains subclasses (unsatisfiable classes)
or if the ontology is inconsistent. Keep it a separate CI step / nightly — it is
slow (JVM warmup + reasoning) relative to the per-PR rdflib pass, which stays as the
fast gate. Alternative without a JVM: the ELK reasoner via ROBOT, but ELK only covers
OWL 2 EL — fine for subsumption, not for the full DL the cardinality cases need.
Effort. S–M. The emitter already produces the TBox; the work is the reasoner step + JVM provisioning + a deliberately-unsatisfiable fixture to prove the gate bites.
L5 — Alloy (structural, finite-scope)¶
What it proves that L3/L4 can't. "Can a model instance like this even exist?" Alloy searches a finite scope for a structure satisfying all constraints (or a counterexample to an asserted property). It is the right tool for relational invariants that are awkward in OWL/SHACL: "every relator binds exactly the roles its type declares", "no role is played by two fillers in the same relator", "the state-machine transition relation has no unreachable state and no sink that isn't terminal". These are existence/reachability questions, not per-instance data checks.
Trigger. When state machines or relators grow structural invariants we want to prove hold for all small instances, not just the one fixture — i.e. when a hand-written fixture stops being convincing evidence. The state-machine reachability check is the cheapest first win and is model-derived.
How. Emit an .als projection from model.yaml: sig per object type / relator,
facts for cardinalities and role bindings, pred/assert + check ... for N for the
invariants. Run headless via the Alloy CLI (java -jar org.alloytools.alloy.dist.jar
exec model-runtime.als). On-demand / nightly, not per-PR (scope search is exponential).
Counterexamples are the payload — Alloy prints the offending instance.
Effort. M. New emitter (alloy_model() in the generator) + a checker that parses the
CLI result. The transition-table and role-binding facts are mechanical from the IR.
L6 — Z3 / SMT (arithmetic & temporal)¶
What it proves that L5 can't. Alloy is finite-scope and relational; it is weak at
unbounded arithmetic and temporal/ordering math. Z3 handles: cardinality
arithmetic that must hold for any N ("sum of allocations ≤ capacity"), time-window
logic (bitemporal valid_from ≤ valid_to, no overlapping validity for the same key —
see [[RT5-PIN-S2-bitemporal-queries]]), ordering/scheduling, and the "time-series
pulse" alignment invariant (every reified edge shares a common temporal axis).
Trigger. When the model carries numeric or temporal constraints whose correctness is arithmetic, not structural — concretely, the bitemporal validity rules and any capacity/allocation cardinality once reification lands. This is the natural partner to the bitemporal spike work already in this folder.
How. Emit SMT-LIB2 (or build constraints via z3-solver Python bindings, pure-pip,
no JVM) from the IR's cardinality/temporal facts; check-sat per asserted invariant,
report any sat model that witnesses a violation (or unsat of the negation as proof).
On-demand. z3-solver is the lightest of the three to provision (no Java).
Effort. M. Emitter for the arithmetic/temporal slice + a Z3 driver. Scope it to the bitemporal + cardinality invariants first; don't try to encode the whole model.
Sequencing recommendation¶
- L3-DL when reification introduces the first non-obvious satisfiability question (likely next, and the emitter already exists — lowest marginal cost).
- L6 Z3 alongside bitemporal/cardinality enforcement (pure-pip, no JVM, and there is already concrete temporal invariant work to anchor it).
- L5 Alloy when we want all-small-instances structural proofs (state-machine reachability is the cheap first target).
Hold the line from the roadmap: don't stand up a prover until a concrete invariant needs it. Each level above is a projection + checker + CI step + drift-locked artifact — the same four moves L3 and L4 already made.