Persistence & Time¶
A research/reference page for how the untool.ai platform persists data, how it stamps time, and how it lets the same facts be re-shaped for many access patterns without ever rewriting the source of truth.
This page is the persistence + time companion to the ontology and coordination pages. It covers the temporal stance (ADR-042), the warehouse methodology (Data Vault 2.1, ADR-026), the wire format (Apache Arrow, ADR-009), the pace-layered projection model (ADR-041), the unified process-and-time architecture (ADR-038), and the hybrid object query system (ADR-055) — plus the supporting standards: Snodgrass bitemporal SQL, ISO SQL:2011 system-versioned tables, Kimball dimensional modeling, Arrow, Parquet, and W3C PROV-O.
One-line summary
Stamp every record once at write time, store it in whichever shape the access pattern needs, and let pace-layered projections rebuild the fast layers from the slow source of truth. Time is a contract, not a store; the data vault is the slow layer; information marts and the operational graph are the fast layers.
1. The temporal stance — "stamp first, store by access pattern"¶
ARC-ADR-042 is the supreme law of time on this platform. It answered an audit question that surfaces on every platform that grows past one database: "now that we have a time standard, do we need a dedicated time-series engine?" The answer was a polite, evidence-backed no.
1.1 The reframe¶
"This was never a time-series store decision. It is a time-stamping decision." — the five-seat panel that ratified ADR-042 (2026-05-30)
A time standard exists to decouple order from storage. Once every record carries a trustworthy temporal envelope at the moment it is born, the choice of where the bytes live becomes a cheap two-way door. The store may change; the time stays correct.
1.2 The trade-space — bitemporal as the baseline¶
The intellectual backbone is the bitemporal model from Richard Snodgrass's Developing Time-Oriented Database Applications in SQL (Morgan Kaufmann, 1999) and the earlier work of C. J. Date, Hugh Darwen, and Nikos Lorentzos in Temporal Data and the Relational Model (Morgan Kaufmann, 2003). Two axes are kept separate and never collapsed:
| Axis | Meaning | Synonyms |
|---|---|---|
| Transaction time | When the system knew the fact (the row's birth, never edited) | system time, recorded-at |
| Valid time | When the fact is true in the world (independent of when we learned it) | application time, business time, world time |
This baseline is now codified in ISO SQL:2011
as system-versioned tables (transaction time) and application-time period tables (valid
time). Both PostgreSQL (via the tstzrange + EXCLUDE … WITH && pattern) and SQL Server (via
SYSTEM_VERSIONING = ON) implement subsets.
ADR-042 layers two more axes on top of the bitemporal pair for distributed-system honesty:
- Decision time — when a human or agent made the judgment that produced this assertion (separate from when we recorded it).
- Process time — when a workflow step executed (separate again — DBOS-durable runs inherit a different clock than the actor that scheduled them).
A Hybrid Logical Clock (HLC, from Kulkarni et al., Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases, 2014) gives causal order across containers without requiring tight wall-clock sync. NTP/chrony is the physical baseline; the HLC is the ordering contract.
Why bitemporal first, store choice second
Wall-clock skew can invert causation permanently. If you persist before the HLC seam is wired, every row carries skew-prone time forever and the audit trail is unrecoverable. The one near-irreversible move in ADR-042 is wiring the stamp before the first persist. Everything else is reversible.
1.3 Three classes, three homes¶
ADR-042 split "time data" into three classes that are routinely conflated and routes each to the store that already owns its access pattern:
| Class | What | Home | Why |
|---|---|---|---|
| A — Ops time | metrics, span latency, skew SLI | OTel / Prometheus / Tempo / Loki (ARC-ADR-010) | This is a TSDB. Never a domain store. |
| B — Semantic bitemporal ledger | pinned elements, relators, valid + transaction time | append-only ledger behind the IPinStore / ISerializationClock seam |
The actual contested store. Decided on a measured trigger. |
| C — Value/drift telemetry | pace-layer ICEs (Information Content Entities), drift notes | NATS JetStream log → materialized view | An ordered causal event log. |
The crucial discipline is the anti-dual-write invariant: the temporal envelope is written once by the stamping authority; every other store copies only the slice it owns and never back-writes another store's slice. This is enforced as a contract row, not folklore.
The append-only invariant is enforced with an insert-only DB role, not developer
discipline. Every world-time assertion is written once, never UPDATEd, and closed by
superseded_at. Violate it once and the audit trail is unrecoverable.
2. Data Vault 2.1 — the slow layer of the warehouse¶
ARC-ADR-026 adopts Data Vault 2.1 (Daniel Linstedt, Michael Olschimke, Building a Scalable Data Warehouse with Data Vault 2.0, Morgan Kaufmann 2015 + the 2.1 supplement from the Data Vault Alliance) as the enterprise warehouse methodology for every analytics-bearing spoke. The methodology is loader-agnostic; the reference loader is dbt + Datavault4dbt.
2.1 The three layers¶
SOURCE SYSTEMS (CRM, billing, events, files, APIs)
│ ELT (dlt, Fivetran, Kafka Connect, custom)
▼
┌─────────────────┐
│ Raw Vault │ insert-only · history-preserving · hash-keyed
│ hubs/links/sat │ load_date + record_source on every row
└────────┬────────┘
│ business rules in dbt models
▼
┌─────────────────┐
│ Business Vault │ same DV constructs · _bv suffix
│ same-as · PIT │ computed sats · bridge tables
└────────┬────────┘
│ virtualize where possible, materialize where SLAs demand
▼
┌─────────────────┐
│ Information │ star · snowflake · OBT · graph
│ Marts │ shaped per consumer
└─────────────────┘
2.2 The three constructs¶
| Construct | Holds | Audit columns | Hash columns |
|---|---|---|---|
| Hub | a unique business key (e.g. customer_id) and the moment it first appeared |
load_date, record_source |
<hub>_hk (SHA-256 of the canonical key) |
| Link | the association between two or more hubs (e.g. order_customer) |
load_date, record_source |
<link>_hk (SHA-256 of concatenated hub keys) |
| Satellite | descriptive context for a hub or link, change-tracked over time | load_date, load_end_date, record_source, hash_diff |
<sat>_hk (inherits parent), <sat>_hd (hash of payload) |
A hub is the noun. A link is the verb. A satellite is the adjective stream over time. That separation — and only that separation — gives the warehouse its three superpowers: schema-on-write integration, full audit by default, and parallel idempotent loads.
2.3 Raw vault vs business vault¶
- Raw vault is sacred. Insert-only. Never edited. Never deleted. It is the integration layer — its job is to faithfully record what each source said, when it said it, exactly as it was said. No business rules. No transforms beyond hash-keying and column passthrough.
- Business vault is where interpretation lives. Same hub/link/sat constructs with a
_bvsuffix, but the columns are computed. Same-as links resolve identity. PIT (Point-In-Time) tables snapshot the "current as of T" state. Bridge tables flatten link-chains for query.
This split is the warehouse equivalent of ADR-041's pace separation: raw vault is the slow layer (changes only when the source schema changes), business vault is the medium-fast layer (changes when interpretation changes), information marts are the fast layer (change when the consumer's question changes).
2.4 Why DV fits a sandboxed-spoke fleet¶
The fleet's spokes are sandboxed — each runs in its own container with no shared filesystem, no cross-mount, and writes only through contracts. Data Vault's parallelism is the exact match: every producer writes its own satellite for its hub/link participation, keyed by SHA-256 of the business key. There is no shared sequence, no surrogate-key coordination, no write-side contention. Two spokes loading "the same" customer record from two different source systems land in two different satellites attached to the same hub. Identity reconciliation happens later, in the business vault, via same-as links.
Same-as links — the identity-resolution primitive
A same_as_link records the assertion "hub X in CRM and hub Y in billing are the same
real-world entity." It is itself a hub-pair with a satellite carrying the rule that
produced the assertion, a confidence, and a valid-time window. Identity is declared,
not assumed, and the declaration is itself bitemporal.
2.5 Streaming as a first-class load shape¶
DV 2.1 (vs 2.0) formalizes streaming. Late-arriving keys, micro-batch and continuous loads, and idempotency under retry are first-class methodology concerns. The platform's NATS JetStream bus (ARC-ADR-022) is the streaming substrate; CloudEvents v1.0 is the envelope; an idempotency key (the message ID) plus the hub's hash key together make every load deterministic on replay.
3. Hash keys & hash diffs¶
tools/data-vault/hash.mjs and its Python sibling implement the
canonical hashing rules. Hash-based business keys are the rule, not surrogate sequences,
for four reasons:
- Parallel load. A SHA-256 of the canonical business key is computable independently in every loader, on every node, with no central coordination. Surrogate sequences require a lock-step counter, which is the antithesis of horizontal scale.
- Idempotency. Reloading the same row produces the same hash, so dedupe is implicit. With sequences, replay creates a new surrogate every time and you need an out-of-band reconcile.
- Cross-system referential integrity. Two systems hashing the same business key produce the same hub key, with no negotiation. Sequences can't do this without a central registry.
- Reproducibility. A reload from cold storage produces the same warehouse, byte for byte
(modulo
load_date). With sequences, the warehouse is path-dependent.
3.1 SHA-256 over MD5¶
DV 2.0 sometimes specified MD5. DV 2.1 (and ADR-026) require SHA-256:
- MD5 is cryptographically broken (collision attacks since 2004). For business keys this rarely matters in practice, but the cost differential is tiny and the collision surface matters at planet-scale warehouses.
- SHA-256 is
LENGTH 64(hex) — about 2× MD5's 32. On a billion-row hub the extra 32 bytes per row is ~32 GB, which on cloud storage is rounding error. - All target warehouses (Snowflake, Postgres, BigQuery, Databricks) have a native SHA-256. No vendor lock-in.
3.2 Canonical ordering, normalization, separator, null sentinel¶
A hash is only as deterministic as its input. ADR-026's hashing rules:
| Rule | Value |
|---|---|
| Field order | Canonical (alphabetical by attribute name, source-system-independent) |
| Unicode | NFC normalization before hashing |
| Separator between fields | \|\| (double pipe; configurable but consistent within a model) |
| Null sentinel | ^^ (configurable; never the empty string, which is a valid value) |
| Trimming | Trailing whitespace trimmed; leading whitespace preserved |
| Casing | UPPER() on business keys; payload casing preserved |
The hash_diff on a satellite is the SHA-256 of the payload (descriptive columns), so a
satellite load can decide "is this a new version?" in a single hash comparison. Cheap on every
warehouse engine; portable across them.
4. ArcadeDB — multi-model engine for the platform plane¶
The platform plane (the slow layer for live operational reads) runs on
ArcadeDB, the multi-model engine that speaks document + graph + key-
value + time series + vector in one process. The fleet runs it locally (Docker container in
templates/local-stack/) and on Azure ACI (rg-arcadedb-test).
4.1 Why one multi-model engine, not three specialist engines¶
The naive instinct is: Neo4j for graph, Postgres for relational, Pinecone (or pgvector) for vectors. ADR-026 and ARC-ADR-055 both push back on this for the platform plane:
- Cross-store joins are the dominant query pattern for an agent stack. A search like "give me the documents semantically similar to X, in the same project, authored by anyone who has reviewed Y" spans vector + graph + relational. Three engines means three round-trips, three transaction boundaries, three failure modes.
- Operational tier budget. Per ARC-ADR-023, every stateful store is backup / HA / upgrade / monitoring surface. Three stores is three upgrade cycles and three on-call runbooks. One is one.
- Schema congruence. Multi-model means one schema (the ontology's RDF/LPG projection) maps to one engine. Multiple engines means multiple shadow schemas drift apart.
ArcadeDB's trade-off: it is good enough at every model, best in class at none. That trade is correct for the platform plane (the operational read-and-write layer for live agents). For the analytics plane, the data vault sits on a real OLAP warehouse (Snowflake / BigQuery / Databricks), and that's where star-schema marts get materialized for BI.
4.2 ADR-055 — the hybrid object query system¶
ARC-ADR-055 decided the read seam: the Universal Data Adapter (UDA) orchestrates a three-phase retrieval pipeline:
- Vector phase — semantic similarity over
Chunkelements in the vector store. - Graph phase — context expansion (e.g. fetch the chunk's
Document, its parentRepository, relatedDecisionreferences). - Relational/axiomatic phase — security clearance, project scope, SHACL/OWL constraints.
The three result sets are merged via Reciprocal Rank Fusion (RRF, Cormack et al. 2009) and
hydrated into IHyperElement Object Model instances. Agents never call databases directly;
they invoke Search(QueryString) on the UDA and receive a fully entity-resolved graph
sub-network.
4.3 Multi-model also means multi-temporal¶
Time fits ArcadeDB's mixed-mode posture: the bitemporal ledger (Class B from ADR-042)
lives as vertices with valid_from / valid_to / recorded_at / superseded_at properties, with
relator vertices (ARC-ADR-016)
carrying time-indexed roles. Time-series buckets are not used — that was an explicit ADR-042
decision. ArcadeDB's TS-bucket feature is its least-proven; metrics belong in Prometheus.
5. Apache Arrow — the canonical wire format¶
ARC-ADR-009 settled the type vocabulary across the Universal Data Adapter: every connector normalizes to Apache Arrow record batches at the boundary.
5.1 Why Arrow over JSON or Protobuf¶
| Dimension | Arrow | JSON | Protobuf |
|---|---|---|---|
| Representation | Columnar | Row (text) | Row (binary) |
| Zero-copy reads | Yes | No | No (deserialization required) |
| Cross-language | C++, Rust, Python, Java, Go, JS, C#, Julia | Universal text | Code-gen per language |
| Decimal / temporal precision | First-class (decimal128, timestamp(unit, tz)) |
Lossy (strings) | Schema-dependent |
| Nested / repeated types | First-class (struct, list, map) |
Native | Schema-dependent |
| Analytical scan cost | Optimal (vectorized) | Worst | Middling |
| Streaming chunks | Native (RecordBatch) | Manual framing | Manual framing |
The decisive driver is analytical workload shape. The UDA's primary load is bulk reads from BigQuery, Snowflake, Postgres, and Parquet object stores. JSON imposes row-by-row parse and string-typed decimals/timestamps; Protobuf imposes per-message deserialize. Arrow's columnar batches are read once, scanned vectorized, and shipped over Arrow Flight RPC with zero copy.
5.2 Arrow + ADBC + dlt¶
The connector substrate is ADBC (Arrow Database Connectivity) — the Arrow-native answer to JDBC/ODBC. Drivers exist for Postgres, BigQuery, Snowflake, DuckDB, and SQLite, all returning Arrow record batches directly without an intermediate row-tuple layer.
For ingestion, dlt (Data Load Tool) is the framework of choice. dlt pipelines emit Arrow batches to staging, where the data vault loaders pick them up. The whole flow stays columnar from source to staging to raw vault.
5.3 Arrow + Parquet — the cold-storage twin¶
Apache Arrow's in-memory format is paired with Apache Parquet for on-disk storage. The two are designed by overlapping committees: Arrow batches deserialize from Parquet column chunks with near-zero overhead. The data vault's staging layer is typically Parquet in object storage (Azure Blob, S3, GCS), giving cheap cold storage that the raw vault loader can scan vectorized at high throughput.
6. Pace-layered projection — Stewart Brand as architecture¶
ARC-ADR-041 borrows pace layering from Stewart Brand's The Clock of the Long Now (Basic Books, 1999) and earlier in How Buildings Learn (Viking, 1994). The thesis: a healthy complex system has layers that change at different speeds, loosely coupled so the fast layers can churn without destabilizing the slow ones.
| Pace | Layer | Cadence | Job |
|---|---|---|---|
| Slow | Canonical ontology (RDF + OWL + SHACL) | Quarterly | Meaning, rules, rigor |
| Slow | Raw data vault | Source-schema changes only | History, integration, audit |
| Medium | Business vault | Interpretation changes | Same-as resolution, computed sats |
| Medium | LPG CANON zone | Slow-layer changes ratchet down | Operational graph reads |
| Fast | Information marts | Per-question, per-sprint | BI, ML features, OBT |
| Fast | LPG FRONTIER zone | Operator-namespaced extensions | Operational metadata, drift telemetry |
| Faster | Trace + metric layer | Per-second | Observability |
6.1 Down-projection (slow → fast)¶
The canonical RDF is projected deterministically into the LPG CANON zone. Every LPG node
carries the source iri; every LPG edge is the binarized projection of a relator vertex. The
projection is asymmetrically lossy — binarized edges drop the time index — so all relation
reasoning stays in RDF (and in ArcadeDB's relator vertices), never in the LPG shadow.
The information marts work the same way for the analytics plane: PIT tables, bridge tables, and star schemas are projections of the raw + business vault, materialized when an SLA demands it, virtualized otherwise. The marts can be rebuilt at any time from the vault; the vault never gets rebuilt from the marts.
6.2 Up-graduation (fast → slow) — the gated ratchet¶
The harder direction is up: when useful structure emerges in the fast layer (an operator
adds an ops: property, a vector search consistently surfaces a cluster, a same-as link is
proposed by a heuristic), can it graduate into canon?
ADR-041's answer is the two-stage graduation:
- Nominate (fast) — a statistical signal flags a candidate. Popularity, frequency, RRF rank — any fast measurement may nominate.
- Classify → dispose (slow) — only an ontological classifier (a slow-layer agent running the ARC-ADR-032 sift-sort loop) may decide which BFO/UFO category the candidate joins. Categories have disjointness axioms; mis-categorization is rejected, not silently flattened into "class."
This is the platform's enforcement of the propose-dispose maxim: operations propose meaning; the formal layer disposes; nothing auto-promotes. Popularity nominates; classification decides.
7. Unified process & time architecture (ADR-038)¶
ARC-ADR-038 is the binding layer that fuses the four time axes (transaction, valid, decision, process) into one canonical temporal envelope rides on every CloudEvent, every pin, every workflow step.
7.1 The five-axis envelope¶
{
"transaction_time": "2026-06-06T12:34:56.789Z", // when the system recorded it
"valid_time": { // when it is true in the world
"from": "2026-06-01T00:00:00Z",
"to": "2026-06-30T23:59:59Z"
},
"decision_time": "2026-06-06T12:30:00Z", // when the human/agent judged
"process_time": "2026-06-06T12:34:55.000Z", // when the workflow step ran
"hlc": "2026-06-06T12:34:56.789Z|42|node-7" // causal-order tag (HLC)
}
The HLC tag combines a physical timestamp, a logical counter, and the originating node ID (Kulkarni et al. 2014). Two events with identical wall-clock timestamps still have a strict total causal order via the counter; the counter is bounded (it resets when the wall clock advances), so it is not a Lamport clock that grows forever.
7.2 Process time and event-sourcing¶
The "process time" axis is inspired by event-sourcing (Greg Young, Martin Fowler, Event Sourcing) and the CACAO 2.0 playbook standard (OASIS), along with BPMN 2.0 (OMG). A workflow step has a process time (when DBOS-durable execution actually ran it) that is distinct from the transaction time of the pin it produced and the decision time of the human/agent that scheduled it. Conflating them destroys the audit trail.
The fleet's durable runtime is DBOS Transact (ARC-ADR-018)
— durable workflows with checkpoint/replay, durable queues, scheduled workflows, and
list / cancel / resume / fork. BPMN/CACAO playbooks parse to one intermediate
representation, executed by a ~400-line kernel inside DBOS
(ARC-ADR-031).
7.3 Provenance as a primitive — W3C PROV-O¶
Time alone is not provenance. The platform encodes provenance per the W3C PROV-O standard (W3C Recommendation 2013) with OWL-Time (W3C Recommendation 2022) for temporal intervals. Every write to the bitemporal ledger emits a PROV-O triple linking the activity (the workflow step), the agent (the human or AI), and the entity (the pin) to the temporal envelope. This is what makes "evidence as a primitive" (ADR-041) more than a slogan.
8. Information marts — the fast consumption layer¶
The vault's job is integration and history. Information marts shape that history for particular consumers. Four mart patterns are first-class in ADR-026:
8.1 Kimball star schema¶
Ralph Kimball's The Data Warehouse Toolkit (3rd ed., Wiley 2013) is the canon for dimensional modeling. A star schema is one fact table (the measure events) surrounded by dimension tables (the descriptive context, with slowly-changing-dimension Type 2 history). Fast for BI tools, easy for analysts, well-supported across every OLAP engine.
In a DV 2.1 architecture, the star is built from PIT and bridge tables in the business vault. The mart materializes when an SLA demands it; otherwise it stays a virtual view.
8.2 Snowflake schema¶
A snowflake is a star with normalized dimensions (dimension tables that point to sub-dimension tables, instead of flattened). Use it when a dimension is genuinely multi-level and the duplication cost of flattening exceeds the join cost of normalizing. In practice, BI tools prefer flat stars; snowflakes show up most often as an intermediate build step.
8.3 OBT — One Big Table¶
The OBT mart is one denormalized table with every column the consumer needs, joined and flattened at build time. It is the dominant pattern for modern columnar warehouses (Snowflake, BigQuery, Databricks) because columnar compression and partition pruning make the storage cost trivial and the query cost optimal. ML feature stores especially favor OBT.
8.4 Graph projection¶
The fourth mart pattern is the graph projection: hubs become nodes, links become edges, satellites become time-indexed property streams. This mart materializes into the operational graph on the platform plane (ArcadeDB) and is the input to the LPG CANON zone (ADR-041).
8.5 Materialize vs virtualize¶
The methodology choice is not "which mart shape?" — it is all four, as consumer demand warrants. The methodology choice is materialize vs virtualize:
| Choice | When | Cost |
|---|---|---|
| Virtualize (view) | Latency SLA permits compute on read; data volume moderate | Compute per query |
| Materialize | Latency SLA strict; query frequency high; data volume large | Storage + refresh job |
The default is virtualize. Materialize only when measurement says virtualization fails the SLA.
9. Vector + embedding storage¶
9.1 The embedder¶
The fleet's canonical embedder is Cohere embed-v-4-0
served via Azure AI Foundry (fndry-01 in subscription AASub1), producing 1536-dimensional
dense vectors. The LLM Gateway (ARC-ADR-021)
provides a single REST surface so spokes never hardcode a vendor SDK.
A local-embedder image is pinned at hub issue #184 — a CPU/NPU/iGPU local model serving embeddings without the Foundry round-trip, for offline development and for the platform plane when sub-100ms embedding latency matters.
9.2 Storing vectors — property vs sidecar¶
Two patterns coexist, chosen by access pattern:
Pattern A — vector as ArcadeDB property. The vector is stored as a LIST<FLOAT> or
ARRAY<FLOAT> property on the node. Search is via ArcadeDB's HNSW index or via in-process
cosine. Best for: small-to-medium corpora (≤ 1M vectors), tight graph-context queries (the
node-and-its-vector come back in one read).
Pattern B — sidecar vector index (pgvector / Chroma). The vector lives in a dedicated ANN-optimized store; only an opaque ID joins back to the graph. Best for: large corpora, specialized ANN algorithms (IVF-Flat, IVF-PQ), independent scale.
ADR-055's hybrid query system orchestrates both patterns under the same Search() seam.
The UDA's query planner decides per query which substrate to hit, then merges via RRF. The
caller never sees the topology.
9.3 Embedding versioning¶
A neglected but critical point: every stored vector is the output of a specific embedder version. When the embedder upgrades (Cohere ships v5; the local model retrains), the vector space shifts and old vectors are no longer comparable to new ones. The platform records the embedder version as a satellite attribute on the embedded entity, and re-embeds on a versioned schedule rather than mixing vector generations in the same index.
10. Standards & references¶
10.1 Temporal data¶
- Snodgrass, R.T. (1999). Developing Time-Oriented Database Applications in SQL. Morgan Kaufmann. PDF
- Date, C.J., Darwen, H., Lorentzos, N. (2003). Temporal Data and the Relational Model. Morgan Kaufmann.
- ISO/IEC 9075:2011 (SQL:2011) — system-versioned tables, application-time period tables. ISO catalog
- Kulkarni, S. et al. (2014). Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases. PDF
- Lamport, L. (1978). Time, Clocks, and the Ordering of Events in a Distributed System. PDF
10.2 Data Vault¶
- Linstedt, D. & Olschimke, M. (2015). Building a Scalable Data Warehouse with Data Vault 2.0. Morgan Kaufmann.
- Data Vault Alliance — DV 2.1 standard documents.
- Datavault4dbt — the reference loader.
10.3 Dimensional modeling¶
- Kimball, R. & Ross, M. (2013). The Data Warehouse Toolkit (3rd ed.). Wiley.
- Kimball Group — design tips, articles, the official reference.
10.4 Wire & storage formats¶
- Apache Arrow — columnar format spec.
- Arrow Flight RPC — high-performance transport over gRPC.
- ADBC — Arrow Database Connectivity, the JDBC/ODBC successor.
- Apache Parquet — columnar storage format.
10.5 Provenance & process¶
- W3C PROV-O (2013). Recommendation.
- W3C OWL-Time (2022). Recommendation.
- OMG BPMN 2.0. Specification.
- OASIS CACAO 2.0. Specification.
- CloudEvents v1.0. Specification.
10.6 Pace layering¶
- Brand, S. (1999). The Clock of the Long Now: Time and Responsibility. Basic Books.
- Brand, S. (1994). How Buildings Learn. Viking.
11. End-to-end flow — one diagram¶
flowchart LR
SRC["Source systems<br/>(CRM · billing · events · APIs · files)"]
DLT["dlt pipeline<br/>(Arrow record batches)"]
STG["Staging layer<br/>(Parquet on object store)"]
RV["Raw Vault<br/>hubs · links · sats<br/>load_date · record_source"]
BV["Business Vault<br/>same-as · PIT · bridge<br/>computed sats"]
MART["Information Marts<br/>star · snowflake · OBT · graph"]
LPG["LPG CANON zone<br/>(ArcadeDB)"]
AGENT["Agents · BI · ML<br/>via UDA Search()"]
SRC -->|extract| DLT
DLT -->|Arrow batches| STG
STG -->|hash-key + sat-load| RV
RV -->|business rules<br/>same-as resolution| BV
BV -->|materialize or virtualize| MART
BV -->|down-project<br/>A-Box only| LPG
MART --> AGENT
LPG --> AGENT
subgraph SLOW [Slow pace · meaning · history]
RV
BV
end
subgraph FAST [Fast pace · operations · queries]
MART
LPG
end
classDef slow fill:#1e293b,stroke:#60a5fa,color:#e2e8f0
classDef fast fill:#0f172a,stroke:#38bdf8,color:#e2e8f0
class RV,BV slow
class MART,LPG fast
The flow is one-way at the layer boundary: raw vault is built from staging, business vault is built from raw, marts are built from business. Nothing back-writes. The fast layers can be torn down and rebuilt at any time without losing a single fact, because every fact's authoritative home is the raw vault and every fact carries its bitemporal envelope from the moment it was born.
12. Invariants — the short list¶
Five rules that bind every layer regardless of engine choice:
- Stamp before persist. The HLC seam is wired before the first persistent write. Skew-stamped rows are permanent disorder.
- One writer of the envelope. The temporal envelope is written once by the stamping authority. Every other store copies only the slice it owns and never back-writes.
- Append-only ledger. Every world-time assertion is written once, never
UPDATEd, closed bysuperseded_at. Enforced with an insert-only DB role, not developer discipline. - Valid time and transaction time are separate column pairs. Never collapsed into a single timestamp.
- Popularity nominates; classification decides. Fast-layer signals may flag candidates for graduation; only the slow ontological classifier may decide their category.
These five are the contract. Stores may change beneath them; the contract does not.
See also: Ontology Foundations · Ontology Stack · Coordination & VFS · Intellectual Foundations (Bibliography)