Skip to content

Universal Data Adapter for backend-core

Context and Problem Statement

backend-core today talks to exactly one datastore — ArcadeDB (multi-model graph/document/vector), reached over its HTTP/JSON API. We need a system that lets us connect to many backend types from one place: register a connection, see its settings and health in a human UI, manipulate its features, and move data in/out through a pipeline. ArcadeDB is first and essential; Google BigQuery is next; more (Postgres, etc.) will follow. Bulk data movement should ultimately route through the dlt (dlthub) pipeline. We prefer mature packages over bespoke code, must keep secrets out of source and the app database, and are open on implementation language where it makes the system more powerful.

How should we structure a "universal data adapter" that is extensible to new backend types without rewrites, secure by default, and contract-first?

Decision Drivers

  • Extensibility — adding a new backend type should not require touching routing/core code.
  • Heterogeneous capabilities — relational, graph, vector, and warehouse "job" semantics differ; the abstraction must not assume SQL.
  • Pipeline reuse — lean on dlt for ELT (sources/destinations, incremental state, schema evolution) rather than reinventing it.
  • Security — credentials referenced by pointer, resolved at runtime; never persisted in plaintext (consistent with the ArcadeDB secret-file hardening, PR #11).
  • Contract-first — managed via the OpenAPI contract; the Svelte UI is generated from it.
  • Powerful, not dogmatic, language choice — willing to use Rust/C# where it wins, but pragmatic about ecosystem gravity.
  • Buy over build where a mature package fits; build the thin glue ourselves.

Considered Options

  1. Python (FastAPI) connector core on dlt + per-type drivers (recommended)
  2. Buy an embedded connector platform — PyAirbyte / Singer+Meltano / Trino / Steampipe
  3. Fully custom connector + pipeline framework (no dlt)
  4. Rust/C# connector core that drives dlt as a subprocess/sidecar

Decision Outcome

Chosen: Option 1 — a Python connector core on dlt, with per-type drivers, a connection registry in ArcadeDB, a capability-based connector abstraction, and a contract-first API + Svelte UI. Rust (rust-api-v2) is retained for hot serving paths, not the connector core.

Architecture

  • Connection registry (ArcadeDB documents):
  • connection_types — catalogue: name, capability flags, a settings JSON Schema, and a dlt slug.
  • connection_instancesdisplay_name, environment, settings (validated against the type schema; no secrets), secret_ref (opaque pointer), health_state + last_checked_at + last_error, derived capabilities.
  • Connector capability interface — a ConnectorBase (test_connection, introspect_schema, read, write) plus opt-in mixins declared per type: GraphCapable, VectorCapable, JobCapable (e.g. BigQuery jobs). Connectors are registered via a decorator (@register_connector("arcadedb")); a new type = a small package + one catalogue row, no routing changes.
  • dlt orchestration layer — a thin module that materializes a dlt source/destination from a registry record (resolving the secret_ref) and runs/schedules the pipeline. Native ops (graph traversal, vector search, ad-hoc reads) bypass dlt and use the type's client directly.
  • Secrets — the registry stores only a secret_ref (Docker secret path / Key Vault name / env key). Resolution happens at pipeline instantiation via dlt's VaultProvider tier (or a Docker secret file). A CI/semgrep rule blocks any code path that persists a resolved credential.
  • Contract-first API (FastAPI → OpenAPI → Svelte client):
  • GET /v1/connection-types — list the type catalogue.
  • GET /v1/connections, POST /v1/connections — list / create connections.
  • GET /v1/connections/{id}, PATCH /v1/connections/{id}, DELETE /v1/connections/{id} — read / update / delete one connection.
  • POST /v1/connections/{id}/test — run a connectivity test.
  • GET /v1/connections/{id}/schema — introspect schema.
  • POST /v1/connections/{id}/pipelines — start a pipeline run (returns 202 + a polling Location); GET /v1/connections/{id}/pipelines — list runs; GET /v1/connections/{id}/pipelines/{jobId} — poll one run.
  • POST /v1/connections/{id}/query — native ad-hoc read.

RFC 9457 Problem Details for errors; async pipeline endpoints return a polling Location header. - UI gates capability-specific actions on the instance's capability flags, not the type name.

Packages (adopt)

Package Role
dlt (Python) ELT core; native BigQuery; runtime-dynamic pipelines; VaultProvider secrets; custom @dlt.destination for ArcadeDB
arcadedb-python ArcadeDB over HTTP (the Postgres-wire path is fragile — avoid for ArcadeDB)
ADBC (+ sqlalchemy-bigquery) Arrow-native warehouse extract; SQLAlchemy dialects for SQL-standard sources
Vault / Azure Key Vault (+ External Secrets Operator) runtime secret resolution; never plaintext in app DB

Avoid (this iteration): PyAirbyte / Singer / Meltano (heavy, subprocess-bound), Steampipe / Trino (daemon/cluster, read-only), connectorx / Ibis (too narrow / query-only).

Language

Python owns the connector core. dlt is Python-only and the connector ecosystem (arcadedb-python, ADBC-python, SQLAlchemy) is richest there; an FFI/sidecar to a Python dlt worker on every operation would be fragile for no gain. Rust (rust-api-v2) owns serving — streaming Arrow batches, vector result serving, edge rate-limiting. This is "Python where it is strongest, Rust where it is strongest," not "everything in Python."

To avoid a serialization bottleneck where Python and Rust do exchange data (Arrow batches), the Python↔Rust transfer uses a zero-copy IPC mechanism — Apache Arrow Flight (gRPC-based, the default) or a shared-memory buffer for co-located processes — rather than re-serializing through JSON. The chosen mechanism is recorded when the serving path is implemented (phase 3).

Consequences

Good: - New backend types are additive (package + catalogue row); no core rewrites. - Leverages dlt's incremental/state/schema machinery instead of rebuilding it. - Secret posture matches PR #11 (pointer + runtime resolution); nothing plaintext at rest. - Contract-first keeps the UI and API in lockstep.

Bad / risks: - The connector core is Python-centric despite a preference to spread languages. - ArcadeDB needs a custom dlt destination (medium effort; type mapping + idempotency are ours). - A common schema vocabulary (CDM) for introspect_schema must be defined before BigQuery, or its type system forces a breaking change. To contain this risk the CDM adopts an existing standard rather than a bespoke vocabulary: Apache Arrow schema metadata as the canonical type system for introspect_schema, with per-type adapters mapping native types into it. - Dynamic pipeline parameters (cursors, partitions) vary per type → the pipelines request body uses a per-type extensions object validated against the connection type's settings JSON Schema (already defined in the connection_types registry), so the Svelte UI can render the correct inputs per connector; or operators bypass the API with raw dlt scripts. - dlt is not yet a dependency of backend-core (it lives in the AgentArmy hub today) — adding it expands the dependency surface.

Acceptance criteria

This ADR is accepted (ratified 2026-05-25); the following are the conditions each implementation increment must satisfy as Phase 1 is built out:

  • The ArcadeDB connector must pass the existing agentarmy-doctor checks and a contract conformance test.
  • A semgrep rule must assert that no secret_ref is ever resolved into a persisted field.
  • The OpenAPI contract drift gate must stay green and the Svelte client must regenerate cleanly.

Phased Plan

  1. ArcadeDB — registry schema, ConnectorBase, ArcadeDB connector (graph+vector mixins), secret_ref resolution via the Docker secret file (PR #11), the API endpoints, Svelte connection CRUD + health badge.
  2. BigQueryBigQueryConnector + JobCapable, dlt BigQuery destination, pipeline run/poll UI; define the CDM schema vocabulary here.
  3. Universal — settings-schema validation at create time, connector auto-discovery via entry points, multi-environment credential namespacing, optional ADBC fast-path reads.

More Information

  • Confirmed state: backend-core already ships an app/ package, and dlt is not currently in requirements.txt — adopting it (phase 1) adds a new dependency, as noted in the risks above.
  • Related: PR #11 (ArcadeDB secret-file hardening) establishes the secret_ref/secret-file precedent this design extends.
  • Research basis: parallel agent research (dlt/ADBC/connectorx/SQLAlchemy landscape; build-vs-buy; architecture), 2026-05-24.