Merge remote-tracking branch 'origin/main' into ragnorc/omnigraph-mcp-crate

This commit is contained in:
Ragnor Comerford 2026-06-16 16:44:11 +02:00
commit c08e8dbac4
No known key found for this signature in database
173 changed files with 20828 additions and 10366 deletions

View file

@ -1,6 +1,6 @@
# Architecture
OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets, with Git-style branches and commits across the whole graph, multi-modal querying (vector + FTS + BM25 + RRF + graph traversal) in one runtime, an HTTP server with Cedar policy, and a CLI driven by a single `omnigraph.yaml`.
OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets, with Git-style branches and commits across the whole graph, multi-modal querying (vector + FTS + BM25 + RRF + graph traversal) in one runtime, an HTTP server with Cedar policy, and a CLI driven by a per-operator `~/.omnigraph/config.yaml` plus team-owned cluster directories.
## Reading guide
@ -10,7 +10,7 @@ Three views, increasing zoom:
2. **Layer view** — the eight-layer stack inside one OmniGraph process.
3. **Component zoom-ins** — what's inside each layer.
For runtime flows (read query, mutation), see [`docs/dev/execution.md`](execution.md). For the on-disk layout of a graph, see [`docs/user/storage.md`](../user/storage.md).
For runtime flows (read query, mutation), see [`docs/dev/execution.md`](execution.md). For the on-disk layout of a graph, see [`docs/user/storage.md`](../user/concepts/storage.md).
L1 (orange in the diagrams) is what we inherit from Lance; L2 (blue) is what OmniGraph adds. The L1/L2 framing is also called out in prose at the bottom of this doc.
@ -280,7 +280,7 @@ flowchart LR
eng --> wq
```
The server applies Cedar policy at the HTTP boundary today. The roadmap, called out in [docs/dev/invariants.md](invariants.md) as a known gap, is to push policy into the planner as predicates. After Cedar, mutating handlers go through `WorkloadController` (per-actor admission cap + byte budget; PR 2 / MR-686) before reaching the engine. The engine itself holds an `Arc<WriteQueueManager>` so concurrent mutations on the same `(table, branch)` serialize at the queue, while disjoint keys run in parallel — see [docs/user/server.md](../user/server.md) "Per-actor admission control" and [docs/dev/writes.md](writes.md). The CLI bypasses the HTTP layer (and admission) and calls the engine API directly.
The server applies Cedar policy at the HTTP boundary today. The roadmap, called out in [docs/dev/invariants.md](invariants.md) as a known gap, is to push policy into the planner as predicates. After Cedar, mutating handlers go through `WorkloadController` (per-actor admission cap + byte budget; PR 2 / MR-686) before reaching the engine. The engine itself holds an `Arc<WriteQueueManager>` so concurrent mutations on the same `(table, branch)` serialize at the queue, while disjoint keys run in parallel — see [docs/user/server.md](../user/operations/server.md) "Per-actor admission control" and [docs/dev/writes.md](writes.md). The CLI bypasses the HTTP layer (and admission) and calls the engine API directly.
Code paths:

View file

@ -8,7 +8,7 @@ This page explains what the policy says and how to change it.
| Setting | Value | Why |
|---|---|---|
| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test Workspace`, `Test omnigraph-server --features aws`, `CODEOWNERS matches source`, `CODEOWNERS not hand-edited` | Every PR must pass workspace tests, AGENTS.md link integrity, and the CODEOWNERS hygiene checks. The two CODEOWNERS contexts must equal the job `name:` values in `.github/workflows/codeowners.yml` **verbatim** — a context naming a job that never reports (the old `CODEOWNERS / drift` used the job *id*, and the job was path-filtered) leaves every PR permanently pending and forces admin overrides. `strict: true` requires the branch to be up-to-date with `main` before merge. |
| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test omnigraph-server --features aws`, `CODEOWNERS matches source`, `CODEOWNERS not hand-edited` | Every PR must pass the AWS-feature build/test, AGENTS.md link integrity, and the CODEOWNERS hygiene checks. **`Test Workspace` is deliberately NOT required** — it runs only on push to `main` (post-merge), tags, and manual `workflow_dispatch`, to keep PR turnaround fast (it was the ~15min+ slow gate). It is therefore *not* listed here: a required check that never reports on PRs (the `test` job is `if: github.event_name != 'pull_request'`) would leave every PR permanently pending — the same job-never-reports trap the CODEOWNERS contexts call out below. The trade-off (a regression lands on `main` and is caught by the post-merge run, so `main` can briefly go red) and its mitigations are documented in [ci.md](ci.md). The two CODEOWNERS contexts must equal the job `name:` values in `.github/workflows/codeowners.yml` **verbatim** — a context naming a job that never reports (the old `CODEOWNERS / drift` used the job *id*, and the job was path-filtered) leaves every PR permanently pending and forces admin overrides. `strict: true` requires the branch to be up-to-date with `main` before merge. |
| **Required approving reviews** | `1` | At least one reviewer. With a 2-person team, going higher would block all merges when one person is unavailable. |
| **Require code-owner reviews** | `true` | The reviewer must be a code owner per `.github/CODEOWNERS`. This is what makes the codeowners chassis enforced. |
| **Dismiss stale reviews on new commits** | `true` | A push after approval invalidates the prior review. Prevents the "approve, then sneak in unreviewed changes" pattern. |

View file

@ -3,6 +3,9 @@
`.github/workflows/`:
- **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repository PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`).
- **`Test Workspace` does not run on pull requests.** The job is gated `if: github.event_name != 'pull_request'`, so the full workspace + failpoints suite runs only on push to `main` (post-merge), on `v*` tags, and on manual `workflow_dispatch`. This was a deliberate PR-latency trade-off — it was the slowest gate (~15min warm, up to the 75min cold ceiling). `RustFS S3 Integration` `needs: test`, so it is push-/dispatch-only for the same reason. The fast PR gates remain: `Classify Changes`, `Check AGENTS.md Links`, `Test omnigraph-server --features aws`, and the two CODEOWNERS checks. `Test Workspace` is correspondingly **not** in the required-check list (`.github/branch-protection.json`); see [branch-protection.md](branch-protection.md).
- **Consequences to internalize:** (1) a regression that the suite would catch now lands on `main` and turns the post-merge run red, rather than being blocked pre-merge — `main` can briefly break, so run `cargo test --workspace --locked` locally before merging anything non-trivial, or trigger this workflow on your branch via the Actions "Run workflow" button. (2) `openapi.json` is no longer auto-regenerated on PRs (that step is inside the `test` job); for server/API changes, regenerate it locally with `OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapi` and commit it, or the strict drift check fails the post-merge `main` run.
- **Applying this policy:** removing `Test Workspace` from the JSON is inert until an admin runs `./scripts/apply-branch-protection.sh`. **Run it immediately after this change merges** — until then GitHub still requires a `Test Workspace` context that no longer reports on PRs, which leaves every open PR permanently pending (the job-never-reports trap).
- **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest.
- **Windows binary build job**: `cargo build --release --locked -p omnigraph-cli -p omnigraph-server` on windows-latest with smoke checks for `omnigraph.exe version`, `omnigraph-server.exe --help`, and PowerShell installer syntax.
- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_graph_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`.

View file

@ -3,11 +3,11 @@
**Status:** Draft / thinking-in-progress
**Type:** Architecture direction
**Date:** 2026-06-07
**Relationship:** generalizes today's `omnigraph.yaml` graph/query/policy configuration surface ([CLI reference](../user/cli-reference.md), [server docs](../user/server.md)) into a future cluster control plane. The distilled rules are in [cluster-axioms.md](cluster-axioms.md); detailed downstream implementation spec and blast-radius assessment in [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md). This is a proposed architecture, not an implemented RFC.
**Relationship:** generalizes today's `omnigraph.yaml` graph/query/policy configuration surface ([CLI reference](../user/cli/reference.md), [server docs](../user/operations/server.md)) into a future cluster control plane. The distilled rules are in [cluster-axioms.md](cluster-axioms.md); detailed downstream implementation spec and blast-radius assessment in [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md). This is a proposed architecture, not an implemented RFC.
> **Implementation status.** The examples below describe the full target schema.
> Stage 2B only accepts the read-only subset documented in
> [cluster-config.md](../user/cluster-config.md). Future-phase fields such as
> [cluster-config.md](../user/clusters/config.md). Future-phase fields such as
> `env_file`, `apply`, `providers`, `pipelines`, `embeddings`, `ui`, `aliases`,
> and `bindings` are intentionally rejected with typed diagnostics until their
> reconciler semantics are implemented.

View file

@ -177,4 +177,4 @@ For all three modes, a mid-load failure (RI / cardinality violation, validation
## Embeddings during load
If a node type has `@embed` properties, the loader calls the engine embedding client (Gemini, RETRIEVAL_DOCUMENT) per row to populate the vector column. See [embeddings.md](../user/embeddings.md).
The loader does **not** embed `@embed` properties at load time. `@embed` is a catalog annotation consumed by query typecheck/lint; vectors are supplied directly in the load data, or pre-filled by the offline `omnigraph embed` pipeline. Query-time `nearest($v, "string")` auto-embeds the query string via the provider-independent embedding client. See [embeddings.md](../user/search/embeddings.md). (Ingest-time `@embed` execution is a planned RFC-012 phase.)

View file

@ -20,13 +20,13 @@ constraints. User-facing behavior should still be documented through
| Area | Read |
|---|---|
| System structure, L1/L2 framing, component diagrams | [architecture.md](architecture.md) |
| On-disk layout, manifest schema, URI behavior | [storage.md](../user/storage.md) |
| On-disk layout, manifest schema, URI behavior | [storage.md](../user/concepts/storage.md) |
| Direct-publish writes, D2, staged writes, recovery sidecars | [writes.md](writes.md) |
| Query execution, mutation execution, loader flow | [execution.md](execution.md) |
| Index lifecycle and graph topology indexes | [indexes.md](../user/indexes.md) |
| Branch and commit internals | [branches-commits.md](../user/branches-commits.md) |
| Index lifecycle and graph topology indexes | [indexes.md](../user/search/indexes.md) |
| Branch and commit internals | [branches-commits.md](../user/branching/index.md) |
| Three-way merge implementation and conflicts | [merge.md](merge.md) |
| Diff/change-feed implementation | [changes.md](../user/changes.md) |
| Diff/change-feed implementation | [changes.md](../user/branching/changes.md) |
| Branch protection policy | [branch-protection.md](branch-protection.md) |
| CODEOWNERS source of truth | [codeowners.md](codeowners.md) |
@ -34,14 +34,14 @@ constraints. User-facing behavior should still be documented through
| Area | Read |
|---|---|
| Schema grammar, catalog, migration planner | [schema-language.md](../user/schema-language.md) |
| Query grammar, IR, lints, mutation restrictions | [query-language.md](../user/query-language.md) |
| Embedding client and `@embed` integration | [embeddings.md](../user/embeddings.md) |
| Cedar policy surface and server gating | [policy.md](../user/policy.md) |
| Server auth, OpenAPI, endpoint handlers | [server.md](../user/server.md) |
| Error taxonomy and serialization | [errors.md](../user/errors.md) |
| Constants and tunables | [constants.md](../user/constants.md) |
| Transaction model public contract | [transactions.md](../user/transactions.md) |
| Schema grammar, catalog, migration planner | [schema-language.md](../user/schema/index.md) |
| Query grammar, IR, lints, mutation restrictions | [query-language.md](../user/queries/index.md) |
| Embedding client and `@embed` integration | [embeddings.md](../user/search/embeddings.md) |
| Cedar policy surface and server gating | [policy.md](../user/operations/policy.md) |
| Server auth, OpenAPI, endpoint handlers | [server.md](../user/operations/server.md) |
| Error taxonomy and serialization | [errors.md](../user/operations/errors.md) |
| Constants and tunables | [constants.md](../user/reference/constants.md) |
| Transaction model public contract | [transactions.md](../user/branching/transactions.md) |
## Project Operations
@ -79,6 +79,9 @@ Working documents for in-flight feature work. Removed when the work lands.
| Per-operator config — `~/.omnigraph/` identity, keyed credentials, named servers (the operator slice of RFC-002) | [rfc-007-operator-config.md](rfc-007-operator-config.md) |
| Deprecate `omnigraph.yaml` — one concern per config surface; key-by-key migration map and staged retirement | [rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md) |
| Unify CLI embedded/remote access paths — parity referee, shared wire-DTO crate, `GraphClient` trait, declared plane capabilities | [rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) |
| Restructure the CLI around explicit planes — one graph-addressing model, declared capability surface, plane-grouped help (expands RFC-009 Phase 4) | [rfc-010-cli-planes-restructure.md](rfc-010-cli-planes-restructure.md) |
| CLI refactoring — one addressing & config model post-`omnigraph.yaml`: scope + `--graph` + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) | [rfc-011-cli-refactoring.md](rfc-011-cli-refactoring.md) |
| Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
## Boundary

View file

@ -15,6 +15,38 @@ Use it this way:
- Keep implementation ledgers, roadmap detail, and historical MR notes in the
per-area docs. This file is the filter, not the encyclopedia.
## Governing principle: logical contract over physical state
The hard invariants below are instances of one rule. Keep it in view whenever
a change touches the boundary between what the graph *means* and how it is
physically stored.
> **Logical state is the contract. Physical state — index coverage, fragment
> layout, compaction versions, staged writes — is derived, rebuildable, and may
> be produced asynchronously. A physical operation must never fail a logical
> one. Preconditions are checked against logical state; physical reconciliation
> is idempotent and may lag or retry. Genuine logical conflicts still fail
> loudly: the licence to lag covers physical convergence, not correctness.**
Invariants that instantiate it: **2** (manifest-atomic visibility) and **5**
(recovery is part of the commit protocol) — a partially-written physical layer
never changes what a graph commit means; **7** (indexes are derived state) — a
query is correct under partial index coverage, and expensive index work
converges from manifest state instead of gating the write path; **13** (failures
bounded and observable) — the licence to lag is not a licence to drop, so a
physical step that cannot make progress is surfaced, not swallowed. Deny-list
items that enforce it: synchronous inline vector/FTS index rebuilds on the
commit path; state that drifts from Lance or the manifest when it can be
derived; job queues for manifest-derivable state where a reconciler fits.
The failure shape it rules out: a legitimate background operation on the
physical layer (compaction, an index build, an interrupted staged write) is
allowed to break a logical operation (a query's correctness, a migration's
success, a branch's writability). The smell to watch for is a logical operation
whose precondition is a *physical* fact — a cached file version, an index's
existence, a fragment count. Make the precondition logical and let a reconciler
converge the physical state.
## Hard Invariants
1. **Respect the substrate.** Lance owns columnar storage, per-dataset
@ -58,7 +90,7 @@ Use it this way:
branch they read even when index coverage is partial. Expensive index work
should converge from manifest state instead of extending the critical write
path. Scalar staged index builds and vector inline residuals are documented
in [writes.md](writes.md) and [indexes.md](../user/indexes.md).
in [writes.md](writes.md) and [indexes.md](../user/search/indexes.md).
8. **Schema identity survives renames.** Accepted schema identity must remain
stable across type and property renames. Rename support belongs in migration
@ -100,14 +132,14 @@ Use it this way:
|---|---|---|
| Multi-table commit | Manifest CAS plus recovery sidecars; not a single Lance primitive | [writes.md](writes.md), [architecture.md](architecture.md) |
| Constructive mutations | In-memory `MutationStaging`, one end-of-query table commit per touched table, then one manifest publish | [writes.md](writes.md), [execution.md](execution.md) |
| Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/query-language.md), [writes.md](writes.md) |
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branches-commits.md), [maintenance.md](../user/maintenance.md) |
| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema-language.md), [execution.md](execution.md) |
| Unique constraints | Intra-batch and write-path checks exist; intake and branch-merge derive the composite key through one shared function (`loader::composite_unique_key`, a separator-free `Vec<String>` tuple) and fail loudly on an un-keyable column type rather than silently exempting it; full cross-version uniqueness against already-committed rows is still a gap | [schema-language.md](../user/schema-language.md) |
| Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/queries/index.md), [writes.md](writes.md) |
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branching/index.md), [maintenance.md](../user/operations/maintenance.md) |
| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema/index.md), [execution.md](execution.md) |
| Unique constraints | Intra-batch and write-path checks exist; intake and branch-merge derive the composite key through one shared function (`loader::composite_unique_key`, a separator-free `Vec<String>` tuple) and fail loudly on an un-keyable column type rather than silently exempting it; full cross-version uniqueness against already-committed rows is still a gap | [schema-language.md](../user/schema/index.md) |
| Storage trait | `TableStorage` (via `db.storage()`) is staged-only; the inline-commit residuals (`delete_where`, `create_vector_index`) are split onto a separate sealed `InlineCommitResidual` trait reached via `db.storage_inline_residual()` (MR-854), so §1 holds by construction; capability/stat surfaces are roadmap | [writes.md](writes.md), [architecture.md](architecture.md) |
| Index lifecycle | `ensure_indices` is explicit today; reconciler-based convergence is roadmap | [indexes.md](../user/indexes.md), [maintenance.md](../user/maintenance.md) |
| Traversal IDs | Runtime still builds `TypeIndex`; Lance stable row-id based graph IDs are roadmap | [architecture.md](architecture.md), [query-language.md](../user/query-language.md) |
| Auth | Bearer token hashing and server-side actor resolution are implemented at the HTTP boundary | [server.md](../user/server.md), [policy.md](../user/policy.md) |
| Index lifecycle | `@index`/`@key` declares *intent*; the physical index is derived state and never fails a logical op. `schema apply` builds no indexes (records intent only; index-only changes touch no table data). `load`/`mutate` build inline through one chokepoint (`build_indices_on_dataset_for_catalog`, type-dispatched by `node_prop_index_kind`: enum + orderable scalar → BTREE, free-text String → FTS, Vector → vector) that fault-isolates an untrainable Vector column into a *pending* index instead of aborting. `optimize`/`ensure_indices` is the reconciler: it creates declared-but-missing indexes and folds appended/rewritten fragments into existing ones (`optimize_indices`), reporting still-pending columns. Explicit maintenance call, not yet a background loop | [indexes.md](../user/search/indexes.md), [maintenance.md](../user/operations/maintenance.md) |
| Traversal IDs | Runtime still builds `TypeIndex`; Lance stable row-id based graph IDs are roadmap | [architecture.md](architecture.md), [query-language.md](../user/queries/index.md) |
| Auth | Bearer token hashing and server-side actor resolution are implemented at the HTTP boundary | [server.md](../user/operations/server.md), [policy.md](../user/operations/policy.md) |
| Tests | Tempdir-backed Lance tests are the current substrate; the storage adapter has an in-memory backend for adapter-level contract tests, but Lance datasets bypass it | [testing.md](testing.md) |
The branch-delete reconciler is authority-derived: it reclaims orphaned forks
@ -132,13 +164,18 @@ them explicit.
new writer cannot couple a write with a HEAD advance through the default
surface. The dead legacy methods (`append_batch` on the trait,
`merge_insert_batch{,es}`, `create_{btree,inverted}_index`) were removed. The
remaining residuals are `delete_where` (gated on MR-A — Lance v7.x bump)
and `create_vector_index` (gated on Lance #6666); see
[lance.md](lance.md) and [writes.md](writes.md). New write paths should use
the staged shape unless a documented Lance blocker applies.
remaining residuals are `delete_where` and `create_vector_index`. The Lance
6.0.1 → 7.0.0 bump landed, so the staged two-phase delete API
(`DeleteBuilder::execute_uncommitted`, Lance #6658) is now available and MR-A
is unblocked — but the migration itself is still pending, so `delete_where`
stays inline for now. `create_vector_index` remains gated on Lance #6666
(still open). See [lance.md](lance.md) and [writes.md](writes.md). New write
paths should use the staged shape unless a documented Lance blocker applies.
- **Deletes and vector indexes:** `delete_where` and vector index creation still
advance Lance HEAD inline because the required public Lance APIs are missing.
Keep D2 and recovery coverage in place until those residuals are removed.
advance Lance HEAD inline. The public delete two-phase API now exists (Lance
#6658 shipped in 7.0.0), so the delete residual is unblocked pending the MR-A
migration; vector index creation is still blocked (Lance #6666 open). Keep D2
and recovery coverage in place until those residuals are removed.
- **Blob-column compaction:** Lance `compact_files` mis-decodes blob-v2 columns
under its forced `BlobHandling::AllBinary` read ("more fields in the schema
than provided column indices"), so `optimize` skips any table with a `Blob`
@ -160,6 +197,22 @@ them explicit.
one-winner-CAS territory; closing this fully needs a cross-process
serialization primitive (e.g. lease-based use of the schema-apply lock
branch) — design it before promoting multi-process write topologies.
- **Fork reclaim is in-process-safe only:** the first write to a table on a
branch forks it (a Lance `create_branch` that advances state before the
manifest publish). An interrupted fork (crash, or a cancelled request
future) leaves a manifest-unreferenced branch ref. The next write self-heals
it — `reclaim_orphaned_fork_and_refork` (`force_delete_branch` + re-fork)
— but reclaim is only safe because the writer holds the per-`(table,
branch)` write queue from before the fork through the publish AND re-checks
the live manifest under it, so no *in-process* writer can be mid-fork. A
reclaim cannot serialize against a foreign-*process* in-flight fork: it may
force-delete a peer's just-created ref, which makes that peer's commit fail
and retry — the same one-winner-CAS exposure as above, not corruption. The
reclaim never fires unless in-process-queue + manifest authority both prove
the ref is manifest-unreferenced. `cleanup`'s per-table reconciler
(`reconcile_orphaned_branches`) is the guaranteed backstop for any fork the
write path never revisits. Both degrade to a no-op if Lance ships an atomic
multi-dataset branch op.
- **Local `write_text_if_match` is not a cross-process CAS:** object-store
backends use a true conditional put (ETag If-Match; the in-memory test
backend too), but upstream `object_store` leaves `PutMode::Update`

View file

@ -156,7 +156,24 @@ If a future need pulls one of these into scope, add a row to the matching domain
When Lance ships a major release that changes any of the above (file format bump, new index type, transaction semantics change, new branching primitive), refresh this index in the same change as the omnigraph upgrade. Stale Lance pointers are worse than no pointers.
### Last alignment audit: 2026-05-22 (Lance 6.0.1 upstream; omnigraph pinned at 6.0.1)
### Last alignment audit: 2026-06-15 (Lance 7.0.0 upstream; omnigraph pinned at 7.0.0)
Migration from Lance 6.0.1 → 7.0.0 landed in this cycle. **Arrow stayed 58, DataFusion stayed 53** (no change) — the only transitive bump is `object_store` 0.12.5 → 0.13.2. 141 upstream commits reviewed (6.0.1 → 7.0.0); no fixes lost (the 6.0.x release-branch backports are all forward-ported into 7.0.0). Behavior-affecting findings:
- **object_store 0.13 moved convenience methods behind a new `ObjectStoreExt` trait** (`get`/`put`/`head`/`rename`/`delete`; `list`/`list_with_delimiter`/`put_opts` stay on the core `ObjectStore` trait). Fix = add `use object_store::ObjectStoreExt;` to `storage.rs` and `db/manifest/namespace.rs`; no call-site changes. Mirrors Lance's own migration in PR #6672. The local-FS `PutMode::Update` gap is unchanged (still unimplemented upstream), so `storage.rs::write_text_if_match`'s local content-token emulation stays.
- **`roaring` must be pinned to 0.11.4** (`cargo update -p roaring --precise 0.11.4`). Lance 7.0.0's `UpdatedFragmentOffsets` newtype (PR #6650) derives `Eq` over `HashMap<u64, RoaringBitmap>`, which needs `RoaringBitmap: Eq` — added only in roaring 0.11.4 (roaring-rs PR #341). Lance's loose `roaring = "0.11"` constraint otherwise resolves the broken 0.11.3 and **lance itself fails to compile** (`RoaringBitmap: Eq is not satisfied`). roaring is transitive (no direct workspace dep); the pin lives only in `Cargo.lock`.
- **`_row_created_at_version` for merge-insert INSERT rows now = the commit version** (PR #6774; was a fallback of 1 / dataset-creation version). Flipped `lance_version_columns.rs::lance_merge_insert_new_row_stamps_created_at_version` to assert `== v2`. Production change-detection keys on `_row_last_updated_at_version` + ID-set membership, so classification logic is unaffected (the `changes/mod.rs` rationale comment was corrected).
- **BTREE range-query bound inclusiveness fixed** (PR #6796, issue #6792): `x <= hi AND x > lo` returned the wrong boundary row on 6.0.1. omnigraph today builds BTREE only on string `@key` columns (`id`/`src`/`dst`) and queries them by equality/IN, not range, so its *current* query patterns almost certainly never hit this bug — but the corrected boundary semantics are a contract we rely on the moment a BTREE-range path appears (BTREE-on-properties via the index-type tickets, or a range-on-key query). Pinned by `lance_surface_guards.rs::btree_range_query_boundary_is_correct` (reproduces #6792's 5-row + BTREE shape).
- **`WriteParams::auto_cleanup` default flipped from on (every-20-commits) to `None`** (PR #6755). On 6.0.1 the on-by-default hook could GC versions the `__manifest` pins for snapshots/time-travel. omnigraph owns cleanup explicitly (`optimize.rs::cleanup_all_tables`). Two parts to the fix, because `auto_cleanup` is **create-time config only and has no effect on existing datasets** (Lance `write.rs` docs): (1) `auto_cleanup: None` at all 11 `WriteParams` sites so *new* datasets store no cleanup config; (2) — the load-bearing half — `skip_auto_cleanup: true` on every commit path, because graphs created **before** the bump still carry the on-config in their datasets, and Lance's hook fires off the *dataset's stored* config at commit time (`io/commit.rs`: `if !commit_config.skip_auto_cleanup`). So the staged commit path (`commit_staged``CommitBuilder::with_skip_auto_cleanup(true)`), the `__manifest` publisher (`MergeInsertBuilder::skip_auto_cleanup(true)`), and the direct `WriteParams` paths all skip the hook. Without this, an upgraded graph would still auto-cleanup and delete `__manifest`-pinned versions. Pinned by `lance_surface_guards.rs::skip_auto_cleanup_suppresses_version_gc` (negative control + with-skip survival).
- **Lance #6658 SHIPPED in 7.0.0** (`DeleteBuilder::execute_uncommitted`, exposed via PR #6781) → MR-A (migrate `delete_where` to the staged two-phase API, retire the parse-time D2 rule) is now **unblocked**, tracked separately (dev-graph `iss-950`). The bump itself keeps `delete_where` inline; the `_compile_delete_result_field_shape` guard is left untouched until MR-A.
- **The unenforced primary key is now immutable once set** (`lance::dataset::transaction`, ~L24722480: `if !primary_key_before.is_empty() && (writes_primary_key || primary_key_after != primary_key_before) → "the unenforced primary key is a reserved key and cannot be changed once set"`). omnigraph marks `__manifest.object_id` as the unenforced PK (`lance-schema:unenforced-primary-key`) for merge-insert row-level CAS — baked into `manifest_schema()` at init, and added by the `migrate_v1_to_v2` internal-schema migration for pre-v0.4.0 graphs. The migration relied on Lance 6's idempotent re-apply for crash-recovery (a crash after the field-set but before the stamp bump re-enters the migration with the PK already present); under v7 that re-apply errors, so a real v1 graph could never finish migrating. Fixed by guarding the set on the manifest's unenforced-PK field (`db/manifest/migrations.rs::migrate_v1_to_v2`): `["object_id"]` → no-op, `[]` → set, any other PK field → loud refusal (the wrong CAS key, unchangeable under v7). Pinned by `lance_surface_guards.rs::unenforced_primary_key_is_immutable_once_set` (red if Lance relaxes immutability); regression: `db::manifest::tests::test_publish_migrates_pre_stamp_manifest_to_current_version` (was red under v7).
- **Native `DirectoryNamespace` no longer recognizes omnigraph's manifest-tracked tables** (`lance-namespace-impls` dir.rs ~L1310): `list/describe/create_table_version` route through `check_table_status`, which reports an omnigraph table absent → `TableNotFound`. The decoupling is *contingent on omnigraph's legacy boolean PK key*, not an unconditional v7 property: v7's namespace eagerly adds the new `lance-schema:unenforced-primary-key:position` key to any `__manifest` lacking it; that write hits the immutable-PK rule above (the boolean key already set the PK), so `ensure_manifest_table_up_to_date` errors and the namespace silently falls back to directory listing. omnigraph keeps the boolean key deliberately — Lance honors it permanently (maps to PK position 0), and one uniform on-disk format beats a new-vs-old split (existing graphs can't be re-keyed to the position key under that same immutability rule). omnigraph production never uses Lance's native namespace (its publisher writes `__manifest` directly via merge_insert; its own `namespace.rs` impls are custom), so this is test-only — the `test_directory_namespace_direct_publish_cannot_replace_native_omnigraph_write_path` surface guard was realigned to the v7 behavior (it now asserts the native namespace is fully decoupled, which only strengthens the guard's thesis).
- **Still NOT fixed in 7.0.0:** vector-index two-phase (Lance #6666 open) — `create_vector_index` inline residual retained; blob-column compaction — `compact_files_still_fails_on_blob_columns` guard still red on a fix, `optimize` still skips blob tables behind `LANCE_SUPPORTS_BLOB_COMPACTION`.
- **No Lance API surface omnigraph uses changed at *compile* time** (the only compile break was object_store) — but **two runtime behaviors did** (the unenforced-PK immutability and the native-namespace `TableNotFound`, above), each caught by the full engine test suite rather than the build. `CleanupPolicy`, `WriteParams` (apart from the `auto_cleanup` default), `CompactionOptions`, the namespace models (resolved via `lance-namespace-reqwest-client` 0.7.7, unchanged across the bump), `Operation`, `ManifestLocation`, and `MergeInsertBuilder` shapes are all stable. Lesson: a clean build is not a clean alignment — run `cargo test --workspace` before declaring a Lance bump done.
Bump this date stanza on the next alignment pass.
### Prior alignment audit: 2026-05-22 (Lance 6.0.1 upstream; omnigraph pinned at 6.0.1)
Migration from Lance 4.0.0 → 6.0.1 landed in this cycle (DataFusion 52 → 53, Arrow 57 → 58, lance-tokenizer 6.0.1 added, tantivy* removed). Direct 4 → 6 jump; v5.x was not used as an intermediate (rationale in `~/.claude/plans/shimmering-percolating-duckling.md`). Behavior-affecting findings:
@ -169,6 +186,7 @@ Migration from Lance 4.0.0 → 6.0.1 landed in this cycle (DataFusion 52 → 53,
- **`Dataset::checkout_version(N).await?.restore().await?`**: `restore()` takes `&mut self` and returns `Result<()>` (mutates in place, does not consume + return a new dataset). The recovery rollback hammer at `db/manifest/recovery.rs:505-522` continues to work. Pinned by `lance_surface_guards.rs::_compile_checkout_version_then_restore_signature`.
- **`DatasetBuilder::from_namespace(...).with_branch(...).with_version(...).load()`** surface preserved (the namespace builder chain at `db/manifest/namespace.rs:162-174`). Pinned by `lance_surface_guards.rs::_compile_dataset_builder_from_namespace_signature`.
- **`compact_files(&mut ds, CompactionOptions::default(), None)`** signature stable. `CompactionOptions` still does not expose `data_storage_version`; `compact_files` builds its own `WriteParams { ..Default::default() }`. Note: `LanceFileVersion::default()` is now V2_1 in v6, so optimize-rewritten fragments come out at V2_1 by default (was V2_0 in v4). Existing explicit V2_2 pins on creates/appends still apply.
- **`Dataset::optimize_indices(&mut self, &lance_index::optimize::OptimizeOptions)`** (via `DatasetIndexExt`) is a depended-on surface as of the index-coverage work: `db/omnigraph/optimize.rs` calls it after `compact_files` to fold appended/rewritten fragments into existing indexes (incremental merge, not retrain). It is a **committing** call (mutates in place, advances HEAD; no uncommitted variant in v6.0.1), so optimize treats it as an inline-commit residual under the `SidecarKind::Optimize` recovery sidecar. Signature pinned by `lance_surface_guards.rs::_compile_optimize_indices_signature`; the incremental-coverage behavior pinned by `optimize_indices_extends_fragment_coverage` (appended fragment uncovered before, covered after).
- **`Dataset::delete(predicate)` returns `DeleteResult { new_dataset: Arc<Dataset>, num_deleted_rows: u64 }`** — unchanged shape. Pinned by `lance_surface_guards.rs::_compile_delete_result_field_shape`. MR-A will repurpose this guard to the staged two-phase variant once `DeleteBuilder::execute_uncommitted` migration lands.
- **File reader read methods now async** (Lance PR #6710, v6.0). No effect — omnigraph reaches Lance exclusively through `Dataset::scan` and the staged-write API.
- **Tokenizer vendored as `lance-tokenizer`** (Lance PR #6512, v6.0). No effect — no direct tokenizer imports.
@ -178,6 +196,4 @@ Migration from Lance 4.0.0 → 6.0.1 landed in this cycle (DataFusion 52 → 53,
- **`Dataset::force_delete_branch`** (`branches().delete(name, force=true)`, dataset.rs:524) tolerates a missing branch-*contents* ref (vs plain `delete_branch`'s `RefNotFound`), but on the local store still errors `NotFound` if the branch `tree/` directory is fully absent (`remove_dir_all`'s NotFound is not caught for Lance's native error variant, refs.rs:526-549). Both variants still refuse a branch with referencing descendants (`RefConflict`). `TableStore::force_delete_branch` wraps this to be fully idempotent (tolerates already-absent). The single-authority branch-delete redesign uses it for orphan reclamation (eager best-effort reclaim + cleanup reconciler). Pinned by `lance_surface_guards.rs::force_delete_branch_semantics`. Branch delete is "flip the ref atomically, then `remove_dir_all(tree/{branch})`"; branch-exclusive data lives under `tree/{branch}/` so a drop reclaims it immediately without touching `main`.
- **Lance blob-v2 `compact_files` bug** (no public issue found as of 2026-06): `compact_files` disables binary-copy for blob datasets and forces `BlobHandling::AllBinary` on the read side; the v2.1+ structural decoder then mis-counts column infos for the blob-v2 struct and fails with `Invalid user input: there were more fields in the schema than provided column indices / infos` (`lance-encoding/src/decoder.rs::ColumnInfoIter::expect_next`). This fails even a pristine uniform-V2_2 multi-fragment blob table; vector/list/scalar/ragged columns and mixed file versions all compact fine. Reads/queries use descriptor handling (`BlobHandling::default()`) and are unaffected. `optimize` skips blob-bearing tables behind `LANCE_SUPPORTS_BLOB_COMPACTION = false` (`db/omnigraph/optimize.rs`), reporting `SkipReason::BlobColumnsUnsupportedByLance`. Pinned by `lance_surface_guards.rs::compact_files_still_fails_on_blob_columns`, which turns red when the bug is fixed → flip the gate, remove the skip branch + the `maintenance.rs::optimize_skips_blob_table_and_reports_skip` skip assertions.
Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (10 named guards; 5 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension).
Bump this date stanza on the next alignment pass.
Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (10 named guards; 5 runtime + 5 compile-only; plus the index-coverage work's `_compile_optimize_indices_signature` and `optimize_indices_extends_fragment_coverage`). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension).

View file

@ -348,4 +348,4 @@ Callers move at their own pace. The envelope upgrades + URL rename ship in v0.6.
- RFC 8288 (`Link` relations, `successor-version`)
- MCP spec: [modelcontextprotocol.io](https://modelcontextprotocol.io)
- [invariants.md](./invariants.md) — substrate boundaries this work respects
- [../user/server.md](../user/server.md) — current HTTP surface (post-MR-656 picks up the `/query`+`/mutate` rename and deprecation)
- [../user/server.md](../user/operations/server.md) — current HTTP surface (post-MR-656 picks up the `/query`+`/mutate` rename and deprecation)

View file

@ -68,7 +68,7 @@ anything moves — mirroring the storage collapse, where the pinned contract
tests gated the swap, and the test-monolith modularization (#192/#193), which
makes Phase 3 tractable: the CLI dispatch is 1,184 lines today, not 4,200.
### Phase 1 — Parity matrix (the referee; do first, no refactor)
### Phase 1 — Parity matrix (the referee; do first, no refactor) *(landed)*
A CLI integration test (extend the `system_local.rs` harness, which already
spawns both binaries): one fixture graph; for every forked verb, run the
@ -81,7 +81,16 @@ This pins today's behavior so Phase 3 can't silently change it, and catches
every future fork drift. It also incidentally covers utoipa annotation↔route
mismatches (a lying `#[utoipa::path]` makes the remote leg 404).
### Phase 2 — One wire-DTO crate
**Phase 1 outcome (landed):** `crates/omnigraph-cli/tests/parity_matrix.rs`
— 11 rows green with an **empty divergence ledger**: with matched Cedar
policy on both arms, embedded and remote agree on every forked verb's
scrubbed JSON and exit codes. Two findings along the way: like-for-like
requires the same policy bundle on both arms (a tokens-only server is
default-deny by design — the harness encodes this), and inline execution's
unbound-param matches-all vs the invoke path's hard error is a cross-path
asymmetry, filed as #207 and pinned (not repaired) by the matrix.
### Phase 2 — One wire-DTO crate *(landed)*
Move the HTTP request/response types and the single `engine result → DTO`
mapping per verb into a shared crate (working name `omnigraph-api-types`),
@ -113,6 +122,15 @@ neither axum nor the engine's internals. The engine crate does not depend on
it — the `engine result → DTO` mapping lives in the shared crate (or the CLI/
server side), taking engine result types as input.
**Phase 2 outcome (landed):** `crates/omnigraph-api-types` holds the wire
DTOs + their `engine-result → DTO` mappings; `omnigraph-server::api` is a
`pub use` re-export (so `openapi.json` is byte-identical — the referee
passed with zero diff), and the CLI consumes the crate directly. One
deliberate refinement of the original sketch: `LoadOutput` is a rendered
CLI output type, not a wire DTO, so it stayed CLI-side — both its mappings
(local `LoadResult`, remote `IngestOutput`) now sit together in
`output.rs`. The parity matrix passed textually unchanged.
### Phase 3 — `GraphClient` trait, two implementations
```text
@ -143,15 +161,20 @@ and cluster commands must work with the server down) explicit in code.
"Server" targets include operator-config named servers (RFC-007), not only
literal `http(s)://` URIs.
### Phase 5 — Route alignment
### Phase 5 — Route alignment (landed)
Add a canonical `/load` endpoint (the handler already exists behind the
`/ingest` shim); point `RemoteClient` at it; keep `/ingest` on its existing
deprecation path. While here, check whether the server uses `utoipa-axum`'s
router-coupled registration (`OpenApiRouter`/`routes!`); if it hand-mounts
routes beside `#[utoipa::path]` annotations, prefer migrating registration so
path annotations and mount points are the same declaration (the modularization
already hit one orphaned-attribute incident of exactly this class).
Added a canonical `POST /load` (shared `run_ingest` body; the deprecated
`/ingest` is now a thin alias carrying `#[deprecated]` + RFC 9745/8288
`Deprecation`/`Link: </load>` headers, exactly mirroring `/mutate``/change`)
and pointed the CLI's remote `load` arm at it; `/ingest` stays on its
deprecation path. `/load` reuses `IngestRequest`/`IngestOutput` (as canonical
`/mutate` reuses `Change*`); a DTO rename is a separate change.
Registration finding: the server **hand-mounts** routes (`.route(...)`) beside a
manual `#[openapi(paths(...))]` list, not `utoipa-axum`'s `OpenApiRouter`/
`routes!`. This PR followed the existing manual pattern (one `.route` + one
`paths(...)` entry + the `#[utoipa::path]` annotation) rather than migrating
registration — the migration is a worthwhile but orthogonal cleanup, deferred.
## Non-goals

View file

@ -0,0 +1,449 @@
# RFC: Restructure the CLI Around Explicit Planes
**Status:** Proposed
**Date:** 2026-06-13
**Audience:** CLI/server/cluster maintainers
**Builds on:** [rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md)
(Phases 3a3c landed — the embedded/remote data-plane fork is now one
`GraphClient` enum; this RFC **expands RFC-009 Phase 4** from a narrow
embedded-vs-remote capability table into the full plane model, and leaves
Phase 5 route alignment where it is),
[rfc-007-operator-config.md](rfc-007-operator-config.md) (operator
`--server`/`--graph`/`--target` addressing — the surfaces this RFC makes
uniform across planes),
[rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md).
**Sequencing:** post-v0.7.0, after RFC-009 Phase 3c (done).
## Summary
The CLI silently spans **three planes** — data, storage/maintenance, and
control — and forces the operator to know which plane each verb lives on *and*
address a graph differently per plane. The same graph you query as
`--server prod --graph knowledge` you must maintain as
`s3://bucket/knowledge.omni`. Plane restrictions (`graphs list` is server-only,
`optimize` is storage-only) are *accidental* — discovered by hitting a cryptic
error, not *declared*.
This RFC makes the plane model **explicit and coherent** with three moves:
1. **One graph-addressing model** across every verb (`--target`/`--graph`/
positional URI/`--server`), resolving to a storage URI for maintenance and a
remote client for data — instead of two different ways to name one graph.
2. **A declared, per-subcommand capability surface** (RFC-009 Phase 4): each
verb declares its plane(s); wrong-plane invocations get an honest "this is
storage-plane, `--server` doesn't apply" error from one table, not scattered
`bail!`s.
3. **Plane-grouped `--help`** so the model is legible at a glance.
No new server feature. Storage maintenance stays off the wire — deliberately.
## Current state of affairs
The CLI has 23 top-level commands. They divide into three planes, addressed
three different ways:
| Plane | Verbs | Reaches the graph by | Addressing surface |
|---|---|---|---|
| **Data** | `query`, `mutate`, `load`, `ingest`, `branch *`, `snapshot`, `export`, `commit *`, `schema show/apply` (and `graphs list`, **remote-only today** — see note) | embedded engine **or** HTTP server (one `GraphClient`) | positional URI **or** `--target` / `--graph` / `--server` (config aliases) |
| **Storage / maintenance** | `init`, `optimize`, `repair`, `cleanup`, `schema plan`, `queries validate` | embedded engine **only**, directly on storage (`file://` or `s3://`) | positional URI **or** `--target`**no `--server` / `--graph`** (except `init`, which today takes **only a required positional URI** — no `--target`) |
| **Control** | `cluster validate/plan/apply/approve/status/refresh/import/force-unlock` | a cluster **directory** (`file://` or `s3://`), not a graph URI | `--config <dir>` |
### What's confusing (validated facts)
1. **Two names for one graph.** Data verbs resolve `--server prod --graph
knowledge` through `GraphClient::resolve*` (the embedded/remote fork collapsed
in RFC-009 Phases 3a3c; only the two `GraphClient` factories call
`apply_server_flag`). Maintenance verbs instead use
`resolve_uri`/`resolve_local_uri` and accept only a positional URI or
`--target` — so to compact the graph you *query* as `--server prod --graph
knowledge` you must *type* `s3://bucket/knowledge.omni`. One graph, two
addressing vocabularies.
> **Note (`graphs list`).** It is routed through `GraphClient` only to share
> the addressing/token resolver; its embedded arm fails loudly, so it is
> **remote-only today** (the later capability table and *Relationship to
> RFC-009* record it as remote-now / embedded-cluster-later).
2. **Plane restrictions are accidental, not declared.** `graphs list` is
server-only and `optimize`/`repair`/`cleanup`/`init` are storage-only purely
by code shape. Point `optimize` at an `https://` URL and you get whatever
`Omnigraph::open` says about an https URI — accidental error text that, per
Hyrum's Law, is already someone's dependency. The capability is real but
unstated.
3. **The split is per-subcommand, and the family names hide it.** `schema plan`
is storage-only (`resolve_local_uri`) while `schema show`/`schema apply` are
data-plane (the graph client). `queries validate` opens the graph to
typecheck while `queries list` only reads the registry config. The plane is
a property of the *subcommand*, not the family.
4. **Maintenance has no server/cluster counterpart at all.** There is no HTTP
route and no `cluster` subcommand for `optimize`/`cleanup`/`repair` (verified:
nothing in the server route table, nothing in `omnigraph-cluster/src`). For a
server-backed deployment you run the *same CLI* against the storage URI,
out-of-band from the serving process. This is correct (maintenance is
heavyweight, destructive, single-operator — it should not be a multi-tenant
HTTP surface), but it is **undocumented in the CLI's own shape**, so it reads
as an omission rather than a decision.
5. **`init` has a hidden control-plane twin.** Bare `init` creates a single
graph from storage; in cluster mode the equivalent is `cluster apply`
(graph-creation stage, with ledger/recovery/approval semantics). Same intent,
two entry points, no signpost between them.
6. **Flat `--help`.** All 23 commands list as one undifferentiated block, so the
plane a verb belongs to is tribal knowledge.
The net effect: a new operator must already know OmniGraph's plane architecture
to predict which flags work on which verb and how to name a graph. The CLI does
not teach its own model.
## Target CLI ergonomics
The throughline: **you name a graph one way, and the CLI tells you what works
where.** Simple examples of the end state:
### One name for a graph, everywhere
A config target `knowledge` works on every verb that touches that graph:
```bash
omnigraph query --target knowledge --query q.gq # data (embedded or remote, auto)
omnigraph load --target knowledge --data rows.jsonl # data
omnigraph optimize --target knowledge # maintenance (resolves to its storage URI)
omnigraph cleanup --target knowledge --keep 10 --confirm
omnigraph repair --target knowledge --confirm
```
The positional URI form still works everywhere, unchanged:
```bash
omnigraph optimize s3://bucket/knowledge.omni
```
### Data plane: same command, embedded or remote
You don't pick "local vs server" syntax — resolution decides:
```bash
omnigraph query ./local.omni --query q.gq # opens engine directly
omnigraph query --server prod --graph knowledge --query q.gq # over HTTP
omnigraph query --target knowledge --query q.gq # whichever the config says
```
### Maintenance: `--target` must resolve to direct storage (loud if not)
```bash
$ omnigraph optimize --target prod
error: `--target prod` resolves to a remote server (https://prod…).
`optimize` is a storage-plane command and needs direct storage access.
Pass the graph's s3://… URI, or use --cluster <dir> --graph <id>.
```
Cluster-managed graphs get an explicit, intentional path (no implicit
`cluster.yaml` peeking):
```bash
omnigraph optimize --cluster ./cluster --graph knowledge
```
### Wrong-plane = one honest, stable error
```bash
$ omnigraph optimize --server prod
error: `optimize` is a storage-plane command; `--server` addresses the data
plane and does not apply here. Use --target <name> or a storage URI.
$ omnigraph graphs list ./local.omni
error: `graphs list` needs a remote multi-graph server (http/https) today.
(Embedded cluster-catalog enumeration is planned — RFC-009.)
```
### `--help` teaches the model
```
DATA PLANE run against a graph (embedded or --server)
query mutate load branch snapshot export commit schema show schema apply
STORAGE / MAINTENANCE direct storage access; no server
init optimize repair cleanup schema plan queries validate
CONTROL PLANE manage a cluster directory
cluster
INSPECT / SESSION
graphs list queries list lint policy embed login logout config
```
### Exceptions, signposted (not silent)
```bash
omnigraph init --schema s.pg ./new.omni # plain path: fine
$ omnigraph init --target knowledge --schema s.pg # cluster-managed target: redirected
error: `knowledge` is a cluster-managed graph. Create it via `cluster apply`
(which records ledger + recovery + approvals), not `init`.
```
**In one line:** one way to name a graph, the right flags accepted per verb, and
a CLI that tells you its planes instead of making you memorize them.
## Proposed shape (mechanism)
### One addressing model for every graph-addressing verb
Route **all** graph-addressing verbs — data *and* maintenance — through one
resolver that turns `(positional URI | --target | --graph | --server)` into
either a **storage URI** (`file://`/`s3://`) → embedded execution, or a **remote
`GraphClient`** → HTTP execution, per the verb's declared plane.
**Authority rule (the precedence must not be silent).** `--target` is an
operator/legacy target lookup; `cluster.yaml` is a *different* authority surface
(read only by `cluster` commands and `--cluster` boot). A maintenance verb must
not quietly consult both and invent a precedence. The rule:
- A maintenance verb's `--target` resolves through the **operator/legacy**
config and its URI must already be **direct storage**; a target that resolves
to a remote (`http(s)://`) URL **fails loudly** (see the example above).
- **Cluster-managed graphs are addressed explicitly** via a cluster-root +
graph-id pair (spelled `--cluster <dir> --graph <id>` for illustration), so
reading cluster state is an intentional mode — never an implicit fallback
between operator config and `cluster.yaml`.
> **Flag-shape caveat (deferred).** `--graph` is *already* a global flag that
> `requires = "server"` and appends `/graphs/<id>` to a **remote** URL — a
> different meaning, and clap won't permit `--graph` without `--server`. So the
> cluster-maintenance addressing needs either a distinct flag (e.g.
> `--cluster-graph <id>`) or an explicit global-flag migration. This is why
> the cluster-managed resolver is **deferred to a later slice** (it also rides
> the applied-state-vs-declared-config open question below); the
> operator/legacy `--target` path lands first.
### A declared, per-subcommand capability surface (RFC-009 Phase 4, expanded)
One table, **per subcommand** (family-level rows hide exactly the cases the
table exists to make non-accidental):
| Command | Data (embedded) | Data (remote) | Storage (direct) | Config / session | Notes |
|---|---|---|---|---|---|
| `query`, `mutate`, `load`, `ingest` | ✅ | ✅ | — | — | `ingest` is the deprecated alias of `load` |
| `branch create/list/delete/merge` | ✅ | ✅ | — | — | |
| `snapshot`, `export`, `commit list/show` | ✅ | ✅ | — | — | |
| `schema show` | ✅ | ✅ | — | — | |
| `schema apply` | ✅ | ✅ | — | — | declarative alternative: `cluster apply` |
| `schema plan` | — | — | ✅ | — | local resolver today |
| `queries validate` | — | — | ✅ | — | opens the graph to typecheck |
| `init` | — | — | ✅ | — | cluster-managed graphs → `cluster apply` |
| `optimize`, `repair`, `cleanup` | — | — | ✅ | — | |
| `graphs list` | (later) | ✅ | — | — | remote today; embedded-cluster later (RFC-009) |
| `queries list` | — | — | — | ✅ | reads the registry config; no graph |
| `lint` | — | — | ✅ | ✅ | `--schema` file, or opens a local graph |
| `policy validate/test/explain` | — | — | — | ✅ | reads policy files + config |
| `embed` | — | — | — | ✅ | local tooling (files + embedding API) |
| `login`, `logout`, `config`, `version` | — | — | — | ✅ | session / config; no graph |
The resolver consults this table. A wrong-plane invocation produces one honest,
stable message instead of N ad-hoc `bail!`s and accidental `open` errors.
### Plane-grouped `--help`
Group the command list by plane (the `--help` block shown under Target CLI
ergonomics). Cosmetic, zero behavior change, highest legibility-per-line.
### Maintenance stays off the wire (decision, not omission)
This RFC **does not** add server routes for `optimize`/`cleanup`/`repair`:
- **Serving = the server.** Multi-tenant, safe-for-many-callers data plane.
- **Storage maintenance = the CLI against storage**, addressed uniformly,
run by an operator or a scheduled job with storage access.
Adding maintenance-over-HTTP would re-introduce a heavyweight, destructive
multi-tenant surface and *add* a plane rather than clarify the three we have.
A future cluster-driven maintenance reconciler (scheduled compaction/GC as a
control-plane policy) is explicitly **out of scope** — net-new design (who runs
it, with what resource bounds), not a CLI restructure.
### `init` is an explicit exception (decision)
Direct-storage `init` against a plain URI/target stays. But if a target resolves
to a **cluster-managed** graph root, `init` **refuses and signposts** `cluster
apply` (which records ledger, recovery, and approval artifacts) rather than
initializing that root out of band. This closes the "hidden twin" of the current
state.
## Compatibility
Additive and low-risk:
- **`--target`/`--graph` on maintenance verbs** is new capability; the positional
URI form keeps working unchanged.
- **Grouped `--help`** is cosmetic.
- **Capability-surface error text** changes the message you get on a wrong-plane
or misaddressed invocation. Per Hyrum's Law that text is observable; the change
is deliberate, release-noted, and replaces an *accidental* `Omnigraph::open`
string with a *stable, declared* one — a net improvement, but flagged.
No engine, server, or wire-protocol change. The work is CLI-internal: the shared
resolver, the capability table, and help grouping.
## Test plan
Extend the existing CLI suites rather than adding a duplicate harness:
- **`parity_matrix.rs`** — capability exclusions (the per-subcommand plane table
becomes the source of truth for which verbs are remote-only / storage-only).
- **`cli_data.rs`** — maintenance wrong-plane errors (`optimize --server`,
`optimize --target <remote>`), and `--target` resolving to direct storage.
- **`cli_schema_config.rs`** — `graphs list` plane behavior, `schema plan`
vs `schema show/apply` plane split, and plane-grouped `--help` output.
- **`system_local.rs`** — `--server` / operator-targeting edge cases end-to-end.
Pin the new wrong-plane error strings deliberately: this RFC is intentionally
replacing accidental `Omnigraph::open` strings with stable capability errors, and
those strings become observable behavior (Hyrum).
## Relationship to RFC-009
RFC-009 Phase 4 was scoped as "declared plane capabilities" for the
embedded-vs-remote axis only. This RFC **subsumes and broadens** that phase into
the full three-plane, per-subcommand model (adds uniform maintenance addressing,
the authority rule, and help grouping). RFC-009 Phase 5 (remote `load`
`/load` route alignment) is unaffected and remains in RFC-009.
**`graphs list` reconciliation:** RFC-009's answered open question (pinned in
`parity_matrix.rs`'s exclusions comment) targets `graphs list` becoming
Both-capability once the embedded arm enumerates the cluster catalog. This RFC
**aligns** with that rather than superseding it: the capability table shows
`graphs list` as remote today, embedded-cluster later.
## Open questions
1. **Capability-table location** — a CLI-internal const, or surfaced (e.g. in
`--help` and a machine-readable `omnigraph capabilities` for tooling)?
2. **`--cluster <dir> --graph <id>` for maintenance** — does the maintenance
command resolve the storage URI from the applied cluster state, or from the
declared `cluster.yaml`? (Applied state is the truth the server serves;
declared config may be ahead of it.)
## Review comments (Codex, 2026-06-13)
Overall take: the direction is right. The planes already exist; making them
declared in code, help text, and error messages should reduce operator surprise.
Keeping storage maintenance off HTTP is also the right boundary: `optimize`,
`repair`, and `cleanup` are direct-storage operator actions, not a multi-tenant
serving surface.
Before implementation, tighten these points:
1. **Resolver authority needs a sharper rule.** The proposal says maintenance
resolves storage URIs "from `cluster.yaml` / operator config", but those are
different authority surfaces. Today `--target` is an operator/legacy
graph-target lookup; cluster config is read by `cluster` commands and by
`--cluster` server boot. Do not make a maintenance command silently consult
both and pick a precedence. Either:
- `--target` on maintenance means an operator/legacy target whose URI is
already direct storage, with remote targets failing loudly; or
- add an explicit cluster-root/config resolver for this case, so reading
cluster state is an intentional mode.
**Resolution (accepted):** both — `--target` resolves through operator/legacy
config and must be direct storage (remote → loud fail); cluster-managed graphs
use the explicit `--cluster <dir> --graph <id>` resolver. See *Authority
rule* under Proposed shape.
2. **`graphs list` conflicts with RFC-009's target shape.** This RFC classifies
`graphs list` as remote-only, while RFC-009's answered open question says it
becomes Both-capability once the embedded arm enumerates the cluster catalog.
Pick one direction here: either this RFC explicitly supersedes that target,
or the capability table should show `graphs list` as remote today and
embedded-cluster later.
**Resolution (accepted):** align, don't supersede. The table shows `graphs
list` remote-today / embedded-cluster-later. See *Relationship to RFC-009*.
3. **The capability table should be per subcommand, not per family.** The
family-level rows hide the exact cases the table is supposed to make
non-accidental. At minimum, call out:
- `schema plan` as local/storage-backed today, while `schema show` and
`schema apply` route through the graph client;
- `queries validate` versus `queries list`, which do not have the same
plane shape;
- `lint`, `policy`, `embed`, `login`, `logout`, `config`, and `version`, so
enumeration/session/tooling commands are intentionally classified instead
of falling outside the model.
**Resolution (accepted):** the capability table is now per-subcommand and
classifies every command, including the session/tooling group.
4. **`init` should be an explicit exception.** Direct-storage `init` is fine.
A cluster-managed graph should be created by `cluster apply`, with ledger,
recovery, and approval semantics. If a named target resolves to a
cluster-managed graph root, `init` should signpost `cluster apply` rather
than quietly initializing that root out of band.
**Resolution (accepted):** promoted from open question to a decision. See
*`init` is an explicit exception*.
Testing notes for the implementation slice:
- Extend the existing CLI suites rather than adding a new duplicate harness:
`parity_matrix.rs` for capability exclusions, `cli_data.rs` for maintenance
wrong-plane errors, `cli_schema_config.rs` for `graphs list` / help behavior,
and `system_local.rs` for `--server` / operator-targeting edge cases.
- Pin the new wrong-plane error strings deliberately. This RFC is intentionally
replacing accidental `Omnigraph::open` strings with stable capability errors,
and those strings become observable behavior.
**Resolution (accepted):** captured as the *Test plan* section.
## Verification comments (Codex, 2026-06-13)
Follow-up verification against the current CLI/server code found a few
remaining current-state nits. These are doc-shape issues, not objections to the
proposal:
1. **Current-state table overstates `graphs list`.** The table under *Current
state of affairs* still lists `graphs list` with data verbs that reach the
graph by embedded engine or HTTP. Current code routes it through `GraphClient`
only to share the resolver, but the embedded arm fails loudly; the later
RFC text correctly says remote today / embedded-cluster later. Make the
current-state row match that.
**Resolution (accepted):** the Data row now marks `graphs list` **remote-only
today**, with a note that it rides `GraphClient` only to share the resolver.
2. **Current-state table overstates `init` addressing.** `init` is grouped with
maintenance verbs whose addressing surface is positional URI or `--target`.
Current `init` only accepts a required positional URI and has no `--target`
or config path. The proposal can add that capability, but the current-state
table should not describe it as already present.
**Resolution (accepted):** the Storage row now calls out that `init` takes
**only a required positional URI** today (no `--target`); adding `--target` to
`init` is part of the proposal, entangled with the `init``cluster apply`
signpost, not current state.
3. **`apply_server_flag` call-site count is stale.** The text says data verbs
resolve `--server prod --graph knowledge` through `apply_server_flag` at
16 call sites. Current code has the fork collapsed: data verbs call
`GraphClient::resolve*`, and only the two `GraphClient` factories call
`apply_server_flag`. Rephrase the verified fact around `GraphClient`, not
the old pre-collapse call-site count.
**Resolution (accepted):** validated-fact #1 now describes the post-collapse
reality (`GraphClient::resolve*`; the two factories call `apply_server_flag`),
dropping the stale count.
4. **`--cluster <dir> --graph <id>` collides with today's global `--graph`
semantics.** The target ergonomics section proposes that flag shape for
maintenance, but current `--graph` is a global flag that requires
`--server` and appends `/graphs/<id>` to a remote server URL. Either choose
a separate cluster-maintenance graph flag shape, or call out the clap/global
flag migration explicitly as part of the implementation.
**Resolution (accepted):** the *Authority rule* now carries a flag-shape
caveat — the cluster-managed resolver (and its flag shape, e.g.
`--cluster-graph` vs a `--graph` migration) is **deferred to a later slice**;
the operator/legacy `--target` path lands first. The illustrative
`--cluster <dir> --graph <id>` spelling is marked as not-final.

View file

@ -0,0 +1,756 @@
# RFC-011: CLI refactoring — one addressing & config model
**Status:** Accepted — implemented (the `omnigraph.yaml` excision landed as
#250/#251/#252; D1D4, D6, D7, D9, D10 shipped). Two items remain: **D11**
(server-side maintenance jobs) is gated on the bulk-data-plane RFC #219; **D5**
(combined admin scope) stays deferred by design.
**Date:** 2026-06-14
**Audience:** CLI/server maintainers
**Builds on:** [rfc-007-operator-config.md](rfc-007-operator-config.md)
(per-operator config, keyed credentials, named servers),
[rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md)
(the legacy file this RFC finishes removing),
[rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md)
(`GraphClient` — embedded ≡ remote at the execution layer),
[rfc-010-cli-planes-restructure.md](rfc-010-cli-planes-restructure.md)
(declared planes + the wrong-plane guard this RFC subsumes).
**Sequencing:** lands as / after RFC-008 stage 5 (the `omnigraph.yaml` removal).
## Summary
Refactor the CLI around one coherent model once `omnigraph.yaml` is gone. The
shape:
- **One ontology** (store, server, cluster; cluster config vs operator config;
catalog; profile; capability) where each term names exactly one concept.
- **Addressing = scope + `--graph`, with the access path *derived*.** A command
resolves a *scope* (operator defaults, an optional named *profile*, or one
explicit primitive address — `--store` / `--server` / `--cluster`), selects a
graph inside it with `--graph`, and the **served-vs-direct access path falls out
of the scope's bindings × the verb's capability** — it is never a per-command
toggle and never inferred from a URI scheme.
- **Served is the front door; direct storage is privileged.** The everyday scope
is a *server* (a bearer token, no bucket credentials). Reading or writing a
remote store/cluster directly is an explicit, credentialed, admin/break-glass
act — never the default, never baked into everyday operator config.
- **The CLI is stateless per command.** No `current_profile` pointer, no
`USE`-style mode; every command is fully determined by its flags + static
config. You *select* a graph, you do not *switch into* one.
- **Definitions are named; payloads are passed.** Queries (`.gq`) and schema
(`.pg`) live in the catalog and are invoked by name; params and bulk data are
the only per-call inputs.
This removes `--target`, `--cluster-graph`, `--uri` scheme-dispatch, and the
plane guard's "a `--target` that resolves to a remote URL" special case — and it
collapses the four-plane vocabulary, for users, into a single capability rule.
## Motivation: the legacy file pollutes the taxonomy
Today the CLI exposes four overlapping addressing forms but the system has only
three real entities; the mismatch is the whole problem, and `omnigraph.yaml` is
the carrier:
1. **`--target` straddles kinds.** It resolves through the legacy
`omnigraph.yaml` `graphs:` map (`config.rs::resolve_target_uri`), and that
`.uri` can be a **storage location** (`file`/`s3`) *or* a **remote server**
(`http`). One flag, two access paths with different capability and trust
models. The wrong-plane guard's storage-plane remote rejection
(`helpers.rs:467`) exists *only* to compensate for this overload.
2. **Scheme-inferred transport.** `<URI>`/`--uri` has the same disease a level
down: `is_remote_uri` (`helpers.rs:15`) silently picks embedded vs remote from
the scheme. Transport is guessed from a string, not declared.
3. **No single environment concept.** Defaults are smeared across the deprecated
`omnigraph.yaml` (`cli.graph`, `server.graph`) with no clean way to name or
switch environments.
Removing `omnigraph.yaml` is the moment to fix all three at once.
## Ontology
Every term is one concept. The rest of this RFC uses them precisely.
### Entities — the things that exist
- **Graph** — a typed property graph (node/edge types over Lance); the thing you
query and mutate. *Example: the `knowledge` graph.*
- **Store** — the storage location of a **single** graph: its Lance datasets at a
`file://`/`s3://` URI. Addressed directly with `--store`. *Example:
`s3://acme/clusters/brain/graphs/knowledge.omni`.*
- **Cluster** — a storage root holding **many** graphs plus the catalog and
control-plane state (state ledger, approvals, recovery). Managed as-code by the
team. *Example: the `brain` cluster at `s3://acme/clusters/brain`.*
- **Server** — an `omnigraph-server` process serving graphs over HTTP with bearer
auth and Cedar policy; boots from a bare graph or a cluster. *Example: `prod` at
`https://graph.example.com`, serving the `brain` cluster.*
### Config & catalog — the descriptions
- **Cluster config**`cluster.yaml` in the cluster root, declaring the **desired
state** (graphs, schemas, stored queries, policies, storage), applied with
`cluster apply`. Team-owned; the source of truth for *what the system is*.
- **Catalog** — the **applied** registry the cluster owns in storage: the graphs,
stored queries, and policies `cluster apply` materialized. What a server serves
and what `query <name>` resolves against. *(Cluster config is the spec; the
catalog is the applied result.)*
- **Operator config**`~/.omnigraph/config.yaml`, your **personal** file:
identity (actor), default graph, named servers/clusters, output prefs, optional
profiles. Declares *who I am*, never what the system is.
- **Profile** — an optional named bundle of **defaults inside the operator
config** (one of {cluster, server, store} + a default graph). Config data,
**not state**: selecting one fills in omitted flags for a command; it does not
put you "in" a mode. Chosen per command (`--profile <name>`) or per shell
(`OMNIGRAPH_PROFILE`).
- **Credential** — a bearer token keyed to a **server name**, resolved via
`OMNIGRAPH_TOKEN_<NAME>` or `~/.omnigraph/credentials` (`0600`); sent only to
the server it is keyed to. (Per RFC-007 — the operator config holds endpoints,
never tokens.)
### What you run — definitions vs payloads
- **Schema** — the `.pg` type definitions for a graph; authored as a file, applied
via `schema apply` (or `cluster apply`).
- **Stored query** — a named query in the catalog, the team's reusable contract;
invoked by name. *Example: `find_people`.*
- **Query file (`.gq`)** — an authoring artifact holding `query <name>`
declarations; becomes a stored query when `cluster apply` adopts it. For
authoring/ad-hoc, not everyday invocation.
- **Payload** — the per-call inputs that vary each run: params (`--params`,
positional args) and bulk data (`--data`). Never part of config.
### How a command resolves
- **Scope** — the resolved environment a command addresses: operator defaults, a
named profile, or one explicit primitive address.
- **Access path****served** (through a server) or **direct** (open storage
in-process). Derived from scope × capability; see "Access path" below.
- **Capability** — what a verb requires: `any`, `served`, `direct`, `control`,
or `local`.
- **Target shape** — whether the verb is **graph-scoped** (selects one graph
inside the scope), **scope-scoped** (operates on the whole server/cluster
scope), or **local** (does not resolve scope or graph).
- **Actor** — the identity a write is attributed to: server-resolved from the
bearer token (served), or `--as` ?? `operator.actor` (direct).
### The relationships that prevent confusion
- **Exactly two config surfaces:** **cluster config** (team) and **operator
config** (personal). Nothing else is "a config."
- A **profile is not a third config** — it lives *inside* the operator config, and
it is **defaults, not state**.
- A **catalog is not config** — it is the *applied state* the cluster owns.
- A **store is one graph; a cluster is many graphs** + catalog + control state.
- A **graph is the logical thing**; store/server/cluster are ways to reach it.
- "State" elsewhere is not the profile: *graph state* is committed data in Lance;
*cluster state* is the applied control-plane ledger. Neither is operator config.
## Design
### First principles
> Addressing should be 1:1 with the system's real entities; the access path
> (served vs direct) should be **derived**, never inferred from a string or
> toggled per command; the CLI should be **terse by config and stateless per
> command**; and **definitions are named while payloads are passed**.
Every command answers four orthogonal questions — kept orthogonal here:
| Axis | Question | Today | Target |
|---|---|---|---|
| Scope | which environment? | `omnigraph.yaml` defaults / `--target` | operator defaults · `--profile` · one primitive |
| Target shape | whole scope or one graph? | implicit in command family | declared per verb |
| Graph | which graph in it? | tangled into the address | `--graph` only for graph-scoped server/cluster verbs |
| Access path | served or direct? | inferred from scheme / target | **derived** from scope × capability |
| Actor | who am I? | `--as` > `cli.actor` (yaml) > `operator.actor` | `--as`/`operator.actor` (direct) · token (served) |
### A scope binds one entity — and served is the default
A scope (a profile, the flat defaults, or one primitive flag) binds **exactly one
of** {server, cluster, store}. Server and cluster scopes may contain many graphs
and can carry a `default_graph`; a store scope is already one graph and does not
accept `--graph`. They differ by privilege, and **the everyday default is a
server**:
- **server** → served (the everyday scope). A bearer token, **no storage
credentials**. Data verbs run through it, policy-enforced; maintenance verbs are
unavailable from this scope — there is no server route for them, so you must
name storage explicitly. This is what a normal operator's config binds.
- **cluster** → direct storage to a managed cluster, for **control,
maintenance, and graph-backed validation only** (`cluster *`,
`optimize`/`repair`/`cleanup`/`schema plan`, graph-backed `lint`, and
`queries validate`). Data verbs are **not** run directly against a cluster —
they go served, or `--store` for ad-hoc. **Privileged:** requires bucket
credentials, so it appears only in a maintainer's config or as an explicit
`--cluster` flag — never in an everyday operator's defaults.
- **store** → one graph's storage, direct. A **local file** store is ordinary
local dev; a **remote `s3://`** store is break-glass. No catalog (named queries
do not resolve — the ad-hoc lane).
A scope names **one** thing, so there is no independent `server`+`cluster` pair
that could disagree (the audit's coherence hazard is gone by construction — the
default is just a server). And the storage root lives only where it must:
### Direct storage access is privileged (the storage-root rule)
> The storage root (`s3://…`) is **server-and-admin knowledge, never
> everyday-operator knowledge.** Everyday operator config binds a server (a bearer
> token, no bucket credentials). Direct remote access — opening a cluster root or
> an `s3://` store — is always **explicit and privileged**: you name
> `--cluster`/`--store`, and only someone with bucket credentials can. The CLI
> never opens a remote store from a default scope.
This is the least-privilege posture — revoke a bearer token, don't rotate bucket
keys; only the **server process** and an occasional **maintenance admin** ever
hold storage credentials. It makes "use the server, not raw storage"
**structural**, not advisory: direct access requires credentials a normal operator
does not have *and* a flag they must type. The only storage root in an everyday
setup is the one the **server** boots from; operators never see it. (Local *file*
stores for dev are unaffected — a local file is not the production bucket.)
### Access path is derived, not chosen
The two access paths are genuinely different — not two transports for one thing:
- **Served** (through a server): the server resolves your actor from a token and
enforces Cedar policy at the HTTP boundary. In cluster mode the **catalog and
config** (graph set, stored queries, policy bundles) are pinned to the applied
serving revision and move only on restart; **graph data** is read through the
server's engine handle against the requested branch/snapshot (it is not frozen
at boot, though a long-running server will not observe *out-of-band direct
writes* to storage until its handle refreshes). No storage credentials needed.
- **Direct** (open the Lance storage in-process): a **privileged** path — it needs
your own storage credentials, so only an admin/maintainer (or a local-dev file
store) takes it. Actor self-declared (`--as` ?? `operator.actor`), reads **live
storage HEAD**. There is **no server-side identity/auth gate** — but engine-level
Cedar policy *is* still enforced when the graph selection provides a policy
(enforcement is engine-wide; embedded `_as` writers call the same `enforce`).
"Direct" means "no HTTP boundary," not "unpoliced."
Because they differ in authority, freshness, and availability, a graph reached via
a server and that graph's raw storage are **different things you name
differently** — not one identity you flip. Making the access path a per-command
toggle (`--via`) is the `--target` mistake in new clothes; it is rejected.
> **The access path follows from the scope and the verb.** A **server** scope →
> served (data/catalog). A **cluster** scope → direct control, maintenance, and
> validation. A **store** scope → direct ad-hoc data (no catalog). The verb's
> capability picks which applies and rejects the mismatches.
State the bound plainly: the everyday data path
(`query`/`mutate`/`load`/`branch`/`export`/`commit`) against a served graph
**never needs direct storage access**, and direct access is legitimate only in
bounded places: **bootstrap** (`init`), **storage-native maintenance**
(`optimize`/`repair`/`cleanup`/`schema plan`), **graph-backed validation**
(`lint`), **catalog validation** (`queries validate`), the **control plane**
(`cluster *`), **local dev** with no server, and **break-glass** (recovery, or
checking whether a long-running server's handle lags live HEAD). Everything else
is served. This is what makes "discourage direct storage" enforceable rather
than aspirational.
This list is expected to **shrink**: Decision 11 moves
`optimize`/`cleanup` (and healthy-path `repair`) to server-managed jobs, which
would leave direct access to just standalone/local dev, the control plane, and
break-glass — and remove the last routine reason an admin needs bucket
credentials.
### Capability semantics
The CLI validates through verb capability, not plane jargon:
| Capability | Meaning | Examples |
|---|---|---|
| `any` | graph-scoped data; served via a server scope; direct only against a **store** scope (local dev / break-glass); **errors on a cluster scope** | `query`, `mutate`, `load`, `export`, branch reads, `schema show/apply` |
| `served` | requires an HTTP server; may be graph-scoped or scope-scoped | `graphs list`, `queries list` |
| `direct` | graph-scoped storage-native or graph-backed validation; no server form exists | `init`, `optimize`, `repair`, `cleanup`, `schema plan`, graph-backed `lint` |
| `control` | cluster-scoped catalog/control-plane work; addresses the cluster, not a single raw store | `cluster *`, `queries validate` |
| `local` | does not address a graph or scope | `config`, `profile`, `lint --query ... --schema ...` |
`any` does **not** mean "the user picks": the resolver picks from the scope.
Internally the exhaustive `command_plane` match (`planes.rs`) stays as the drift
guard; user-facing errors speak in terms of what the command needs.
### Definitions vs payloads
Queries and schema are **definitions** — contracts that live in the catalog and
are invoked **by name**; params and data are **payloads** passed per call. So the
everyday form is `omnigraph query <name> [params]`, not
`omnigraph query --file find.gq`. A `.gq` path on a routine query is a smell: the
query is not in the catalog yet. Lifecycle: **author a `.gq``cluster apply`
adopts it → invoke by name thereafter.**
Named queries resolve through a **server** (which serves the cluster's catalog).
`queries list` is therefore a served catalog read. `queries validate` is a
control/catalog check against the cluster-owned query definitions. A bare
`--store` has **no catalog**, so it is the ad-hoc lane (`-e` / `--file`), and
`--cluster` does not invoke stored queries. So named-query invocation is a
**served** convenience; direct access (`--store`) is always ad-hoc.
| Kind | Examples | How it enters a command |
|---|---|---|
| Definition | stored query, schema | named in the catalog; authored as a file, adopted by `cluster apply` |
| Payload | params, bulk data | passed per call (`--params`, positional args, `--data`) |
| Authoring / ad-hoc | a `.gq` you're writing | `-e '…'`, `--file new.gq`, `lint --query new.gq --schema schema.pg`, `schema apply --schema` |
### Resolution rule
1. If the verb is `local`, reject graph/scope flags and run without resolving a
scope.
2. If a primitive address is supplied (`--store`/`--server`/`--cluster`), use it
and ignore operator-config scope defaults. *(A **named** primitive — `--server
prod`, `--cluster brain` — still resolves through the operator-config registry;
a **literal**`--server https://…`, `--store s3://…` — bypasses it. Per
Decision 2: a value containing `://` is a literal, otherwise a config-name
lookup.)*
3. Else if `--profile <name>` (or `OMNIGRAPH_PROFILE`) selects a profile, use it.
4. Else use the operator config's flat defaults. Error only if neither resolves.
*(No sticky "current" pointer — each command resolves scope fresh.)*
5. Resolve the graph only for **graph-scoped** verbs. Server/cluster scopes:
exactly one graph in scope → use it; else `default_graph`; else require
`--graph <id>`. Store scopes are already one graph, so `--graph` is rejected.
**Scope-scoped** verbs (`graphs list`, `queries list`, `queries validate`,
and `cluster *`) do not select a graph unless their own resource argument says
otherwise.
6. Derive the access path from capability × scope:
- `direct` verb → the scope's cluster/store; if the scope is a server, error
(name storage explicitly — it is privileged).
- `served` verb → the scope's server; if the scope is a cluster/store, error.
- `control` verb → the scope's cluster; if the scope is a server/store, error
(name a cluster explicitly — it is privileged).
- `any` verb → **served** if the scope is a server; **direct** against a
**store** scope (ad-hoc); on a **cluster** scope, error — cluster is
maintenance-only, so use a server for data or `--store` for ad-hoc.
7. Reject mismatches with an error naming the missing axis.
Good errors:
```text
scope "prod" has 4 graphs; pass --graph <id> or set default_graph
optimize needs direct storage access; scope "prod" is a server — name storage with --cluster s3://… or --store (requires storage credentials)
graphs list enumerates a server scope; do not pass --graph
--store opens raw storage directly, bypassing any server (no HTTP auth gate, live HEAD); for recovery/inspection
```
### Config shape (operator config)
`~/.omnigraph/config.yaml` — your personal file; the cluster config
(`cluster.yaml` + catalog) is the separate, team-owned surface. The default-graph
key is `default_graph` everywhere (the per-command flag is `--graph`).
**Everyday operator — binds a server, holds no storage root:**
```yaml
defaults:
server: prod
default_graph: knowledge
output: table
servers:
prod: { url: https://graph.example.com } # token keyed by name (RFC-007); no creds here
staging: { url: https://staging.example.com }
profiles: # optional, only for multiple environments
staging: { server: staging, default_graph: knowledge }
```
A normal operator never has a storage root or bucket credentials. Their default
scope is served; `optimize`/`repair`/`cleanup` error with a pointer to name
storage explicitly.
**Maintainer — opts into a cluster root (and has bucket credentials):**
```yaml
profiles:
brain-admin: { cluster: brain, default_graph: knowledge } # direct; admin/control/maintenance
clusters:
brain: { root: s3://acme/clusters/brain } # the s3:// root lives ONLY here
```
The `clusters:` block — the only place a storage root appears in operator config —
is **admin-only and opt-in**, absent from a normal operator's file. Equivalently,
skip config and name it per command:
`omnigraph optimize --cluster s3://acme/clusters/brain --graph knowledge`. The
cluster stays the source of truth for the managed catalog; tokens live in the
keyed credential store, never in this file.
### Command shape
Assume the everyday flat defaults: server `prod`, default graph `knowledge`.
| Intent | Command | Path |
|---|---|---|
| Run a catalog query | `omnigraph query find_people` | served |
| …with params | `omnigraph query find_people --params '{"title":"Eng"}'` | served |
| Another graph in scope | `omnigraph query find_people --graph archive` | served |
| Write | `omnigraph load --data batch.jsonl --mode append` | served |
| A different environment | `omnigraph --profile staging query find_people` | served |
| One-off server, no config | `omnigraph query find_people --server https://graph.example.com --graph knowledge` | served |
| Maintain (admin, explicit storage) | `omnigraph optimize --cluster s3://acme/clusters/brain --graph knowledge` | direct (privileged) |
| Maintain (admin, via admin profile) | `omnigraph --profile brain-admin optimize --graph knowledge` | direct (privileged) |
| List catalog queries | `omnigraph queries list` | served |
| Validate cluster query catalog | `omnigraph queries validate --cluster s3://acme/clusters/brain` | control (privileged) |
| Offline query lint | `omnigraph lint --query new.gq --schema schema.pg` | local |
| Graph-backed query lint | `omnigraph lint --query new.gq --cluster s3://acme/clusters/brain --graph knowledge` | direct (privileged) |
| Local dev, no server | `omnigraph query -e 'match { … } return { … }' --store graph.omni` | direct (local file) |
| Break-glass: raw storage of a served graph | `omnigraph query --file find.gq --store s3://acme/clusters/brain/graphs/knowledge.omni` | direct (privileged, rare) |
Note what the everyday rows are: **all served.** `optimize` does *not* appear in
the default-scope rows — from a server scope it errors and points you to name
storage (see the resolution rule), so maintenance is always a deliberate,
credentialed act. There is no "force served/direct" row — you never toggle the
path on a configured graph; the only way to reach raw storage is to *name it*
(`--cluster`/`--store`), which makes the privileged bypass unmistakable. Everyday
rows invoke a query **by name**; a `.gq` file appears only where there is no
catalog (bare store, break-glass) via `-e`/`--file`.
## Before / after
**Before** = best available today (legacy `omnigraph.yaml` `--target`, `.gq`
files, `--cluster-graph`, scheme inference). **After** = this model.
| Intent | Before | After |
|---|---|---|
| Run a query | `omnigraph query --target knowledge --query find.gq --name find_people` | `omnigraph query find_people` |
| Another graph | `omnigraph query --target archive --query find.gq --name find_people` | `omnigraph query find_people --graph archive` |
| Load | `omnigraph load --data b.jsonl --mode append --target knowledge` | `omnigraph load --data b.jsonl --mode append` |
| Maintain (admin) | `omnigraph optimize --cluster brain --cluster-graph knowledge` | `omnigraph optimize --cluster s3://acme/clusters/brain --graph knowledge` |
| Another environment | edit `omnigraph.yaml`, or re-address with full URIs | `--profile staging …` or `OMNIGRAPH_PROFILE=staging` |
| One-off remote | `omnigraph query --uri https://… --query find.gq` *(scheme→remote)* | `omnigraph query find_people --server https://… --graph knowledge` |
| Raw storage of a served graph | `omnigraph query s3://…/knowledge.omni --query find.gq` *(looks like a normal query)* | `omnigraph query --file find.gq --store s3://…/knowledge.omni` *(explicit bypass)* |
**Removed:** `--target`; `--cluster-graph` (`--graph` is the graph selector only
for graph-scoped server/cluster verbs); `--uri` http-scheme dispatch; `--via`
(never ships); everyday `--query <file>` (definitions are named);
`omnigraph.yaml` and its `cli.graph`/`server.graph` defaults.
## Server-side corollary
The same ontology applies to `omnigraph-server` boot: with `omnigraph.yaml` gone,
a server boots from a single bare graph URI **or** a cluster (`--cluster <dir|s3>`,
RFC-005), never a `graphs:` map. The store/server/cluster ontology is then
consistent across CLI and server.
## Migration & compatibility
Addressing flags and config keys are observable contract (Hyrum); every removal is
staged and release-noted.
- **`config migrate`** (shipped) maps each legacy `graphs:` entry **by what it
actually is**: `http(s)` URIs → a `server:` (the recommended everyday shape);
`file` URIs → a local `store:`; an `s3://` **graph** URI → an **admin** `store:`
(it is a single graph, not a cluster); an `s3://` **cluster root** (one that
carries cluster state) → an **admin** `cluster:`. Everyday `s3://` graph usage
migrates with a **warning** — prefer serving it via a server rather than
re-establishing direct remote access. It reports dropped keys.
- **Operators move to a server-default scope.** Where a legacy setup pointed
`cli.graph` at an `s3://` graph for everyday use, migration flags it: the
recommended shape is a `server:` scope (bearer token, no bucket creds), with the
`s3://` root kept only in a maintainer's config — not every operator's.
- **`--target`** warns for one release, then errors; **`OMNIGRAPH_NO_LEGACY_CONFIG=1`**
(already the strict switch) becomes the default — loading `omnigraph.yaml` is a
hard error.
- **`--cluster-graph``--graph`**: `--cluster-graph` is accepted with a warning
for one release, then removed.
- **`--graph` meaning change**: today `--graph` is "graph id on a multi-graph
server" (paired with `--server`); it generalizes to "select the graph for
graph-scoped verbs in server/cluster scopes." Existing `--server --graph`
usage keeps working (it is a strict superset); release-note the broadened
meaning and the fact that store/scope-scoped verbs reject it.
- **`--uri http://…`** warns, then errors with a pointer to `--server`.
- **`--as` on served paths**: today global `--as` is accepted (a no-op on remote
writes — the server resolves the actor from the token); rejecting it on the
served path is staged — warn for one release, then error.
- **`--alias`** → the `alias` namespace (`omnigraph alias <name>`, Decision 4);
the old `--alias` flag warns for one release, then is removed.
## Non-goals
- **No change to the direct/served capability split.** Maintenance stays
storage-direct by design (no server routes for `optimize`/`repair`/`cleanup`);
this RFC only makes the split explicit.
- **No new transport.** Addressing surface, not protocol.
- **No positional sigil grammar** (`@server/graph`, `%cluster/graph`). Considered
and rejected: explicit flags are more discoverable; profiles already give
brevity. Revisit only on demonstrated expert-terseness demand.
## Decisions
The questions this RFC opened are resolved as follows. Two are explicitly
deferred (see below); they do not block the model.
1. **Local-dev path → embedded `--store` scope.** Local dev runs the engine
in-process against a `--store <file>` (or a store-scoped profile); `omnigraph
serve` stays available but is not required. Consistent with embedded ≡ remote
(RFC-009).
2. **Primitives are one flag, typed by content.** `--server` and `--cluster`
accept either a config name or a literal URI: a value containing `://` is a
literal (bypasses the registry); otherwise it is a config-name lookup (error if
unknown). `--store` is always a URI. (Replaces the earlier "literal-vs-named"
question — no `--server-url`/`--cluster-root` split.)
3. **Stored invocation: `query <name>` (read) / `mutate <name>` (write), one
catalog namespace.** A name maps to one definition; the verb asserts its kind
and the CLI errors on mismatch (`'apply_labels' is a mutation — use
omnigraph mutate apply_labels`). No `invoke` verb.
4. **Aliases live under an `alias` namespace**`omnigraph alias <name> [args]`,
never bare top-level. An alias can therefore neither shadow nor be shadowed by a
built-in (current or future) verb.
6. **Profile merge: scope wholesale, prefs layered.** The entity binding +
`default_graph` come *wholesale* from the active scope (a profile, or flat
defaults if none) — never per-key merged across the entity dimension (that would
yield "server *and* cluster"). Only non-scope preferences (`output`, table
layout) take flat defaults as a base. Precedence: explicit flag > profile > flat
defaults.
7. **No default graph → error + list candidates.** A graph-scoped verb with no
`--graph`, no `default_graph`, and >1 graph in scope errors and lists candidates
(served: `GET /graphs`; cluster-direct: catalog enumeration). If enumeration is
policy-gated/unavailable, it says so and asks for `--graph`. Never auto-pick.
9. **Diagnostics & safety.** Writes echo the resolved scope + access path to stderr
(suppress with `--quiet`). Destructive verbs (`cleanup`, overwrite `load`,
`branch delete`) require confirmation when the scope is not local; `--yes` skips
it; **no TTY without `--yes` errors** (never silently proceed). `--json`/CI never
prompt — destructive without `--yes` errors.
10. **Cluster graphs evolve only via `cluster apply`.** `schema apply` (an `any`
verb) targets standalone graphs; against a cluster-managed graph it errors and
points at `cluster apply` (which records ledger/recovery/approvals — RFC-004).
Mirrors `init`'s refusal of a cluster-managed path.
11. **Maintenance moves server-side (committed direction).** `optimize`/`cleanup`
(and healthy-path `repair`) become server/cluster-managed async jobs —
policy-gated, audited, single-coordinator — with `direct` retained only as
break-glass (`repair` when the server is down). Runs out-of-band (a worker +
async job routes, the `POST …` / `GET …/{id}` shape of the bulk-data-plane RFC
(`docs/rfcs/0001-bulk-data-plane.md`, PR #219, not yet merged)), never inline in
serving; `schema plan` is
excluded (≈ `cluster plan` in cluster mode). The **mechanism** (job routes,
worker, scheduling) is a follow-up RFC; until it lands the capability table above
stands, and maintenance is `direct`. When it lands, the maintenance verbs'
capability becomes "served-job + direct break-glass."
## Deferred
Non-blocking; settle when convenient.
- **D5 — combined admin scope.** A scope binds one entity; admins read via a
server scope and maintain via `--cluster`. A `deployments: { … }` object
(server + cluster validated coherent, referenced by a profile) is revisited only
if admin ergonomics demand it — and Decision 11 largely removes the need.
- **D8 — the `profile` command surface.** *Shipped:* `profile list` / `profile
show [<name>]` (read-only inspection). The *no sticky `profile use`* constraint
holds — it is a design principle, not a command.
## Safety
Dropping the sticky `current_profile` pointer removes the main footgun — a
destructive command silently inheriting a "current" environment from an earlier
session. Because each command resolves scope fresh, what is on the command line is
what runs. Two guards remain (a flat default or `OMNIGRAPH_PROFILE` can still point
at prod): echo the resolved scope + access path on writes, and require
confirmation (or `--yes`) for destructive verbs when the resolved scope is not
local (Decision 9). The most dangerous direct writes (`cleanup`, overwrite
`load`) are *structurally* rare now — unavailable from the everyday server scope,
and gated behind bucket credentials plus an explicit `--cluster`/`--store` — so a
normal operator's setup mostly cannot issue them by accident at all.
## Invariants & deny-list check
- **§10 query semantics first-class / §11 transport at the boundary:** preserved —
addressing resolves CLI-side to a `GraphClient`; no transport concepts leak into
engine crates.
- **§12 no client-set actor:** strengthened — the served path's actor stays
token-resolved and `--as` is rejected there; direct self-declares.
- **Least privilege (security posture):** everyday operators hold a revocable
bearer token, not bucket credentials; only the server process and maintenance
admins hold storage creds. Direct remote access is structural opt-in, not a
default — narrowing the blast radius of a leaked operator config.
- **§6 strong consistency:** both paths are snapshot-isolated per query; this RFC
changes addressing, not isolation.
- **Deny-list (no state that drifts):** profiles and aliases are static config
sugar that resolve to canonical scopes; they declare nothing the cluster or
server doesn't already own. No sticky session state is introduced.
- No Hard Invariant is weakened; the change is CLI surface + config removal.
## Relationship to prior work
The completion of the config/CLI lineage: RFC-007 added the operator config and
keyed credentials; RFC-008 demoted `omnigraph.yaml`; RFC-009 unified execution
behind `GraphClient`; RFC-010 declared the planes. This RFC removes the last
legacy addressing surface so the plane model becomes a clean function of the three
real entities, and folds the planes into a single capability rule. It is adjacent
to the public-track bulk-data-plane RFC (`docs/rfcs/0001-bulk-data-plane.md`,
PR #219, not yet merged), which canonicalizes `load`/`export` verbs; this RFC
canonicalizes how every verb *addresses* a graph.
## Appendix: target CLI taxonomy (end state)
The full command set under this model, organized by **capability** (the new
classifying axis) instead of plane — the end-state counterpart to the
current-taxonomy appendix below. Every command, with its end-state addressing.
```
omnigraph
├─ any — data verbs · served by default (server scope, or --server <url|name>);
│ --graph selects the graph in scope; --store forces ad-hoc direct (no catalog)
│ ├─ query (alias: read*) invoke a stored query by NAME; -e/--file for ad-hoc
│ ├─ mutate (alias: change*) invoke a stored mutation by name; -e/--file for ad-hoc
│ ├─ load bulk write — --data, --mode required; --from forks a missing branch
│ ├─ export dump graph data (NDJSON / Arrow)
│ ├─ snapshot current per-table versions
│ ├─ branch { create | list | delete | merge } merge takes --into <target>
│ ├─ commit { list | show } inspect the commit graph
│ └─ schema { show (alias: get) | apply } cluster graphs evolve via cluster apply (Decision 10)
├─ served — needs a server (errors on a store/cluster scope)
│ ├─ graphs list enumerate the graphs a server serves
│ └─ queries list list stored queries in the served catalog
├─ direct — storage-native, PRIVILEGED · --cluster <root> | --store <uri> + bucket creds; never a server
│ ├─ init bootstrap a graph (--store <uri>); refuses a cluster-managed path
│ ├─ optimize compaction; --graph selects
│ ├─ repair publish uncovered drift; --confirm / --force
│ ├─ cleanup version GC; --keep / --older-than / --confirm
│ ├─ schema plan migration preview (reads storage directly)
│ └─ lint --query <path> graph-backed query lint (with --graph on cluster scope)
├─ control — cluster/catalog control, PRIVILEGED · --cluster <dir|s3>
│ ├─ cluster { validate | plan | apply | approve | status | refresh | import | force-unlock }
│ apply/approve take --as <actor>; force-unlock takes <LOCK_ID>
│ └─ queries validate validate cluster-owned stored queries against graph schemas
└─ local — no graph
├─ policy { validate | test | explain } offline Cedar tooling
├─ profile { list | show } read-only; NO mutating `use` (no sticky state)
├─ alias <name> [args] personal shortcut; expands to its bound stored-query call (D4)
├─ config { migrate } finish the omnigraph.yaml split (RFC-008)
├─ login / logout per-server bearer credentials
├─ embed offline embedding pipeline
├─ lint --query <path> --schema <path> file-only query lint
└─ version (-v)
```
`*` `read`/`change` remain as deprecated aliases (warn on use); `ingest` and the
`check``lint` argv-shim are **removed**. `get` aliases `schema show`.
### Addressing forms (end state)
Three scope forms — one per real entity — plus the graph selector. No `--target`,
no `--cluster-graph`, no `--uri` scheme-dispatch, no `--via`.
| Form | Resolves to | Access | Privilege |
|---|---|---|---|
| **server scope** — operator default, a `--profile`, or `--server <url\|name>` | a served endpoint + keyed token | served | everyday (bearer token) |
| **cluster scope** — an admin profile, or `--cluster <root>` | a managed cluster's storage + catalog | direct | privileged (bucket creds) |
| **store scope**`--store <uri>` | one graph's storage (no catalog) | direct | local-dev (file) / break-glass (s3) |
| **`--graph <id>`** | selects the graph for graph-scoped verbs in server/cluster scopes; invalid for store scopes and scope-scoped verbs | — | — |
Resolution: explicit primitive (`--server`/`--cluster`/`--store`) → `--profile` /
`OMNIGRAPH_PROFILE` → operator flat defaults. Access path is then derived from the
scope kind × the verb's capability (see the Resolution rule); it is never inferred
from a URI scheme and never toggled.
### What moved vs today
| Command(s) | Today (plane) | End state (capability) |
|---|---|---|
| `query`/`mutate`/`load`/`export`/`snapshot`/`branch`/`commit`/`schema show`/`schema apply` | Data | **`any`** (served-default; `--store` ad-hoc) |
| `graphs list` | Data (remote-only) | **`served`** |
| `queries list` | Session | **`served`** (catalog read) |
| `init`/`optimize`/`repair`/`cleanup`/`schema plan`/graph-backed `lint` | Storage | **`direct`** (privileged) |
| `queries validate` | Storage | **`control`** (catalog validation) |
| `cluster *` | Control | **control** (unchanged) |
| `policy *`/`embed`/`login`/`logout`/`config`/`version`/offline `lint --query --schema` | Session | **`local`** |
| `ingest`; `--target`; `--cluster-graph`; `--uri http` dispatch | present | **removed** |
| — | — | **added:** `profile { list | show }` (read-only) |
Cross-capability families: `schema` (`plan` is `direct`, `show`/`apply` are
`any`), `queries` (`list` is `served`, `validate` is `control`), and `lint`
(offline with `--schema` is `local`, graph-backed is `direct`) split per
subcommand/mode, exactly where their authority and data dependencies differ.
## Appendix: current CLI taxonomy (today)
The **as-is** command surface this RFC transforms, kept so the RFC is
self-contained. The source of truth is the exhaustive `command_plane` match in
`crates/omnigraph-cli/src/planes.rs`.
Where it disagrees with the design above (four planes, `--target`,
`--cluster-graph`, scheme-inferred transport), the design is the *target* and this
is *today*.
### The four planes (today)
| Plane | What it touches | Addressing accepted |
|---|---|---|
| **Data** | a graph — embedded **or** via a server | `<URI>` · `--target` · `--server` (+`--graph`) |
| **Storage** | direct storage, no server | `<URI>` · `--target` (local/S3 only) · some also `--cluster`+`--cluster-graph` |
| **Control** | a cluster *directory* | `--config <dir>` |
| **Session** | no graph | — |
`--server`/`--graph` are gated strictly to the data plane; `guard_addressing`
(`planes.rs:128`) rejects them elsewhere (RFC-010 Slice 1).
### Command tree by plane (today)
```
omnigraph
├─ DATA ────────── run against a graph; embedded or --server
│ ├─ query (alias: read) · mutate (alias: change) · load · ingest (hidden, deprecated)
│ ├─ branch { create | list | delete | merge } · snapshot · export · commit { list | show }
│ ├─ graphs { list } (remote-only)
│ └─ schema { show (alias: get) | apply } ← show/apply are DATA
├─ STORAGE ─────── direct file://|s3:// access; --server rejected
│ ├─ init · optimize · repair · cleanup (optimize/repair/cleanup also: --cluster --cluster-graph)
│ ├─ lint (check shim) · schema plan ← plan is STORAGE
│ └─ queries validate
├─ CONTROL ─────── cluster directory via --config <dir>
│ └─ cluster { validate | plan | apply | approve | status | refresh | import | force-unlock }
└─ SESSION ─────── no graph
├─ policy { validate | test | explain } · embed · login / logout
├─ config { migrate } · queries list ← list is SESSION
└─ version (-v)
```
`read`/`change` are visible clap aliases (deprecated names, warn); `check` is an
argv-shim → `lint`; `get` aliases `schema show`; `ingest` is hidden but runs.
### Cross-plane families (today)
- **`schema`**: `schema plan` is Storage; `schema show`/`apply` are Data.
- **`queries`**: `queries validate` is Storage; `queries list` is Session.
### Addressing forms (today)
| Form | Looks up in | Resolves to | Source |
|---|---|---|---|
| `<URI>` / `--uri` | nothing (explicit) | the literal URI | — |
| `--target <name>` | `omnigraph.yaml` `graphs:` | that graph's `uri` (local / S3 / **http**) | `config.rs::resolve_target_uri` |
| `--server <name>` (+`--graph`) | `~/.omnigraph/config.yaml` `servers:` | a remote server URL | `helpers.rs::resolve_server_flag` |
| `--cluster <dir\|s3> --cluster-graph <id>` | served cluster state | the graph's storage URI | `helpers.rs` (RFC-010 Slice 3) |
Precedence (`resolve_target_uri`): explicit `<URI>`/`--uri``--target`
`cli.graph` default → error. `is_remote_uri` (`helpers.rs:15`) then selects
`GraphClient::Remote` vs `Embedded` (`client.rs:86`).
### Enforcement points (today)
- **`guard_addressing`** (`planes.rs:128`): `--server`/`--graph` on a non-data verb
fails with a declared message.
- **Storage-plane remote rejection** (`helpers.rs:467`): a storage verb whose
`--target` resolves to `http(s)://` is rejected.
- **`init` into a cluster layout** is refused (use `cluster apply`).
## Audit comments
Reviewed against the current CLI taxonomy, `planes.rs`, `cli.rs`, `helpers.rs`,
`client.rs`, RFC-007/RFC-010, and the user-facing CLI/server docs.
### Validated
- The target taxonomy now has a stable classifier: `any`, `served`, `direct`,
`control`, and `local` are all declared capabilities.
- Cluster scope is coherent: it is privileged direct storage for control,
maintenance, and validation, not a direct data path. `any` data verbs served by
default and reject cluster scope.
- Graph selection is no longer universal. Graph-scoped verbs select a graph;
scope-scoped verbs such as `graphs list`, `queries list`, `queries validate`,
and `cluster *` address the whole server/cluster scope.
- The current-state appendix still matches the implemented CLI: four planes,
`--target`, `--cluster-graph`, scheme-inferred transport, `schema plan` as
Storage, and `schema show/apply` as Data.
Decisions and deferrals are tracked in [Decisions](#decisions) above — not
duplicated here.

View file

@ -0,0 +1,295 @@
# RFC: Provider-Independent Embedding Configuration
**Status:** Accepted — Phases 1-5 implemented
**Date:** 2026-06-15
**Builds on:** the engine embedding client (`crates/omnigraph/src/embedding.rs`), the `@embed` catalog
annotation (`omnigraph-compiler/src/catalog`), the cluster `providers.embedding` surface
([cluster-config-specs.md](cluster-config-specs.md), [rfc-007-operator-config.md](rfc-007-operator-config.md)
for the secret-resolution pattern).
**Target release:** staged — NFR floor first, then the provider-independent config core; ingest-time `@embed`
execution is a separate later phase.
## Summary
OmniGraph's embedding subsystem is **hardwired to a single provider (Google Gemini)** and has no recorded
link between the model that produced a stored vector and the model that embeds a query string. Today that
happens to be self-consistent (one live client embeds both sides), but it is consistent by accident, not by
construction: the provider is hardcoded, the model is a moving `-preview` target, nothing validates that a
query vector and a stored vector share a space, and the one configurable knob (key + base URL) cannot change
the provider or model.
This RFC makes embedding **provider-independent**: one resolved `EmbeddingConfig { provider, model, base_url,
api_key, dim, normalize }` behind a sealed provider abstraction, resolved once and shared by every embedder.
The **primary variant is OpenAI-compatible** — a single request/response shape (`POST {base}/embeddings`,
`{model, input, dimensions}`) that covers **OpenRouter** (the recommended default gateway, one key for Gemini,
OpenAI, Mistral, BGE, Qwen, sentence-transformers, …), OpenAI direct, and any self-hosted OpenAI-compatible
endpoint (vLLM, Ollama, LM Studio, Together). A native **Gemini** (`generativelanguage`) variant is retained
for shops that want to hit Google directly with its `RETRIEVAL_QUERY`/`RETRIEVAL_DOCUMENT` task-type
asymmetry, plus a deterministic **Mock**. The embedding *identity* (provider + model + dim) is recorded in the
schema IR so it travels with the data, and a query whose resolved embedder cannot match the stored vectors'
recorded identity is **rejected with a typed error instead of silently ranking across vector spaces.**
Provider/endpoint wiring lands on the already-reserved cluster `providers.embedding` field; secrets follow the
existing operator-credential pattern; no secret ever enters the schema.
This RFC supersedes the framing in `docs/user/search/embeddings.md` that described "two embedding clients
with different defaults" — one of those clients was dead code with zero callers and has been removed (see
Phase 1); the OpenAI request shape returns as a first-class *provider variant* of the one client, not as a
second parallel client.
## Motivation
This work originated in an external handoff that reported a live cross-provider bug: gemini-3072 stored
vectors compared against OpenAI-1536 query vectors, silently. Investigation against the current source showed
the reported mechanism is **inaccurate** — the OpenAI client it blamed (`omnigraph-compiler/src/embedding.rs`)
was `pub(crate)`, `#![allow(dead_code)]`, and had **zero callers**; the live `nearest("string")` path and the
offline `omnigraph embed` CLI both use the engine **Gemini** client; and `@embed` does no ingest-time
embedding at all. So the documented happy path is self-consistent. But the investigation surfaced four real
problems the handoff's instincts correctly smelled:
- **P1 — Provider is hardwired.** The one live client builds Google `generativelanguage` requests; only key +
base URL are configurable, not the provider or model. A non-Gemini shop cannot use `nearest("string")`
without a Gemini key, and cannot make it produce non-Gemini vectors. If they store their own vectors and
query with `nearest("string")`, the query is embedded with Gemini → a silent cross-space ranking. This is
the handoff's failure, reached by a different cause.
- **P2 — A dead, divergent second client + stale docs** invited exactly the misdiagnosis the handoff made.
- **P3 — No same-space guarantee recorded with the data.** Nothing stamps which model/dim produced a stored
vector, so write-side and read-side embedders can drift with no validation.
- **P4 — `@embed` is declarative-in-name-only.** It records a source property for the typechecker but never
embeds at ingest; the docs claimed otherwise.
Per the project's first principle, the lower-liability shape is **one provider-independent client with the
identity recorded next to the data**, not N independently-defaulted clients kept in lockstep by discipline.
Hardcoding one provider mortgages every future "we need OpenAI / a local model / Vertex" against a rewrite;
recording identity once closes the silent-wrong-results class by construction.
## Current state — which API we actually use
| | Live engine client (`crates/omnigraph/src/embedding.rs`) | Deleted dead client (was `omnigraph-compiler/src/embedding.rs`) |
|---|---|---|
| Provider | **Google Gemini Developer API** (`generativelanguage`, *not* Vertex AI) | OpenAI |
| Endpoint | `POST {base}/models/{model}:embedContent` | `POST {base}/embeddings` |
| Auth | header `x-goog-api-key`, env `GEMINI_API_KEY` | `Authorization: Bearer`, env `OPENAI_API_KEY` |
| Model | `gemini-embedding-2-preview` (hardcoded) | `text-embedding-3-small` (env `NANOGRAPH_EMBED_MODEL`) |
| Base default | `https://generativelanguage.googleapis.com/v1beta` | `https://api.openai.com/v1` |
| Request body | `{model, content:{parts:[{text}]}, taskType, outputDimensionality}` | `{model, input:[…], dimensions}` |
| Response | `{embedding:{values:[f32]}}` | `{data:[{index, embedding:[f32]}]}` |
| Task types | `RETRIEVAL_QUERY` / `RETRIEVAL_DOCUMENT` | none |
| Status | **live** — used by `nearest("string")` and `omnigraph embed` | **removed in Phase 1** (zero callers) |
Both shapes honour a requested output dimensionality (Gemini `outputDimensionality`, OpenAI `dimensions`)
driven by the target column width, so dimension is already schema-driven. The two known shapes are exactly the
two initial provider variants this RFC defines — the OpenAI shape returns from git history as a `Provider`
variant of the single client.
## Guide-level explanation
### Configuring a provider (operator view)
Pick a provider for the graph in `cluster.yaml` (the team-owned surface), referencing a secret by name. The
recommended default routes through OpenRouter (OpenAI-compatible, one key for many models):
```yaml
providers:
embedding:
default:
kind: openai-compatible # openai-compatible | gemini | mock
base_url: https://openrouter.ai/api/v1
model: google/gemini-embedding-2 # or openai/text-embedding-3-large, mistralai/mistral-embed, …
api_key: ${OPENROUTER_API_KEY}
graphs:
knowledge:
schema: knowledge.pg
embedding_provider: default
```
The same `openai-compatible` kind points at OpenAI direct (`base_url: https://api.openai.com/v1`,
`model: text-embedding-3-large`) or a self-hosted endpoint (vLLM/Ollama/LM Studio) by changing `base_url`. Use
`kind: gemini` only to reach Google's `generativelanguage` API directly (it keeps the query/document
task-type asymmetry that the OpenAI-compatible shape does not expose). Dimensions are schema-driven by the
target `Vector(N)` column, not duplicated in the provider profile.
The zero-config tier keeps working with env only (`OMNIGRAPH_EMBED_PROVIDER`, `OMNIGRAPH_EMBED_BASE_URL`,
`OMNIGRAPH_EMBED_MODEL`, and the provider api-key env — `OPENROUTER_API_KEY` / `OPENAI_API_KEY` /
`GEMINI_API_KEY`), so no cluster file is required for a single-graph setup.
### Recording identity in the schema
`@embed` grows optional arguments that pin the embedding identity to the vector column:
```pg
node Doc {
slug: String @key
text: String
v: Vector(3072) @embed("text", model="gemini-embedding-2", dim=3072) @index
}
```
The single-argument form `@embed("text")` keeps working unchanged. The recorded identity persists in the
schema IR (`_schema.ir.json`) and so travels with `schema apply` and `schema show`.
### What a mismatch looks like
If the resolved read-side embedder cannot produce the recorded identity (wrong model, wrong dim, wrong
provider), `nearest($v, "string")` fails with a typed error naming both sides, instead of returning a
plausible-but-meaningless ranking. Changing the recorded identity on an existing column is a loud schema-apply
refusal (it is a re-embed, a deliberate migration step), reusing the migration planner's existing
annotation-change rejection.
## Reference-level design
### One client, sealed provider abstraction
Replace the two-variant `EmbeddingTransport` with a resolved config plus a sealed provider enum:
```text
EmbeddingConfig { provider: Provider, model, base_url, api_key, dim, normalize }
enum Provider {
OpenAiCompatible, // POST {base}/embeddings, Bearer auth, {model, input, dimensions} → {data:[{embedding,index}]}
// covers OpenRouter (default gateway), OpenAI direct, vLLM/Ollama/LM Studio/Together
Gemini, // POST {base}/models/{model}:embedContent, x-goog-api-key, with RETRIEVAL_QUERY/DOCUMENT task types
Mock, // deterministic, offline
}
struct EmbeddingClient { config, http, retry, deadline }
```
`Provider` owns the per-API differences (endpoint suffix, auth header, request JSON, response JSON, task-type
support); the client owns retry/backoff, the deadline, normalization, and tracing — all provider-independent.
**OpenRouter is not a distinct variant** — it is `OpenAiCompatible` with `base_url =
https://openrouter.ai/api/v1`, which is the point: one OpenAI-compatible shape gives provider-independence
across every model OpenRouter fronts, so the gateway does the multi-provider fan-out and OmniGraph carries one
request shape. The native `Gemini` variant exists only for direct-to-Google with task-type asymmetry. An enum
(not a trait) is the earned complexity for this small, first-party set; if third-party plug-in providers are
ever needed, the enum becomes a trait behind the same `EmbeddingConfig` surface without touching callers.
The OpenAI-compatible `input` accepts an **array**, giving batch embedding for free — which the later
ingest phase needs for throughput, and which removes the open dependency on Gemini's native
`batchEmbedContents`.
### Config resolution (resolved once, shared)
Precedence, highest first for served cluster graphs: applied cluster `providers.embedding.<name>` profile →
env (`OMNIGRAPH_EMBED_*`, provider api-key env) → built-in defaults. The cluster `api_key` value is a
`${NAME}` env reference resolved at server boot; plaintext never lives in the schema, state ledger, or any
checked-in file. Resolution happens once per graph handle; the resolved client is shared by
`nearest("string")`. Direct single-graph serving, embedded callers, and the offline CLI keep the env path
unless they inject an `EmbeddingConfig` directly.
### Identity recorded in the schema IR (not a new store)
The `@embed` args serialize into `PropertyIR.annotations``_schema.ir.json`, which `schema apply` already
persists atomically and which the catalog (the one thing `nearest()` reads at query time) is built from. No
new metadata store, no manifest column, no extra read on the query path. The migration planner already rejects
non-description annotation changes as `UnsupportedChange`, so "recorded identity is immutable without a
deliberate re-embed migration" is the default behaviour, not new code. (A second, optional copy in Lance
field metadata — co-located with the vectors — is available later by activating the currently no-op
`UpdatePropertyMetadata` migration step; out of scope here.)
### Query-time validation
`resolve_nearest_query_vec` compares the resolved read-side identity against the column's recorded identity
before embedding; on mismatch it returns a typed `OmniError` naming recorded vs resolved (model, dim,
provider). This is the only behaviour that closes P3 by construction.
### NFR floor (independent of the provider work)
- **Deadline:** wrap every embed call (query or document) in a total-operation deadline
(`OMNIGRAPH_EMBED_DEADLINE_MS`) so a degraded provider cannot hang the caller for the current ~121 s worst
case (4 × 30 s timeout + backoff).
- **Observability:** `tracing` span per embed call (provider, model, dim, attempts, outcome, elapsed; `warn!`
per retry; token usage when the provider returns it). The subsystem has zero instrumentation today.
- **Single normalization:** one `normalize_vector` (the dead client carried a divergent second copy; removed
in Phase 1).
- **Stable model:** make the model configurable and default to a stable (non-`-preview`) model once the GA
name is confirmed.
### Ingest-time `@embed` (later phase, not this RFC's core)
Making `@embed` embed at ingest is a separate phase with a hard constraint: embedding is a slow, external,
**non-idempotent** side effect, so it must run **entirely before staging** — in the pure in-memory phase,
before any `stage_*`/Lance HEAD move, alongside the existing constraint validation — so a mid-load provider
failure aborts with zero drift. It must never sit inside or after the commit protocol, because the recovery
sweep cannot re-run or undo an external embedding. It also needs a content-hash skip (so `load --mode
overwrite` does not re-bill every row), batching, and a bounded-concurrency stage. Specified here only to fix
the design constraint; deferred to its own RFC/phase.
### Phasing (implementation order)
| Phase | Scope | Demo |
|---|---|---|
| **1 — NFR floor + dead-client removal** | deadline, observability, single normalize, configurable model, delete dead client + `NANOGRAPH_*` | a hung provider fails at the deadline; embed calls traced; `rg NANOGRAPH_` empty |
| **2 — Provider-independent config** | `EmbeddingConfig` + `Provider` enum (OpenAiCompatible covering OpenRouter/OpenAI/local, Gemini, Mock); env-first resolution; client reuse | point `base_url` at OpenRouter, run `nearest("string")`, get correct neighbours vs OpenRouter-stored vectors; CLI shares the config |
| **3 — Record identity in schema IR** | `@embed` args grammar + catalog + IR persistence | `schema show` reflects recorded model/dim |
| **4 — Query-time validation** | compare resolved vs recorded; typed error; planner refusal on identity change | stored model A vs read model B → loud error, never silent garbage |
| **5 — Cluster provider wiring** | `providers.embedding` resources; `graphs.<id>.embedding_provider`; `${NAME}` resolution at server boot | provider profile resolved from applied cluster state; legacy `omnigraph.yaml` untouched |
| later | ingest-time `@embed` (Shape C) | separate RFC |
**Status:** Phases 15 are implemented (`@embed("…", model="…")` is recorded in the schema IR and validated at
query time with a typed same-space error; an unrecorded `@embed` keeps working with no check; cluster-served
graphs can bind an applied `providers.embedding` profile). Ingest-time `@embed` remains.
## Invariants & deny-list check
- **Invariant 9 (integrity failures are loud):** strengthened — query-time identity mismatch becomes a typed
error instead of silent wrong results.
- **Invariant 10 (query semantics are first-class IR concepts):** embedding identity becomes IR/catalog data,
not an out-of-band env guess.
- **Invariant 11 (transport stays at the boundary):** strengthened — Phase 1 removes the HTTP client + async
runtime (`reqwest`, `tokio`) from `omnigraph-compiler`, whose own manifest advertises "Zero Lance
dependency"; the embedding HTTP client lives only in the engine.
- **Invariant 12 / secret handling:** api-keys resolve through the existing credential chain; never in schema
or checked-in config.
- **Invariant 13 (bounded & observable):** addressed — the deadline bounds latency; tracing makes the
subsystem observable.
- **Deny-list — "silent fallback / dropped rows":** the cross-space ranking is exactly a silent-wrong-result;
this RFC closes it.
- **Deny-list — "new write paths that advance Lance HEAD before manifest publish without a recovery
sidecar":** the ingest phase (deferred) explicitly keeps embedding *before* staging, so it does not create a
new HEAD-advancing write path. No invariant is weakened.
## Drawbacks & alternatives
- **Do nothing.** The happy path works today, so the live risk is narrow (P1 + P3). But the provider hardwiring
and missing validation are a latent silent-wrong-results class that bites the first non-Gemini user.
- **Interim env-only provider switch (no schema record).** Cheaper, but leaves the same-space guarantee to
operator discipline (fails P3). Folded in as Phase 2's env-first resolution, with Phases 34 adding the
record/validate guarantee.
- **Trait-based provider plug-ins now.** Rejected as unearned complexity for two first-party providers; the
enum upgrades to a trait behind the same surface if needed.
- **Stamp identity in the manifest or Lance field metadata instead of the IR.** The manifest is the wrong
granularity; field metadata needs net-new wiring and a query-path dataset open. The IR is where `@embed`
already lives and is already read at query time (see spike).
## Reversibility
Mostly reversible. Phases 12 and 5 are code/config (env, CLI, cluster keys) and cheap to undo. Phase 3
(recording identity in the schema IR) is **near-permanent** — it changes the on-disk `_schema.ir.json` shape
and the schema hash — so it earns the most scrutiny: the single-arg `@embed` form stays byte-compatible, and
recorded identity is additive (absent identity = today's behaviour). Provider request/response shapes are
external API contracts, not our format, so adding providers is reversible.
## Gateway tradeoff (OpenRouter)
Routing through OpenRouter (the default) buys provider-independence with one key and one billing relationship,
batch input, and access to the GA `google/gemini-embedding-2`. Costs to accept, all controllable:
- **Extra network hop** → more query-path latency. The Phase-1 deadline bounds it; the cache mitigates repeats.
- **Text transits a third party.** OpenRouter's `provider: { data_collection }` routing preference controls
retention; shops with strict residency requirements use `kind: gemini`/`openai-compatible` pointed at the
provider (or a self-hosted endpoint) directly instead of the gateway. Provider-independence means this is a
config change, not a code change.
- **Loses Gemini's task-type asymmetry** when Gemini is reached via the OpenAI-compatible gateway (both sides
embed symmetrically). This is a retrieval-quality cost, **not** a same-space correctness cost — both stored
and query vectors take the identical path, so they stay in one space by construction. Shops that want the
asymmetry use `kind: gemini`.
## Unresolved questions
- GA Gemini model name — **resolved:** `google/gemini-embedding-2` (via OpenRouter) / `gemini-embedding-2`
(direct), 1283072 dims (recommended 768/1536/3072). Default flips off `-preview` in Phase 2.
- Gemini `batchEmbedContents` availability — **moot** when going through the OpenAI-compatible gateway (its
`input` array batches); still relevant only for the direct `kind: gemini` path.
- Identity granularity: per-vector-property args vs one graph-level default profile referenced by name.
- Whether to backfill recorded identity for existing graphs, or treat absent-identity as "unvalidated, legacy"
permanently.
- Default model for the zero-config tier: `google/gemini-embedding-2` vs `openai/text-embedding-3-large`
(both 3072-capable) — pick the project default.

View file

@ -7,7 +7,7 @@ This file is the always-on map of the test surface. **Consult it before every ta
| Crate | Path | Style |
|---|---|---|
| `omnigraph` (engine) | `crates/omnigraph/tests/` | Integration tests (28 files), fixture-driven, share `tests/helpers/mod.rs` |
| `omnigraph-cli` | `crates/omnigraph-cli/tests/` | Per-area suites (post-modularization): `cli_cluster.rs` (cluster command surface + operator-actor cascade), `cli_cluster_e2e.rs` (spawned-binary lifecycle compositions — lost-state re-import recovery, out-of-band drift, graph-root destruction, multi-graph mixed-disposition convergence), `cli_data.rs` (load/read/change/branch/commit/export/snapshot/policy/embed/maintenance + operator format cascade), `cli_schema_config.rs` (init/config, schema plan/apply, RFC-008 deprecation warnings + `config migrate` + strict mode), `cli_queries.rs`, `system_local.rs` (full-cycle cluster lifecycle with a spawned `--cluster` server, applied-policy enforcement over HTTP, keyed-credential auth, operator aliases), `system_remote.rs`; share `tests/support/mod.rs` (hermetic `OMNIGRAPH_HOME` by default) |
| `omnigraph-cli` | `crates/omnigraph-cli/tests/` | Per-area suites (post-modularization): `cli_cluster.rs` (cluster command surface + operator-actor cascade), `cli_cluster_e2e.rs` (spawned-binary lifecycle compositions — lost-state re-import recovery, out-of-band drift, graph-root destruction, multi-graph mixed-disposition convergence), `cli_data.rs` (load/read/change/branch/commit/export/snapshot/policy/embed/maintenance + operator format cascade), `cli_schema_config.rs` (init/config, schema plan/apply), `cli_queries.rs`, `parity_matrix.rs` (RFC-009 Phase 1: the embedded-vs-remote referee — every forked verb run against both arms with matched Cedar policy and the same actor, scrubbed-JSON + exit-code equality; divergences are pinned in its `KNOWN_DIVERGENCES` ledger, never silently repaired), `system_local.rs` (full-cycle cluster lifecycle with a spawned `--cluster` server, applied-policy enforcement over HTTP, keyed-credential auth, operator aliases), `system_remote.rs`; share `tests/support/mod.rs` (hermetic `OMNIGRAPH_HOME` by default) |
| `omnigraph-cluster` | mostly in-source `#[cfg(test)] mod tests`; `tests/failpoints.rs` (feature-gated); `tests/s3_cluster.rs` (bucket-gated full lifecycle on object storage) | Cluster config parser, local JSON state diff, state CAS/lock handling/recovery, read-only validate/plan/status plus explicit refresh/import graph observations, config-only apply (content-addressed payload publish, disposition gating, composite-digest convergence, idempotent re-apply), catalog payload verification (status read-only, refresh drift + self-heal), failpoint crash-mid-apply / CAS-race coverage, Stage 4A graph creation (create executor, recovery sidecars + sweep rows, create crash windows), Stage 4B schema apply (migration previews in plan, schema executor, schema-apply sweep classification, schema crash windows), Stage 4C gated deletes (digest-bound approvals, delete executor + tombstones, delete sweep rows, delete crash windows), and 5A policy binding metadata (applies_to in the applied revision, binding-change diffing + convergence, pre-5A backfill), and the 5B serving-snapshot read API (converged read, refusal rows) |
| `omnigraph-server` | `crates/omnigraph-server/tests/` | Per-area suites (post-modularization): `auth_policy.rs`, `data_routes.rs`, `schema_routes.rs`, `stored_queries.rs`, `multi_graph.rs` (cluster-mode boot — converged serving, policy binding wiring, boot refusals — + the concurrent branch-ops matrix), `boot_settings.rs` (mode inference, PolicySource), `s3.rs` (bucket-gated: single-graph serving + config-free `--cluster s3://` boot), `openapi.rs` (OpenAPI drift / regeneration); share `tests/support/mod.rs` |
| `omnigraph-compiler` | mostly in-source `#[cfg(test)] mod tests` | Parser, type-checker, IR lowering, lint |
@ -29,7 +29,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
| `changes.rs` | `diff_between` / `diff_commits` |
| `consistency.rs` | Cross-table snapshot isolation, atomic publish |
| `schema_apply.rs` | Migration plan + apply, schema-apply lock |
| `schema_apply.rs` | Migration plan + apply, schema-apply lock; index materialization deferred to the reconciler (iss-848): `apply_schema_defers_vector_index_on_empty_table` (an empty-table Vector `@index` never aborts the apply) and `index_only_constraint_apply_touches_no_table_data` (adding an `@index` is metadata-only — no table-version bump) |
| `search.rs` | FTS / vector / hybrid (`bm25`, `nearest`, `rrf`) |
| `traversal.rs` | `Expand`, variable-length hops, anti-join (CSR path — `OMNIGRAPH_TRAVERSAL_MODE` unset) |
| `traversal_indexed.rs` | BTREE-indexed Expand (`execute_expand_indexed`) forced via `OMNIGRAPH_TRAVERSAL_MODE`, asserted semantically equal to the CSR path; own binary, all `#[serial]` so env writes never race |
@ -42,7 +42,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `lance_version_columns.rs` | Per-row `_row_last_updated_at_version` behavior |
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
| `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly — no HTTP — proving embedded and CLI callers hit the same gate as the server, with action × scope shapes matching `authorize_request` |
| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice |
| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice; the index reconciler (iss-848): `index_build_tolerates_null_vector_rows` (an untrainable Vector column defers instead of aborting the build, sibling indexes still build) and `optimize_materializes_index_declared_but_unbuilt` (optimize creates a declared-but-deferred index) |
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). |
| `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path |
| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). |

View file

@ -19,8 +19,14 @@ publisher's row-level CAS on `__manifest` is the single fence.
`__run__*` branch on an upgraded graph is swept off `__manifest` by the
v2→v3 internal-schema migration on first read-write open. (The inert
`_graph_runs.lance` bytes remain until a `delete_prefix` primitive lands.)
- Cancelled mutation futures leave **no graph-level state** — only orphaned
Lance fragments, which the existing `omnigraph cleanup` pipe reclaims.
- Cancelled mutation futures leave **no graph-visible state** — the manifest
is never advanced. They can leave two kinds of unreferenced residue, both
self-healing: orphaned Lance fragments (reclaimed by `omnigraph cleanup`),
and — on the *first* write to a table on a branch, which forks it before the
publish — a manifest-unreferenced branch ref. The next write to that table
reclaims the stale fork and re-forks (`reclaim_orphaned_fork_and_refork`),
and `cleanup`'s per-table reconciler is the guaranteed backstop; see the
fork-reclaim note in [invariants.md](invariants.md).
## Read-your-writes within a multi-statement mutation
@ -80,10 +86,17 @@ deferred to a follow-up cycle — tracked).
Three writers have been migrated onto staged primitives:
* **`ensure_indices`** (`db/omnigraph/table_ops.rs::build_indices_on_dataset_for_catalog`)
— scalar indices (BTree, Inverted) now use `stage_create_*_index` +
`commit_staged`. Vector indices stay inline (residual — Lance
`build_index_metadata_from_segments` is `pub(crate)` in 6.0.1;
companion ticket to lance-format/lance#6658 needed).
— scalar indices (BTree, Inverted) use `stage_create_*_index` +
`commit_staged`. Which index a `@index`/`@key` property gets is dispatched by
type via `node_prop_index_kind` (enum + orderable scalar → BTree, free-text
String → Inverted/FTS, Vector → vector). Vector indices stay inline (residual
— Lance `build_index_metadata_from_segments` is `pub(crate)` in 6.0.1;
companion ticket to lance-format/lance#6658 needed). This build is
existence-gated (it creates a *missing* index over current fragments); folding
fragments appended afterward into an *existing* index is `optimize`'s
`optimize_indices` pass — an inline-commit residual, not a staged write (Lance
exposes no uncommitted index-optimize), covered by the optimize recovery
sidecar (see [maintenance.md](../user/operations/maintenance.md)).
* **`branch_merge::publish_rewritten_merge_table`**
(`exec/merge.rs`) — merge_insert now uses `stage_merge_insert` +
`commit_staged`. Deletes stay inline (Lance #6658 residual).
@ -305,7 +318,7 @@ success and one failure. The losing writer's error is
`ManifestConflictDetails::ExpectedVersionMismatch { table_key, expected,
actual }`. The HTTP server maps this to **409 Conflict** with body
`{"error": "...", "code": "conflict", "manifest_conflict": { "table_key":
"...", "expected": N, "actual": M }}` — see [docs/user/server.md](../user/server.md).
"...", "expected": N, "actual": M }}` — see [docs/user/server.md](../user/operations/server.md).
## Audit