mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-18 02:24:27 +02:00
Merge remote-tracking branch 'origin/main' into ragnorc/omnigraph-mcp-crate
This commit is contained in:
commit
c08e8dbac4
173 changed files with 20828 additions and 10366 deletions
|
|
@ -1,6 +1,6 @@
|
|||
# Architecture
|
||||
|
||||
OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets, with Git-style branches and commits across the whole graph, multi-modal querying (vector + FTS + BM25 + RRF + graph traversal) in one runtime, an HTTP server with Cedar policy, and a CLI driven by a single `omnigraph.yaml`.
|
||||
OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets, with Git-style branches and commits across the whole graph, multi-modal querying (vector + FTS + BM25 + RRF + graph traversal) in one runtime, an HTTP server with Cedar policy, and a CLI driven by a per-operator `~/.omnigraph/config.yaml` plus team-owned cluster directories.
|
||||
|
||||
## Reading guide
|
||||
|
||||
|
|
@ -10,7 +10,7 @@ Three views, increasing zoom:
|
|||
2. **Layer view** — the eight-layer stack inside one OmniGraph process.
|
||||
3. **Component zoom-ins** — what's inside each layer.
|
||||
|
||||
For runtime flows (read query, mutation), see [`docs/dev/execution.md`](execution.md). For the on-disk layout of a graph, see [`docs/user/storage.md`](../user/storage.md).
|
||||
For runtime flows (read query, mutation), see [`docs/dev/execution.md`](execution.md). For the on-disk layout of a graph, see [`docs/user/storage.md`](../user/concepts/storage.md).
|
||||
|
||||
L1 (orange in the diagrams) is what we inherit from Lance; L2 (blue) is what OmniGraph adds. The L1/L2 framing is also called out in prose at the bottom of this doc.
|
||||
|
||||
|
|
@ -280,7 +280,7 @@ flowchart LR
|
|||
eng --> wq
|
||||
```
|
||||
|
||||
The server applies Cedar policy at the HTTP boundary today. The roadmap, called out in [docs/dev/invariants.md](invariants.md) as a known gap, is to push policy into the planner as predicates. After Cedar, mutating handlers go through `WorkloadController` (per-actor admission cap + byte budget; PR 2 / MR-686) before reaching the engine. The engine itself holds an `Arc<WriteQueueManager>` so concurrent mutations on the same `(table, branch)` serialize at the queue, while disjoint keys run in parallel — see [docs/user/server.md](../user/server.md) "Per-actor admission control" and [docs/dev/writes.md](writes.md). The CLI bypasses the HTTP layer (and admission) and calls the engine API directly.
|
||||
The server applies Cedar policy at the HTTP boundary today. The roadmap, called out in [docs/dev/invariants.md](invariants.md) as a known gap, is to push policy into the planner as predicates. After Cedar, mutating handlers go through `WorkloadController` (per-actor admission cap + byte budget; PR 2 / MR-686) before reaching the engine. The engine itself holds an `Arc<WriteQueueManager>` so concurrent mutations on the same `(table, branch)` serialize at the queue, while disjoint keys run in parallel — see [docs/user/server.md](../user/operations/server.md) "Per-actor admission control" and [docs/dev/writes.md](writes.md). The CLI bypasses the HTTP layer (and admission) and calls the engine API directly.
|
||||
|
||||
Code paths:
|
||||
|
||||
|
|
|
|||
|
|
@ -8,7 +8,7 @@ This page explains what the policy says and how to change it.
|
|||
|
||||
| Setting | Value | Why |
|
||||
|---|---|---|
|
||||
| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test Workspace`, `Test omnigraph-server --features aws`, `CODEOWNERS matches source`, `CODEOWNERS not hand-edited` | Every PR must pass workspace tests, AGENTS.md link integrity, and the CODEOWNERS hygiene checks. The two CODEOWNERS contexts must equal the job `name:` values in `.github/workflows/codeowners.yml` **verbatim** — a context naming a job that never reports (the old `CODEOWNERS / drift` used the job *id*, and the job was path-filtered) leaves every PR permanently pending and forces admin overrides. `strict: true` requires the branch to be up-to-date with `main` before merge. |
|
||||
| **Required status checks (strict)** | `Classify Changes`, `Check AGENTS.md Links`, `Test omnigraph-server --features aws`, `CODEOWNERS matches source`, `CODEOWNERS not hand-edited` | Every PR must pass the AWS-feature build/test, AGENTS.md link integrity, and the CODEOWNERS hygiene checks. **`Test Workspace` is deliberately NOT required** — it runs only on push to `main` (post-merge), tags, and manual `workflow_dispatch`, to keep PR turnaround fast (it was the ~15min+ slow gate). It is therefore *not* listed here: a required check that never reports on PRs (the `test` job is `if: github.event_name != 'pull_request'`) would leave every PR permanently pending — the same job-never-reports trap the CODEOWNERS contexts call out below. The trade-off (a regression lands on `main` and is caught by the post-merge run, so `main` can briefly go red) and its mitigations are documented in [ci.md](ci.md). The two CODEOWNERS contexts must equal the job `name:` values in `.github/workflows/codeowners.yml` **verbatim** — a context naming a job that never reports (the old `CODEOWNERS / drift` used the job *id*, and the job was path-filtered) leaves every PR permanently pending and forces admin overrides. `strict: true` requires the branch to be up-to-date with `main` before merge. |
|
||||
| **Required approving reviews** | `1` | At least one reviewer. With a 2-person team, going higher would block all merges when one person is unavailable. |
|
||||
| **Require code-owner reviews** | `true` | The reviewer must be a code owner per `.github/CODEOWNERS`. This is what makes the codeowners chassis enforced. |
|
||||
| **Dismiss stale reviews on new commits** | `true` | A push after approval invalidates the prior review. Prevents the "approve, then sneak in unreviewed changes" pattern. |
|
||||
|
|
|
|||
|
|
@ -3,6 +3,9 @@
|
|||
`.github/workflows/`:
|
||||
|
||||
- **ci.yml**: text-only changes skip; otherwise `cargo test --workspace --locked` on ubuntu-latest with protobuf compiler. OpenAPI-drift check that auto-commits the regenerated `openapi.json` for same-repository PRs. Also runs the AGENTS.md cross-link integrity check (`scripts/check-agents-md.sh`).
|
||||
- **`Test Workspace` does not run on pull requests.** The job is gated `if: github.event_name != 'pull_request'`, so the full workspace + failpoints suite runs only on push to `main` (post-merge), on `v*` tags, and on manual `workflow_dispatch`. This was a deliberate PR-latency trade-off — it was the slowest gate (~15min warm, up to the 75min cold ceiling). `RustFS S3 Integration` `needs: test`, so it is push-/dispatch-only for the same reason. The fast PR gates remain: `Classify Changes`, `Check AGENTS.md Links`, `Test omnigraph-server --features aws`, and the two CODEOWNERS checks. `Test Workspace` is correspondingly **not** in the required-check list (`.github/branch-protection.json`); see [branch-protection.md](branch-protection.md).
|
||||
- **Consequences to internalize:** (1) a regression that the suite would catch now lands on `main` and turns the post-merge run red, rather than being blocked pre-merge — `main` can briefly break, so run `cargo test --workspace --locked` locally before merging anything non-trivial, or trigger this workflow on your branch via the Actions "Run workflow" button. (2) `openapi.json` is no longer auto-regenerated on PRs (that step is inside the `test` job); for server/API changes, regenerate it locally with `OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapi` and commit it, or the strict drift check fails the post-merge `main` run.
|
||||
- **Applying this policy:** removing `Test Workspace` from the JSON is inert until an admin runs `./scripts/apply-branch-protection.sh`. **Run it immediately after this change merges** — until then GitHub still requires a `Test Workspace` context that no longer reports on PRs, which leaves every open PR permanently pending (the job-never-reports trap).
|
||||
- **AWS feature build job**: `cargo build/test -p omnigraph-server --features aws` on ubuntu-latest.
|
||||
- **Windows binary build job**: `cargo build --release --locked -p omnigraph-cli -p omnigraph-server` on windows-latest with smoke checks for `omnigraph.exe version`, `omnigraph-server.exe --help`, and PowerShell installer syntax.
|
||||
- **RustFS S3 integration**: spins up RustFS in Docker, runs `s3_storage`, `server_opens_s3_graph_directly_and_serves_snapshot_and_read`, and `local_cli_s3_end_to_end_init_load_read_flow`.
|
||||
|
|
|
|||
|
|
@ -3,11 +3,11 @@
|
|||
**Status:** Draft / thinking-in-progress
|
||||
**Type:** Architecture direction
|
||||
**Date:** 2026-06-07
|
||||
**Relationship:** generalizes today's `omnigraph.yaml` graph/query/policy configuration surface ([CLI reference](../user/cli-reference.md), [server docs](../user/server.md)) into a future cluster control plane. The distilled rules are in [cluster-axioms.md](cluster-axioms.md); detailed downstream implementation spec and blast-radius assessment in [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md). This is a proposed architecture, not an implemented RFC.
|
||||
**Relationship:** generalizes today's `omnigraph.yaml` graph/query/policy configuration surface ([CLI reference](../user/cli/reference.md), [server docs](../user/operations/server.md)) into a future cluster control plane. The distilled rules are in [cluster-axioms.md](cluster-axioms.md); detailed downstream implementation spec and blast-radius assessment in [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md). This is a proposed architecture, not an implemented RFC.
|
||||
|
||||
> **Implementation status.** The examples below describe the full target schema.
|
||||
> Stage 2B only accepts the read-only subset documented in
|
||||
> [cluster-config.md](../user/cluster-config.md). Future-phase fields such as
|
||||
> [cluster-config.md](../user/clusters/config.md). Future-phase fields such as
|
||||
> `env_file`, `apply`, `providers`, `pipelines`, `embeddings`, `ui`, `aliases`,
|
||||
> and `bindings` are intentionally rejected with typed diagnostics until their
|
||||
> reconciler semantics are implemented.
|
||||
|
|
|
|||
|
|
@ -177,4 +177,4 @@ For all three modes, a mid-load failure (RI / cardinality violation, validation
|
|||
|
||||
## Embeddings during load
|
||||
|
||||
If a node type has `@embed` properties, the loader calls the engine embedding client (Gemini, RETRIEVAL_DOCUMENT) per row to populate the vector column. See [embeddings.md](../user/embeddings.md).
|
||||
The loader does **not** embed `@embed` properties at load time. `@embed` is a catalog annotation consumed by query typecheck/lint; vectors are supplied directly in the load data, or pre-filled by the offline `omnigraph embed` pipeline. Query-time `nearest($v, "string")` auto-embeds the query string via the provider-independent embedding client. See [embeddings.md](../user/search/embeddings.md). (Ingest-time `@embed` execution is a planned RFC-012 phase.)
|
||||
|
|
|
|||
|
|
@ -20,13 +20,13 @@ constraints. User-facing behavior should still be documented through
|
|||
| Area | Read |
|
||||
|---|---|
|
||||
| System structure, L1/L2 framing, component diagrams | [architecture.md](architecture.md) |
|
||||
| On-disk layout, manifest schema, URI behavior | [storage.md](../user/storage.md) |
|
||||
| On-disk layout, manifest schema, URI behavior | [storage.md](../user/concepts/storage.md) |
|
||||
| Direct-publish writes, D2, staged writes, recovery sidecars | [writes.md](writes.md) |
|
||||
| Query execution, mutation execution, loader flow | [execution.md](execution.md) |
|
||||
| Index lifecycle and graph topology indexes | [indexes.md](../user/indexes.md) |
|
||||
| Branch and commit internals | [branches-commits.md](../user/branches-commits.md) |
|
||||
| Index lifecycle and graph topology indexes | [indexes.md](../user/search/indexes.md) |
|
||||
| Branch and commit internals | [branches-commits.md](../user/branching/index.md) |
|
||||
| Three-way merge implementation and conflicts | [merge.md](merge.md) |
|
||||
| Diff/change-feed implementation | [changes.md](../user/changes.md) |
|
||||
| Diff/change-feed implementation | [changes.md](../user/branching/changes.md) |
|
||||
| Branch protection policy | [branch-protection.md](branch-protection.md) |
|
||||
| CODEOWNERS source of truth | [codeowners.md](codeowners.md) |
|
||||
|
||||
|
|
@ -34,14 +34,14 @@ constraints. User-facing behavior should still be documented through
|
|||
|
||||
| Area | Read |
|
||||
|---|---|
|
||||
| Schema grammar, catalog, migration planner | [schema-language.md](../user/schema-language.md) |
|
||||
| Query grammar, IR, lints, mutation restrictions | [query-language.md](../user/query-language.md) |
|
||||
| Embedding client and `@embed` integration | [embeddings.md](../user/embeddings.md) |
|
||||
| Cedar policy surface and server gating | [policy.md](../user/policy.md) |
|
||||
| Server auth, OpenAPI, endpoint handlers | [server.md](../user/server.md) |
|
||||
| Error taxonomy and serialization | [errors.md](../user/errors.md) |
|
||||
| Constants and tunables | [constants.md](../user/constants.md) |
|
||||
| Transaction model public contract | [transactions.md](../user/transactions.md) |
|
||||
| Schema grammar, catalog, migration planner | [schema-language.md](../user/schema/index.md) |
|
||||
| Query grammar, IR, lints, mutation restrictions | [query-language.md](../user/queries/index.md) |
|
||||
| Embedding client and `@embed` integration | [embeddings.md](../user/search/embeddings.md) |
|
||||
| Cedar policy surface and server gating | [policy.md](../user/operations/policy.md) |
|
||||
| Server auth, OpenAPI, endpoint handlers | [server.md](../user/operations/server.md) |
|
||||
| Error taxonomy and serialization | [errors.md](../user/operations/errors.md) |
|
||||
| Constants and tunables | [constants.md](../user/reference/constants.md) |
|
||||
| Transaction model public contract | [transactions.md](../user/branching/transactions.md) |
|
||||
|
||||
## Project Operations
|
||||
|
||||
|
|
@ -79,6 +79,9 @@ Working documents for in-flight feature work. Removed when the work lands.
|
|||
| Per-operator config — `~/.omnigraph/` identity, keyed credentials, named servers (the operator slice of RFC-002) | [rfc-007-operator-config.md](rfc-007-operator-config.md) |
|
||||
| Deprecate `omnigraph.yaml` — one concern per config surface; key-by-key migration map and staged retirement | [rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md) |
|
||||
| Unify CLI embedded/remote access paths — parity referee, shared wire-DTO crate, `GraphClient` trait, declared plane capabilities | [rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) |
|
||||
| Restructure the CLI around explicit planes — one graph-addressing model, declared capability surface, plane-grouped help (expands RFC-009 Phase 4) | [rfc-010-cli-planes-restructure.md](rfc-010-cli-planes-restructure.md) |
|
||||
| CLI refactoring — one addressing & config model post-`omnigraph.yaml`: scope + `--graph` + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) | [rfc-011-cli-refactoring.md](rfc-011-cli-refactoring.md) |
|
||||
| Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
|
||||
|
||||
## Boundary
|
||||
|
||||
|
|
|
|||
|
|
@ -15,6 +15,38 @@ Use it this way:
|
|||
- Keep implementation ledgers, roadmap detail, and historical MR notes in the
|
||||
per-area docs. This file is the filter, not the encyclopedia.
|
||||
|
||||
## Governing principle: logical contract over physical state
|
||||
|
||||
The hard invariants below are instances of one rule. Keep it in view whenever
|
||||
a change touches the boundary between what the graph *means* and how it is
|
||||
physically stored.
|
||||
|
||||
> **Logical state is the contract. Physical state — index coverage, fragment
|
||||
> layout, compaction versions, staged writes — is derived, rebuildable, and may
|
||||
> be produced asynchronously. A physical operation must never fail a logical
|
||||
> one. Preconditions are checked against logical state; physical reconciliation
|
||||
> is idempotent and may lag or retry. Genuine logical conflicts still fail
|
||||
> loudly: the licence to lag covers physical convergence, not correctness.**
|
||||
|
||||
Invariants that instantiate it: **2** (manifest-atomic visibility) and **5**
|
||||
(recovery is part of the commit protocol) — a partially-written physical layer
|
||||
never changes what a graph commit means; **7** (indexes are derived state) — a
|
||||
query is correct under partial index coverage, and expensive index work
|
||||
converges from manifest state instead of gating the write path; **13** (failures
|
||||
bounded and observable) — the licence to lag is not a licence to drop, so a
|
||||
physical step that cannot make progress is surfaced, not swallowed. Deny-list
|
||||
items that enforce it: synchronous inline vector/FTS index rebuilds on the
|
||||
commit path; state that drifts from Lance or the manifest when it can be
|
||||
derived; job queues for manifest-derivable state where a reconciler fits.
|
||||
|
||||
The failure shape it rules out: a legitimate background operation on the
|
||||
physical layer (compaction, an index build, an interrupted staged write) is
|
||||
allowed to break a logical operation (a query's correctness, a migration's
|
||||
success, a branch's writability). The smell to watch for is a logical operation
|
||||
whose precondition is a *physical* fact — a cached file version, an index's
|
||||
existence, a fragment count. Make the precondition logical and let a reconciler
|
||||
converge the physical state.
|
||||
|
||||
## Hard Invariants
|
||||
|
||||
1. **Respect the substrate.** Lance owns columnar storage, per-dataset
|
||||
|
|
@ -58,7 +90,7 @@ Use it this way:
|
|||
branch they read even when index coverage is partial. Expensive index work
|
||||
should converge from manifest state instead of extending the critical write
|
||||
path. Scalar staged index builds and vector inline residuals are documented
|
||||
in [writes.md](writes.md) and [indexes.md](../user/indexes.md).
|
||||
in [writes.md](writes.md) and [indexes.md](../user/search/indexes.md).
|
||||
|
||||
8. **Schema identity survives renames.** Accepted schema identity must remain
|
||||
stable across type and property renames. Rename support belongs in migration
|
||||
|
|
@ -100,14 +132,14 @@ Use it this way:
|
|||
|---|---|---|
|
||||
| Multi-table commit | Manifest CAS plus recovery sidecars; not a single Lance primitive | [writes.md](writes.md), [architecture.md](architecture.md) |
|
||||
| Constructive mutations | In-memory `MutationStaging`, one end-of-query table commit per touched table, then one manifest publish | [writes.md](writes.md), [execution.md](execution.md) |
|
||||
| Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/query-language.md), [writes.md](writes.md) |
|
||||
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branches-commits.md), [maintenance.md](../user/maintenance.md) |
|
||||
| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema-language.md), [execution.md](execution.md) |
|
||||
| Unique constraints | Intra-batch and write-path checks exist; intake and branch-merge derive the composite key through one shared function (`loader::composite_unique_key`, a separator-free `Vec<String>` tuple) and fail loudly on an un-keyable column type rather than silently exempting it; full cross-version uniqueness against already-committed rows is still a gap | [schema-language.md](../user/schema-language.md) |
|
||||
| Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/queries/index.md), [writes.md](writes.md) |
|
||||
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branching/index.md), [maintenance.md](../user/operations/maintenance.md) |
|
||||
| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema/index.md), [execution.md](execution.md) |
|
||||
| Unique constraints | Intra-batch and write-path checks exist; intake and branch-merge derive the composite key through one shared function (`loader::composite_unique_key`, a separator-free `Vec<String>` tuple) and fail loudly on an un-keyable column type rather than silently exempting it; full cross-version uniqueness against already-committed rows is still a gap | [schema-language.md](../user/schema/index.md) |
|
||||
| Storage trait | `TableStorage` (via `db.storage()`) is staged-only; the inline-commit residuals (`delete_where`, `create_vector_index`) are split onto a separate sealed `InlineCommitResidual` trait reached via `db.storage_inline_residual()` (MR-854), so §1 holds by construction; capability/stat surfaces are roadmap | [writes.md](writes.md), [architecture.md](architecture.md) |
|
||||
| Index lifecycle | `ensure_indices` is explicit today; reconciler-based convergence is roadmap | [indexes.md](../user/indexes.md), [maintenance.md](../user/maintenance.md) |
|
||||
| Traversal IDs | Runtime still builds `TypeIndex`; Lance stable row-id based graph IDs are roadmap | [architecture.md](architecture.md), [query-language.md](../user/query-language.md) |
|
||||
| Auth | Bearer token hashing and server-side actor resolution are implemented at the HTTP boundary | [server.md](../user/server.md), [policy.md](../user/policy.md) |
|
||||
| Index lifecycle | `@index`/`@key` declares *intent*; the physical index is derived state and never fails a logical op. `schema apply` builds no indexes (records intent only; index-only changes touch no table data). `load`/`mutate` build inline through one chokepoint (`build_indices_on_dataset_for_catalog`, type-dispatched by `node_prop_index_kind`: enum + orderable scalar → BTREE, free-text String → FTS, Vector → vector) that fault-isolates an untrainable Vector column into a *pending* index instead of aborting. `optimize`/`ensure_indices` is the reconciler: it creates declared-but-missing indexes and folds appended/rewritten fragments into existing ones (`optimize_indices`), reporting still-pending columns. Explicit maintenance call, not yet a background loop | [indexes.md](../user/search/indexes.md), [maintenance.md](../user/operations/maintenance.md) |
|
||||
| Traversal IDs | Runtime still builds `TypeIndex`; Lance stable row-id based graph IDs are roadmap | [architecture.md](architecture.md), [query-language.md](../user/queries/index.md) |
|
||||
| Auth | Bearer token hashing and server-side actor resolution are implemented at the HTTP boundary | [server.md](../user/operations/server.md), [policy.md](../user/operations/policy.md) |
|
||||
| Tests | Tempdir-backed Lance tests are the current substrate; the storage adapter has an in-memory backend for adapter-level contract tests, but Lance datasets bypass it | [testing.md](testing.md) |
|
||||
|
||||
The branch-delete reconciler is authority-derived: it reclaims orphaned forks
|
||||
|
|
@ -132,13 +164,18 @@ them explicit.
|
|||
new writer cannot couple a write with a HEAD advance through the default
|
||||
surface. The dead legacy methods (`append_batch` on the trait,
|
||||
`merge_insert_batch{,es}`, `create_{btree,inverted}_index`) were removed. The
|
||||
remaining residuals are `delete_where` (gated on MR-A — Lance v7.x bump)
|
||||
and `create_vector_index` (gated on Lance #6666); see
|
||||
[lance.md](lance.md) and [writes.md](writes.md). New write paths should use
|
||||
the staged shape unless a documented Lance blocker applies.
|
||||
remaining residuals are `delete_where` and `create_vector_index`. The Lance
|
||||
6.0.1 → 7.0.0 bump landed, so the staged two-phase delete API
|
||||
(`DeleteBuilder::execute_uncommitted`, Lance #6658) is now available and MR-A
|
||||
is unblocked — but the migration itself is still pending, so `delete_where`
|
||||
stays inline for now. `create_vector_index` remains gated on Lance #6666
|
||||
(still open). See [lance.md](lance.md) and [writes.md](writes.md). New write
|
||||
paths should use the staged shape unless a documented Lance blocker applies.
|
||||
- **Deletes and vector indexes:** `delete_where` and vector index creation still
|
||||
advance Lance HEAD inline because the required public Lance APIs are missing.
|
||||
Keep D2 and recovery coverage in place until those residuals are removed.
|
||||
advance Lance HEAD inline. The public delete two-phase API now exists (Lance
|
||||
#6658 shipped in 7.0.0), so the delete residual is unblocked pending the MR-A
|
||||
migration; vector index creation is still blocked (Lance #6666 open). Keep D2
|
||||
and recovery coverage in place until those residuals are removed.
|
||||
- **Blob-column compaction:** Lance `compact_files` mis-decodes blob-v2 columns
|
||||
under its forced `BlobHandling::AllBinary` read ("more fields in the schema
|
||||
than provided column indices"), so `optimize` skips any table with a `Blob`
|
||||
|
|
@ -160,6 +197,22 @@ them explicit.
|
|||
one-winner-CAS territory; closing this fully needs a cross-process
|
||||
serialization primitive (e.g. lease-based use of the schema-apply lock
|
||||
branch) — design it before promoting multi-process write topologies.
|
||||
- **Fork reclaim is in-process-safe only:** the first write to a table on a
|
||||
branch forks it (a Lance `create_branch` that advances state before the
|
||||
manifest publish). An interrupted fork (crash, or a cancelled request
|
||||
future) leaves a manifest-unreferenced branch ref. The next write self-heals
|
||||
it — `reclaim_orphaned_fork_and_refork` (`force_delete_branch` + re-fork)
|
||||
— but reclaim is only safe because the writer holds the per-`(table,
|
||||
branch)` write queue from before the fork through the publish AND re-checks
|
||||
the live manifest under it, so no *in-process* writer can be mid-fork. A
|
||||
reclaim cannot serialize against a foreign-*process* in-flight fork: it may
|
||||
force-delete a peer's just-created ref, which makes that peer's commit fail
|
||||
and retry — the same one-winner-CAS exposure as above, not corruption. The
|
||||
reclaim never fires unless in-process-queue + manifest authority both prove
|
||||
the ref is manifest-unreferenced. `cleanup`'s per-table reconciler
|
||||
(`reconcile_orphaned_branches`) is the guaranteed backstop for any fork the
|
||||
write path never revisits. Both degrade to a no-op if Lance ships an atomic
|
||||
multi-dataset branch op.
|
||||
- **Local `write_text_if_match` is not a cross-process CAS:** object-store
|
||||
backends use a true conditional put (ETag If-Match; the in-memory test
|
||||
backend too), but upstream `object_store` leaves `PutMode::Update`
|
||||
|
|
|
|||
|
|
@ -156,7 +156,24 @@ If a future need pulls one of these into scope, add a row to the matching domain
|
|||
|
||||
When Lance ships a major release that changes any of the above (file format bump, new index type, transaction semantics change, new branching primitive), refresh this index in the same change as the omnigraph upgrade. Stale Lance pointers are worse than no pointers.
|
||||
|
||||
### Last alignment audit: 2026-05-22 (Lance 6.0.1 upstream; omnigraph pinned at 6.0.1)
|
||||
### Last alignment audit: 2026-06-15 (Lance 7.0.0 upstream; omnigraph pinned at 7.0.0)
|
||||
|
||||
Migration from Lance 6.0.1 → 7.0.0 landed in this cycle. **Arrow stayed 58, DataFusion stayed 53** (no change) — the only transitive bump is `object_store` 0.12.5 → 0.13.2. 141 upstream commits reviewed (6.0.1 → 7.0.0); no fixes lost (the 6.0.x release-branch backports are all forward-ported into 7.0.0). Behavior-affecting findings:
|
||||
|
||||
- **object_store 0.13 moved convenience methods behind a new `ObjectStoreExt` trait** (`get`/`put`/`head`/`rename`/`delete`; `list`/`list_with_delimiter`/`put_opts` stay on the core `ObjectStore` trait). Fix = add `use object_store::ObjectStoreExt;` to `storage.rs` and `db/manifest/namespace.rs`; no call-site changes. Mirrors Lance's own migration in PR #6672. The local-FS `PutMode::Update` gap is unchanged (still unimplemented upstream), so `storage.rs::write_text_if_match`'s local content-token emulation stays.
|
||||
- **`roaring` must be pinned to 0.11.4** (`cargo update -p roaring --precise 0.11.4`). Lance 7.0.0's `UpdatedFragmentOffsets` newtype (PR #6650) derives `Eq` over `HashMap<u64, RoaringBitmap>`, which needs `RoaringBitmap: Eq` — added only in roaring 0.11.4 (roaring-rs PR #341). Lance's loose `roaring = "0.11"` constraint otherwise resolves the broken 0.11.3 and **lance itself fails to compile** (`RoaringBitmap: Eq is not satisfied`). roaring is transitive (no direct workspace dep); the pin lives only in `Cargo.lock`.
|
||||
- **`_row_created_at_version` for merge-insert INSERT rows now = the commit version** (PR #6774; was a fallback of 1 / dataset-creation version). Flipped `lance_version_columns.rs::lance_merge_insert_new_row_stamps_created_at_version` to assert `== v2`. Production change-detection keys on `_row_last_updated_at_version` + ID-set membership, so classification logic is unaffected (the `changes/mod.rs` rationale comment was corrected).
|
||||
- **BTREE range-query bound inclusiveness fixed** (PR #6796, issue #6792): `x <= hi AND x > lo` returned the wrong boundary row on 6.0.1. omnigraph today builds BTREE only on string `@key` columns (`id`/`src`/`dst`) and queries them by equality/IN, not range, so its *current* query patterns almost certainly never hit this bug — but the corrected boundary semantics are a contract we rely on the moment a BTREE-range path appears (BTREE-on-properties via the index-type tickets, or a range-on-key query). Pinned by `lance_surface_guards.rs::btree_range_query_boundary_is_correct` (reproduces #6792's 5-row + BTREE shape).
|
||||
- **`WriteParams::auto_cleanup` default flipped from on (every-20-commits) to `None`** (PR #6755). On 6.0.1 the on-by-default hook could GC versions the `__manifest` pins for snapshots/time-travel. omnigraph owns cleanup explicitly (`optimize.rs::cleanup_all_tables`). Two parts to the fix, because `auto_cleanup` is **create-time config only and has no effect on existing datasets** (Lance `write.rs` docs): (1) `auto_cleanup: None` at all 11 `WriteParams` sites so *new* datasets store no cleanup config; (2) — the load-bearing half — `skip_auto_cleanup: true` on every commit path, because graphs created **before** the bump still carry the on-config in their datasets, and Lance's hook fires off the *dataset's stored* config at commit time (`io/commit.rs`: `if !commit_config.skip_auto_cleanup`). So the staged commit path (`commit_staged` → `CommitBuilder::with_skip_auto_cleanup(true)`), the `__manifest` publisher (`MergeInsertBuilder::skip_auto_cleanup(true)`), and the direct `WriteParams` paths all skip the hook. Without this, an upgraded graph would still auto-cleanup and delete `__manifest`-pinned versions. Pinned by `lance_surface_guards.rs::skip_auto_cleanup_suppresses_version_gc` (negative control + with-skip survival).
|
||||
- **Lance #6658 SHIPPED in 7.0.0** (`DeleteBuilder::execute_uncommitted`, exposed via PR #6781) → MR-A (migrate `delete_where` to the staged two-phase API, retire the parse-time D2 rule) is now **unblocked**, tracked separately (dev-graph `iss-950`). The bump itself keeps `delete_where` inline; the `_compile_delete_result_field_shape` guard is left untouched until MR-A.
|
||||
- **The unenforced primary key is now immutable once set** (`lance::dataset::transaction`, ~L2472–2480: `if !primary_key_before.is_empty() && (writes_primary_key || primary_key_after != primary_key_before) → "the unenforced primary key is a reserved key and cannot be changed once set"`). omnigraph marks `__manifest.object_id` as the unenforced PK (`lance-schema:unenforced-primary-key`) for merge-insert row-level CAS — baked into `manifest_schema()` at init, and added by the `migrate_v1_to_v2` internal-schema migration for pre-v0.4.0 graphs. The migration relied on Lance 6's idempotent re-apply for crash-recovery (a crash after the field-set but before the stamp bump re-enters the migration with the PK already present); under v7 that re-apply errors, so a real v1 graph could never finish migrating. Fixed by guarding the set on the manifest's unenforced-PK field (`db/manifest/migrations.rs::migrate_v1_to_v2`): `["object_id"]` → no-op, `[]` → set, any other PK field → loud refusal (the wrong CAS key, unchangeable under v7). Pinned by `lance_surface_guards.rs::unenforced_primary_key_is_immutable_once_set` (red if Lance relaxes immutability); regression: `db::manifest::tests::test_publish_migrates_pre_stamp_manifest_to_current_version` (was red under v7).
|
||||
- **Native `DirectoryNamespace` no longer recognizes omnigraph's manifest-tracked tables** (`lance-namespace-impls` dir.rs ~L1310): `list/describe/create_table_version` route through `check_table_status`, which reports an omnigraph table absent → `TableNotFound`. The decoupling is *contingent on omnigraph's legacy boolean PK key*, not an unconditional v7 property: v7's namespace eagerly adds the new `lance-schema:unenforced-primary-key:position` key to any `__manifest` lacking it; that write hits the immutable-PK rule above (the boolean key already set the PK), so `ensure_manifest_table_up_to_date` errors and the namespace silently falls back to directory listing. omnigraph keeps the boolean key deliberately — Lance honors it permanently (maps to PK position 0), and one uniform on-disk format beats a new-vs-old split (existing graphs can't be re-keyed to the position key under that same immutability rule). omnigraph production never uses Lance's native namespace (its publisher writes `__manifest` directly via merge_insert; its own `namespace.rs` impls are custom), so this is test-only — the `test_directory_namespace_direct_publish_cannot_replace_native_omnigraph_write_path` surface guard was realigned to the v7 behavior (it now asserts the native namespace is fully decoupled, which only strengthens the guard's thesis).
|
||||
- **Still NOT fixed in 7.0.0:** vector-index two-phase (Lance #6666 open) — `create_vector_index` inline residual retained; blob-column compaction — `compact_files_still_fails_on_blob_columns` guard still red on a fix, `optimize` still skips blob tables behind `LANCE_SUPPORTS_BLOB_COMPACTION`.
|
||||
- **No Lance API surface omnigraph uses changed at *compile* time** (the only compile break was object_store) — but **two runtime behaviors did** (the unenforced-PK immutability and the native-namespace `TableNotFound`, above), each caught by the full engine test suite rather than the build. `CleanupPolicy`, `WriteParams` (apart from the `auto_cleanup` default), `CompactionOptions`, the namespace models (resolved via `lance-namespace-reqwest-client` 0.7.7, unchanged across the bump), `Operation`, `ManifestLocation`, and `MergeInsertBuilder` shapes are all stable. Lesson: a clean build is not a clean alignment — run `cargo test --workspace` before declaring a Lance bump done.
|
||||
|
||||
Bump this date stanza on the next alignment pass.
|
||||
|
||||
### Prior alignment audit: 2026-05-22 (Lance 6.0.1 upstream; omnigraph pinned at 6.0.1)
|
||||
|
||||
Migration from Lance 4.0.0 → 6.0.1 landed in this cycle (DataFusion 52 → 53, Arrow 57 → 58, lance-tokenizer 6.0.1 added, tantivy* removed). Direct 4 → 6 jump; v5.x was not used as an intermediate (rationale in `~/.claude/plans/shimmering-percolating-duckling.md`). Behavior-affecting findings:
|
||||
|
||||
|
|
@ -169,6 +186,7 @@ Migration from Lance 4.0.0 → 6.0.1 landed in this cycle (DataFusion 52 → 53,
|
|||
- **`Dataset::checkout_version(N).await?.restore().await?`**: `restore()` takes `&mut self` and returns `Result<()>` (mutates in place, does not consume + return a new dataset). The recovery rollback hammer at `db/manifest/recovery.rs:505-522` continues to work. Pinned by `lance_surface_guards.rs::_compile_checkout_version_then_restore_signature`.
|
||||
- **`DatasetBuilder::from_namespace(...).with_branch(...).with_version(...).load()`** surface preserved (the namespace builder chain at `db/manifest/namespace.rs:162-174`). Pinned by `lance_surface_guards.rs::_compile_dataset_builder_from_namespace_signature`.
|
||||
- **`compact_files(&mut ds, CompactionOptions::default(), None)`** signature stable. `CompactionOptions` still does not expose `data_storage_version`; `compact_files` builds its own `WriteParams { ..Default::default() }`. Note: `LanceFileVersion::default()` is now V2_1 in v6, so optimize-rewritten fragments come out at V2_1 by default (was V2_0 in v4). Existing explicit V2_2 pins on creates/appends still apply.
|
||||
- **`Dataset::optimize_indices(&mut self, &lance_index::optimize::OptimizeOptions)`** (via `DatasetIndexExt`) is a depended-on surface as of the index-coverage work: `db/omnigraph/optimize.rs` calls it after `compact_files` to fold appended/rewritten fragments into existing indexes (incremental merge, not retrain). It is a **committing** call (mutates in place, advances HEAD; no uncommitted variant in v6.0.1), so optimize treats it as an inline-commit residual under the `SidecarKind::Optimize` recovery sidecar. Signature pinned by `lance_surface_guards.rs::_compile_optimize_indices_signature`; the incremental-coverage behavior pinned by `optimize_indices_extends_fragment_coverage` (appended fragment uncovered before, covered after).
|
||||
- **`Dataset::delete(predicate)` returns `DeleteResult { new_dataset: Arc<Dataset>, num_deleted_rows: u64 }`** — unchanged shape. Pinned by `lance_surface_guards.rs::_compile_delete_result_field_shape`. MR-A will repurpose this guard to the staged two-phase variant once `DeleteBuilder::execute_uncommitted` migration lands.
|
||||
- **File reader read methods now async** (Lance PR #6710, v6.0). No effect — omnigraph reaches Lance exclusively through `Dataset::scan` and the staged-write API.
|
||||
- **Tokenizer vendored as `lance-tokenizer`** (Lance PR #6512, v6.0). No effect — no direct tokenizer imports.
|
||||
|
|
@ -178,6 +196,4 @@ Migration from Lance 4.0.0 → 6.0.1 landed in this cycle (DataFusion 52 → 53,
|
|||
- **`Dataset::force_delete_branch`** (`branches().delete(name, force=true)`, dataset.rs:524) tolerates a missing branch-*contents* ref (vs plain `delete_branch`'s `RefNotFound`), but on the local store still errors `NotFound` if the branch `tree/` directory is fully absent (`remove_dir_all`'s NotFound is not caught for Lance's native error variant, refs.rs:526-549). Both variants still refuse a branch with referencing descendants (`RefConflict`). `TableStore::force_delete_branch` wraps this to be fully idempotent (tolerates already-absent). The single-authority branch-delete redesign uses it for orphan reclamation (eager best-effort reclaim + cleanup reconciler). Pinned by `lance_surface_guards.rs::force_delete_branch_semantics`. Branch delete is "flip the ref atomically, then `remove_dir_all(tree/{branch})`"; branch-exclusive data lives under `tree/{branch}/` so a drop reclaims it immediately without touching `main`.
|
||||
- **Lance blob-v2 `compact_files` bug** (no public issue found as of 2026-06): `compact_files` disables binary-copy for blob datasets and forces `BlobHandling::AllBinary` on the read side; the v2.1+ structural decoder then mis-counts column infos for the blob-v2 struct and fails with `Invalid user input: there were more fields in the schema than provided column indices / infos` (`lance-encoding/src/decoder.rs::ColumnInfoIter::expect_next`). This fails even a pristine uniform-V2_2 multi-fragment blob table; vector/list/scalar/ragged columns and mixed file versions all compact fine. Reads/queries use descriptor handling (`BlobHandling::default()`) and are unaffected. `optimize` skips blob-bearing tables behind `LANCE_SUPPORTS_BLOB_COMPACTION = false` (`db/omnigraph/optimize.rs`), reporting `SkipReason::BlobColumnsUnsupportedByLance`. Pinned by `lance_surface_guards.rs::compact_files_still_fails_on_blob_columns`, which turns red when the bug is fixed → flip the gate, remove the skip branch + the `maintenance.rs::optimize_skips_blob_table_and_reports_skip` skip assertions.
|
||||
|
||||
Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (10 named guards; 5 runtime + 5 compile-only). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension).
|
||||
|
||||
Bump this date stanza on the next alignment pass.
|
||||
Surface guards added: `crates/omnigraph/tests/lance_surface_guards.rs` (10 named guards; 5 runtime + 5 compile-only; plus the index-coverage work's `_compile_optimize_indices_signature` and `optimize_indices_extends_fragment_coverage`). Future Lance bumps re-run this file first as the smoke check. Two additional guards from the original plan deferred to follow-up (`manifest_cas_returns_row_level_contention_variant` needs full publisher-race harness; `table_version_metadata_byte_compatible_with_v4` needs `pub(crate)` reach extension).
|
||||
|
|
|
|||
|
|
@ -348,4 +348,4 @@ Callers move at their own pace. The envelope upgrades + URL rename ship in v0.6.
|
|||
- RFC 8288 (`Link` relations, `successor-version`)
|
||||
- MCP spec: [modelcontextprotocol.io](https://modelcontextprotocol.io)
|
||||
- [invariants.md](./invariants.md) — substrate boundaries this work respects
|
||||
- [../user/server.md](../user/server.md) — current HTTP surface (post-MR-656 picks up the `/query`+`/mutate` rename and deprecation)
|
||||
- [../user/server.md](../user/operations/server.md) — current HTTP surface (post-MR-656 picks up the `/query`+`/mutate` rename and deprecation)
|
||||
|
|
|
|||
|
|
@ -68,7 +68,7 @@ anything moves — mirroring the storage collapse, where the pinned contract
|
|||
tests gated the swap, and the test-monolith modularization (#192/#193), which
|
||||
makes Phase 3 tractable: the CLI dispatch is 1,184 lines today, not 4,200.
|
||||
|
||||
### Phase 1 — Parity matrix (the referee; do first, no refactor)
|
||||
### Phase 1 — Parity matrix (the referee; do first, no refactor) *(landed)*
|
||||
|
||||
A CLI integration test (extend the `system_local.rs` harness, which already
|
||||
spawns both binaries): one fixture graph; for every forked verb, run the
|
||||
|
|
@ -81,7 +81,16 @@ This pins today's behavior so Phase 3 can't silently change it, and catches
|
|||
every future fork drift. It also incidentally covers utoipa annotation↔route
|
||||
mismatches (a lying `#[utoipa::path]` makes the remote leg 404).
|
||||
|
||||
### Phase 2 — One wire-DTO crate
|
||||
**Phase 1 outcome (landed):** `crates/omnigraph-cli/tests/parity_matrix.rs`
|
||||
— 11 rows green with an **empty divergence ledger**: with matched Cedar
|
||||
policy on both arms, embedded and remote agree on every forked verb's
|
||||
scrubbed JSON and exit codes. Two findings along the way: like-for-like
|
||||
requires the same policy bundle on both arms (a tokens-only server is
|
||||
default-deny by design — the harness encodes this), and inline execution's
|
||||
unbound-param matches-all vs the invoke path's hard error is a cross-path
|
||||
asymmetry, filed as #207 and pinned (not repaired) by the matrix.
|
||||
|
||||
### Phase 2 — One wire-DTO crate *(landed)*
|
||||
|
||||
Move the HTTP request/response types and the single `engine result → DTO`
|
||||
mapping per verb into a shared crate (working name `omnigraph-api-types`),
|
||||
|
|
@ -113,6 +122,15 @@ neither axum nor the engine's internals. The engine crate does not depend on
|
|||
it — the `engine result → DTO` mapping lives in the shared crate (or the CLI/
|
||||
server side), taking engine result types as input.
|
||||
|
||||
**Phase 2 outcome (landed):** `crates/omnigraph-api-types` holds the wire
|
||||
DTOs + their `engine-result → DTO` mappings; `omnigraph-server::api` is a
|
||||
`pub use` re-export (so `openapi.json` is byte-identical — the referee
|
||||
passed with zero diff), and the CLI consumes the crate directly. One
|
||||
deliberate refinement of the original sketch: `LoadOutput` is a rendered
|
||||
CLI output type, not a wire DTO, so it stayed CLI-side — both its mappings
|
||||
(local `LoadResult`, remote `IngestOutput`) now sit together in
|
||||
`output.rs`. The parity matrix passed textually unchanged.
|
||||
|
||||
### Phase 3 — `GraphClient` trait, two implementations
|
||||
|
||||
```text
|
||||
|
|
@ -143,15 +161,20 @@ and cluster commands must work with the server down) explicit in code.
|
|||
"Server" targets include operator-config named servers (RFC-007), not only
|
||||
literal `http(s)://` URIs.
|
||||
|
||||
### Phase 5 — Route alignment
|
||||
### Phase 5 — Route alignment (landed)
|
||||
|
||||
Add a canonical `/load` endpoint (the handler already exists behind the
|
||||
`/ingest` shim); point `RemoteClient` at it; keep `/ingest` on its existing
|
||||
deprecation path. While here, check whether the server uses `utoipa-axum`'s
|
||||
router-coupled registration (`OpenApiRouter`/`routes!`); if it hand-mounts
|
||||
routes beside `#[utoipa::path]` annotations, prefer migrating registration so
|
||||
path annotations and mount points are the same declaration (the modularization
|
||||
already hit one orphaned-attribute incident of exactly this class).
|
||||
Added a canonical `POST /load` (shared `run_ingest` body; the deprecated
|
||||
`/ingest` is now a thin alias carrying `#[deprecated]` + RFC 9745/8288
|
||||
`Deprecation`/`Link: </load>` headers, exactly mirroring `/mutate`↔`/change`)
|
||||
and pointed the CLI's remote `load` arm at it; `/ingest` stays on its
|
||||
deprecation path. `/load` reuses `IngestRequest`/`IngestOutput` (as canonical
|
||||
`/mutate` reuses `Change*`); a DTO rename is a separate change.
|
||||
|
||||
Registration finding: the server **hand-mounts** routes (`.route(...)`) beside a
|
||||
manual `#[openapi(paths(...))]` list, not `utoipa-axum`'s `OpenApiRouter`/
|
||||
`routes!`. This PR followed the existing manual pattern (one `.route` + one
|
||||
`paths(...)` entry + the `#[utoipa::path]` annotation) rather than migrating
|
||||
registration — the migration is a worthwhile but orthogonal cleanup, deferred.
|
||||
|
||||
## Non-goals
|
||||
|
||||
|
|
|
|||
449
docs/dev/rfc-010-cli-planes-restructure.md
Normal file
449
docs/dev/rfc-010-cli-planes-restructure.md
Normal file
|
|
@ -0,0 +1,449 @@
|
|||
# RFC: Restructure the CLI Around Explicit Planes
|
||||
|
||||
**Status:** Proposed
|
||||
**Date:** 2026-06-13
|
||||
**Audience:** CLI/server/cluster maintainers
|
||||
**Builds on:** [rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md)
|
||||
(Phases 3a–3c landed — the embedded/remote data-plane fork is now one
|
||||
`GraphClient` enum; this RFC **expands RFC-009 Phase 4** from a narrow
|
||||
embedded-vs-remote capability table into the full plane model, and leaves
|
||||
Phase 5 route alignment where it is),
|
||||
[rfc-007-operator-config.md](rfc-007-operator-config.md) (operator
|
||||
`--server`/`--graph`/`--target` addressing — the surfaces this RFC makes
|
||||
uniform across planes),
|
||||
[rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md).
|
||||
**Sequencing:** post-v0.7.0, after RFC-009 Phase 3c (done).
|
||||
|
||||
## Summary
|
||||
|
||||
The CLI silently spans **three planes** — data, storage/maintenance, and
|
||||
control — and forces the operator to know which plane each verb lives on *and*
|
||||
address a graph differently per plane. The same graph you query as
|
||||
`--server prod --graph knowledge` you must maintain as
|
||||
`s3://bucket/knowledge.omni`. Plane restrictions (`graphs list` is server-only,
|
||||
`optimize` is storage-only) are *accidental* — discovered by hitting a cryptic
|
||||
error, not *declared*.
|
||||
|
||||
This RFC makes the plane model **explicit and coherent** with three moves:
|
||||
|
||||
1. **One graph-addressing model** across every verb (`--target`/`--graph`/
|
||||
positional URI/`--server`), resolving to a storage URI for maintenance and a
|
||||
remote client for data — instead of two different ways to name one graph.
|
||||
2. **A declared, per-subcommand capability surface** (RFC-009 Phase 4): each
|
||||
verb declares its plane(s); wrong-plane invocations get an honest "this is
|
||||
storage-plane, `--server` doesn't apply" error from one table, not scattered
|
||||
`bail!`s.
|
||||
3. **Plane-grouped `--help`** so the model is legible at a glance.
|
||||
|
||||
No new server feature. Storage maintenance stays off the wire — deliberately.
|
||||
|
||||
## Current state of affairs
|
||||
|
||||
The CLI has 23 top-level commands. They divide into three planes, addressed
|
||||
three different ways:
|
||||
|
||||
| Plane | Verbs | Reaches the graph by | Addressing surface |
|
||||
|---|---|---|---|
|
||||
| **Data** | `query`, `mutate`, `load`, `ingest`, `branch *`, `snapshot`, `export`, `commit *`, `schema show/apply` (and `graphs list`, **remote-only today** — see note) | embedded engine **or** HTTP server (one `GraphClient`) | positional URI **or** `--target` / `--graph` / `--server` (config aliases) |
|
||||
| **Storage / maintenance** | `init`, `optimize`, `repair`, `cleanup`, `schema plan`, `queries validate` | embedded engine **only**, directly on storage (`file://` or `s3://`) | positional URI **or** `--target` — **no `--server` / `--graph`** (except `init`, which today takes **only a required positional URI** — no `--target`) |
|
||||
| **Control** | `cluster validate/plan/apply/approve/status/refresh/import/force-unlock` | a cluster **directory** (`file://` or `s3://`), not a graph URI | `--config <dir>` |
|
||||
|
||||
### What's confusing (validated facts)
|
||||
|
||||
1. **Two names for one graph.** Data verbs resolve `--server prod --graph
|
||||
knowledge` through `GraphClient::resolve*` (the embedded/remote fork collapsed
|
||||
in RFC-009 Phases 3a–3c; only the two `GraphClient` factories call
|
||||
`apply_server_flag`). Maintenance verbs instead use
|
||||
`resolve_uri`/`resolve_local_uri` and accept only a positional URI or
|
||||
`--target` — so to compact the graph you *query* as `--server prod --graph
|
||||
knowledge` you must *type* `s3://bucket/knowledge.omni`. One graph, two
|
||||
addressing vocabularies.
|
||||
|
||||
> **Note (`graphs list`).** It is routed through `GraphClient` only to share
|
||||
> the addressing/token resolver; its embedded arm fails loudly, so it is
|
||||
> **remote-only today** (the later capability table and *Relationship to
|
||||
> RFC-009* record it as remote-now / embedded-cluster-later).
|
||||
|
||||
2. **Plane restrictions are accidental, not declared.** `graphs list` is
|
||||
server-only and `optimize`/`repair`/`cleanup`/`init` are storage-only purely
|
||||
by code shape. Point `optimize` at an `https://` URL and you get whatever
|
||||
`Omnigraph::open` says about an https URI — accidental error text that, per
|
||||
Hyrum's Law, is already someone's dependency. The capability is real but
|
||||
unstated.
|
||||
|
||||
3. **The split is per-subcommand, and the family names hide it.** `schema plan`
|
||||
is storage-only (`resolve_local_uri`) while `schema show`/`schema apply` are
|
||||
data-plane (the graph client). `queries validate` opens the graph to
|
||||
typecheck while `queries list` only reads the registry config. The plane is
|
||||
a property of the *subcommand*, not the family.
|
||||
|
||||
4. **Maintenance has no server/cluster counterpart at all.** There is no HTTP
|
||||
route and no `cluster` subcommand for `optimize`/`cleanup`/`repair` (verified:
|
||||
nothing in the server route table, nothing in `omnigraph-cluster/src`). For a
|
||||
server-backed deployment you run the *same CLI* against the storage URI,
|
||||
out-of-band from the serving process. This is correct (maintenance is
|
||||
heavyweight, destructive, single-operator — it should not be a multi-tenant
|
||||
HTTP surface), but it is **undocumented in the CLI's own shape**, so it reads
|
||||
as an omission rather than a decision.
|
||||
|
||||
5. **`init` has a hidden control-plane twin.** Bare `init` creates a single
|
||||
graph from storage; in cluster mode the equivalent is `cluster apply`
|
||||
(graph-creation stage, with ledger/recovery/approval semantics). Same intent,
|
||||
two entry points, no signpost between them.
|
||||
|
||||
6. **Flat `--help`.** All 23 commands list as one undifferentiated block, so the
|
||||
plane a verb belongs to is tribal knowledge.
|
||||
|
||||
The net effect: a new operator must already know OmniGraph's plane architecture
|
||||
to predict which flags work on which verb and how to name a graph. The CLI does
|
||||
not teach its own model.
|
||||
|
||||
## Target CLI ergonomics
|
||||
|
||||
The throughline: **you name a graph one way, and the CLI tells you what works
|
||||
where.** Simple examples of the end state:
|
||||
|
||||
### One name for a graph, everywhere
|
||||
|
||||
A config target `knowledge` works on every verb that touches that graph:
|
||||
|
||||
```bash
|
||||
omnigraph query --target knowledge --query q.gq # data (embedded or remote, auto)
|
||||
omnigraph load --target knowledge --data rows.jsonl # data
|
||||
omnigraph optimize --target knowledge # maintenance (resolves to its storage URI)
|
||||
omnigraph cleanup --target knowledge --keep 10 --confirm
|
||||
omnigraph repair --target knowledge --confirm
|
||||
```
|
||||
|
||||
The positional URI form still works everywhere, unchanged:
|
||||
|
||||
```bash
|
||||
omnigraph optimize s3://bucket/knowledge.omni
|
||||
```
|
||||
|
||||
### Data plane: same command, embedded or remote
|
||||
|
||||
You don't pick "local vs server" syntax — resolution decides:
|
||||
|
||||
```bash
|
||||
omnigraph query ./local.omni --query q.gq # opens engine directly
|
||||
omnigraph query --server prod --graph knowledge --query q.gq # over HTTP
|
||||
omnigraph query --target knowledge --query q.gq # whichever the config says
|
||||
```
|
||||
|
||||
### Maintenance: `--target` must resolve to direct storage (loud if not)
|
||||
|
||||
```bash
|
||||
$ omnigraph optimize --target prod
|
||||
error: `--target prod` resolves to a remote server (https://prod…).
|
||||
`optimize` is a storage-plane command and needs direct storage access.
|
||||
Pass the graph's s3://… URI, or use --cluster <dir> --graph <id>.
|
||||
```
|
||||
|
||||
Cluster-managed graphs get an explicit, intentional path (no implicit
|
||||
`cluster.yaml` peeking):
|
||||
|
||||
```bash
|
||||
omnigraph optimize --cluster ./cluster --graph knowledge
|
||||
```
|
||||
|
||||
### Wrong-plane = one honest, stable error
|
||||
|
||||
```bash
|
||||
$ omnigraph optimize --server prod
|
||||
error: `optimize` is a storage-plane command; `--server` addresses the data
|
||||
plane and does not apply here. Use --target <name> or a storage URI.
|
||||
|
||||
$ omnigraph graphs list ./local.omni
|
||||
error: `graphs list` needs a remote multi-graph server (http/https) today.
|
||||
(Embedded cluster-catalog enumeration is planned — RFC-009.)
|
||||
```
|
||||
|
||||
### `--help` teaches the model
|
||||
|
||||
```
|
||||
DATA PLANE run against a graph (embedded or --server)
|
||||
query mutate load branch snapshot export commit schema show schema apply
|
||||
|
||||
STORAGE / MAINTENANCE direct storage access; no server
|
||||
init optimize repair cleanup schema plan queries validate
|
||||
|
||||
CONTROL PLANE manage a cluster directory
|
||||
cluster
|
||||
|
||||
INSPECT / SESSION
|
||||
graphs list queries list lint policy embed login logout config
|
||||
```
|
||||
|
||||
### Exceptions, signposted (not silent)
|
||||
|
||||
```bash
|
||||
omnigraph init --schema s.pg ./new.omni # plain path: fine
|
||||
|
||||
$ omnigraph init --target knowledge --schema s.pg # cluster-managed target: redirected
|
||||
error: `knowledge` is a cluster-managed graph. Create it via `cluster apply`
|
||||
(which records ledger + recovery + approvals), not `init`.
|
||||
```
|
||||
|
||||
**In one line:** one way to name a graph, the right flags accepted per verb, and
|
||||
a CLI that tells you its planes instead of making you memorize them.
|
||||
|
||||
## Proposed shape (mechanism)
|
||||
|
||||
### One addressing model for every graph-addressing verb
|
||||
|
||||
Route **all** graph-addressing verbs — data *and* maintenance — through one
|
||||
resolver that turns `(positional URI | --target | --graph | --server)` into
|
||||
either a **storage URI** (`file://`/`s3://`) → embedded execution, or a **remote
|
||||
`GraphClient`** → HTTP execution, per the verb's declared plane.
|
||||
|
||||
**Authority rule (the precedence must not be silent).** `--target` is an
|
||||
operator/legacy target lookup; `cluster.yaml` is a *different* authority surface
|
||||
(read only by `cluster` commands and `--cluster` boot). A maintenance verb must
|
||||
not quietly consult both and invent a precedence. The rule:
|
||||
|
||||
- A maintenance verb's `--target` resolves through the **operator/legacy**
|
||||
config and its URI must already be **direct storage**; a target that resolves
|
||||
to a remote (`http(s)://`) URL **fails loudly** (see the example above).
|
||||
- **Cluster-managed graphs are addressed explicitly** via a cluster-root +
|
||||
graph-id pair (spelled `--cluster <dir> --graph <id>` for illustration), so
|
||||
reading cluster state is an intentional mode — never an implicit fallback
|
||||
between operator config and `cluster.yaml`.
|
||||
|
||||
> **Flag-shape caveat (deferred).** `--graph` is *already* a global flag that
|
||||
> `requires = "server"` and appends `/graphs/<id>` to a **remote** URL — a
|
||||
> different meaning, and clap won't permit `--graph` without `--server`. So the
|
||||
> cluster-maintenance addressing needs either a distinct flag (e.g.
|
||||
> `--cluster-graph <id>`) or an explicit global-flag migration. This is why
|
||||
> the cluster-managed resolver is **deferred to a later slice** (it also rides
|
||||
> the applied-state-vs-declared-config open question below); the
|
||||
> operator/legacy `--target` path lands first.
|
||||
|
||||
### A declared, per-subcommand capability surface (RFC-009 Phase 4, expanded)
|
||||
|
||||
One table, **per subcommand** (family-level rows hide exactly the cases the
|
||||
table exists to make non-accidental):
|
||||
|
||||
| Command | Data (embedded) | Data (remote) | Storage (direct) | Config / session | Notes |
|
||||
|---|---|---|---|---|---|
|
||||
| `query`, `mutate`, `load`, `ingest` | ✅ | ✅ | — | — | `ingest` is the deprecated alias of `load` |
|
||||
| `branch create/list/delete/merge` | ✅ | ✅ | — | — | |
|
||||
| `snapshot`, `export`, `commit list/show` | ✅ | ✅ | — | — | |
|
||||
| `schema show` | ✅ | ✅ | — | — | |
|
||||
| `schema apply` | ✅ | ✅ | — | — | declarative alternative: `cluster apply` |
|
||||
| `schema plan` | — | — | ✅ | — | local resolver today |
|
||||
| `queries validate` | — | — | ✅ | — | opens the graph to typecheck |
|
||||
| `init` | — | — | ✅ | — | cluster-managed graphs → `cluster apply` |
|
||||
| `optimize`, `repair`, `cleanup` | — | — | ✅ | — | |
|
||||
| `graphs list` | (later) | ✅ | — | — | remote today; embedded-cluster later (RFC-009) |
|
||||
| `queries list` | — | — | — | ✅ | reads the registry config; no graph |
|
||||
| `lint` | — | — | ✅ | ✅ | `--schema` file, or opens a local graph |
|
||||
| `policy validate/test/explain` | — | — | — | ✅ | reads policy files + config |
|
||||
| `embed` | — | — | — | ✅ | local tooling (files + embedding API) |
|
||||
| `login`, `logout`, `config`, `version` | — | — | — | ✅ | session / config; no graph |
|
||||
|
||||
The resolver consults this table. A wrong-plane invocation produces one honest,
|
||||
stable message instead of N ad-hoc `bail!`s and accidental `open` errors.
|
||||
|
||||
### Plane-grouped `--help`
|
||||
|
||||
Group the command list by plane (the `--help` block shown under Target CLI
|
||||
ergonomics). Cosmetic, zero behavior change, highest legibility-per-line.
|
||||
|
||||
### Maintenance stays off the wire (decision, not omission)
|
||||
|
||||
This RFC **does not** add server routes for `optimize`/`cleanup`/`repair`:
|
||||
|
||||
- **Serving = the server.** Multi-tenant, safe-for-many-callers data plane.
|
||||
- **Storage maintenance = the CLI against storage**, addressed uniformly,
|
||||
run by an operator or a scheduled job with storage access.
|
||||
|
||||
Adding maintenance-over-HTTP would re-introduce a heavyweight, destructive
|
||||
multi-tenant surface and *add* a plane rather than clarify the three we have.
|
||||
A future cluster-driven maintenance reconciler (scheduled compaction/GC as a
|
||||
control-plane policy) is explicitly **out of scope** — net-new design (who runs
|
||||
it, with what resource bounds), not a CLI restructure.
|
||||
|
||||
### `init` is an explicit exception (decision)
|
||||
|
||||
Direct-storage `init` against a plain URI/target stays. But if a target resolves
|
||||
to a **cluster-managed** graph root, `init` **refuses and signposts** `cluster
|
||||
apply` (which records ledger, recovery, and approval artifacts) rather than
|
||||
initializing that root out of band. This closes the "hidden twin" of the current
|
||||
state.
|
||||
|
||||
## Compatibility
|
||||
|
||||
Additive and low-risk:
|
||||
|
||||
- **`--target`/`--graph` on maintenance verbs** is new capability; the positional
|
||||
URI form keeps working unchanged.
|
||||
- **Grouped `--help`** is cosmetic.
|
||||
- **Capability-surface error text** changes the message you get on a wrong-plane
|
||||
or misaddressed invocation. Per Hyrum's Law that text is observable; the change
|
||||
is deliberate, release-noted, and replaces an *accidental* `Omnigraph::open`
|
||||
string with a *stable, declared* one — a net improvement, but flagged.
|
||||
|
||||
No engine, server, or wire-protocol change. The work is CLI-internal: the shared
|
||||
resolver, the capability table, and help grouping.
|
||||
|
||||
## Test plan
|
||||
|
||||
Extend the existing CLI suites rather than adding a duplicate harness:
|
||||
|
||||
- **`parity_matrix.rs`** — capability exclusions (the per-subcommand plane table
|
||||
becomes the source of truth for which verbs are remote-only / storage-only).
|
||||
- **`cli_data.rs`** — maintenance wrong-plane errors (`optimize --server`,
|
||||
`optimize --target <remote>`), and `--target` resolving to direct storage.
|
||||
- **`cli_schema_config.rs`** — `graphs list` plane behavior, `schema plan`
|
||||
vs `schema show/apply` plane split, and plane-grouped `--help` output.
|
||||
- **`system_local.rs`** — `--server` / operator-targeting edge cases end-to-end.
|
||||
|
||||
Pin the new wrong-plane error strings deliberately: this RFC is intentionally
|
||||
replacing accidental `Omnigraph::open` strings with stable capability errors, and
|
||||
those strings become observable behavior (Hyrum).
|
||||
|
||||
## Relationship to RFC-009
|
||||
|
||||
RFC-009 Phase 4 was scoped as "declared plane capabilities" for the
|
||||
embedded-vs-remote axis only. This RFC **subsumes and broadens** that phase into
|
||||
the full three-plane, per-subcommand model (adds uniform maintenance addressing,
|
||||
the authority rule, and help grouping). RFC-009 Phase 5 (remote `load` →
|
||||
`/load` route alignment) is unaffected and remains in RFC-009.
|
||||
|
||||
**`graphs list` reconciliation:** RFC-009's answered open question (pinned in
|
||||
`parity_matrix.rs`'s exclusions comment) targets `graphs list` becoming
|
||||
Both-capability once the embedded arm enumerates the cluster catalog. This RFC
|
||||
**aligns** with that rather than superseding it: the capability table shows
|
||||
`graphs list` as remote today, embedded-cluster later.
|
||||
|
||||
## Open questions
|
||||
|
||||
1. **Capability-table location** — a CLI-internal const, or surfaced (e.g. in
|
||||
`--help` and a machine-readable `omnigraph capabilities` for tooling)?
|
||||
2. **`--cluster <dir> --graph <id>` for maintenance** — does the maintenance
|
||||
command resolve the storage URI from the applied cluster state, or from the
|
||||
declared `cluster.yaml`? (Applied state is the truth the server serves;
|
||||
declared config may be ahead of it.)
|
||||
|
||||
## Review comments (Codex, 2026-06-13)
|
||||
|
||||
Overall take: the direction is right. The planes already exist; making them
|
||||
declared in code, help text, and error messages should reduce operator surprise.
|
||||
Keeping storage maintenance off HTTP is also the right boundary: `optimize`,
|
||||
`repair`, and `cleanup` are direct-storage operator actions, not a multi-tenant
|
||||
serving surface.
|
||||
|
||||
Before implementation, tighten these points:
|
||||
|
||||
1. **Resolver authority needs a sharper rule.** The proposal says maintenance
|
||||
resolves storage URIs "from `cluster.yaml` / operator config", but those are
|
||||
different authority surfaces. Today `--target` is an operator/legacy
|
||||
graph-target lookup; cluster config is read by `cluster` commands and by
|
||||
`--cluster` server boot. Do not make a maintenance command silently consult
|
||||
both and pick a precedence. Either:
|
||||
- `--target` on maintenance means an operator/legacy target whose URI is
|
||||
already direct storage, with remote targets failing loudly; or
|
||||
- add an explicit cluster-root/config resolver for this case, so reading
|
||||
cluster state is an intentional mode.
|
||||
|
||||
**Resolution (accepted):** both — `--target` resolves through operator/legacy
|
||||
config and must be direct storage (remote → loud fail); cluster-managed graphs
|
||||
use the explicit `--cluster <dir> --graph <id>` resolver. See *Authority
|
||||
rule* under Proposed shape.
|
||||
|
||||
2. **`graphs list` conflicts with RFC-009's target shape.** This RFC classifies
|
||||
`graphs list` as remote-only, while RFC-009's answered open question says it
|
||||
becomes Both-capability once the embedded arm enumerates the cluster catalog.
|
||||
Pick one direction here: either this RFC explicitly supersedes that target,
|
||||
or the capability table should show `graphs list` as remote today and
|
||||
embedded-cluster later.
|
||||
|
||||
**Resolution (accepted):** align, don't supersede. The table shows `graphs
|
||||
list` remote-today / embedded-cluster-later. See *Relationship to RFC-009*.
|
||||
|
||||
3. **The capability table should be per subcommand, not per family.** The
|
||||
family-level rows hide the exact cases the table is supposed to make
|
||||
non-accidental. At minimum, call out:
|
||||
- `schema plan` as local/storage-backed today, while `schema show` and
|
||||
`schema apply` route through the graph client;
|
||||
- `queries validate` versus `queries list`, which do not have the same
|
||||
plane shape;
|
||||
- `lint`, `policy`, `embed`, `login`, `logout`, `config`, and `version`, so
|
||||
enumeration/session/tooling commands are intentionally classified instead
|
||||
of falling outside the model.
|
||||
|
||||
**Resolution (accepted):** the capability table is now per-subcommand and
|
||||
classifies every command, including the session/tooling group.
|
||||
|
||||
4. **`init` should be an explicit exception.** Direct-storage `init` is fine.
|
||||
A cluster-managed graph should be created by `cluster apply`, with ledger,
|
||||
recovery, and approval semantics. If a named target resolves to a
|
||||
cluster-managed graph root, `init` should signpost `cluster apply` rather
|
||||
than quietly initializing that root out of band.
|
||||
|
||||
**Resolution (accepted):** promoted from open question to a decision. See
|
||||
*`init` is an explicit exception*.
|
||||
|
||||
Testing notes for the implementation slice:
|
||||
|
||||
- Extend the existing CLI suites rather than adding a new duplicate harness:
|
||||
`parity_matrix.rs` for capability exclusions, `cli_data.rs` for maintenance
|
||||
wrong-plane errors, `cli_schema_config.rs` for `graphs list` / help behavior,
|
||||
and `system_local.rs` for `--server` / operator-targeting edge cases.
|
||||
- Pin the new wrong-plane error strings deliberately. This RFC is intentionally
|
||||
replacing accidental `Omnigraph::open` strings with stable capability errors,
|
||||
and those strings become observable behavior.
|
||||
|
||||
**Resolution (accepted):** captured as the *Test plan* section.
|
||||
|
||||
## Verification comments (Codex, 2026-06-13)
|
||||
|
||||
Follow-up verification against the current CLI/server code found a few
|
||||
remaining current-state nits. These are doc-shape issues, not objections to the
|
||||
proposal:
|
||||
|
||||
1. **Current-state table overstates `graphs list`.** The table under *Current
|
||||
state of affairs* still lists `graphs list` with data verbs that reach the
|
||||
graph by embedded engine or HTTP. Current code routes it through `GraphClient`
|
||||
only to share the resolver, but the embedded arm fails loudly; the later
|
||||
RFC text correctly says remote today / embedded-cluster later. Make the
|
||||
current-state row match that.
|
||||
|
||||
**Resolution (accepted):** the Data row now marks `graphs list` **remote-only
|
||||
today**, with a note that it rides `GraphClient` only to share the resolver.
|
||||
|
||||
2. **Current-state table overstates `init` addressing.** `init` is grouped with
|
||||
maintenance verbs whose addressing surface is positional URI or `--target`.
|
||||
Current `init` only accepts a required positional URI and has no `--target`
|
||||
or config path. The proposal can add that capability, but the current-state
|
||||
table should not describe it as already present.
|
||||
|
||||
**Resolution (accepted):** the Storage row now calls out that `init` takes
|
||||
**only a required positional URI** today (no `--target`); adding `--target` to
|
||||
`init` is part of the proposal, entangled with the `init`→`cluster apply`
|
||||
signpost, not current state.
|
||||
|
||||
3. **`apply_server_flag` call-site count is stale.** The text says data verbs
|
||||
resolve `--server prod --graph knowledge` through `apply_server_flag` at
|
||||
16 call sites. Current code has the fork collapsed: data verbs call
|
||||
`GraphClient::resolve*`, and only the two `GraphClient` factories call
|
||||
`apply_server_flag`. Rephrase the verified fact around `GraphClient`, not
|
||||
the old pre-collapse call-site count.
|
||||
|
||||
**Resolution (accepted):** validated-fact #1 now describes the post-collapse
|
||||
reality (`GraphClient::resolve*`; the two factories call `apply_server_flag`),
|
||||
dropping the stale count.
|
||||
|
||||
4. **`--cluster <dir> --graph <id>` collides with today's global `--graph`
|
||||
semantics.** The target ergonomics section proposes that flag shape for
|
||||
maintenance, but current `--graph` is a global flag that requires
|
||||
`--server` and appends `/graphs/<id>` to a remote server URL. Either choose
|
||||
a separate cluster-maintenance graph flag shape, or call out the clap/global
|
||||
flag migration explicitly as part of the implementation.
|
||||
|
||||
**Resolution (accepted):** the *Authority rule* now carries a flag-shape
|
||||
caveat — the cluster-managed resolver (and its flag shape, e.g.
|
||||
`--cluster-graph` vs a `--graph` migration) is **deferred to a later slice**;
|
||||
the operator/legacy `--target` path lands first. The illustrative
|
||||
`--cluster <dir> --graph <id>` spelling is marked as not-final.
|
||||
756
docs/dev/rfc-011-cli-refactoring.md
Normal file
756
docs/dev/rfc-011-cli-refactoring.md
Normal file
|
|
@ -0,0 +1,756 @@
|
|||
# RFC-011: CLI refactoring — one addressing & config model
|
||||
|
||||
**Status:** Accepted — implemented (the `omnigraph.yaml` excision landed as
|
||||
#250/#251/#252; D1–D4, D6, D7, D9, D10 shipped). Two items remain: **D11**
|
||||
(server-side maintenance jobs) is gated on the bulk-data-plane RFC #219; **D5**
|
||||
(combined admin scope) stays deferred by design.
|
||||
**Date:** 2026-06-14
|
||||
**Audience:** CLI/server maintainers
|
||||
**Builds on:** [rfc-007-operator-config.md](rfc-007-operator-config.md)
|
||||
(per-operator config, keyed credentials, named servers),
|
||||
[rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md)
|
||||
(the legacy file this RFC finishes removing),
|
||||
[rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md)
|
||||
(`GraphClient` — embedded ≡ remote at the execution layer),
|
||||
[rfc-010-cli-planes-restructure.md](rfc-010-cli-planes-restructure.md)
|
||||
(declared planes + the wrong-plane guard this RFC subsumes).
|
||||
**Sequencing:** lands as / after RFC-008 stage 5 (the `omnigraph.yaml` removal).
|
||||
|
||||
## Summary
|
||||
|
||||
Refactor the CLI around one coherent model once `omnigraph.yaml` is gone. The
|
||||
shape:
|
||||
|
||||
- **One ontology** (store, server, cluster; cluster config vs operator config;
|
||||
catalog; profile; capability) where each term names exactly one concept.
|
||||
- **Addressing = scope + `--graph`, with the access path *derived*.** A command
|
||||
resolves a *scope* (operator defaults, an optional named *profile*, or one
|
||||
explicit primitive address — `--store` / `--server` / `--cluster`), selects a
|
||||
graph inside it with `--graph`, and the **served-vs-direct access path falls out
|
||||
of the scope's bindings × the verb's capability** — it is never a per-command
|
||||
toggle and never inferred from a URI scheme.
|
||||
- **Served is the front door; direct storage is privileged.** The everyday scope
|
||||
is a *server* (a bearer token, no bucket credentials). Reading or writing a
|
||||
remote store/cluster directly is an explicit, credentialed, admin/break-glass
|
||||
act — never the default, never baked into everyday operator config.
|
||||
- **The CLI is stateless per command.** No `current_profile` pointer, no
|
||||
`USE`-style mode; every command is fully determined by its flags + static
|
||||
config. You *select* a graph, you do not *switch into* one.
|
||||
- **Definitions are named; payloads are passed.** Queries (`.gq`) and schema
|
||||
(`.pg`) live in the catalog and are invoked by name; params and bulk data are
|
||||
the only per-call inputs.
|
||||
|
||||
This removes `--target`, `--cluster-graph`, `--uri` scheme-dispatch, and the
|
||||
plane guard's "a `--target` that resolves to a remote URL" special case — and it
|
||||
collapses the four-plane vocabulary, for users, into a single capability rule.
|
||||
|
||||
## Motivation: the legacy file pollutes the taxonomy
|
||||
|
||||
Today the CLI exposes four overlapping addressing forms but the system has only
|
||||
three real entities; the mismatch is the whole problem, and `omnigraph.yaml` is
|
||||
the carrier:
|
||||
|
||||
1. **`--target` straddles kinds.** It resolves through the legacy
|
||||
`omnigraph.yaml` `graphs:` map (`config.rs::resolve_target_uri`), and that
|
||||
`.uri` can be a **storage location** (`file`/`s3`) *or* a **remote server**
|
||||
(`http`). One flag, two access paths with different capability and trust
|
||||
models. The wrong-plane guard's storage-plane remote rejection
|
||||
(`helpers.rs:467`) exists *only* to compensate for this overload.
|
||||
2. **Scheme-inferred transport.** `<URI>`/`--uri` has the same disease a level
|
||||
down: `is_remote_uri` (`helpers.rs:15`) silently picks embedded vs remote from
|
||||
the scheme. Transport is guessed from a string, not declared.
|
||||
3. **No single environment concept.** Defaults are smeared across the deprecated
|
||||
`omnigraph.yaml` (`cli.graph`, `server.graph`) with no clean way to name or
|
||||
switch environments.
|
||||
|
||||
Removing `omnigraph.yaml` is the moment to fix all three at once.
|
||||
|
||||
## Ontology
|
||||
|
||||
Every term is one concept. The rest of this RFC uses them precisely.
|
||||
|
||||
### Entities — the things that exist
|
||||
|
||||
- **Graph** — a typed property graph (node/edge types over Lance); the thing you
|
||||
query and mutate. *Example: the `knowledge` graph.*
|
||||
- **Store** — the storage location of a **single** graph: its Lance datasets at a
|
||||
`file://`/`s3://` URI. Addressed directly with `--store`. *Example:
|
||||
`s3://acme/clusters/brain/graphs/knowledge.omni`.*
|
||||
- **Cluster** — a storage root holding **many** graphs plus the catalog and
|
||||
control-plane state (state ledger, approvals, recovery). Managed as-code by the
|
||||
team. *Example: the `brain` cluster at `s3://acme/clusters/brain`.*
|
||||
- **Server** — an `omnigraph-server` process serving graphs over HTTP with bearer
|
||||
auth and Cedar policy; boots from a bare graph or a cluster. *Example: `prod` at
|
||||
`https://graph.example.com`, serving the `brain` cluster.*
|
||||
|
||||
### Config & catalog — the descriptions
|
||||
|
||||
- **Cluster config** — `cluster.yaml` in the cluster root, declaring the **desired
|
||||
state** (graphs, schemas, stored queries, policies, storage), applied with
|
||||
`cluster apply`. Team-owned; the source of truth for *what the system is*.
|
||||
- **Catalog** — the **applied** registry the cluster owns in storage: the graphs,
|
||||
stored queries, and policies `cluster apply` materialized. What a server serves
|
||||
and what `query <name>` resolves against. *(Cluster config is the spec; the
|
||||
catalog is the applied result.)*
|
||||
- **Operator config** — `~/.omnigraph/config.yaml`, your **personal** file:
|
||||
identity (actor), default graph, named servers/clusters, output prefs, optional
|
||||
profiles. Declares *who I am*, never what the system is.
|
||||
- **Profile** — an optional named bundle of **defaults inside the operator
|
||||
config** (one of {cluster, server, store} + a default graph). Config data,
|
||||
**not state**: selecting one fills in omitted flags for a command; it does not
|
||||
put you "in" a mode. Chosen per command (`--profile <name>`) or per shell
|
||||
(`OMNIGRAPH_PROFILE`).
|
||||
- **Credential** — a bearer token keyed to a **server name**, resolved via
|
||||
`OMNIGRAPH_TOKEN_<NAME>` or `~/.omnigraph/credentials` (`0600`); sent only to
|
||||
the server it is keyed to. (Per RFC-007 — the operator config holds endpoints,
|
||||
never tokens.)
|
||||
|
||||
### What you run — definitions vs payloads
|
||||
|
||||
- **Schema** — the `.pg` type definitions for a graph; authored as a file, applied
|
||||
via `schema apply` (or `cluster apply`).
|
||||
- **Stored query** — a named query in the catalog, the team's reusable contract;
|
||||
invoked by name. *Example: `find_people`.*
|
||||
- **Query file (`.gq`)** — an authoring artifact holding `query <name>`
|
||||
declarations; becomes a stored query when `cluster apply` adopts it. For
|
||||
authoring/ad-hoc, not everyday invocation.
|
||||
- **Payload** — the per-call inputs that vary each run: params (`--params`,
|
||||
positional args) and bulk data (`--data`). Never part of config.
|
||||
|
||||
### How a command resolves
|
||||
|
||||
- **Scope** — the resolved environment a command addresses: operator defaults, a
|
||||
named profile, or one explicit primitive address.
|
||||
- **Access path** — **served** (through a server) or **direct** (open storage
|
||||
in-process). Derived from scope × capability; see "Access path" below.
|
||||
- **Capability** — what a verb requires: `any`, `served`, `direct`, `control`,
|
||||
or `local`.
|
||||
- **Target shape** — whether the verb is **graph-scoped** (selects one graph
|
||||
inside the scope), **scope-scoped** (operates on the whole server/cluster
|
||||
scope), or **local** (does not resolve scope or graph).
|
||||
- **Actor** — the identity a write is attributed to: server-resolved from the
|
||||
bearer token (served), or `--as` ?? `operator.actor` (direct).
|
||||
|
||||
### The relationships that prevent confusion
|
||||
|
||||
- **Exactly two config surfaces:** **cluster config** (team) and **operator
|
||||
config** (personal). Nothing else is "a config."
|
||||
- A **profile is not a third config** — it lives *inside* the operator config, and
|
||||
it is **defaults, not state**.
|
||||
- A **catalog is not config** — it is the *applied state* the cluster owns.
|
||||
- A **store is one graph; a cluster is many graphs** + catalog + control state.
|
||||
- A **graph is the logical thing**; store/server/cluster are ways to reach it.
|
||||
- "State" elsewhere is not the profile: *graph state* is committed data in Lance;
|
||||
*cluster state* is the applied control-plane ledger. Neither is operator config.
|
||||
|
||||
## Design
|
||||
|
||||
### First principles
|
||||
|
||||
> Addressing should be 1:1 with the system's real entities; the access path
|
||||
> (served vs direct) should be **derived**, never inferred from a string or
|
||||
> toggled per command; the CLI should be **terse by config and stateless per
|
||||
> command**; and **definitions are named while payloads are passed**.
|
||||
|
||||
Every command answers four orthogonal questions — kept orthogonal here:
|
||||
|
||||
| Axis | Question | Today | Target |
|
||||
|---|---|---|---|
|
||||
| Scope | which environment? | `omnigraph.yaml` defaults / `--target` | operator defaults · `--profile` · one primitive |
|
||||
| Target shape | whole scope or one graph? | implicit in command family | declared per verb |
|
||||
| Graph | which graph in it? | tangled into the address | `--graph` only for graph-scoped server/cluster verbs |
|
||||
| Access path | served or direct? | inferred from scheme / target | **derived** from scope × capability |
|
||||
| Actor | who am I? | `--as` > `cli.actor` (yaml) > `operator.actor` | `--as`/`operator.actor` (direct) · token (served) |
|
||||
|
||||
### A scope binds one entity — and served is the default
|
||||
|
||||
A scope (a profile, the flat defaults, or one primitive flag) binds **exactly one
|
||||
of** {server, cluster, store}. Server and cluster scopes may contain many graphs
|
||||
and can carry a `default_graph`; a store scope is already one graph and does not
|
||||
accept `--graph`. They differ by privilege, and **the everyday default is a
|
||||
server**:
|
||||
|
||||
- **server** → served (the everyday scope). A bearer token, **no storage
|
||||
credentials**. Data verbs run through it, policy-enforced; maintenance verbs are
|
||||
unavailable from this scope — there is no server route for them, so you must
|
||||
name storage explicitly. This is what a normal operator's config binds.
|
||||
- **cluster** → direct storage to a managed cluster, for **control,
|
||||
maintenance, and graph-backed validation only** (`cluster *`,
|
||||
`optimize`/`repair`/`cleanup`/`schema plan`, graph-backed `lint`, and
|
||||
`queries validate`). Data verbs are **not** run directly against a cluster —
|
||||
they go served, or `--store` for ad-hoc. **Privileged:** requires bucket
|
||||
credentials, so it appears only in a maintainer's config or as an explicit
|
||||
`--cluster` flag — never in an everyday operator's defaults.
|
||||
- **store** → one graph's storage, direct. A **local file** store is ordinary
|
||||
local dev; a **remote `s3://`** store is break-glass. No catalog (named queries
|
||||
do not resolve — the ad-hoc lane).
|
||||
|
||||
A scope names **one** thing, so there is no independent `server`+`cluster` pair
|
||||
that could disagree (the audit's coherence hazard is gone by construction — the
|
||||
default is just a server). And the storage root lives only where it must:
|
||||
|
||||
### Direct storage access is privileged (the storage-root rule)
|
||||
|
||||
> The storage root (`s3://…`) is **server-and-admin knowledge, never
|
||||
> everyday-operator knowledge.** Everyday operator config binds a server (a bearer
|
||||
> token, no bucket credentials). Direct remote access — opening a cluster root or
|
||||
> an `s3://` store — is always **explicit and privileged**: you name
|
||||
> `--cluster`/`--store`, and only someone with bucket credentials can. The CLI
|
||||
> never opens a remote store from a default scope.
|
||||
|
||||
This is the least-privilege posture — revoke a bearer token, don't rotate bucket
|
||||
keys; only the **server process** and an occasional **maintenance admin** ever
|
||||
hold storage credentials. It makes "use the server, not raw storage"
|
||||
**structural**, not advisory: direct access requires credentials a normal operator
|
||||
does not have *and* a flag they must type. The only storage root in an everyday
|
||||
setup is the one the **server** boots from; operators never see it. (Local *file*
|
||||
stores for dev are unaffected — a local file is not the production bucket.)
|
||||
|
||||
### Access path is derived, not chosen
|
||||
|
||||
The two access paths are genuinely different — not two transports for one thing:
|
||||
|
||||
- **Served** (through a server): the server resolves your actor from a token and
|
||||
enforces Cedar policy at the HTTP boundary. In cluster mode the **catalog and
|
||||
config** (graph set, stored queries, policy bundles) are pinned to the applied
|
||||
serving revision and move only on restart; **graph data** is read through the
|
||||
server's engine handle against the requested branch/snapshot (it is not frozen
|
||||
at boot, though a long-running server will not observe *out-of-band direct
|
||||
writes* to storage until its handle refreshes). No storage credentials needed.
|
||||
- **Direct** (open the Lance storage in-process): a **privileged** path — it needs
|
||||
your own storage credentials, so only an admin/maintainer (or a local-dev file
|
||||
store) takes it. Actor self-declared (`--as` ?? `operator.actor`), reads **live
|
||||
storage HEAD**. There is **no server-side identity/auth gate** — but engine-level
|
||||
Cedar policy *is* still enforced when the graph selection provides a policy
|
||||
(enforcement is engine-wide; embedded `_as` writers call the same `enforce`).
|
||||
"Direct" means "no HTTP boundary," not "unpoliced."
|
||||
|
||||
Because they differ in authority, freshness, and availability, a graph reached via
|
||||
a server and that graph's raw storage are **different things you name
|
||||
differently** — not one identity you flip. Making the access path a per-command
|
||||
toggle (`--via`) is the `--target` mistake in new clothes; it is rejected.
|
||||
|
||||
> **The access path follows from the scope and the verb.** A **server** scope →
|
||||
> served (data/catalog). A **cluster** scope → direct control, maintenance, and
|
||||
> validation. A **store** scope → direct ad-hoc data (no catalog). The verb's
|
||||
> capability picks which applies and rejects the mismatches.
|
||||
|
||||
State the bound plainly: the everyday data path
|
||||
(`query`/`mutate`/`load`/`branch`/`export`/`commit`) against a served graph
|
||||
**never needs direct storage access**, and direct access is legitimate only in
|
||||
bounded places: **bootstrap** (`init`), **storage-native maintenance**
|
||||
(`optimize`/`repair`/`cleanup`/`schema plan`), **graph-backed validation**
|
||||
(`lint`), **catalog validation** (`queries validate`), the **control plane**
|
||||
(`cluster *`), **local dev** with no server, and **break-glass** (recovery, or
|
||||
checking whether a long-running server's handle lags live HEAD). Everything else
|
||||
is served. This is what makes "discourage direct storage" enforceable rather
|
||||
than aspirational.
|
||||
|
||||
This list is expected to **shrink**: Decision 11 moves
|
||||
`optimize`/`cleanup` (and healthy-path `repair`) to server-managed jobs, which
|
||||
would leave direct access to just standalone/local dev, the control plane, and
|
||||
break-glass — and remove the last routine reason an admin needs bucket
|
||||
credentials.
|
||||
|
||||
### Capability semantics
|
||||
|
||||
The CLI validates through verb capability, not plane jargon:
|
||||
|
||||
| Capability | Meaning | Examples |
|
||||
|---|---|---|
|
||||
| `any` | graph-scoped data; served via a server scope; direct only against a **store** scope (local dev / break-glass); **errors on a cluster scope** | `query`, `mutate`, `load`, `export`, branch reads, `schema show/apply` |
|
||||
| `served` | requires an HTTP server; may be graph-scoped or scope-scoped | `graphs list`, `queries list` |
|
||||
| `direct` | graph-scoped storage-native or graph-backed validation; no server form exists | `init`, `optimize`, `repair`, `cleanup`, `schema plan`, graph-backed `lint` |
|
||||
| `control` | cluster-scoped catalog/control-plane work; addresses the cluster, not a single raw store | `cluster *`, `queries validate` |
|
||||
| `local` | does not address a graph or scope | `config`, `profile`, `lint --query ... --schema ...` |
|
||||
|
||||
`any` does **not** mean "the user picks": the resolver picks from the scope.
|
||||
Internally the exhaustive `command_plane` match (`planes.rs`) stays as the drift
|
||||
guard; user-facing errors speak in terms of what the command needs.
|
||||
|
||||
### Definitions vs payloads
|
||||
|
||||
Queries and schema are **definitions** — contracts that live in the catalog and
|
||||
are invoked **by name**; params and data are **payloads** passed per call. So the
|
||||
everyday form is `omnigraph query <name> [params]`, not
|
||||
`omnigraph query --file find.gq`. A `.gq` path on a routine query is a smell: the
|
||||
query is not in the catalog yet. Lifecycle: **author a `.gq` → `cluster apply`
|
||||
adopts it → invoke by name thereafter.**
|
||||
|
||||
Named queries resolve through a **server** (which serves the cluster's catalog).
|
||||
`queries list` is therefore a served catalog read. `queries validate` is a
|
||||
control/catalog check against the cluster-owned query definitions. A bare
|
||||
`--store` has **no catalog**, so it is the ad-hoc lane (`-e` / `--file`), and
|
||||
`--cluster` does not invoke stored queries. So named-query invocation is a
|
||||
**served** convenience; direct access (`--store`) is always ad-hoc.
|
||||
|
||||
| Kind | Examples | How it enters a command |
|
||||
|---|---|---|
|
||||
| Definition | stored query, schema | named in the catalog; authored as a file, adopted by `cluster apply` |
|
||||
| Payload | params, bulk data | passed per call (`--params`, positional args, `--data`) |
|
||||
| Authoring / ad-hoc | a `.gq` you're writing | `-e '…'`, `--file new.gq`, `lint --query new.gq --schema schema.pg`, `schema apply --schema` |
|
||||
|
||||
### Resolution rule
|
||||
|
||||
1. If the verb is `local`, reject graph/scope flags and run without resolving a
|
||||
scope.
|
||||
2. If a primitive address is supplied (`--store`/`--server`/`--cluster`), use it
|
||||
and ignore operator-config scope defaults. *(A **named** primitive — `--server
|
||||
prod`, `--cluster brain` — still resolves through the operator-config registry;
|
||||
a **literal** — `--server https://…`, `--store s3://…` — bypasses it. Per
|
||||
Decision 2: a value containing `://` is a literal, otherwise a config-name
|
||||
lookup.)*
|
||||
3. Else if `--profile <name>` (or `OMNIGRAPH_PROFILE`) selects a profile, use it.
|
||||
4. Else use the operator config's flat defaults. Error only if neither resolves.
|
||||
*(No sticky "current" pointer — each command resolves scope fresh.)*
|
||||
5. Resolve the graph only for **graph-scoped** verbs. Server/cluster scopes:
|
||||
exactly one graph in scope → use it; else `default_graph`; else require
|
||||
`--graph <id>`. Store scopes are already one graph, so `--graph` is rejected.
|
||||
**Scope-scoped** verbs (`graphs list`, `queries list`, `queries validate`,
|
||||
and `cluster *`) do not select a graph unless their own resource argument says
|
||||
otherwise.
|
||||
6. Derive the access path from capability × scope:
|
||||
- `direct` verb → the scope's cluster/store; if the scope is a server, error
|
||||
(name storage explicitly — it is privileged).
|
||||
- `served` verb → the scope's server; if the scope is a cluster/store, error.
|
||||
- `control` verb → the scope's cluster; if the scope is a server/store, error
|
||||
(name a cluster explicitly — it is privileged).
|
||||
- `any` verb → **served** if the scope is a server; **direct** against a
|
||||
**store** scope (ad-hoc); on a **cluster** scope, error — cluster is
|
||||
maintenance-only, so use a server for data or `--store` for ad-hoc.
|
||||
7. Reject mismatches with an error naming the missing axis.
|
||||
|
||||
Good errors:
|
||||
|
||||
```text
|
||||
scope "prod" has 4 graphs; pass --graph <id> or set default_graph
|
||||
optimize needs direct storage access; scope "prod" is a server — name storage with --cluster s3://… or --store (requires storage credentials)
|
||||
graphs list enumerates a server scope; do not pass --graph
|
||||
--store opens raw storage directly, bypassing any server (no HTTP auth gate, live HEAD); for recovery/inspection
|
||||
```
|
||||
|
||||
### Config shape (operator config)
|
||||
|
||||
`~/.omnigraph/config.yaml` — your personal file; the cluster config
|
||||
(`cluster.yaml` + catalog) is the separate, team-owned surface. The default-graph
|
||||
key is `default_graph` everywhere (the per-command flag is `--graph`).
|
||||
|
||||
**Everyday operator — binds a server, holds no storage root:**
|
||||
|
||||
```yaml
|
||||
defaults:
|
||||
server: prod
|
||||
default_graph: knowledge
|
||||
output: table
|
||||
servers:
|
||||
prod: { url: https://graph.example.com } # token keyed by name (RFC-007); no creds here
|
||||
staging: { url: https://staging.example.com }
|
||||
profiles: # optional, only for multiple environments
|
||||
staging: { server: staging, default_graph: knowledge }
|
||||
```
|
||||
|
||||
A normal operator never has a storage root or bucket credentials. Their default
|
||||
scope is served; `optimize`/`repair`/`cleanup` error with a pointer to name
|
||||
storage explicitly.
|
||||
|
||||
**Maintainer — opts into a cluster root (and has bucket credentials):**
|
||||
|
||||
```yaml
|
||||
profiles:
|
||||
brain-admin: { cluster: brain, default_graph: knowledge } # direct; admin/control/maintenance
|
||||
clusters:
|
||||
brain: { root: s3://acme/clusters/brain } # the s3:// root lives ONLY here
|
||||
```
|
||||
|
||||
The `clusters:` block — the only place a storage root appears in operator config —
|
||||
is **admin-only and opt-in**, absent from a normal operator's file. Equivalently,
|
||||
skip config and name it per command:
|
||||
`omnigraph optimize --cluster s3://acme/clusters/brain --graph knowledge`. The
|
||||
cluster stays the source of truth for the managed catalog; tokens live in the
|
||||
keyed credential store, never in this file.
|
||||
|
||||
### Command shape
|
||||
|
||||
Assume the everyday flat defaults: server `prod`, default graph `knowledge`.
|
||||
|
||||
| Intent | Command | Path |
|
||||
|---|---|---|
|
||||
| Run a catalog query | `omnigraph query find_people` | served |
|
||||
| …with params | `omnigraph query find_people --params '{"title":"Eng"}'` | served |
|
||||
| Another graph in scope | `omnigraph query find_people --graph archive` | served |
|
||||
| Write | `omnigraph load --data batch.jsonl --mode append` | served |
|
||||
| A different environment | `omnigraph --profile staging query find_people` | served |
|
||||
| One-off server, no config | `omnigraph query find_people --server https://graph.example.com --graph knowledge` | served |
|
||||
| Maintain (admin, explicit storage) | `omnigraph optimize --cluster s3://acme/clusters/brain --graph knowledge` | direct (privileged) |
|
||||
| Maintain (admin, via admin profile) | `omnigraph --profile brain-admin optimize --graph knowledge` | direct (privileged) |
|
||||
| List catalog queries | `omnigraph queries list` | served |
|
||||
| Validate cluster query catalog | `omnigraph queries validate --cluster s3://acme/clusters/brain` | control (privileged) |
|
||||
| Offline query lint | `omnigraph lint --query new.gq --schema schema.pg` | local |
|
||||
| Graph-backed query lint | `omnigraph lint --query new.gq --cluster s3://acme/clusters/brain --graph knowledge` | direct (privileged) |
|
||||
| Local dev, no server | `omnigraph query -e 'match { … } return { … }' --store graph.omni` | direct (local file) |
|
||||
| Break-glass: raw storage of a served graph | `omnigraph query --file find.gq --store s3://acme/clusters/brain/graphs/knowledge.omni` | direct (privileged, rare) |
|
||||
|
||||
Note what the everyday rows are: **all served.** `optimize` does *not* appear in
|
||||
the default-scope rows — from a server scope it errors and points you to name
|
||||
storage (see the resolution rule), so maintenance is always a deliberate,
|
||||
credentialed act. There is no "force served/direct" row — you never toggle the
|
||||
path on a configured graph; the only way to reach raw storage is to *name it*
|
||||
(`--cluster`/`--store`), which makes the privileged bypass unmistakable. Everyday
|
||||
rows invoke a query **by name**; a `.gq` file appears only where there is no
|
||||
catalog (bare store, break-glass) via `-e`/`--file`.
|
||||
|
||||
## Before / after
|
||||
|
||||
**Before** = best available today (legacy `omnigraph.yaml` `--target`, `.gq`
|
||||
files, `--cluster-graph`, scheme inference). **After** = this model.
|
||||
|
||||
| Intent | Before | After |
|
||||
|---|---|---|
|
||||
| Run a query | `omnigraph query --target knowledge --query find.gq --name find_people` | `omnigraph query find_people` |
|
||||
| Another graph | `omnigraph query --target archive --query find.gq --name find_people` | `omnigraph query find_people --graph archive` |
|
||||
| Load | `omnigraph load --data b.jsonl --mode append --target knowledge` | `omnigraph load --data b.jsonl --mode append` |
|
||||
| Maintain (admin) | `omnigraph optimize --cluster brain --cluster-graph knowledge` | `omnigraph optimize --cluster s3://acme/clusters/brain --graph knowledge` |
|
||||
| Another environment | edit `omnigraph.yaml`, or re-address with full URIs | `--profile staging …` or `OMNIGRAPH_PROFILE=staging` |
|
||||
| One-off remote | `omnigraph query --uri https://… --query find.gq` *(scheme→remote)* | `omnigraph query find_people --server https://… --graph knowledge` |
|
||||
| Raw storage of a served graph | `omnigraph query s3://…/knowledge.omni --query find.gq` *(looks like a normal query)* | `omnigraph query --file find.gq --store s3://…/knowledge.omni` *(explicit bypass)* |
|
||||
|
||||
**Removed:** `--target`; `--cluster-graph` (`--graph` is the graph selector only
|
||||
for graph-scoped server/cluster verbs); `--uri` http-scheme dispatch; `--via`
|
||||
(never ships); everyday `--query <file>` (definitions are named);
|
||||
`omnigraph.yaml` and its `cli.graph`/`server.graph` defaults.
|
||||
|
||||
## Server-side corollary
|
||||
|
||||
The same ontology applies to `omnigraph-server` boot: with `omnigraph.yaml` gone,
|
||||
a server boots from a single bare graph URI **or** a cluster (`--cluster <dir|s3>`,
|
||||
RFC-005), never a `graphs:` map. The store/server/cluster ontology is then
|
||||
consistent across CLI and server.
|
||||
|
||||
## Migration & compatibility
|
||||
|
||||
Addressing flags and config keys are observable contract (Hyrum); every removal is
|
||||
staged and release-noted.
|
||||
|
||||
- **`config migrate`** (shipped) maps each legacy `graphs:` entry **by what it
|
||||
actually is**: `http(s)` URIs → a `server:` (the recommended everyday shape);
|
||||
`file` URIs → a local `store:`; an `s3://` **graph** URI → an **admin** `store:`
|
||||
(it is a single graph, not a cluster); an `s3://` **cluster root** (one that
|
||||
carries cluster state) → an **admin** `cluster:`. Everyday `s3://` graph usage
|
||||
migrates with a **warning** — prefer serving it via a server rather than
|
||||
re-establishing direct remote access. It reports dropped keys.
|
||||
- **Operators move to a server-default scope.** Where a legacy setup pointed
|
||||
`cli.graph` at an `s3://` graph for everyday use, migration flags it: the
|
||||
recommended shape is a `server:` scope (bearer token, no bucket creds), with the
|
||||
`s3://` root kept only in a maintainer's config — not every operator's.
|
||||
- **`--target`** warns for one release, then errors; **`OMNIGRAPH_NO_LEGACY_CONFIG=1`**
|
||||
(already the strict switch) becomes the default — loading `omnigraph.yaml` is a
|
||||
hard error.
|
||||
- **`--cluster-graph` → `--graph`**: `--cluster-graph` is accepted with a warning
|
||||
for one release, then removed.
|
||||
- **`--graph` meaning change**: today `--graph` is "graph id on a multi-graph
|
||||
server" (paired with `--server`); it generalizes to "select the graph for
|
||||
graph-scoped verbs in server/cluster scopes." Existing `--server --graph`
|
||||
usage keeps working (it is a strict superset); release-note the broadened
|
||||
meaning and the fact that store/scope-scoped verbs reject it.
|
||||
- **`--uri http://…`** warns, then errors with a pointer to `--server`.
|
||||
- **`--as` on served paths**: today global `--as` is accepted (a no-op on remote
|
||||
writes — the server resolves the actor from the token); rejecting it on the
|
||||
served path is staged — warn for one release, then error.
|
||||
- **`--alias`** → the `alias` namespace (`omnigraph alias <name>`, Decision 4);
|
||||
the old `--alias` flag warns for one release, then is removed.
|
||||
|
||||
## Non-goals
|
||||
|
||||
- **No change to the direct/served capability split.** Maintenance stays
|
||||
storage-direct by design (no server routes for `optimize`/`repair`/`cleanup`);
|
||||
this RFC only makes the split explicit.
|
||||
- **No new transport.** Addressing surface, not protocol.
|
||||
- **No positional sigil grammar** (`@server/graph`, `%cluster/graph`). Considered
|
||||
and rejected: explicit flags are more discoverable; profiles already give
|
||||
brevity. Revisit only on demonstrated expert-terseness demand.
|
||||
|
||||
## Decisions
|
||||
|
||||
The questions this RFC opened are resolved as follows. Two are explicitly
|
||||
deferred (see below); they do not block the model.
|
||||
|
||||
1. **Local-dev path → embedded `--store` scope.** Local dev runs the engine
|
||||
in-process against a `--store <file>` (or a store-scoped profile); `omnigraph
|
||||
serve` stays available but is not required. Consistent with embedded ≡ remote
|
||||
(RFC-009).
|
||||
2. **Primitives are one flag, typed by content.** `--server` and `--cluster`
|
||||
accept either a config name or a literal URI: a value containing `://` is a
|
||||
literal (bypasses the registry); otherwise it is a config-name lookup (error if
|
||||
unknown). `--store` is always a URI. (Replaces the earlier "literal-vs-named"
|
||||
question — no `--server-url`/`--cluster-root` split.)
|
||||
3. **Stored invocation: `query <name>` (read) / `mutate <name>` (write), one
|
||||
catalog namespace.** A name maps to one definition; the verb asserts its kind
|
||||
and the CLI errors on mismatch (`'apply_labels' is a mutation — use
|
||||
omnigraph mutate apply_labels`). No `invoke` verb.
|
||||
4. **Aliases live under an `alias` namespace** — `omnigraph alias <name> [args]`,
|
||||
never bare top-level. An alias can therefore neither shadow nor be shadowed by a
|
||||
built-in (current or future) verb.
|
||||
6. **Profile merge: scope wholesale, prefs layered.** The entity binding +
|
||||
`default_graph` come *wholesale* from the active scope (a profile, or flat
|
||||
defaults if none) — never per-key merged across the entity dimension (that would
|
||||
yield "server *and* cluster"). Only non-scope preferences (`output`, table
|
||||
layout) take flat defaults as a base. Precedence: explicit flag > profile > flat
|
||||
defaults.
|
||||
7. **No default graph → error + list candidates.** A graph-scoped verb with no
|
||||
`--graph`, no `default_graph`, and >1 graph in scope errors and lists candidates
|
||||
(served: `GET /graphs`; cluster-direct: catalog enumeration). If enumeration is
|
||||
policy-gated/unavailable, it says so and asks for `--graph`. Never auto-pick.
|
||||
9. **Diagnostics & safety.** Writes echo the resolved scope + access path to stderr
|
||||
(suppress with `--quiet`). Destructive verbs (`cleanup`, overwrite `load`,
|
||||
`branch delete`) require confirmation when the scope is not local; `--yes` skips
|
||||
it; **no TTY without `--yes` errors** (never silently proceed). `--json`/CI never
|
||||
prompt — destructive without `--yes` errors.
|
||||
10. **Cluster graphs evolve only via `cluster apply`.** `schema apply` (an `any`
|
||||
verb) targets standalone graphs; against a cluster-managed graph it errors and
|
||||
points at `cluster apply` (which records ledger/recovery/approvals — RFC-004).
|
||||
Mirrors `init`'s refusal of a cluster-managed path.
|
||||
11. **Maintenance moves server-side (committed direction).** `optimize`/`cleanup`
|
||||
(and healthy-path `repair`) become server/cluster-managed async jobs —
|
||||
policy-gated, audited, single-coordinator — with `direct` retained only as
|
||||
break-glass (`repair` when the server is down). Runs out-of-band (a worker +
|
||||
async job routes, the `POST …` / `GET …/{id}` shape of the bulk-data-plane RFC
|
||||
(`docs/rfcs/0001-bulk-data-plane.md`, PR #219, not yet merged)), never inline in
|
||||
serving; `schema plan` is
|
||||
excluded (≈ `cluster plan` in cluster mode). The **mechanism** (job routes,
|
||||
worker, scheduling) is a follow-up RFC; until it lands the capability table above
|
||||
stands, and maintenance is `direct`. When it lands, the maintenance verbs'
|
||||
capability becomes "served-job + direct break-glass."
|
||||
|
||||
## Deferred
|
||||
|
||||
Non-blocking; settle when convenient.
|
||||
|
||||
- **D5 — combined admin scope.** A scope binds one entity; admins read via a
|
||||
server scope and maintain via `--cluster`. A `deployments: { … }` object
|
||||
(server + cluster validated coherent, referenced by a profile) is revisited only
|
||||
if admin ergonomics demand it — and Decision 11 largely removes the need.
|
||||
- **D8 — the `profile` command surface.** *Shipped:* `profile list` / `profile
|
||||
show [<name>]` (read-only inspection). The *no sticky `profile use`* constraint
|
||||
holds — it is a design principle, not a command.
|
||||
|
||||
## Safety
|
||||
|
||||
Dropping the sticky `current_profile` pointer removes the main footgun — a
|
||||
destructive command silently inheriting a "current" environment from an earlier
|
||||
session. Because each command resolves scope fresh, what is on the command line is
|
||||
what runs. Two guards remain (a flat default or `OMNIGRAPH_PROFILE` can still point
|
||||
at prod): echo the resolved scope + access path on writes, and require
|
||||
confirmation (or `--yes`) for destructive verbs when the resolved scope is not
|
||||
local (Decision 9). The most dangerous direct writes (`cleanup`, overwrite
|
||||
`load`) are *structurally* rare now — unavailable from the everyday server scope,
|
||||
and gated behind bucket credentials plus an explicit `--cluster`/`--store` — so a
|
||||
normal operator's setup mostly cannot issue them by accident at all.
|
||||
|
||||
## Invariants & deny-list check
|
||||
|
||||
- **§10 query semantics first-class / §11 transport at the boundary:** preserved —
|
||||
addressing resolves CLI-side to a `GraphClient`; no transport concepts leak into
|
||||
engine crates.
|
||||
- **§12 no client-set actor:** strengthened — the served path's actor stays
|
||||
token-resolved and `--as` is rejected there; direct self-declares.
|
||||
- **Least privilege (security posture):** everyday operators hold a revocable
|
||||
bearer token, not bucket credentials; only the server process and maintenance
|
||||
admins hold storage creds. Direct remote access is structural opt-in, not a
|
||||
default — narrowing the blast radius of a leaked operator config.
|
||||
- **§6 strong consistency:** both paths are snapshot-isolated per query; this RFC
|
||||
changes addressing, not isolation.
|
||||
- **Deny-list (no state that drifts):** profiles and aliases are static config
|
||||
sugar that resolve to canonical scopes; they declare nothing the cluster or
|
||||
server doesn't already own. No sticky session state is introduced.
|
||||
- No Hard Invariant is weakened; the change is CLI surface + config removal.
|
||||
|
||||
## Relationship to prior work
|
||||
|
||||
The completion of the config/CLI lineage: RFC-007 added the operator config and
|
||||
keyed credentials; RFC-008 demoted `omnigraph.yaml`; RFC-009 unified execution
|
||||
behind `GraphClient`; RFC-010 declared the planes. This RFC removes the last
|
||||
legacy addressing surface so the plane model becomes a clean function of the three
|
||||
real entities, and folds the planes into a single capability rule. It is adjacent
|
||||
to the public-track bulk-data-plane RFC (`docs/rfcs/0001-bulk-data-plane.md`,
|
||||
PR #219, not yet merged), which canonicalizes `load`/`export` verbs; this RFC
|
||||
canonicalizes how every verb *addresses* a graph.
|
||||
|
||||
## Appendix: target CLI taxonomy (end state)
|
||||
|
||||
The full command set under this model, organized by **capability** (the new
|
||||
classifying axis) instead of plane — the end-state counterpart to the
|
||||
current-taxonomy appendix below. Every command, with its end-state addressing.
|
||||
|
||||
```
|
||||
omnigraph
|
||||
│
|
||||
├─ any — data verbs · served by default (server scope, or --server <url|name>);
|
||||
│ --graph selects the graph in scope; --store forces ad-hoc direct (no catalog)
|
||||
│ ├─ query (alias: read*) invoke a stored query by NAME; -e/--file for ad-hoc
|
||||
│ ├─ mutate (alias: change*) invoke a stored mutation by name; -e/--file for ad-hoc
|
||||
│ ├─ load bulk write — --data, --mode required; --from forks a missing branch
|
||||
│ ├─ export dump graph data (NDJSON / Arrow)
|
||||
│ ├─ snapshot current per-table versions
|
||||
│ ├─ branch { create | list | delete | merge } merge takes --into <target>
|
||||
│ ├─ commit { list | show } inspect the commit graph
|
||||
│ └─ schema { show (alias: get) | apply } cluster graphs evolve via cluster apply (Decision 10)
|
||||
│
|
||||
├─ served — needs a server (errors on a store/cluster scope)
|
||||
│ ├─ graphs list enumerate the graphs a server serves
|
||||
│ └─ queries list list stored queries in the served catalog
|
||||
│
|
||||
├─ direct — storage-native, PRIVILEGED · --cluster <root> | --store <uri> + bucket creds; never a server
|
||||
│ ├─ init bootstrap a graph (--store <uri>); refuses a cluster-managed path
|
||||
│ ├─ optimize compaction; --graph selects
|
||||
│ ├─ repair publish uncovered drift; --confirm / --force
|
||||
│ ├─ cleanup version GC; --keep / --older-than / --confirm
|
||||
│ ├─ schema plan migration preview (reads storage directly)
|
||||
│ └─ lint --query <path> graph-backed query lint (with --graph on cluster scope)
|
||||
│
|
||||
├─ control — cluster/catalog control, PRIVILEGED · --cluster <dir|s3>
|
||||
│ ├─ cluster { validate | plan | apply | approve | status | refresh | import | force-unlock }
|
||||
│ apply/approve take --as <actor>; force-unlock takes <LOCK_ID>
|
||||
│ └─ queries validate validate cluster-owned stored queries against graph schemas
|
||||
│
|
||||
└─ local — no graph
|
||||
├─ policy { validate | test | explain } offline Cedar tooling
|
||||
├─ profile { list | show } read-only; NO mutating `use` (no sticky state)
|
||||
├─ alias <name> [args] personal shortcut; expands to its bound stored-query call (D4)
|
||||
├─ config { migrate } finish the omnigraph.yaml split (RFC-008)
|
||||
├─ login / logout per-server bearer credentials
|
||||
├─ embed offline embedding pipeline
|
||||
├─ lint --query <path> --schema <path> file-only query lint
|
||||
└─ version (-v)
|
||||
```
|
||||
|
||||
`*` `read`/`change` remain as deprecated aliases (warn on use); `ingest` and the
|
||||
`check`→`lint` argv-shim are **removed**. `get` aliases `schema show`.
|
||||
|
||||
### Addressing forms (end state)
|
||||
|
||||
Three scope forms — one per real entity — plus the graph selector. No `--target`,
|
||||
no `--cluster-graph`, no `--uri` scheme-dispatch, no `--via`.
|
||||
|
||||
| Form | Resolves to | Access | Privilege |
|
||||
|---|---|---|---|
|
||||
| **server scope** — operator default, a `--profile`, or `--server <url\|name>` | a served endpoint + keyed token | served | everyday (bearer token) |
|
||||
| **cluster scope** — an admin profile, or `--cluster <root>` | a managed cluster's storage + catalog | direct | privileged (bucket creds) |
|
||||
| **store scope** — `--store <uri>` | one graph's storage (no catalog) | direct | local-dev (file) / break-glass (s3) |
|
||||
| **`--graph <id>`** | selects the graph for graph-scoped verbs in server/cluster scopes; invalid for store scopes and scope-scoped verbs | — | — |
|
||||
|
||||
Resolution: explicit primitive (`--server`/`--cluster`/`--store`) → `--profile` /
|
||||
`OMNIGRAPH_PROFILE` → operator flat defaults. Access path is then derived from the
|
||||
scope kind × the verb's capability (see the Resolution rule); it is never inferred
|
||||
from a URI scheme and never toggled.
|
||||
|
||||
### What moved vs today
|
||||
|
||||
| Command(s) | Today (plane) | End state (capability) |
|
||||
|---|---|---|
|
||||
| `query`/`mutate`/`load`/`export`/`snapshot`/`branch`/`commit`/`schema show`/`schema apply` | Data | **`any`** (served-default; `--store` ad-hoc) |
|
||||
| `graphs list` | Data (remote-only) | **`served`** |
|
||||
| `queries list` | Session | **`served`** (catalog read) |
|
||||
| `init`/`optimize`/`repair`/`cleanup`/`schema plan`/graph-backed `lint` | Storage | **`direct`** (privileged) |
|
||||
| `queries validate` | Storage | **`control`** (catalog validation) |
|
||||
| `cluster *` | Control | **control** (unchanged) |
|
||||
| `policy *`/`embed`/`login`/`logout`/`config`/`version`/offline `lint --query --schema` | Session | **`local`** |
|
||||
| `ingest`; `--target`; `--cluster-graph`; `--uri http` dispatch | present | **removed** |
|
||||
| — | — | **added:** `profile { list | show }` (read-only) |
|
||||
|
||||
Cross-capability families: `schema` (`plan` is `direct`, `show`/`apply` are
|
||||
`any`), `queries` (`list` is `served`, `validate` is `control`), and `lint`
|
||||
(offline with `--schema` is `local`, graph-backed is `direct`) split per
|
||||
subcommand/mode, exactly where their authority and data dependencies differ.
|
||||
|
||||
## Appendix: current CLI taxonomy (today)
|
||||
|
||||
The **as-is** command surface this RFC transforms, kept so the RFC is
|
||||
self-contained. The source of truth is the exhaustive `command_plane` match in
|
||||
`crates/omnigraph-cli/src/planes.rs`.
|
||||
Where it disagrees with the design above (four planes, `--target`,
|
||||
`--cluster-graph`, scheme-inferred transport), the design is the *target* and this
|
||||
is *today*.
|
||||
|
||||
### The four planes (today)
|
||||
|
||||
| Plane | What it touches | Addressing accepted |
|
||||
|---|---|---|
|
||||
| **Data** | a graph — embedded **or** via a server | `<URI>` · `--target` · `--server` (+`--graph`) |
|
||||
| **Storage** | direct storage, no server | `<URI>` · `--target` (local/S3 only) · some also `--cluster`+`--cluster-graph` |
|
||||
| **Control** | a cluster *directory* | `--config <dir>` |
|
||||
| **Session** | no graph | — |
|
||||
|
||||
`--server`/`--graph` are gated strictly to the data plane; `guard_addressing`
|
||||
(`planes.rs:128`) rejects them elsewhere (RFC-010 Slice 1).
|
||||
|
||||
### Command tree by plane (today)
|
||||
|
||||
```
|
||||
omnigraph
|
||||
├─ DATA ────────── run against a graph; embedded or --server
|
||||
│ ├─ query (alias: read) · mutate (alias: change) · load · ingest (hidden, deprecated)
|
||||
│ ├─ branch { create | list | delete | merge } · snapshot · export · commit { list | show }
|
||||
│ ├─ graphs { list } (remote-only)
|
||||
│ └─ schema { show (alias: get) | apply } ← show/apply are DATA
|
||||
├─ STORAGE ─────── direct file://|s3:// access; --server rejected
|
||||
│ ├─ init · optimize · repair · cleanup (optimize/repair/cleanup also: --cluster --cluster-graph)
|
||||
│ ├─ lint (check shim) · schema plan ← plan is STORAGE
|
||||
│ └─ queries validate
|
||||
├─ CONTROL ─────── cluster directory via --config <dir>
|
||||
│ └─ cluster { validate | plan | apply | approve | status | refresh | import | force-unlock }
|
||||
└─ SESSION ─────── no graph
|
||||
├─ policy { validate | test | explain } · embed · login / logout
|
||||
├─ config { migrate } · queries list ← list is SESSION
|
||||
└─ version (-v)
|
||||
```
|
||||
|
||||
`read`/`change` are visible clap aliases (deprecated names, warn); `check` is an
|
||||
argv-shim → `lint`; `get` aliases `schema show`; `ingest` is hidden but runs.
|
||||
|
||||
### Cross-plane families (today)
|
||||
|
||||
- **`schema`**: `schema plan` is Storage; `schema show`/`apply` are Data.
|
||||
- **`queries`**: `queries validate` is Storage; `queries list` is Session.
|
||||
|
||||
### Addressing forms (today)
|
||||
|
||||
| Form | Looks up in | Resolves to | Source |
|
||||
|---|---|---|---|
|
||||
| `<URI>` / `--uri` | nothing (explicit) | the literal URI | — |
|
||||
| `--target <name>` | `omnigraph.yaml` `graphs:` | that graph's `uri` (local / S3 / **http**) | `config.rs::resolve_target_uri` |
|
||||
| `--server <name>` (+`--graph`) | `~/.omnigraph/config.yaml` `servers:` | a remote server URL | `helpers.rs::resolve_server_flag` |
|
||||
| `--cluster <dir\|s3> --cluster-graph <id>` | served cluster state | the graph's storage URI | `helpers.rs` (RFC-010 Slice 3) |
|
||||
|
||||
Precedence (`resolve_target_uri`): explicit `<URI>`/`--uri` → `--target` →
|
||||
`cli.graph` default → error. `is_remote_uri` (`helpers.rs:15`) then selects
|
||||
`GraphClient::Remote` vs `Embedded` (`client.rs:86`).
|
||||
|
||||
### Enforcement points (today)
|
||||
|
||||
- **`guard_addressing`** (`planes.rs:128`): `--server`/`--graph` on a non-data verb
|
||||
fails with a declared message.
|
||||
- **Storage-plane remote rejection** (`helpers.rs:467`): a storage verb whose
|
||||
`--target` resolves to `http(s)://` is rejected.
|
||||
- **`init` into a cluster layout** is refused (use `cluster apply`).
|
||||
|
||||
## Audit comments
|
||||
|
||||
Reviewed against the current CLI taxonomy, `planes.rs`, `cli.rs`, `helpers.rs`,
|
||||
`client.rs`, RFC-007/RFC-010, and the user-facing CLI/server docs.
|
||||
|
||||
### Validated
|
||||
|
||||
- The target taxonomy now has a stable classifier: `any`, `served`, `direct`,
|
||||
`control`, and `local` are all declared capabilities.
|
||||
- Cluster scope is coherent: it is privileged direct storage for control,
|
||||
maintenance, and validation, not a direct data path. `any` data verbs served by
|
||||
default and reject cluster scope.
|
||||
- Graph selection is no longer universal. Graph-scoped verbs select a graph;
|
||||
scope-scoped verbs such as `graphs list`, `queries list`, `queries validate`,
|
||||
and `cluster *` address the whole server/cluster scope.
|
||||
- The current-state appendix still matches the implemented CLI: four planes,
|
||||
`--target`, `--cluster-graph`, scheme-inferred transport, `schema plan` as
|
||||
Storage, and `schema show/apply` as Data.
|
||||
|
||||
Decisions and deferrals are tracked in [Decisions](#decisions) above — not
|
||||
duplicated here.
|
||||
295
docs/dev/rfc-012-embedding-provider-config.md
Normal file
295
docs/dev/rfc-012-embedding-provider-config.md
Normal file
|
|
@ -0,0 +1,295 @@
|
|||
# RFC: Provider-Independent Embedding Configuration
|
||||
|
||||
**Status:** Accepted — Phases 1-5 implemented
|
||||
**Date:** 2026-06-15
|
||||
**Builds on:** the engine embedding client (`crates/omnigraph/src/embedding.rs`), the `@embed` catalog
|
||||
annotation (`omnigraph-compiler/src/catalog`), the cluster `providers.embedding` surface
|
||||
([cluster-config-specs.md](cluster-config-specs.md), [rfc-007-operator-config.md](rfc-007-operator-config.md)
|
||||
for the secret-resolution pattern).
|
||||
**Target release:** staged — NFR floor first, then the provider-independent config core; ingest-time `@embed`
|
||||
execution is a separate later phase.
|
||||
|
||||
## Summary
|
||||
|
||||
OmniGraph's embedding subsystem is **hardwired to a single provider (Google Gemini)** and has no recorded
|
||||
link between the model that produced a stored vector and the model that embeds a query string. Today that
|
||||
happens to be self-consistent (one live client embeds both sides), but it is consistent by accident, not by
|
||||
construction: the provider is hardcoded, the model is a moving `-preview` target, nothing validates that a
|
||||
query vector and a stored vector share a space, and the one configurable knob (key + base URL) cannot change
|
||||
the provider or model.
|
||||
|
||||
This RFC makes embedding **provider-independent**: one resolved `EmbeddingConfig { provider, model, base_url,
|
||||
api_key, dim, normalize }` behind a sealed provider abstraction, resolved once and shared by every embedder.
|
||||
The **primary variant is OpenAI-compatible** — a single request/response shape (`POST {base}/embeddings`,
|
||||
`{model, input, dimensions}`) that covers **OpenRouter** (the recommended default gateway, one key for Gemini,
|
||||
OpenAI, Mistral, BGE, Qwen, sentence-transformers, …), OpenAI direct, and any self-hosted OpenAI-compatible
|
||||
endpoint (vLLM, Ollama, LM Studio, Together). A native **Gemini** (`generativelanguage`) variant is retained
|
||||
for shops that want to hit Google directly with its `RETRIEVAL_QUERY`/`RETRIEVAL_DOCUMENT` task-type
|
||||
asymmetry, plus a deterministic **Mock**. The embedding *identity* (provider + model + dim) is recorded in the
|
||||
schema IR so it travels with the data, and a query whose resolved embedder cannot match the stored vectors'
|
||||
recorded identity is **rejected with a typed error instead of silently ranking across vector spaces.**
|
||||
Provider/endpoint wiring lands on the already-reserved cluster `providers.embedding` field; secrets follow the
|
||||
existing operator-credential pattern; no secret ever enters the schema.
|
||||
|
||||
This RFC supersedes the framing in `docs/user/search/embeddings.md` that described "two embedding clients
|
||||
with different defaults" — one of those clients was dead code with zero callers and has been removed (see
|
||||
Phase 1); the OpenAI request shape returns as a first-class *provider variant* of the one client, not as a
|
||||
second parallel client.
|
||||
|
||||
## Motivation
|
||||
|
||||
This work originated in an external handoff that reported a live cross-provider bug: gemini-3072 stored
|
||||
vectors compared against OpenAI-1536 query vectors, silently. Investigation against the current source showed
|
||||
the reported mechanism is **inaccurate** — the OpenAI client it blamed (`omnigraph-compiler/src/embedding.rs`)
|
||||
was `pub(crate)`, `#![allow(dead_code)]`, and had **zero callers**; the live `nearest("string")` path and the
|
||||
offline `omnigraph embed` CLI both use the engine **Gemini** client; and `@embed` does no ingest-time
|
||||
embedding at all. So the documented happy path is self-consistent. But the investigation surfaced four real
|
||||
problems the handoff's instincts correctly smelled:
|
||||
|
||||
- **P1 — Provider is hardwired.** The one live client builds Google `generativelanguage` requests; only key +
|
||||
base URL are configurable, not the provider or model. A non-Gemini shop cannot use `nearest("string")`
|
||||
without a Gemini key, and cannot make it produce non-Gemini vectors. If they store their own vectors and
|
||||
query with `nearest("string")`, the query is embedded with Gemini → a silent cross-space ranking. This is
|
||||
the handoff's failure, reached by a different cause.
|
||||
- **P2 — A dead, divergent second client + stale docs** invited exactly the misdiagnosis the handoff made.
|
||||
- **P3 — No same-space guarantee recorded with the data.** Nothing stamps which model/dim produced a stored
|
||||
vector, so write-side and read-side embedders can drift with no validation.
|
||||
- **P4 — `@embed` is declarative-in-name-only.** It records a source property for the typechecker but never
|
||||
embeds at ingest; the docs claimed otherwise.
|
||||
|
||||
Per the project's first principle, the lower-liability shape is **one provider-independent client with the
|
||||
identity recorded next to the data**, not N independently-defaulted clients kept in lockstep by discipline.
|
||||
Hardcoding one provider mortgages every future "we need OpenAI / a local model / Vertex" against a rewrite;
|
||||
recording identity once closes the silent-wrong-results class by construction.
|
||||
|
||||
## Current state — which API we actually use
|
||||
|
||||
| | Live engine client (`crates/omnigraph/src/embedding.rs`) | Deleted dead client (was `omnigraph-compiler/src/embedding.rs`) |
|
||||
|---|---|---|
|
||||
| Provider | **Google Gemini Developer API** (`generativelanguage`, *not* Vertex AI) | OpenAI |
|
||||
| Endpoint | `POST {base}/models/{model}:embedContent` | `POST {base}/embeddings` |
|
||||
| Auth | header `x-goog-api-key`, env `GEMINI_API_KEY` | `Authorization: Bearer`, env `OPENAI_API_KEY` |
|
||||
| Model | `gemini-embedding-2-preview` (hardcoded) | `text-embedding-3-small` (env `NANOGRAPH_EMBED_MODEL`) |
|
||||
| Base default | `https://generativelanguage.googleapis.com/v1beta` | `https://api.openai.com/v1` |
|
||||
| Request body | `{model, content:{parts:[{text}]}, taskType, outputDimensionality}` | `{model, input:[…], dimensions}` |
|
||||
| Response | `{embedding:{values:[f32]}}` | `{data:[{index, embedding:[f32]}]}` |
|
||||
| Task types | `RETRIEVAL_QUERY` / `RETRIEVAL_DOCUMENT` | none |
|
||||
| Status | **live** — used by `nearest("string")` and `omnigraph embed` | **removed in Phase 1** (zero callers) |
|
||||
|
||||
Both shapes honour a requested output dimensionality (Gemini `outputDimensionality`, OpenAI `dimensions`)
|
||||
driven by the target column width, so dimension is already schema-driven. The two known shapes are exactly the
|
||||
two initial provider variants this RFC defines — the OpenAI shape returns from git history as a `Provider`
|
||||
variant of the single client.
|
||||
|
||||
## Guide-level explanation
|
||||
|
||||
### Configuring a provider (operator view)
|
||||
|
||||
Pick a provider for the graph in `cluster.yaml` (the team-owned surface), referencing a secret by name. The
|
||||
recommended default routes through OpenRouter (OpenAI-compatible, one key for many models):
|
||||
|
||||
```yaml
|
||||
providers:
|
||||
embedding:
|
||||
default:
|
||||
kind: openai-compatible # openai-compatible | gemini | mock
|
||||
base_url: https://openrouter.ai/api/v1
|
||||
model: google/gemini-embedding-2 # or openai/text-embedding-3-large, mistralai/mistral-embed, …
|
||||
api_key: ${OPENROUTER_API_KEY}
|
||||
graphs:
|
||||
knowledge:
|
||||
schema: knowledge.pg
|
||||
embedding_provider: default
|
||||
```
|
||||
|
||||
The same `openai-compatible` kind points at OpenAI direct (`base_url: https://api.openai.com/v1`,
|
||||
`model: text-embedding-3-large`) or a self-hosted endpoint (vLLM/Ollama/LM Studio) by changing `base_url`. Use
|
||||
`kind: gemini` only to reach Google's `generativelanguage` API directly (it keeps the query/document
|
||||
task-type asymmetry that the OpenAI-compatible shape does not expose). Dimensions are schema-driven by the
|
||||
target `Vector(N)` column, not duplicated in the provider profile.
|
||||
|
||||
The zero-config tier keeps working with env only (`OMNIGRAPH_EMBED_PROVIDER`, `OMNIGRAPH_EMBED_BASE_URL`,
|
||||
`OMNIGRAPH_EMBED_MODEL`, and the provider api-key env — `OPENROUTER_API_KEY` / `OPENAI_API_KEY` /
|
||||
`GEMINI_API_KEY`), so no cluster file is required for a single-graph setup.
|
||||
|
||||
### Recording identity in the schema
|
||||
|
||||
`@embed` grows optional arguments that pin the embedding identity to the vector column:
|
||||
|
||||
```pg
|
||||
node Doc {
|
||||
slug: String @key
|
||||
text: String
|
||||
v: Vector(3072) @embed("text", model="gemini-embedding-2", dim=3072) @index
|
||||
}
|
||||
```
|
||||
|
||||
The single-argument form `@embed("text")` keeps working unchanged. The recorded identity persists in the
|
||||
schema IR (`_schema.ir.json`) and so travels with `schema apply` and `schema show`.
|
||||
|
||||
### What a mismatch looks like
|
||||
|
||||
If the resolved read-side embedder cannot produce the recorded identity (wrong model, wrong dim, wrong
|
||||
provider), `nearest($v, "string")` fails with a typed error naming both sides, instead of returning a
|
||||
plausible-but-meaningless ranking. Changing the recorded identity on an existing column is a loud schema-apply
|
||||
refusal (it is a re-embed, a deliberate migration step), reusing the migration planner's existing
|
||||
annotation-change rejection.
|
||||
|
||||
## Reference-level design
|
||||
|
||||
### One client, sealed provider abstraction
|
||||
|
||||
Replace the two-variant `EmbeddingTransport` with a resolved config plus a sealed provider enum:
|
||||
|
||||
```text
|
||||
EmbeddingConfig { provider: Provider, model, base_url, api_key, dim, normalize }
|
||||
enum Provider {
|
||||
OpenAiCompatible, // POST {base}/embeddings, Bearer auth, {model, input, dimensions} → {data:[{embedding,index}]}
|
||||
// covers OpenRouter (default gateway), OpenAI direct, vLLM/Ollama/LM Studio/Together
|
||||
Gemini, // POST {base}/models/{model}:embedContent, x-goog-api-key, with RETRIEVAL_QUERY/DOCUMENT task types
|
||||
Mock, // deterministic, offline
|
||||
}
|
||||
struct EmbeddingClient { config, http, retry, deadline }
|
||||
```
|
||||
|
||||
`Provider` owns the per-API differences (endpoint suffix, auth header, request JSON, response JSON, task-type
|
||||
support); the client owns retry/backoff, the deadline, normalization, and tracing — all provider-independent.
|
||||
**OpenRouter is not a distinct variant** — it is `OpenAiCompatible` with `base_url =
|
||||
https://openrouter.ai/api/v1`, which is the point: one OpenAI-compatible shape gives provider-independence
|
||||
across every model OpenRouter fronts, so the gateway does the multi-provider fan-out and OmniGraph carries one
|
||||
request shape. The native `Gemini` variant exists only for direct-to-Google with task-type asymmetry. An enum
|
||||
(not a trait) is the earned complexity for this small, first-party set; if third-party plug-in providers are
|
||||
ever needed, the enum becomes a trait behind the same `EmbeddingConfig` surface without touching callers.
|
||||
|
||||
The OpenAI-compatible `input` accepts an **array**, giving batch embedding for free — which the later
|
||||
ingest phase needs for throughput, and which removes the open dependency on Gemini's native
|
||||
`batchEmbedContents`.
|
||||
|
||||
### Config resolution (resolved once, shared)
|
||||
|
||||
Precedence, highest first for served cluster graphs: applied cluster `providers.embedding.<name>` profile →
|
||||
env (`OMNIGRAPH_EMBED_*`, provider api-key env) → built-in defaults. The cluster `api_key` value is a
|
||||
`${NAME}` env reference resolved at server boot; plaintext never lives in the schema, state ledger, or any
|
||||
checked-in file. Resolution happens once per graph handle; the resolved client is shared by
|
||||
`nearest("string")`. Direct single-graph serving, embedded callers, and the offline CLI keep the env path
|
||||
unless they inject an `EmbeddingConfig` directly.
|
||||
|
||||
### Identity recorded in the schema IR (not a new store)
|
||||
|
||||
The `@embed` args serialize into `PropertyIR.annotations` → `_schema.ir.json`, which `schema apply` already
|
||||
persists atomically and which the catalog (the one thing `nearest()` reads at query time) is built from. No
|
||||
new metadata store, no manifest column, no extra read on the query path. The migration planner already rejects
|
||||
non-description annotation changes as `UnsupportedChange`, so "recorded identity is immutable without a
|
||||
deliberate re-embed migration" is the default behaviour, not new code. (A second, optional copy in Lance
|
||||
field metadata — co-located with the vectors — is available later by activating the currently no-op
|
||||
`UpdatePropertyMetadata` migration step; out of scope here.)
|
||||
|
||||
### Query-time validation
|
||||
|
||||
`resolve_nearest_query_vec` compares the resolved read-side identity against the column's recorded identity
|
||||
before embedding; on mismatch it returns a typed `OmniError` naming recorded vs resolved (model, dim,
|
||||
provider). This is the only behaviour that closes P3 by construction.
|
||||
|
||||
### NFR floor (independent of the provider work)
|
||||
|
||||
- **Deadline:** wrap every embed call (query or document) in a total-operation deadline
|
||||
(`OMNIGRAPH_EMBED_DEADLINE_MS`) so a degraded provider cannot hang the caller for the current ~121 s worst
|
||||
case (4 × 30 s timeout + backoff).
|
||||
- **Observability:** `tracing` span per embed call (provider, model, dim, attempts, outcome, elapsed; `warn!`
|
||||
per retry; token usage when the provider returns it). The subsystem has zero instrumentation today.
|
||||
- **Single normalization:** one `normalize_vector` (the dead client carried a divergent second copy; removed
|
||||
in Phase 1).
|
||||
- **Stable model:** make the model configurable and default to a stable (non-`-preview`) model once the GA
|
||||
name is confirmed.
|
||||
|
||||
### Ingest-time `@embed` (later phase, not this RFC's core)
|
||||
|
||||
Making `@embed` embed at ingest is a separate phase with a hard constraint: embedding is a slow, external,
|
||||
**non-idempotent** side effect, so it must run **entirely before staging** — in the pure in-memory phase,
|
||||
before any `stage_*`/Lance HEAD move, alongside the existing constraint validation — so a mid-load provider
|
||||
failure aborts with zero drift. It must never sit inside or after the commit protocol, because the recovery
|
||||
sweep cannot re-run or undo an external embedding. It also needs a content-hash skip (so `load --mode
|
||||
overwrite` does not re-bill every row), batching, and a bounded-concurrency stage. Specified here only to fix
|
||||
the design constraint; deferred to its own RFC/phase.
|
||||
|
||||
### Phasing (implementation order)
|
||||
|
||||
| Phase | Scope | Demo |
|
||||
|---|---|---|
|
||||
| **1 — NFR floor + dead-client removal** | deadline, observability, single normalize, configurable model, delete dead client + `NANOGRAPH_*` | a hung provider fails at the deadline; embed calls traced; `rg NANOGRAPH_` empty |
|
||||
| **2 — Provider-independent config** | `EmbeddingConfig` + `Provider` enum (OpenAiCompatible covering OpenRouter/OpenAI/local, Gemini, Mock); env-first resolution; client reuse | point `base_url` at OpenRouter, run `nearest("string")`, get correct neighbours vs OpenRouter-stored vectors; CLI shares the config |
|
||||
| **3 — Record identity in schema IR** | `@embed` args grammar + catalog + IR persistence | `schema show` reflects recorded model/dim |
|
||||
| **4 — Query-time validation** | compare resolved vs recorded; typed error; planner refusal on identity change | stored model A vs read model B → loud error, never silent garbage |
|
||||
| **5 — Cluster provider wiring** | `providers.embedding` resources; `graphs.<id>.embedding_provider`; `${NAME}` resolution at server boot | provider profile resolved from applied cluster state; legacy `omnigraph.yaml` untouched |
|
||||
| later | ingest-time `@embed` (Shape C) | separate RFC |
|
||||
|
||||
**Status:** Phases 1–5 are implemented (`@embed("…", model="…")` is recorded in the schema IR and validated at
|
||||
query time with a typed same-space error; an unrecorded `@embed` keeps working with no check; cluster-served
|
||||
graphs can bind an applied `providers.embedding` profile). Ingest-time `@embed` remains.
|
||||
|
||||
## Invariants & deny-list check
|
||||
|
||||
- **Invariant 9 (integrity failures are loud):** strengthened — query-time identity mismatch becomes a typed
|
||||
error instead of silent wrong results.
|
||||
- **Invariant 10 (query semantics are first-class IR concepts):** embedding identity becomes IR/catalog data,
|
||||
not an out-of-band env guess.
|
||||
- **Invariant 11 (transport stays at the boundary):** strengthened — Phase 1 removes the HTTP client + async
|
||||
runtime (`reqwest`, `tokio`) from `omnigraph-compiler`, whose own manifest advertises "Zero Lance
|
||||
dependency"; the embedding HTTP client lives only in the engine.
|
||||
- **Invariant 12 / secret handling:** api-keys resolve through the existing credential chain; never in schema
|
||||
or checked-in config.
|
||||
- **Invariant 13 (bounded & observable):** addressed — the deadline bounds latency; tracing makes the
|
||||
subsystem observable.
|
||||
- **Deny-list — "silent fallback / dropped rows":** the cross-space ranking is exactly a silent-wrong-result;
|
||||
this RFC closes it.
|
||||
- **Deny-list — "new write paths that advance Lance HEAD before manifest publish without a recovery
|
||||
sidecar":** the ingest phase (deferred) explicitly keeps embedding *before* staging, so it does not create a
|
||||
new HEAD-advancing write path. No invariant is weakened.
|
||||
|
||||
## Drawbacks & alternatives
|
||||
|
||||
- **Do nothing.** The happy path works today, so the live risk is narrow (P1 + P3). But the provider hardwiring
|
||||
and missing validation are a latent silent-wrong-results class that bites the first non-Gemini user.
|
||||
- **Interim env-only provider switch (no schema record).** Cheaper, but leaves the same-space guarantee to
|
||||
operator discipline (fails P3). Folded in as Phase 2's env-first resolution, with Phases 3–4 adding the
|
||||
record/validate guarantee.
|
||||
- **Trait-based provider plug-ins now.** Rejected as unearned complexity for two first-party providers; the
|
||||
enum upgrades to a trait behind the same surface if needed.
|
||||
- **Stamp identity in the manifest or Lance field metadata instead of the IR.** The manifest is the wrong
|
||||
granularity; field metadata needs net-new wiring and a query-path dataset open. The IR is where `@embed`
|
||||
already lives and is already read at query time (see spike).
|
||||
|
||||
## Reversibility
|
||||
|
||||
Mostly reversible. Phases 1–2 and 5 are code/config (env, CLI, cluster keys) and cheap to undo. Phase 3
|
||||
(recording identity in the schema IR) is **near-permanent** — it changes the on-disk `_schema.ir.json` shape
|
||||
and the schema hash — so it earns the most scrutiny: the single-arg `@embed` form stays byte-compatible, and
|
||||
recorded identity is additive (absent identity = today's behaviour). Provider request/response shapes are
|
||||
external API contracts, not our format, so adding providers is reversible.
|
||||
|
||||
## Gateway tradeoff (OpenRouter)
|
||||
|
||||
Routing through OpenRouter (the default) buys provider-independence with one key and one billing relationship,
|
||||
batch input, and access to the GA `google/gemini-embedding-2`. Costs to accept, all controllable:
|
||||
|
||||
- **Extra network hop** → more query-path latency. The Phase-1 deadline bounds it; the cache mitigates repeats.
|
||||
- **Text transits a third party.** OpenRouter's `provider: { data_collection }` routing preference controls
|
||||
retention; shops with strict residency requirements use `kind: gemini`/`openai-compatible` pointed at the
|
||||
provider (or a self-hosted endpoint) directly instead of the gateway. Provider-independence means this is a
|
||||
config change, not a code change.
|
||||
- **Loses Gemini's task-type asymmetry** when Gemini is reached via the OpenAI-compatible gateway (both sides
|
||||
embed symmetrically). This is a retrieval-quality cost, **not** a same-space correctness cost — both stored
|
||||
and query vectors take the identical path, so they stay in one space by construction. Shops that want the
|
||||
asymmetry use `kind: gemini`.
|
||||
|
||||
## Unresolved questions
|
||||
|
||||
- GA Gemini model name — **resolved:** `google/gemini-embedding-2` (via OpenRouter) / `gemini-embedding-2`
|
||||
(direct), 128–3072 dims (recommended 768/1536/3072). Default flips off `-preview` in Phase 2.
|
||||
- Gemini `batchEmbedContents` availability — **moot** when going through the OpenAI-compatible gateway (its
|
||||
`input` array batches); still relevant only for the direct `kind: gemini` path.
|
||||
- Identity granularity: per-vector-property args vs one graph-level default profile referenced by name.
|
||||
- Whether to backfill recorded identity for existing graphs, or treat absent-identity as "unvalidated, legacy"
|
||||
permanently.
|
||||
- Default model for the zero-config tier: `google/gemini-embedding-2` vs `openai/text-embedding-3-large`
|
||||
(both 3072-capable) — pick the project default.
|
||||
|
|
@ -7,7 +7,7 @@ This file is the always-on map of the test surface. **Consult it before every ta
|
|||
| Crate | Path | Style |
|
||||
|---|---|---|
|
||||
| `omnigraph` (engine) | `crates/omnigraph/tests/` | Integration tests (28 files), fixture-driven, share `tests/helpers/mod.rs` |
|
||||
| `omnigraph-cli` | `crates/omnigraph-cli/tests/` | Per-area suites (post-modularization): `cli_cluster.rs` (cluster command surface + operator-actor cascade), `cli_cluster_e2e.rs` (spawned-binary lifecycle compositions — lost-state re-import recovery, out-of-band drift, graph-root destruction, multi-graph mixed-disposition convergence), `cli_data.rs` (load/read/change/branch/commit/export/snapshot/policy/embed/maintenance + operator format cascade), `cli_schema_config.rs` (init/config, schema plan/apply, RFC-008 deprecation warnings + `config migrate` + strict mode), `cli_queries.rs`, `system_local.rs` (full-cycle cluster lifecycle with a spawned `--cluster` server, applied-policy enforcement over HTTP, keyed-credential auth, operator aliases), `system_remote.rs`; share `tests/support/mod.rs` (hermetic `OMNIGRAPH_HOME` by default) |
|
||||
| `omnigraph-cli` | `crates/omnigraph-cli/tests/` | Per-area suites (post-modularization): `cli_cluster.rs` (cluster command surface + operator-actor cascade), `cli_cluster_e2e.rs` (spawned-binary lifecycle compositions — lost-state re-import recovery, out-of-band drift, graph-root destruction, multi-graph mixed-disposition convergence), `cli_data.rs` (load/read/change/branch/commit/export/snapshot/policy/embed/maintenance + operator format cascade), `cli_schema_config.rs` (init/config, schema plan/apply), `cli_queries.rs`, `parity_matrix.rs` (RFC-009 Phase 1: the embedded-vs-remote referee — every forked verb run against both arms with matched Cedar policy and the same actor, scrubbed-JSON + exit-code equality; divergences are pinned in its `KNOWN_DIVERGENCES` ledger, never silently repaired), `system_local.rs` (full-cycle cluster lifecycle with a spawned `--cluster` server, applied-policy enforcement over HTTP, keyed-credential auth, operator aliases), `system_remote.rs`; share `tests/support/mod.rs` (hermetic `OMNIGRAPH_HOME` by default) |
|
||||
| `omnigraph-cluster` | mostly in-source `#[cfg(test)] mod tests`; `tests/failpoints.rs` (feature-gated); `tests/s3_cluster.rs` (bucket-gated full lifecycle on object storage) | Cluster config parser, local JSON state diff, state CAS/lock handling/recovery, read-only validate/plan/status plus explicit refresh/import graph observations, config-only apply (content-addressed payload publish, disposition gating, composite-digest convergence, idempotent re-apply), catalog payload verification (status read-only, refresh drift + self-heal), failpoint crash-mid-apply / CAS-race coverage, Stage 4A graph creation (create executor, recovery sidecars + sweep rows, create crash windows), Stage 4B schema apply (migration previews in plan, schema executor, schema-apply sweep classification, schema crash windows), Stage 4C gated deletes (digest-bound approvals, delete executor + tombstones, delete sweep rows, delete crash windows), and 5A policy binding metadata (applies_to in the applied revision, binding-change diffing + convergence, pre-5A backfill), and the 5B serving-snapshot read API (converged read, refusal rows) |
|
||||
| `omnigraph-server` | `crates/omnigraph-server/tests/` | Per-area suites (post-modularization): `auth_policy.rs`, `data_routes.rs`, `schema_routes.rs`, `stored_queries.rs`, `multi_graph.rs` (cluster-mode boot — converged serving, policy binding wiring, boot refusals — + the concurrent branch-ops matrix), `boot_settings.rs` (mode inference, PolicySource), `s3.rs` (bucket-gated: single-graph serving + config-free `--cluster s3://` boot), `openapi.rs` (OpenAPI drift / regeneration); share `tests/support/mod.rs` |
|
||||
| `omnigraph-compiler` | mostly in-source `#[cfg(test)] mod tests` | Parser, type-checker, IR lowering, lint |
|
||||
|
|
@ -29,7 +29,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
|
||||
| `changes.rs` | `diff_between` / `diff_commits` |
|
||||
| `consistency.rs` | Cross-table snapshot isolation, atomic publish |
|
||||
| `schema_apply.rs` | Migration plan + apply, schema-apply lock |
|
||||
| `schema_apply.rs` | Migration plan + apply, schema-apply lock; index materialization deferred to the reconciler (iss-848): `apply_schema_defers_vector_index_on_empty_table` (an empty-table Vector `@index` never aborts the apply) and `index_only_constraint_apply_touches_no_table_data` (adding an `@index` is metadata-only — no table-version bump) |
|
||||
| `search.rs` | FTS / vector / hybrid (`bm25`, `nearest`, `rrf`) |
|
||||
| `traversal.rs` | `Expand`, variable-length hops, anti-join (CSR path — `OMNIGRAPH_TRAVERSAL_MODE` unset) |
|
||||
| `traversal_indexed.rs` | BTREE-indexed Expand (`execute_expand_indexed`) forced via `OMNIGRAPH_TRAVERSAL_MODE`, asserted semantically equal to the CSR path; own binary, all `#[serial]` so env writes never race |
|
||||
|
|
@ -42,7 +42,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
| `lance_version_columns.rs` | Per-row `_row_last_updated_at_version` behavior |
|
||||
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
|
||||
| `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly — no HTTP — proving embedded and CLI callers hit the same gate as the server, with action × scope shapes matching `authorize_request` |
|
||||
| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice |
|
||||
| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice; the index reconciler (iss-848): `index_build_tolerates_null_vector_rows` (an untrainable Vector column defers instead of aborting the build, sibling indexes still build) and `optimize_materializes_index_declared_but_unbuilt` (optimize creates a declared-but-deferred index) |
|
||||
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). |
|
||||
| `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path |
|
||||
| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). |
|
||||
|
|
|
|||
|
|
@ -19,8 +19,14 @@ publisher's row-level CAS on `__manifest` is the single fence.
|
|||
`__run__*` branch on an upgraded graph is swept off `__manifest` by the
|
||||
v2→v3 internal-schema migration on first read-write open. (The inert
|
||||
`_graph_runs.lance` bytes remain until a `delete_prefix` primitive lands.)
|
||||
- Cancelled mutation futures leave **no graph-level state** — only orphaned
|
||||
Lance fragments, which the existing `omnigraph cleanup` pipe reclaims.
|
||||
- Cancelled mutation futures leave **no graph-visible state** — the manifest
|
||||
is never advanced. They can leave two kinds of unreferenced residue, both
|
||||
self-healing: orphaned Lance fragments (reclaimed by `omnigraph cleanup`),
|
||||
and — on the *first* write to a table on a branch, which forks it before the
|
||||
publish — a manifest-unreferenced branch ref. The next write to that table
|
||||
reclaims the stale fork and re-forks (`reclaim_orphaned_fork_and_refork`),
|
||||
and `cleanup`'s per-table reconciler is the guaranteed backstop; see the
|
||||
fork-reclaim note in [invariants.md](invariants.md).
|
||||
|
||||
## Read-your-writes within a multi-statement mutation
|
||||
|
||||
|
|
@ -80,10 +86,17 @@ deferred to a follow-up cycle — tracked).
|
|||
Three writers have been migrated onto staged primitives:
|
||||
|
||||
* **`ensure_indices`** (`db/omnigraph/table_ops.rs::build_indices_on_dataset_for_catalog`)
|
||||
— scalar indices (BTree, Inverted) now use `stage_create_*_index` +
|
||||
`commit_staged`. Vector indices stay inline (residual — Lance
|
||||
`build_index_metadata_from_segments` is `pub(crate)` in 6.0.1;
|
||||
companion ticket to lance-format/lance#6658 needed).
|
||||
— scalar indices (BTree, Inverted) use `stage_create_*_index` +
|
||||
`commit_staged`. Which index a `@index`/`@key` property gets is dispatched by
|
||||
type via `node_prop_index_kind` (enum + orderable scalar → BTree, free-text
|
||||
String → Inverted/FTS, Vector → vector). Vector indices stay inline (residual
|
||||
— Lance `build_index_metadata_from_segments` is `pub(crate)` in 6.0.1;
|
||||
companion ticket to lance-format/lance#6658 needed). This build is
|
||||
existence-gated (it creates a *missing* index over current fragments); folding
|
||||
fragments appended afterward into an *existing* index is `optimize`'s
|
||||
`optimize_indices` pass — an inline-commit residual, not a staged write (Lance
|
||||
exposes no uncommitted index-optimize), covered by the optimize recovery
|
||||
sidecar (see [maintenance.md](../user/operations/maintenance.md)).
|
||||
* **`branch_merge::publish_rewritten_merge_table`**
|
||||
(`exec/merge.rs`) — merge_insert now uses `stage_merge_insert` +
|
||||
`commit_staged`. Deletes stay inline (Lance #6658 residual).
|
||||
|
|
@ -305,7 +318,7 @@ success and one failure. The losing writer's error is
|
|||
`ManifestConflictDetails::ExpectedVersionMismatch { table_key, expected,
|
||||
actual }`. The HTTP server maps this to **409 Conflict** with body
|
||||
`{"error": "...", "code": "conflict", "manifest_conflict": { "table_key":
|
||||
"...", "expected": N, "actual": M }}` — see [docs/user/server.md](../user/server.md).
|
||||
"...", "expected": N, "actual": M }}` — see [docs/user/server.md](../user/operations/server.md).
|
||||
|
||||
## Audit
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue