Commit graph

42 commits

Author SHA1 Message Date
Ragnor Comerford
e62d9166fb
fix: optimize publishes compaction; recovery roll-back converges manifest (#141)
* test(optimize): cover manifest publish + HEAD-drift reconcile

Red against the pre-fix optimize, which ran compact_files without
publishing the compacted version to __manifest:

- maintenance: optimize must publish so the manifest table_version
  tracks the compacted Lance HEAD and a later schema apply succeeds;
  and must reconcile a pre-existing manifest-behind-HEAD drift (forged
  via raw Lance compaction) so strict writes commit again.
- end_to_end + composite_flow: post-optimize query / strict update /
  reopen in the full lifecycle (the canonical flow previously omitted
  post-optimize writes as a documented "known limitation").
- failpoints: a crash between compaction and the manifest publish rolls
  forward on next open.

* fix(optimize): publish compaction to manifest and reconcile HEAD drift

optimize ran Lance compact_files without publishing the new version to
__manifest, so the manifest table_version lagged the Lance HEAD: reads
stayed pinned to the pre-compaction version, and the next schema apply or
strict update/delete failed its HEAD-vs-manifest precondition with
"stale view ... refresh and retry" (open-time recovery rollback inflated
the gap on retry).

optimize now publishes each compacted table's version under the
per-(table, main) write queue, guarded by a manifest CAS and a
SidecarKind::Optimize recovery sidecar (loose-match; roll-forward is safe
because compaction is content-preserving). When a table has nothing left
to compact but its Lance HEAD is already ahead of the manifest pin
(pre-fix drift, or a recovery restore commit), optimize reconciles the
manifest forward to HEAD (metadata-only, no sidecar). Caches and the
CSR/CSC graph index are invalidated after a publish.

Docs updated (maintenance, storage, branches-commits, writes, testing).

* test(recovery): rollback convergence + optimize-defer regressions

Red against the current code, landed before the fix:
- recovery: after the open-time sweep rolls a sidecar back, the manifest
  must track Lance HEAD (no residual drift) so a follow-up schema apply
  succeeds — the original "+1 per retry" loop. Today roll-back restores
  without publishing, so the manifest lags HEAD and the apply fails its
  HEAD-vs-manifest precondition.
- maintenance: optimize must refuse while a recovery sidecar is pending —
  operating on an unrecovered graph could publish a partial write the
  sweep would roll back.

Also removes optimize_reconciles_preexisting_manifest_head_drift: the
ad-hoc drift reconcile it covered is replaced by recovery-side convergence.

* fix(recovery): converge manifest on roll-back; optimize defers on pending recovery

Root of PR #141's review findings and the original "+1 per retry" loop:
a Lance HEAD ahead of the manifest was ambiguous (benign content-preserving
drift vs. a partial write a sidecar will roll back), and optimize's reconcile
guessed it benign. Close the class instead of guessing:

- Recovery roll-back now PUBLISHES the restored version (via a
  push_table_update_at_head helper shared with roll-forward), so the manifest
  tracks the Lance HEAD after recovery — symmetric with roll-forward. This
  fixes the +1 loop (after one roll-back the retry's HEAD-vs-manifest
  precondition passes) and removes the only remaining source of orphaned
  drift. The audit still records the logical rolled-back-to version; the
  manifest is published at the restore commit (identical content).
- optimize drops the ad-hoc drift reconcile and instead REFUSES when a
  __recovery sidecar is pending, so it only ever operates on a recovered
  graph (manifest == HEAD); its compaction publish can no longer commit a
  partial write. With the reconcile gone, the blob-skip-vs-reconcile gap is
  moot.

Updates the rollback recovery-test helper (manifest == HEAD after roll-back),
the failpoints assertions, and the user/dev docs.

* test(recovery): fix rollback assertion for manifest convergence

The roll-back-publishes change makes the manifest version advance after a
SchemaApply roll-back (to the old-schema content), so the
schema_apply_without_schema_staging_rolls_back_on_next_open assertion must
be `version > pre`, not `version == pre`. This update was dropped during
the commit churn and surfaced as a CI Test Workspace failure; the
old-schema-preserved intent stays covered by count_rows + _schema.pg + the
RolledBack convergence invariant.
2026-06-08 02:50:12 +03:00
Ragnor Comerford
d54bccb940
fix(optimize): skip blob-bearing tables to avoid Lance compaction crash (#138)
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
CI / Container Entrypoint (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / Test Windows release binaries (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled
Release Edge / Smoke Windows installer (push) Has been cancelled
* test(optimize): pin Lance blob-column compaction failure as a surface guard

Lance compact_files mis-decodes blob-v2 columns under its forced BlobHandling::AllBinary read ("more fields in the schema than provided column indices"), failing even a pristine uniform-V2_2 multi-fragment blob table; reads use descriptor handling and are unaffected.

Guard 10 reproduces this and is self-retiring: it turns red on the Lance bump that fixes the bug, forcing LANCE_SUPPORTS_BLOB_COMPACTION to flip.

* fix(optimize): skip blob-bearing tables instead of crashing compaction

omnigraph optimize aborted the whole sweep when any node/edge table had a Blob property: Lance compact_files cannot decode blob-v2 columns under AllBinary (the column-index error pinned by the surface guard). Skip blob-bearing tables behind a LANCE_SUPPORTS_BLOB_COMPACTION gate and report them via TableOptimizeStats.skipped / SkipReason (surfaced in the CLI and a tracing::warn) instead of erroring, which also isolates the failure so the other tables still compact.

Reads/writes are unaffected; only fragment/space reclamation on blob tables is deferred until the upstream Lance fix. Adds a maintenance.rs regression test (validated red with the column-index symptom before the fix, green after), a concise v0.6.1 release note, and updates docs (maintenance, cli-reference, AGENTS capability matrix, invariants Known Gaps, lance.md audit, constants).

* refactor(optimize): make TableOptimizeStats and SkipReason non_exhaustive

Both are returned result types, never built by callers, so #[non_exhaustive] makes this the last field/variant addition that can break downstream literal construction and keeps future ones non-breaking (review feedback on the public-field addition). The v0.6.1 Compatibility Notes call out the source-level change.

Also drops the now-stale "RED today / GREEN after the fix lands" narration in the optimize_skips_blob_table_and_reports_skip test (historical regression context now that the fix is in this branch), and folds in the expanded v0.6.1 release note.

* chore(release): bump workspace to v0.6.1

Coherent version bump to accompany the v0.6.1 release note: all five crate manifests + path-dependency constraints, Cargo.lock, the AGENTS.md surveyed-version line, and openapi.json info.version move 0.6.0 -> 0.6.1. Matches the established release pattern (#118 landed the v0.6.0 note + bump together) and resolves the Codex/Devin review flag that a v0.6.1 note without a bump leaves CARGO_PKG_VERSION reporting 0.6.0 and mixed package versions.
2026-06-02 17:12:00 +02:00
Ragnor Comerford
2d5c4b1202
docs: rename runs.md/runs.rs → writes and repoint all references (#131)
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
CI / Container Entrypoint (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / Test Windows release binaries (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled
Release Edge / Smoke Windows installer (push) Has been cancelled
The Run state machine was removed in MR-771 (v0.4.0); `docs/dev/runs.md`
and `crates/omnigraph/tests/runs.rs` have since documented and tested the
direct-publish write path, so the "runs" name was misleading.

- git mv docs/dev/runs.md → docs/dev/writes.md (reframe H1 + intro;
  keep MR-771 history note)
- git mv crates/omnigraph/tests/runs.rs → tests/writes.rs (reframe header)
- repoint every runs.md / runs.rs reference across docs, AGENTS.md, and
  source comments
- fix four pre-existing broken `docs/runs.md` links (the file never lived
  at that path) to `docs/dev/writes.md`
- fix the stale v0.4.0 anchor to the live section

No behavior change: every source edit is a comment. Engine builds and the
renamed test passes 25/25; scripts/check-agents-md.sh passes.

The run-removal cleanup itself (run_registry.rs guard, __run__ prefix) is
deferred to MR-770.
2026-05-30 23:20:56 +02:00
Andrew Altshuler
4e431d28b8
docs: rewrite README opening + add AGENTS.md dev commands (#122)
* docs(agents): add build/test/lint dev-command section

AGENTS.md (CLAUDE.md) covered architecture and invariants but had no
developer command surface — only runtime `omnigraph` CLI usage. Add a
concise "Build, test, lint" section with the non-obvious gotchas:

- crate dir `crates/omnigraph` is package `omnigraph-engine` (the `-p` name)
- canonical CI gate is `cargo test --workspace --locked`
- how to run one file / one fn
- feature-gated suites (`failpoints`, server `aws`)
- S3 tests skip without `OMNIGRAPH_S3_TEST_BUCKET`
- the two non-test CI checks (check-agents-md, OpenAPI drift)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(readme): rewrite opening, dedupe, fix stale references

- New manifesto-style opening (tagline, X-as-code, features, core
  use cases, coordination-layer line); drop the old prose intro,
  Use Cases, and Capabilities sections.
- Remove Capabilities, which restated the new opening line-for-line.
- Harmonize heading case: "## Core Use Cases".
- Dedupe the verbatim Slack invite (kept the Community section) and
  the double-linked cli.md (kept the contextual pointer).
- Fix stale references that no longer match the code:
  - drop "transactional runs" / "and runs" — no run concept remains;
    writes are atomic per-query, multi-query workflows use branches.
  - update the CLI crate command list to canonical query/mutate plus
    commit/lint/optimize/cleanup.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:42 +01:00
Ragnor Comerford
e0f13b32c5
(feat): multi-graph server mode (#119)
* mr-668: add GraphId newtype + Cloud-mode forward identity stubs (PR 1/10)

PR 1 of the MR-668 multi-graph server work. Pure types, no runtime
behavior changes yet.

Ships the validated identity vocabulary that the rest of the implementation
will consume:

- `GraphId(String)` — `^[a-zA-Z0-9-]{1,64}$`, leading underscore rejected
  (engine reserves every `_*` filename), reserved route names rejected
  (`policies`, `healthz`, `openapi`, `openapi.json`, `graphs`). Validation
  lives in `try_from` only; serde `Deserialize` re-runs it so JSON payloads
  cannot bypass.
- `TenantId(String)` — same regex shape as GraphId. `None` in Cluster
  mode; reserved for Cloud mode (RFC 0003) where it carries the OAuth
  `org_id` claim.
- `GraphKey { tenant_id: Option<TenantId>, graph_id }` — the registry
  HashMap key. `cluster()` constructor for the Cluster-mode default.
- `Scope` enum with `Full` variant — Cluster mode default; RFC 0004 will
  extend with OAuth scopes (`graph:read`/`write`/`admin`/`*`).
- `AuthSource` enum with `Static` variant — Cluster mode default; RFC
  0001 step 1 will add `Oidc`.
- `ResolvedActor { actor_id, tenant_id, scopes, source }` — replaces the
  upcoming refactor of `AuthenticatedActor(Arc<str>)` in PR 4a.

Per MR-668 design decision 13: ship the Cloud-mode forward type shapes
now (no `TokenVerifier` trait yet — that's RFC 0001 step 1) so handler
signatures stay stable across the Cluster → Cloud trajectory. `Scope`
and `AuthSource` use `#[non_exhaustive]` so future variants don't break
caller matches.

Tests: 26 new (15 graph_id + 11 identity), all passing. No regression
in the existing 36 server library tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: Omnigraph::init error-path cleanup + three failpoints (PR 2a/10)

PR 2a of the MR-668 multi-graph server work. Bug fix: a partially-failed
`Omnigraph::init` previously left orphan schema files at the graph URI,
making the URI unusable for a retry (the next `init` would refuse because
`_schema.pg` already exists).

Changes:

1. `init_with_storage` now wraps the I/O phase. On any error from
   `init_storage_phase`, calls `best_effort_cleanup_init_artifacts` to
   remove the three schema files before returning the original error:
   - `_schema.pg`
   - `_schema.ir.json`
   - `__schema_state.json`
   Cleanup is best-effort: a failure to delete is logged via
   `tracing::warn` but does NOT mask the init error.

2. Three failpoints added at the init phase boundaries:
   - `init.after_schema_pg_written`
   - `init.after_schema_contract_written`
   - `init.after_coordinator_init`

3. Four new failpoint tests in `tests/failpoints.rs` pin the cleanup
   behavior at each boundary plus the "original error wins over cleanup
   error" contract. All 23 failpoint tests pass.

Coverage gap (documented in code comments):
Lance per-type datasets and `__manifest/` directory created by
`GraphCoordinator::init` are NOT cleaned up after a coordinator-init-phase
failure. Recursive directory deletion requires `StorageAdapter::delete_prefix`,
which was deferred along with `DELETE /graphs/{id}` (originally PR 2b). When
that primitive lands, the third failpoint test can be tightened to assert
the graph root is fully empty.

Tests: 4 new (init_failpoint_*), all 23 failpoint tests green. No
regression in the 105 engine library tests or 64 end_to_end tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: add GraphHandle + GraphRegistry data structure (PR 3/10)

PR 3 of the MR-668 multi-graph server work. Pure data structure — no
routing changes yet (that's PR 4a).

New file: `crates/omnigraph-server/src/registry.rs`

- `GraphHandle { key: GraphKey, uri: String, engine: Arc<Omnigraph>,
  policy: Option<Arc<PolicyEngine>> }` — the per-graph state that the
  routing middleware (PR 4a) will inject as a request extension.
- `RegistrySnapshot { graphs: HashMap<GraphKey, Arc<GraphHandle>> }` —
  immutable snapshot; replaced atomically via `ArcSwap`.
- `GraphRegistry { snapshot: ArcSwap<_>, mutate: Mutex<()> }` — lock-free
  reads, mutex-serialized mutations.
- `RegistryLookup { Ready(Arc<GraphHandle>) | Gone }` — two-valued, no
  `Tombstoned` variant since DELETE is deferred in v0.7.0 scope.
- `InsertError { DuplicateKey | DuplicateUri }` — both rejection cases
  for create-graph (maps to HTTP 409 in PR 7).
- Methods: `new`, `from_handles` (bulk startup-time init), `get`, `list`,
  `len`, `insert`.

Race semantics pinned by three multi-thread tests:
- `concurrent_insert_same_key_exactly_one_succeeds` — N=8 spawned
  inserts with the same key; exactly 1 returns Ok, 7 return DuplicateKey.
- `concurrent_insert_distinct_keys_all_succeed` — N=8 spawned inserts
  with distinct keys; all succeed.
- `concurrent_reads_during_inserts_see_consistent_snapshots` — reader
  loop concurrent with sequential writes; every listed handle's key
  resolves via `get()` (no torn state).

Why no tombstones field: `DELETE /graphs/{id}` is deferred to bound
the scope of v0.7.0. Without a delete endpoint, there's no use for
tombstones — every key in the registry is `Ready`, and every key
not in the registry is `Gone`. When DELETE lands later, the
`Tombstoned` variant + `tombstones: HashSet<GraphKey>` slot in
additively without breaking caller signatures (the `Gone` variant
remains the "not currently active" case).

Why `tokio::sync::Mutex`: insert is async because PR 7's flow holds
this mutex across the atomic YAML rewrite step (file I/O). std::Mutex
would footgun across .await.

Dependency additions: `arc-swap = { workspace = true }`,
`thiserror = { workspace = true }` (used by InsertError).

Tests: 12 new (12 passing). 74 server lib tests total green
(62 from PR 1 + 12 new). Clippy clean on server crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: router restructure + handler refactor for multi-graph (PR 4a/10)

PR 4a of the MR-668 multi-graph server work. The heaviest single PR —
rewires every handler to extract `Arc<GraphHandle>` from a routing
middleware, replaces `AuthenticatedActor(Arc<str>)` with `ResolvedActor`
everywhere, and adds the `ServerMode` discriminator.

Behavior changes:
- **Single mode** (legacy `omnigraph-server <URI>`): flat routes
  (`/snapshot`, `/read`, `/branches`, …) continue to work exactly as
  v0.6.0. Internally, the registry holds a single handle keyed by the
  sentinel `SINGLE_GRAPH_KEY_ID = "default"`; routing middleware injects
  that handle on every request. No HTTP-visible change.
- **Multi mode** (new): routes nest under `/graphs/{graph_id}/...`.
  Routing middleware extracts the graph id from the path, looks it up
  in the registry, and injects the handle. 404 if not found.
  (Multi-mode startup itself lands in PR 5; this PR provides the
  router-side wiring.)

AppState refactor:
- `engine: Arc<Omnigraph>` and `policy_engine: Option<Arc<PolicyEngine>>`
  fields removed — both now live inside `GraphHandle` in the registry.
- `mode: ServerMode { Single { uri } | Multi { config_path } }` added.
- `registry: Arc<GraphRegistry>` added.
- `server_policy: Option<Arc<PolicyEngine>>` added (placeholder for
  management endpoints in PR 6b; unused today).
- Existing constructors (`new`, `new_with_bearer_token{s,_and_policy}`,
  `new_with_workload`, `open*`) build a single-mode AppState
  internally and remain source-compatible. Tests that constructed
  AppState via these constructors continue to work.
- `with_policy_engine` post-construction setter — rebuilds the
  single-mode handle with the policy attached. Engine-layer
  enforcement is NOT reinstalled (matches the old single-field
  semantics; `open_with_bearer_tokens_and_policy` is the path that
  installs both layers).
- `new_multi` constructor added for PR 5's startup loop.
- `uri()` now returns `Option<&str>` (Some in single, None in multi).

Routing middleware:
- `resolve_graph_handle` injects `Arc<GraphHandle>` as a request
  extension. Mode-aware: single returns the only handle; multi parses
  `/graphs/{graph_id}/...` from the URI. Returns 404 in multi mode
  when the graph id is unregistered. Records `graph_id` on the
  current tracing span.
- `require_bearer_auth` updated to insert `ResolvedActor` (was
  `AuthenticatedActor`).

Handler refactor — every protected handler:
- Gains `Extension(handle): Extension<Arc<GraphHandle>>` param.
- Replaces `state.engine` → `handle.engine`.
- Replaces `state.policy_engine()` → `handle.policy.as_deref()`.
- Replaces `state.uri()` → `handle.uri.as_str()` (or `.clone()`
  where String is needed).
- Replaces `Arc::clone(&state.engine)` → `Arc::clone(&handle.engine)`
  (the spawn-and-clone pattern in `server_export` — proof that a
  long-running export survives the registry being mutated later).

authorize_request signature:
- Was: `(state: &AppState, actor: Option<&AuthenticatedActor>, request: PolicyRequest)`.
- Now: `(actor: Option<&ResolvedActor>, policy: Option<&PolicyEngine>, request: PolicyRequest)`.
- Per-graph callers pass `handle.policy.as_deref()`. The (future PR 6b)
  management endpoints will pass `state.server_policy.as_deref()`.

MR-731 invariant preserved:
- The single chokepoint `request.actor_id = actor.actor_id.as_ref().to_string()`
  inside `authorize_request` still overwrites any client-supplied
  actor identity. Regression test
  `actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers`
  at `tests/server.rs:1114-1216` passes unchanged.

Tests: 0 new (the registry race tests in PR 3 already cover the
data structure; this PR exercises them indirectly via the existing
test suite). 74 lib + 57 server integration + 60 openapi = 191 tests
green. Clippy clean.

LOC: +397 insertions, -153 deletions in `crates/omnigraph-server/src/lib.rs`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: OpenAPI multi-mode cluster filter (PR 4b/10)

PR 4b of the MR-668 multi-graph server work. In multi mode, the served
`/openapi.json` reports cluster routes (`/graphs/{graph_id}/...`) instead
of the legacy flat protected paths — matching what `build_app` actually
mounts (PR 4a's `Router::nest`). Single mode is unchanged.

Implementation:
- New `server_openapi` branch: when `state.mode()` is `Multi`, call
  `nest_paths_under_cluster_prefix(&mut doc)` after `ApiDoc::openapi()`.
- The rewrite consumes `doc.paths.paths`, then for every path-item:
  - If the path is in `ALWAYS_FLAT_PATHS` (`/healthz` for now), keep
    it flat.
  - Otherwise, prefix every operation_id with `cluster_` and reinsert
    the item at `/graphs/{graph_id}<original_path>`.
- Single mode hits no extra work — the path map is untouched.
- The static `ApiDoc::openapi()` still emits the flat surface, so
  in-process callers (the existing `openapi_json()` helper in tests)
  see the unmodified spec.

Why cluster_ prefix on operation IDs: OpenAPI specs require unique
operation_ids across the document. With both flat (single-mode) and
cluster (multi-mode) surfaces ever co-existing in a generated SDK,
the prefix prevents collision. The current served doc only carries
one surface, so the prefix is forward-compat with potential future
dual-surface generation.

Tests: 6 new in `tests/openapi.rs`, all via the `/openapi.json` route
(not the static `ApiDoc::openapi()` helper):
- `multi_mode_openapi_lists_cluster_paths` — every protected path
  appears as a cluster variant.
- `multi_mode_openapi_drops_flat_protected_paths` — flat protected
  paths are absent.
- `multi_mode_openapi_keeps_healthz_flat` — `/healthz` survives.
- `multi_mode_openapi_prefixes_operation_ids_with_cluster` — every
  cluster operation_id starts with `cluster_`.
- `multi_mode_operation_ids_are_unique` — no operation_id collisions.
- `single_mode_openapi_unchanged_by_cluster_filter` — single mode
  still emits the legacy flat surface (regression).

New test helper `app_for_multi_mode(graph_ids)` exercises the new
`AppState::new_multi` constructor from PR 4a — first user of multi-mode
construction outside of unit tests.

Result: 66 openapi tests + 57 server integration tests + 74 lib tests
= 197 green. No regression in the existing OpenAPI drift check
(`openapi_spec_is_up_to_date` still validates the static flat surface
matches the committed openapi.json).

LOC: +67 in lib.rs (rewrite logic), +219 in tests/openapi.rs (test
suite + helper).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: multi-graph startup + mode inference (PR 5/10)

PR 5 of the MR-668 multi-graph server work. This is the first PR that
makes multi mode actually usable end-to-end: operators invoking
`omnigraph-server --config omnigraph.yaml` with a non-empty `graphs:`
map and no single-mode selector now get a running multi-graph server.

Mode inference (MR-668 decision 2, four-rule matrix in
`load_server_settings`):
  1. CLI `<URI>` positional        → Single
  2. CLI `--target <name>`         → Single (URI from graphs.<name>)
  3. `server.graph` in config      → Single (URI from graphs.<name>)
  4. `--config` + non-empty `graphs:` + no single-mode selector
                                   → Multi (all entries in `graphs:`)
  5. otherwise                     → error with migration hint

Rule 5's error message names every escape hatch so operators can fix
their invocation without grepping docs.

Config schema extensions:
  - `TargetConfig.policy: PolicySettings` (per-graph Cedar policy file).
    `#[serde(default)]` so existing single-graph YAMLs keep parsing.
  - `ServerDefaults.policy: PolicySettings` (server-level Cedar policy
    for management endpoints — loaded in PR 5, wired into `GET /graphs`
    in PR 6b).
  - `OmnigraphConfig::resolve_target_policy_file(name)` and
    `resolve_server_policy_file()` helpers — both resolve relative to
    the config file's `base_dir`.

Public types added to `omnigraph-server`:
  - `ServerConfigMode { Single { uri, policy_file } | Multi { graphs,
    config_path, server_policy_file } }`.
  - `GraphStartupConfig { graph_id, uri, policy_file }` — one entry
    per graph in multi mode.

`ServerConfig` shape change:
  - WAS: `{ uri: String, bind, policy_file, allow_unauthenticated }`.
  - NOW: `{ mode: ServerConfigMode, bind, allow_unauthenticated }`.
  - Breaking for any code that constructs `ServerConfig` directly.
    `main.rs` is unaffected (uses `load_server_settings`).

`serve()` now forks on `ServerConfig.mode`:
  - Single: existing flow via `AppState::open_with_bearer_tokens_and_policy`.
  - Multi: parallel open via `futures::stream::iter(graphs)
    .map(open_single_graph).buffer_unordered(4).collect()`. Bound 4 is
    a rule-of-thumb for I/O-bound work — at N≤10 this trades startup
    latency for a small amount of concurrent S3/Lance open pressure.
    Fail-fast: first open error aborts startup; in-flight opens drop
    their engine via Arc (Lance datasets close cleanly).

New helper `open_single_graph(GraphStartupConfig)`:
  - Validates `GraphId` per the regex in PR 1.
  - `Omnigraph::open(uri).await` with descriptive error context.
  - Loads per-graph policy file and re-applies it via
    `Omnigraph::with_policy` (engine-layer enforcement, MR-722).
  - Returns `Arc<GraphHandle>` ready for the registry.

Routing middleware bug fix:
  - `Router::nest("/graphs/{graph_id}", inner)` rewrites
    `request.uri().path()` to the inner suffix (e.g. `/snapshot`).
    The previous middleware tried to parse `{graph_id}` from
    `request.uri().path()` and got 400 instead of 200. Fixed by reading
    from `axum::extract::OriginalUri` request extension, which preserves
    the pre-rewrite URI.
  - Caught by the two new tests
    `cluster_routes_dispatch_per_graph_handle` and
    `cluster_route_for_unknown_graph_returns_404`.

Tests (14 new, all passing):
  - Four-rule matrix: one test per branch + the joint case
    `mode_inference_cli_uri_overrides_graphs_map` + the empty-graphs-map
    error case.
  - Per-graph + server-level policy file path resolution.
  - Reserved `GraphId` rejection at startup.
  - End-to-end multi-graph routing: two graphs side by side, each
    cluster route hits the right engine.
  - Unknown graph id under cluster prefix → 404.
  - Flat routes 404 in multi mode.

Inline `ServerConfig` test (`serve_refuses_to_start_in_state_1_without_unauthenticated`)
and three `server_settings_*` tests updated to the new `mode` shape.

Result: 211 server tests green (74 lib + 71 integration + 66 openapi),
MR-731 regression test still pinned and passing.

LOC: +45 config.rs, +281 lib.rs (net), +395 tests/server.rs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: Cedar resource-model refactor (PR 6a/10)

PR 6a of the MR-668 multi-graph server work. Policy-crate-only refactor —
no HTTP handler changes, no operator-supplied policy.yaml changes. Sets
up the chassis that PR 6b's `GET /graphs` consumes.

Two new `PolicyAction` variants:
  - `GraphCreate` — gates `POST /graphs` (deferred behavioral PR).
  - `GraphList`   — gates `GET /graphs` (lands in PR 6b).

Note: `GraphDelete` is intentionally NOT added in this PR. `DELETE
/graphs/{id}` is deferred from MR-668's v0.7.0 scope to bound complexity
(no `delete_prefix`, no tombstone, no `RegistryLookup::Tombstoned`).
Adding the Cedar action without a consumer would be the same kind of
"dead vocabulary" trap the `Admin` variant already documents.

New `PolicyResourceKind { Graph, Server }` enum, plus a
`PolicyAction::resource_kind()` method that classifies every action.
Per-graph actions (Read, Change, BranchCreate, …) bind to
`Omnigraph::Graph::"<graph_label>"`; server-scoped actions
(GraphCreate, GraphList) bind to the singleton
`Omnigraph::Server::"root"`. `Admin` stays classified as per-graph for
now — MR-724 will pick the final shape when the first consumer surface
ships.

Cedar schema string additions:
  - `entity Server;`
  - `action "graph_create" appliesTo { principal: Actor, resource: Server, ... }`
  - `action "graph_list"   appliesTo { principal: Actor, resource: Server, ... }`

Compiler updates:
  - `compile_policy_source` picks the resource literal based on the
    action's `resource_kind`. Existing graph-only policies generate
    the same Cedar source as before — pinned by
    `per_graph_rules_continue_to_work_alongside_server_rules`.
  - `compile_entities` includes the `Server::"root"` entity only when
    a rule references a server-scoped action. Keeps test assertions
    for graph-only policies tight.
  - `PolicyEngine::authorize` builds the right resource UID at
    request time based on `request.action.resource_kind()`.

Validation rules added to `PolicyConfig::validate`:
  - A rule may not mix server-scoped and per-graph actions (different
    resource kinds need different `permit` clauses).
  - Server-scoped actions cannot have `branch_scope` or
    `target_branch_scope` — there's no branch context at the server
    level.

Operator impact: zero. The Cedar schema `Omnigraph::Server` entity is
internally referenced by `compile_policy_source`; operator policy.yaml
files only declare actions in `rules[].allow.actions` and never
reference the resource entity directly. Decision 6's "internal rename
only; operator policies unaffected" contract is preserved and pinned
by `per_graph_rules_continue_to_work_alongside_server_rules`.

Tests: 5 new (11 policy tests total, up from 6):
  - `graph_list_action_authorizes_against_server_resource`
  - `graph_create_action_authorizes_against_server_resource`
  - `server_scoped_rule_cannot_use_branch_scope`
  - `rule_mixing_server_and_per_graph_actions_is_rejected`
  - `per_graph_rules_continue_to_work_alongside_server_rules`

No regression: 145 server tests (74 lib + 71 integration) still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: GET /graphs endpoint + per-graph policy wire-up (PR 6b/10)

PR 6b of the MR-668 multi-graph server work. First management endpoint —
`GET /graphs` lists every graph registered with the server, gated by the
server-level Cedar policy from PR 6a.

New API shapes (in `omnigraph-server::api`):
  - `GraphInfo { graph_id, uri }` — one entry per registered graph.
  - `GraphListResponse { graphs: Vec<GraphInfo> }` — sorted alphabetically
    by `graph_id` for deterministic output.

Handler `server_graphs_list`:
  - Mounted at `GET /graphs` in both modes.
  - Single mode: returns 405 (resource exists in the API surface, just
    not operational without a `graphs:` map). 405 chosen over 404 so
    clients see "resource exists, wrong context" rather than "no such
    resource".
  - Multi mode: requires bearer auth (when configured); Cedar-gated by
    `PolicyAction::GraphList` against `Omnigraph::Server::"root"`
    (PR 6a's chassis). Returns the sorted registry list.

Cedar gate composition:
  - When no `server.policy.file` is configured, the MR-723 default-deny
    falls through: `GraphList` is not `Read`, so an authenticated actor
    without a server policy gets 403. This is the right default — don't
    expose the registry until the operator explicitly authorizes it.
  - When a server policy is configured, Cedar evaluates the rule. The
    test `get_graphs_with_server_policy_authorizes_per_cedar` pins the
    admin-allow / viewer-deny split.

Routing:
  - New `management` sub-router holding `/graphs` (auth-required, no
    `resolve_graph_handle` middleware — operates on the registry, not
    a single graph).
  - Single mode merges flat protected routes + management.
  - Multi mode merges nested `/graphs/{graph_id}/...` + management.

OpenAPI:
  - `server_graphs_list` registered in `ApiDoc::paths(...)`.
  - `EXPECTED_PATHS` in `tests/openapi.rs` gains `/graphs`.
  - `openapi.json` regenerated (auto-tracked by
    `openapi_spec_is_up_to_date` in CI).

Tests: 4 new in `tests/server.rs::multi_graph_startup`:
  - `get_graphs_lists_registered_graphs_in_multi_mode`
  - `get_graphs_returns_405_in_single_mode`
  - `get_graphs_requires_bearer_auth_when_configured`
  - `get_graphs_with_server_policy_authorizes_per_cedar`

What's NOT in this PR (deferred):
  - Per-graph policy enforcement is wired through `handle.policy`
    (PR 4a already did this); PR 6b doesn't add new per-graph
    behavior beyond making sure the server policy lookup composes
    cleanly alongside it.
  - `POST /graphs` (PR 7) and `DELETE /graphs/{id}` (out of scope
    for v0.7.0).
  - CLI `omnigraph graphs list` (PR 8 will add).

Result: 215 server tests green (74 lib + 66 openapi + 75 integration),
11 policy tests green. MR-731 spoof regression preserved across all
this work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: POST /graphs runtime create endpoint (PR 7/10)

PR 7 of the MR-668 multi-graph server work. Operators can now add a
graph to a running multi-graph server without restarting:

  curl -X POST http://server/graphs \
    -H "Content-Type: application/json" \
    -d '{
          "graph_id": "beta",
          "uri": "/data/beta.omni",
          "schema": { "source": "node Person { name: String @key }\n" },
          "policy": { "file": "./policies/beta.yaml" }
        }'

DELETE remains deferred (out of v0.7.0 scope per the trimmed plan —
no `delete_prefix`, no tombstones).

Body shape (decision 7):
  - Nested `schema: { source: "..." }` (mirrors the `policy: { file }`
    pattern; leaves room for future fields without breakage).
  - Optional nested `policy: { file: "..." }` for per-graph Cedar.
  - 32 MiB body limit (reuses `INGEST_REQUEST_BODY_LIMIT_BYTES`).
  - Asymmetric with `SchemaApplyRequest` which keeps flat
    `schema_source: String` — documented in api.rs.

Atomic YAML rewrite + drift detection:
  - New `config::rewrite_atomic(path, new_config, expected_hash)`:
    flock → re-read + hash check → serialize → write `.tmp` → fsync
    → rename → fsync parent dir. Returns the new hash for the caller
    to update its in-memory baseline.
  - New `config::hash_config_file(path)` — SHA-256 of the on-disk
    bytes, used at startup and after each rewrite.
  - New `RewriteAtomicError { Drift | Io | Serialize }` enum.
  - `AppState.config_hash: Option<Arc<Mutex<[u8;32]>>>` carries the
    in-memory baseline. Updated after every successful rewrite so
    subsequent POSTs don't false-trigger drift.
  - The mutex is `std::sync::Mutex` (brief critical section, no .await
    inside). The flock itself serializes file access process-wide
    AND across multiple server instances (defense in depth).
  - All sync I/O runs inside `tokio::task::spawn_blocking` — flock
    is sync.

Handler ordering (the load-bearing sequence):
  1. Mode check: 405 in single mode.
  2. Cedar authorize: `GraphCreate` against `Omnigraph::Server::"root"`.
  3. Validate body: `GraphId::try_from` (regex + reserved-name), empty
     schema/uri checks, per-graph policy file parse.
  4. Pre-check registry for duplicate graph_id / duplicate uri (409).
  5. `Omnigraph::init` the new engine.
  6. Atomic YAML rewrite (drift detection inside).
  7. Publish in registry (atomic re-check via `GraphRegistry::insert`).

Failure modes (documented in handler rustdoc):
  - Init fails → orphan storage at `req.uri` (PR 2a cleans up schema
    files; Lance datasets remain orphans until `delete_prefix` lands).
  - YAML rewrite fails (drift, IO) → orphan storage; YAML unchanged.
  - Registry insert fails (race) → YAML has entry but registry doesn't;
    next restart opens it cleanly.

New dependency: `fs2 = "0.4"` (workspace + omnigraph-server). POSIX-only
file locking. Linux/macOS deployment supported; Windows out of scope.

Tests (10 new in `tests/server.rs::multi_graph_startup`):
  - `post_graphs_creates_a_new_graph_end_to_end` — happy path, includes
    YAML inspection to confirm the rewrite landed.
  - `post_graphs_baseline_hash_updates_between_rewrites` — two POSTs in
    a row both succeed (drift baseline updates correctly).
  - `post_graphs_duplicate_graph_id_returns_409`
  - `post_graphs_duplicate_uri_returns_409`
  - `post_graphs_invalid_graph_id_returns_400` (reserved name)
  - `post_graphs_empty_schema_source_returns_400`
  - `post_graphs_returns_405_in_single_mode`
  - `post_graphs_yaml_drift_detection_returns_503` — operator hand-edits
    omnigraph.yaml; server refuses to clobber.
  - `hash_config_file_is_deterministic_and_detects_changes`
  - `rewrite_atomic_refuses_when_hash_drifts`

OpenAPI: `server_graphs_create` registered in `ApiDoc::paths(...)`;
openapi.json regenerated.

Result: 225 server tests green (74 lib + 66 openapi + 85 integration),
all MR-731 regressions still pinned.

LOC: ~580 lib.rs net (handler + helpers), ~120 config.rs (rewrite
machinery), +71 api.rs (request/response shapes), +332 tests/server.rs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: CLI omnigraph graphs list/create (PR 8/10)

PR 8 of the MR-668 multi-graph server work. CLI parity for the
v0.7.0 management surface: operators can now manage graphs from
the command line against a running multi-graph server.

  omnigraph graphs list --target dev --json
  omnigraph graphs create \
    --target dev \
    --graph-id beta \
    --graph-uri /data/beta.omni \
    --schema schema.pg

DELETE is intentionally absent — server-side DELETE was deferred from
v0.7.0 scope, and shipping a client subcommand for a server endpoint
that doesn't exist would be dead vocabulary. The help output, the
subcommand enum, and the test that pins it (`graphs_subcommand_help_
lists_list_and_create`) all agree.

CLI architecture (modeled on `BranchCommand`):
  - New `Command::Graphs { command: GraphsCommand }` top-level variant.
  - `GraphsCommand { List, Create }` enum.
  - List: GET `<base>/graphs`. Stdout is `<graph_id>\t<uri>` per line,
    or JSON via `--json`.
  - Create: reads `--schema <path>` from local disk, inlines as
    `schema: { source: <file> }` in the POST body (nested per
    MR-668 decision 7). Optional `--policy-file <path>` becomes
    `policy: { file: <path> }`. Returns 201 → "created graph X at Y"
    or JSON via `--json`.
  - Both subcommands reject local URI targets with a clear
    "remote multi-graph server URL" error.

New API type imports in the CLI: `GraphCreateRequest`,
`GraphCreateResponse`, `GraphListResponse`, `GraphSchemaSpec`,
`GraphPolicySpec` — all from `omnigraph-server::api`.

Tests:
  - cli.rs (4 new, non-network):
      * `graphs_subcommand_help_lists_list_and_create` — pins the
        deferral of `delete` (catches scope creep).
      * `graphs_list_against_local_uri_errors_with_remote_only_message`
      * `graphs_create_against_local_uri_errors_with_remote_only_message`
      * `graphs_create_with_missing_schema_file_errors` — pins the
        IO context in the schema-read error path.
  - system_remote.rs (1 new, `#[ignore]` like its peers):
      * `graphs_list_and_create_against_multi_graph_server` — spawns a
        multi-mode server, calls `graphs list` (sees `alpha`),
        `graphs create` (adds `beta`), `graphs list` again (sees both),
        and confirms the new graph is reachable via its cluster route.

CLI suite: 62 tests green (58 existing + 4 new). The new ignored
end-to-end test runs locally with `cargo test --ignored`.

LOC: +159 main.rs (enum + handlers), +88 cli.rs (unit tests),
+131 system_remote.rs (integration test).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: composite e2e tests, race fix, v0.7.0 release (PR 9/10)

PR 9 — the final integration PR for MR-668 multi-graph server work.
Closes the v0.7.0 release.

Composite lifecycle tests (closes gaps flagged in PR 7's coverage
review):
  - `multi_graph_lifecycle_post_query_restart_persistence` — POST a
    graph, query it via cluster route, reload the config from disk
    and confirm `load_server_settings` sees the rewritten YAML.
    Validates the "restart resolves orphans" failure-mode story.
  - `per_graph_policy_enforced_on_post_created_graph` — POST a graph
    with a per-graph policy attached, then send authenticated read
    and change requests. Per-graph Cedar enforcement fires correctly
    on a POST-created graph (engine-layer policy reinstalled via
    `Omnigraph::with_policy` inside the create flow).
  - `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent
    POSTs with distinct graph_ids all return 201. Caught a real
    race in `rewrite_atomic` (see below).

Race fix — `rewrite_atomic_with_modify`:

The first composite test surfaced a real bug. The old
`rewrite_atomic(path, new_config, expected_hash)` captured the
baseline hash OUTSIDE the flock, then called rewrite_atomic which
re-acquired it inside. Under concurrent writers:

  - POST A: captures baseline H0, calls rewrite_atomic.
  - POST B: captures baseline H0 too (before A's update lands).
  - A: acquires flock, on-disk == H0, writes H1, releases.
  - A: updates baseline H0 → H1.
  - B: tries to acquire flock — waits.
  - B: acquires flock. On-disk is now H1. Expected (captured
       before A finished) is H0. MISMATCH → spurious Drift error.

Worse: even if the timing happens to align, B's `updated` config
was constructed from BYTES read before the flock. B writes a config
that doesn't include A's new graph — silent data loss.

The fix: new `config::rewrite_atomic_with_modify(path, baseline,
modify)` takes a closure. Inside the flock + baseline mutex:
  1. Read on-disk bytes, hash, compare to baseline.
  2. Parse on-disk YAML.
  3. Call `modify(parsed)` to produce the new config — receives
     fresh on-disk state, returns the modification.
  4. Serialize + write + fsync + rename + update baseline.

Everything is read-modify-write under the same critical section.
Concurrent writers serialize cleanly. Test confirmed this is no
longer a race.

The old `rewrite_atomic(path, new_config, expected_hash)` API stays
for tests that don't need the read-modify-write shape; the POST
handler switches to the new shape.

Version bump v0.6.0 → v0.7.0:
  - All 5 `crates/*/Cargo.toml` (compiler, engine, policy, cli, server)
    plus their inter-crate `path` dep version constraints.
  - `Cargo.lock` regenerated by `cargo build --workspace`.
  - `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server
    row updated to mention multi-graph + cluster routes + atomic YAML
    rewrite.
  - `openapi.json` regenerated.

Docs:
  - `docs/releases/v0.7.0.md` (new) — release notes with breaking
    changes, new features, deferred items (DELETE, `delete_prefix`,
    actor forwarding), and the single→multi migration recipe.
  - `docs/user/server.md` — substantial section additions for the
    two modes, mode inference, cluster endpoint table, management
    endpoints, `omnigraph.yaml` ownership contract, `POST /graphs`
    body shape + status codes.
  - `docs/user/cli.md` — `omnigraph graphs list/create` section,
    deferred-DELETE note.
  - `docs/user/policy.md` — server-scoped Cedar actions
    (`graph_create`, `graph_list`), per-graph vs server-level policy
    composition, example server-level policy.

Workspace test pass: 573 tests green across all crates. Zero
failures. MR-731 spoof regression still pinned and passing across
the entire 10-PR series.

This commit closes MR-668. v0.7.0 is ready for tagging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: remove POST /graphs and CLI graphs create (defer runtime graph mgmt)

The POST /graphs runtime-create endpoint shipped in PR 7/10 has three
unresolved high-severity bugs:

  - flock-on-renamed-inode race: the YAML flock is taken on
    omnigraph.yaml itself, then a temp file is renamed over it.
    Cross-process writers end up locking different inodes — both
    believing they hold exclusive access.
  - duplicate-check outside the file lock: precheck runs against
    the in-memory registry only; the locked closure does
    config.graphs.insert(...) unconditionally. Concurrent same-id
    POSTs can persist the loser in YAML while the in-memory registry
    keeps the winner — they disagree after restart.
  - best_effort_cleanup_init_artifacts deletes _schema.pg /
    _schema.ir.json / __schema_state.json on any init failure. An
    accidental re-init against an existing graph's URI destroys its
    schema; subsequent open() fails at read_text(_schema.pg).

The correct fix is a Lance-style cluster catalog (reserve → init →
publish with recovery sidecars), parallel to the engine's existing
__manifest discipline. That work is out of scope for v0.7.0.

For now, disable runtime add/remove from the network and CLI surface.
Operators add graphs by editing omnigraph.yaml and restarting. The
GET /graphs read-only enumeration stays.

Removed:
- POST /graphs handler + router fragment + utoipa registration
- 13 post_graphs_* server tests + 3 composite POST tests +
  multi_mode_app_with_real_config / post_graph helpers
- CLI omnigraph graphs create subcommand + its handler + cli.rs tests
- system_remote.rs combined list+create test trimmed to list-only
- YAML rewrite infra: rewrite_atomic[_with_modify], RewriteAtomicError,
  staging_path, hash_config_file, AppState::config_hash field +
  threading through new_multi and open_multi_graph_state
- fs2 dependency (verified absent from cargo tree)
- sha2/fs2 imports in config.rs (only the rewrite path used them)
- Cedar PolicyAction::GraphCreate variant + "graph_create" match arms
  + action def in Cedar schema + graph_create_action_authorizes_against_server_resource test
- GraphCreateRequest / GraphCreateResponse / GraphSchemaSpec /
  GraphPolicySpec API types (only the POST handler / CLI imported them)

Kept:
- GET /graphs (read-only enumeration) and graph_list Cedar action
- omnigraph graphs list CLI subcommand
- All multi-graph startup, mode inference, cluster routes,
  per-graph + server-level Cedar policies
- server_settings_drive_multi_graph_startup_end_to_end (the test
  that covers operator-authored YAML + restart — the path that
  survives)
- best_effort_cleanup_init_artifacts and the three init failpoints
  (still reachable from CLI `omnigraph init`; preflight fix deferred
  as a follow-up)
- GraphRegistry::insert and its concurrency tests — production
  callers gone, but the method is the natural seam for the future
  cluster-catalog work

Also fixed (transcript issue 4):
- ALWAYS_FLAT_PATHS now includes /graphs so multi-mode OpenAPI
  advertises the management route correctly (was previously rewritten
  to /graphs/{graph_id}/graphs)
- multi_mode_openapi_keeps_healthz_flat → renamed to
  multi_mode_openapi_keeps_management_paths_flat, asserts both
  /healthz and /graphs stay flat
- multi_mode_openapi_prefixes_operation_ids_with_cluster skips
  /graphs in addition to /healthz

Doc fixes:
- docs/user/cli.md: graphs list example was --target http://...,
  but --target is a config-graph-name lookup; corrected to --uri.
  Removed the graphs create example.
- docs/user/server.md: dropped POST /graphs row, "omnigraph.yaml
  ownership", and "POST /graphs body shape" sections. Added a
  paragraph stating runtime add/remove is not exposed in v0.7.0.
- docs/user/policy.md: dropped graph_create action; reworded the
  "Configuration" line to clarify that server-scoped rules (graph_list)
  take neither branch_scope nor target_branch_scope.
- docs/releases/v0.7.0.md: rewrote release narrative — multi-graph
  mode ships; runtime add/remove deferred.
- AGENTS.md: HTTP server bullet and capability matrix row updated to
  reflect read-only GET /graphs and the operator-edit workflow.
- openapi.json regenerated; /graphs has only .get, no .post.

Diff: 17 files, +123 −1525 LOC.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: comment cleanup and policy format style

Strip "PR Na/Nb" sub-PR references throughout MR-668 surfaces — they
were useful during the 10-PR delivery sequence but rot now that the
work is in the tree. Keep the MR-668 umbrella references.

Also:
- Add explicit `when = when` and `resource_literal = resource_literal`
  named args in `compile_policy_source`'s outer `format!` to match the
  surrounding crate style (already explicit for `group` and `action`).
- Rename the best-effort cleanup tracing target from
  "omnigraph::init" to "omnigraph::init::cleanup" so operators can
  filter init-failure cleanup events separately from init's other
  log lines.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: drop actor_id from PolicyRequest; pass actor as separate arg

The MR-731 "server-authoritative actor identity" invariant was enforced
by an in-function chokepoint (`request.actor_id = actor.actor_id...`
overwrite inside `authorize_request`). That worked but relied on every
caller passing in a `PolicyRequest` and trusting the overwrite — a
comment-enforced invariant.

Move the invariant into the type system:

* `PolicyRequest` no longer carries `actor_id`. The struct now models
  what a caller wants to do, not who they are.
* `PolicyEngine::authorize(actor_id: &str, request: &PolicyRequest)`
  and `validate_request(actor_id, request)` take identity as a
  separate argument. The same shape `PolicyChecker::check` already had
  for the engine layer.
* `authorize_request` in the HTTP layer extracts `actor_id` from the
  bearer-resolved `ResolvedActor` and passes it positionally — no
  overwrite step that could be skipped.
* CLI `omnigraph policy explain` updated (the only other consumer
  that built a `PolicyRequest`).

Public API break for the `omnigraph-policy` crate. Worth it: handlers
can no longer accidentally populate `actor_id` from a request body
field, and external consumers are forced by the compiler to source
actor identity from a trusted path.

The MR-731 chokepoint test
`actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers`
still passes — the bearer-resolved actor is what reaches the engine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: consolidate AppState single-mode constructors; delete with_policy_engine

The prior `with_policy_engine` constructor reused the engine `Arc`
from the existing handle (`engine: Arc::clone(&existing.engine)`)
without re-applying `Omnigraph::with_policy`. Combined with
`new_with_workload`, the documented composition pattern was
`AppState::new_with_workload(...).with_policy_engine(p)` — which
produced an `AppState` whose HTTP layer enforced Cedar but whose
underlying engine had no `PolicyChecker` installed. Any caller
reaching the engine via `state.registry().list()[i].engine` could
bypass policy entirely. The doc comment named this gap; the type
system didn't.

Make composition impossible to get wrong:

* Add `AppState::new_single(uri, db, tokens, Option<PolicyEngine>,
  WorkloadController)` — canonical single-mode constructor that
  takes every option together and routes through `build_single_mode`
  (which applies `db.with_policy(checker)` to the engine itself).
* `new`, `new_with_bearer_token`, `new_with_bearer_tokens`,
  `new_with_bearer_tokens_and_policy`, `new_with_workload` all become
  thin wrappers around `new_single`.
* Delete `with_policy_engine`. There is no post-construction policy
  install path any more; the single linear construction forces
  HTTP-layer and engine-layer policy to install together or not at all.

Regression test `engine_layer_policy_fires_via_direct_arc_omnigraph_from_new_single`
constructs an `AppState::new_single` with a deny-all policy, pulls
the `Arc<Omnigraph>` from the registry handle (the same path an
embedded SDK consumer would take), and asserts a direct `mutate_as`
call returns `OmniError::Policy`. Pre-fix this test would have
succeeded the mutation.

Test caller in `ingest_per_actor_admission_cap_returns_429`
migrates from `.with_policy_engine(...)` to `new_single(...,
Some(policy_engine), workload)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: derive any_per_graph_policy on RegistrySnapshot; simplify dup check

`AppState::requires_bearer_auth` walked the entire registry per
request (cloning Arcs into a `Vec`, then `.iter().any(|h| h.policy
.is_some())`) to decide whether the auth middleware should challenge.
The walk is unnecessary — the answer only changes when the registry
mutates, which is exactly the moment a new snapshot is constructed.

Move the flag onto the snapshot itself:

* `RegistrySnapshot { graphs, any_per_graph_policy: bool }`.
* `RegistrySnapshot::new(graphs)` is the only construction path —
  it derives the flag from `graphs.values().any(|h| h.policy
  .is_some())` so the cached value can't drift from the source data.
* `Default` delegates to `new(HashMap::new())`.
* `GraphRegistry::from_handles` and `insert` build snapshots via
  `RegistrySnapshot::new(...)`.
* `GraphRegistry::snapshot_ref()` exposes the current snapshot
  through an `arc_swap::Guard`; callers that need cached derived
  state go through this accessor (callers that only want `graphs`
  still use `list` / `get`).

`requires_bearer_auth` becomes one `ArcSwap::load` + bool read.

Also (drive-by, same file, same hunk): replace the dead
`if let Some(other) = seen_uris.get(...)` + `let _ = other;` pattern
in `from_handles` with a plain `seen_uris.contains_key(...)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: fail-fast multi-graph startup with try_collect

The `open_multi_graph_state` doc comment claims "Fail-fast — the
first open error aborts startup; other in-flight opens are dropped"
but the code did

    .buffer_unordered(4)
    .collect::<Vec<_>>()
    .await
    .into_iter()
    .collect::<Result<Vec<_>>>()?;

which drains every future in the stream before propagating the first
`Err`. With N S3-backed graphs and graph #2 failing fast, the caller
still waits for #1, #3, #4, … to either succeed or fail before
seeing the error.

Replace the four-line dance with `futures::TryStreamExt::try_collect`,
which short-circuits on the first `Err` and drops the rest. The
doc comment now matches behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: drop unused State extractor from 7 read-only handlers

After the routing-middleware refactor moved the engine into the per-graph
`GraphHandle` (extracted via `Extension<Arc<GraphHandle>>`), seven
read-only handlers — `server_snapshot`, `server_read`, `server_export`,
`server_schema_get`, `server_branch_list`, `server_commit_list`,
`server_commit_show` — kept an unused `State(_state): State<AppState>`
extractor. Drop it. Each request avoids one `FromRequestParts` clone
of `AppState`'s Arcs.

Handlers that actually use state (workload admission for write paths,
`server_policy` for management endpoints) keep theirs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: emit info! for graph routing decision

`tracing::Span::current().record("graph_id", ...)` in the routing
middleware silently no-ops here: no upstream `#[tracing::instrument]`
on the handlers declares a `graph_id` field, and `TraceLayer::new_for_http`
doesn't either. The recorded value never lands anywhere visible.

Replace with an explicit `info!(graph_id = %handle.key.graph_id,
"graph routed")` event so operators can grep logs and correlate
requests with the active graph. In single mode the value is the
sentinel `"default"`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: align GET /graphs 405 body code with HTTP status

The single-mode `GET /graphs` handler returned an `ApiError` built
via struct literal with `status: METHOD_NOT_ALLOWED, code: BadRequest`.
The body code disagreed with the HTTP status — clients deserializing
on `code` saw `bad_request`, clients deserializing on `status` saw
405. Same bug class as the earlier 503+Conflict mismatch on the
removed YAML drift path.

Close the class for this one remaining instance:

* Add `ErrorCode::MethodNotAllowed` to the API enum.
* Add `ApiError::method_not_allowed(msg)` — pairs the 405 status
  with the matching code.
* Replace the struct literal in `server_graphs_list` with the
  constructor.
* Regenerate `openapi.json` (adds `method_not_allowed` to the
  ErrorCode schema enum).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: drop unused axum::handler::Handler import

The import landed in earlier work but no current call site uses it.
Emitted an `unused_imports` warning on every server build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: drop unused fs2 workspace dependency

`fs2 = "0.4"` lingered in [workspace.dependencies] after the
POST /graphs flock-on-rename design was pulled. `cargo tree -i fs2`
reports no consumers in the workspace and the dep is not in
Cargo.lock. Removing the declaration closes the "phantom dep" class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: AGENTS.md Cedar row no longer hardcodes action count

The "8 actions" claim drifted as soon as MR-668 added `graph_list`.
Bumping the count would just push the drift one PR forward; the
correct-by-design fix is to defer to the canonical list in
docs/user/policy.md and stop maintaining a duplicate count.

Closes the "doc hardcodes a count that drifts from the enum" class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: cfg(test)-gate GraphRegistry::insert and its mutex

`insert` and the `mutate: Mutex<()>` that serializes it had no
runtime consumer in v0.7.0 — the only insertion path at startup
is `from_handles`, and runtime add/remove is deferred until a
managed cluster catalog ships. Leaving both `pub` and live made
them a "looks like API, isn't" footgun: a future change could
build on `insert` without re-establishing the concurrency contract
with an actual consumer in scope.

Gate both together (`#[cfg(test)]` on the method, the field, and
the `tokio::sync::Mutex` import) so the race-pinning tests still
compile but production cannot reach them. When a real consumer
ships, ungate both — they're a unit. Closes the "public API with
no runtime consumer drifts toward incorrect" class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: drop vestigial PolicyEngine surface

* `validate_request` had zero callsites — pure surface for nothing.
* `deny`'s `_actor_id` and `_request` parameters were both unused
  (the underscore prefix gave it away); the message is built by the
  caller before `deny` ever sees the request. Trim both.

Closes the "public API that the type system can't justify" class
for the policy engine. No behavior change; every existing test
stays green because the deletions never had a runtime effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: regression test for init re-init footgun (red)

A second `Omnigraph::init` against an existing graph URI today
destroys the existing graph's schema artifacts. `init_storage_phase`
overwrites `_schema.pg` before any preflight, and on the inner
`GraphCoordinator::init` failure that follows,
`best_effort_cleanup_init_artifacts` deletes all three schema files.
The existing Lance datasets and `__manifest/` survive but the
schema metadata is gone — unrecoverable without operator surgery.

This test exercises that path and currently fails with
"_schema.pg must not be deleted by a failed re-init", confirming
the destructive cleanup branch fires. The fix in the next commit
makes the test pass by preflighting with `storage.exists()` and
returning a typed error before any write touches disk.

Per AGENTS.md rule 12, the test commit lands just before the fix
commit so the red → green pair is visible in `git log` and a
reviewer can check out this commit alone to reproduce.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: close init re-init footgun via InitOptions preflight (green)

`Omnigraph::init` is "create a new graph"; existing graphs need
an explicit overwrite. Today's behavior — silently overwrite
schema files, then on inner failure delete them via best-effort
cleanup — is destructive against an existing graph regardless of
which branch fires.

Correct-by-design fix:

* New `InitOptions { force: bool }` struct (default `force: false`).
* New `Omnigraph::init_with_options(uri, schema, options)`. The
  old `Omnigraph::init(uri, schema)` is a thin shortcut that
  passes `InitOptions::default()`.
* `init_with_storage` runs a `storage.exists()` preflight on the
  three schema URIs BEFORE any parse, write, or coordinator call.
  Any hit → typed `OmniError::AlreadyInitialized { uri }`. The
  destructive code paths (the `write_text` overwrite and the
  best-effort cleanup) are now unreachable in strict mode against
  an existing graph.
* `force: true` skips the preflight; existing operators who
  actually mean to overwrite opt in explicitly.
* CLI: `omnigraph init --force` maps to `InitOptions { force: true }`.
* HTTP: `OmniError::AlreadyInitialized` maps to 409 via
  `ApiError::from_omni`. Not currently HTTP-reachable (POST /graphs
  was pulled), but the wiring lands here so a future runtime
  create endpoint has one canonical translation.

Closes the "init is destructive against existing state" class.
The regression test added in the previous commit
(`init_on_existing_graph_uri_does_not_destroy_existing_schema`)
turns green: the original schema files now survive a second
init attempt byte-for-byte, and the call errors cleanly with
`AlreadyInitialized`. The four existing
`init_failpoint_after_*_cleans_up_*` tests stay green — strict
mode's preflight passes on a fresh tempdir, and cleanup still
runs as before when a failpoint fires mid-write.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: split PolicyEngine::load into kind-typed loaders

Pre-fix, every caller of `PolicyEngine::load(path, graph_id)`
passed *some* `graph_id` argument — even when the policy was
server-scoped and Cedar's resolution would never touch a Graph
entity. The server-level loader at lib.rs passed the meaningless
sentinel `"server"`. A graph policy file containing a `graph_list`
rule compiled fine; a server policy file containing a `read` rule
compiled fine. Both silently no-op'd at request time because the
engine kind and the rule's resource kind disagreed.

Correct-by-design fix: replace `load` with two kind-typed loaders.

* `PolicyEngine::load_graph(path, graph_id)` — for per-graph
  policy files. Rejects any rule whose action `resource_kind()`
  is `Server`.
* `PolicyEngine::load_server(path)` — for server-level policy
  files. Takes no `graph_id`: server-scoped actions resolve against
  the singleton `Omnigraph::Server::"root"` entity, never a Graph.
  Rejects any rule whose action `resource_kind()` is `Graph`.

The old `load` is hard-deleted in the same commit because every
in-tree consumer migrates here (no semver promise on the workspace
crate, no external pinners). New `PolicyEngineKind` enum types
the loader's intent; `validate_kind_alignment` is the load-time
check that closes the "wrong action, wrong file, silent no-op"
class — operators get a load-time error instead of confused-and-
silent behavior at request time.

Callsites migrated:
* server lib.rs:374 (single-mode per-graph)   → load_graph
* server lib.rs:1065 (multi-mode server)      → load_server
* server lib.rs:1103 (multi-mode per-graph)   → load_graph
* CLI main.rs:732 (resolve_policy_engine)     → load_graph
* tests/server.rs ×5 (4 graph, 1 server)      → load_graph/load_server
* policy_engine_chassis.rs                    → load_graph

Four new in-source tests pin the contract: both rejection paths
and both positive paths.

Closes the "operator puts an action in the wrong file and the
rule silently never matches" class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: introduce GraphRouting, retire single_mode_handle

Pre-fix, `AppState` always carried `Arc<GraphRegistry>` even when
serving one graph. Single mode populated the registry with one
handle keyed by the `SINGLE_GRAPH_KEY_ID = "default"` sentinel;
`single_mode_handle` walked the registry, asserted `len == 1`,
and returned the single element with a 500-class "programmer
error" branch on mismatch. Three smells in a row — magic key,
walk-and-assert, programmer-error guard — all because the
single-mode runtime was forced through a multi-mode abstraction.

Correct-by-design fix: type the routing.

* New `pub enum GraphRouting { Single { handle }, Multi { registry,
  config_path } }` on `AppState`. The `Single` arm carries the handle
  directly — no registry, no key, no walk.
* `resolve_graph_handle` middleware matches on `routing`. Single mode
  returns the handle in O(1); multi mode does the same path-extract +
  registry lookup as before. The 500-class programmer-error branch
  is gone — the type system now makes the violated invariant
  ("single mode has exactly one handle") unrepresentable.
* `requires_bearer_auth` reads `handle.policy.is_some()` directly
  in the Single arm; Multi arm still uses the cached
  `any_per_graph_policy` flag.

`ServerMode` and the legacy `registry` field on `AppState` are still
populated for now — C-3 removes both once every reader is migrated.
The `SINGLE_GRAPH_KEY_ID` sentinel and `ServerMode` will also go
away in C-3.

Closes the "single mode forced through a multi-mode abstraction"
class. All 76 server integration tests stay green: handlers still
extract `Extension<Arc<GraphHandle>>` from the request, so the
middleware's internal change is invisible to them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: remove ServerMode, registry field, and the SINGLE_GRAPH sentinel

C-1/C-2 introduced `GraphRouting` and pointed the middleware at it.
This commit removes the legacy shape that's now dead:

* `ServerMode` enum — deleted. Single mode's `uri` lives on
  `handle.uri`; multi mode's `config_path` lives on the
  `GraphRouting::Multi` arm.
* `AppState.mode: ServerMode` field — deleted.
* `AppState.registry: Arc<GraphRegistry>` field — deleted. Multi
  mode's registry is on `GraphRouting::Multi { registry, .. }`;
  single mode has no registry at all.
* `AppState::mode()`, `AppState::uri()`, `AppState::registry()`
  accessors — deleted. New `AppState::routing() -> &GraphRouting`
  is the single public entry point.
* `SINGLE_GRAPH_KEY_ID` constant — deleted. `GraphHandle.key` is
  still required by the struct, but in single mode the key is now
  only a tracing label (`"default"`, inlined with a comment naming
  its sole remaining purpose). Single-mode flat routes never carry
  a `{graph_id}` parameter, so the key is never compared against
  user input, and there is no registry where it could be a map
  key. C-1/C-2 already removed the registry walk that the sentinel
  was named for.

Callers migrated:
* `build_app` (lib.rs:944) — matches on `state.routing()` instead
  of `state.mode()`.
* `server_graphs_list` (lib.rs:1162) — destructures the Multi arm
  to get the registry; Single arm short-circuits to 405.
* `server_openapi` (lib.rs:1217) — matches the Multi arm for the
  cluster-prefix rewrite.
* `tests/server.rs:3735` — the B2 footgun regression test now
  matches on `state.routing()` to extract the single-mode handle
  (the test's earlier `state.registry().list().next()` shape was
  the closest pre-fix analog to "embedded consumer reaches the
  engine"; the new shape is more direct).

Closes the entire "single mode forced through a multi-mode
abstraction" class. After this commit:
* No magic sentinel as a routing key.
* No `single_mode_handle` walk-and-assert helper.
* No 500-class "programmer error" branch in the middleware.
* No two-field discriminant on `AppState` where one would do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: regression test for nested-route path extraction (red)

`server_branch_delete` and `server_commit_show` use bare
`Path<String>` extractors. In single-mode flat routes
(`/branches/{branch}`, `/commits/{commit_id}`) this works — one
capture, one value. In multi-graph cluster routes
(`/graphs/{graph_id}/branches/{branch}`,
`/graphs/{graph_id}/commits/{commit_id}`) axum 0.8 propagates the
outer `{graph_id}` capture into the inner handler, so the
extractor sees two captures and 500s with
"Wrong number of path arguments. Expected 1 but got 2."

`cluster_routes_dispatch_per_graph_handle` only exercises
`/snapshot` (no Path extractor), so the regression slipped through.
This test closes that gap structurally: every cluster route with
an inner path param gets exercised here.

Currently fails with the exact symptom above. Fix in the next
commit makes it pass.

Per AGENTS.md rule 12, the red test commit lands just before the
fix so the pair is visible in `git log` and a reviewer can check
out this commit alone to reproduce.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: named-field path-param structs for nested cluster routes (green)

`Path<String>` deserializes one path-param value positionally.
Single-mode flat routes (`/branches/{branch}`,
`/commits/{commit_id}`) have one capture; multi-mode nested routes
(`/graphs/{graph_id}/branches/{branch}`,
`/graphs/{graph_id}/commits/{commit_id}`) have two — axum 0.8
propagates the outer capture into nested handlers. Same handler,
two different shapes; the multi-mode shape 500s with
"Wrong number of path arguments. Expected 1 but got 2."

Symptomatic fix: change to `Path<(String, String)>` and ignore the
first element. Breaks again the moment we add another nest layer
(e.g. tenant in Cloud mode).

Correct-by-design fix: named-field structs deserialized by name
from axum's path-param map. Each handler picks only the fields it
needs. Stable across single / multi / future-cloud nest depths
because deserialization is by field name, not position.

* New `BranchPath { branch: String }` (file-local to lib.rs)
* New `CommitPath { commit_id: String }`
* `server_branch_delete` extractor → `Path<BranchPath>`
* `server_commit_show` extractor → `Path<CommitPath>`

Closes the "handler path-extractor type is positional and breaks
when route nesting changes" class. Red test from the previous
commit turns green. All 77 server tests pass (single-mode branch
delete + commit show, plus new multi-mode coverage).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: centralize policy-requires-tokens check in the runtime classifier

Single-mode `open_with_bearer_tokens_and_policy` bailed at lib.rs:380
when policy was installed and no tokens. Multi-mode
`open_multi_graph_state` had no equivalent: the server started, every
request 401'd because no token could ever match, and the operator
spent time debugging a misconfiguration the single-mode path would
have caught at startup.

The doc/code contradiction made the gap easy to miss: the
`ServerRuntimeState::PolicyEnabled` docstring said tokens-or-not
was "unusual but valid — every request fails 401 without a bearer,
which is effectively 'locked'." The single-mode bail contradicted
that. In practice, silent-401-on-every-request is bug-shaped, not
feature-shaped (operators wanting deny-all should configure tokens
plus a deny-all Cedar rule to get meaningful 403s with
policy-decision logging).

Symptomatic fix: add a copy of the bail to multi-mode. Two copies
that can drift again the next time a startup path is added.

Correct-by-design fix: hoist the check into
`classify_server_runtime_state` so both modes get the same
enforcement from one source of truth. The classifier becomes the
single source of truth for "should we start?" and adding a startup
invariant there is now the natural extension point for any future
mode.

Classifier matrix is now complete:

| has_tokens | has_policy | allow_unauthenticated | Result |
|---|---|---|---|
| F | F | F | bail (existing) |
| F | F | T | Open (existing) |
| T | F | * | DefaultDeny (existing) |
| F | T | * | bail (NEW — closes the gap) |
| T | T | * | PolicyEnabled (existing) |

Changes:

* `classify_server_runtime_state` (lib.rs:870-890) gains the
  `(false, true, _) => bail!(…)` arm with a clear message naming
  the failure mode and the two valid resolutions.
* `open_with_bearer_tokens_and_policy` (lib.rs:369+) drops its
  redundant local bail — the classifier rejected the invalid case
  before construction was reached.
* `ServerRuntimeState::PolicyEnabled` docstring is rewritten:
  drops the "(unusual but valid)" carve-out and states plainly
  that PolicyEnabled requires tokens. Names the explicit
  alternative (tokens + deny-all Cedar rule) for operators who
  want the all-requests-denied behavior.
* `classify_policy_enabled_always_wins` test is renamed to
  `classify_policy_enabled_requires_tokens` and the now-invalid
  `(false, true, _)` assertion is removed (covered by the new
  rejection test).
* New `classify_policy_without_tokens_is_rejected` test covers the
  new arm.
* New `serve_refuses_to_start_with_policy_but_no_tokens_multi_mode`
  integration test pins the multi-mode propagation path —
  symmetric with the existing single-mode
  `serve_refuses_to_start_in_state_1_without_unauthenticated`.

Closes the "single mode and multi mode startup branches can drift
on safety invariants" class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: close coverage gaps surfaced by the test-coverage audit

The bot-review pass and the subsequent coverage audit surfaced two
material gaps in PR #119's test surface — both easy to close, both
worth closing before merge.

* **Gap 1 — cluster-route sweep.** The Bug-1 path-extractor
  regression slipped through because
  `cluster_routes_dispatch_per_graph_handle` only exercised
  `/snapshot`. The other six protected cluster routes (`/read`,
  `/change`, `/export`, `/schema`, `/schema/apply`, `/ingest`,
  `/branches/merge`) were implicitly trusted to work without any
  multi-mode integration test.

  Add `all_protected_cluster_routes_resolve_to_their_handler`
  (`tests/server.rs`) that hits each protected cluster route with
  a minimal request and asserts the response is consistent with
  the handler being reached — no 404 (router didn't match), no 500
  with "Wrong number of path arguments" (Bug-1 class), no 500 with
  "missing extension" (routing middleware didn't inject the handle).
  Status code is a negative assertion because each handler's
  happy-path inputs differ; what matters is "the request reached
  the handler," not "the handler returned 200" — that's already
  pinned by the single-mode tests.

* **Gap 2 — `--force` happy path.** The strict re-init regression
  test (`init_on_existing_graph_uri_does_not_destroy_existing_schema`)
  pins the error path; nothing pinned the `force: true` escape
  hatch actually doing what its docstring claims.

  Add `init_with_force_recovers_from_orphan_schema_files`
  (`tests/lifecycle.rs`). Writes a bare `_schema.pg` to simulate
  orphan files from a failed prior init, confirms strict mode
  bails as expected, then confirms `init_with_options(force: true)`
  succeeds and produces a functional graph.

  Note: the test follows the documented semantics — force skips
  the preflight only, it does NOT purge existing Lance state. An
  earlier draft of the test (against full overwrite of an existing
  populated graph) failed because `GraphCoordinator::init` errored
  on the existing `__manifest`, which is exactly the limitation
  the `InitOptions::force` docstring already calls out. Recursive
  purge needs `StorageAdapter::delete_prefix` (tracked separately).

Coverage is now fully aligned with the PR's claims.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: regression test for GraphList open-mode bypass (red)

Cursor bot's review at commit 4120448 surfaced that
`server_graphs_list` returns 200 in Open mode (`--unauthenticated`,
no tokens, no policy), exposing the full graph registry — graph
IDs and URIs that may contain S3 bucket paths or internal
hostnames — to any unauthenticated caller.

Root cause: `authorize_request`'s no-policy fallback only denies
when `actor.is_some()`. In Open mode `actor: None`, so the
denial branch never fires and the call returns `Ok(())`. The
docstring on `server_graphs_list` claims the endpoint is
"Cedar-gated" and that we "don't leak the registry until the
operator explicitly authorizes it" — but Open mode has no Cedar
at all, so the docstring intent and the code disagree.

This commit renames the existing
`get_graphs_lists_registered_graphs_in_multi_mode` test to
`get_graphs_denied_in_open_mode_without_server_policy` and flips
the assertion from 200 → 403. Today this fails (server returns
200) — exactly the symptom the bot named. The fix in the next
commit tightens the no-policy fallback to deny server-scoped
actions unconditionally, regardless of mode.

Per AGENTS.md rule 12, the red test commit lands just before
the fix so the red → green pair is visible in `git log` and a
reviewer can check out this commit alone to reproduce.

Sort-order coverage that previously lived in the renamed test
moves to `get_graphs_with_server_policy_authorizes_per_cedar`
in the next commit, where the admin-200 response is operator-
authorized and a non-empty body is asserted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: server-scoped actions always require explicit policy (green)

`server_graphs_list` returned 200 in Open mode (`--unauthenticated`,
no tokens, no policy) because `authorize_request`'s no-policy
fallback only denied when `actor.is_some()` AND action != Read.
In Open mode `actor: None`, so the denial branch never fired and
the call returned `Ok(())` — leaking the registry (graph IDs +
URIs that may contain S3 bucket paths or internal hostnames) to
any unauthenticated caller. The docstring on `server_graphs_list`
claimed it was "Cedar-gated" and that the server should "not leak
the registry until the operator explicitly authorizes it" —
docstring intent and code disagreed.

Symptomatic fix: special-case GraphList. Breaks the moment
another server-scoped action (`graph_create`, `graph_delete`) is
added.

Correct-by-design fix: tie authorization to the action's
`resource_kind()`. Server-scoped actions
(`PolicyResourceKind::Server`) always require explicit policy
authorization — there is no runtime state where they're served
by default. Per-graph actions keep the existing default-deny
logic (DefaultDeny denies non-Read for authenticated actors;
Open mode allows everything per the operator's `--unauthenticated`
opt-in for graph DATA, but not for server topology).

The fix uses the existing `PolicyResourceKind` enum that #119
already added — no new abstraction. Future server-scoped actions
(runtime `graph_create`/`graph_delete` when the cluster catalog
ships) automatically pick up the same enforcement without any
per-action handler change.

Changes:

* `crates/omnigraph-server/src/lib.rs:51` — re-export
  `PolicyResourceKind` (the kind discriminator was already public
  on the omnigraph-policy crate; needed in scope here).
* `crates/omnigraph-server/src/lib.rs:1457` — `authorize_request`'s
  no-policy fallback gains a server-scoped-action check that fires
  before the actor-based default-deny logic. Error message names
  the failure mode and points at `server.policy.file`.
* `crates/omnigraph-server/tests/server.rs:5037` —
  `get_graphs_with_server_policy_authorizes_per_cedar` extended
  to register two graphs in non-alphabetical order and assert
  the admin-200 response is sorted alphabetically. Restores the
  sort-order coverage that lived in
  `get_graphs_lists_registered_graphs_in_multi_mode` before the
  red commit renamed it to assert denial.

Also bundles a small adjacent cleanup that the bot-review flagged:

* `crates/omnigraph-server/src/graph_id.rs:124` — drop the
  unreachable `"openapi.json"` entry from `is_reserved`. The
  regex `^[a-zA-Z0-9-]{1,64}$` rejects every dot-containing name
  before `is_reserved` can run, so dotted entries in this list
  were dead code that misled readers into thinking the list
  needed to cover them. Comment now names the structural
  exclusion. The `rejects_reserved_route_names` test loses its
  `openapi.json` row (covered by `rejects_dots` via the regex).

Closes the "server-scoped management actions silently leak in
Open mode" class. Red test from the previous commit
(`get_graphs_denied_in_open_mode_without_server_policy`) turns
green; all 78 server integration tests + 76 lib tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: fold multi-graph work into v0.6.0 (no separate v0.7.0 release)

The branch had bumped workspace versions to 0.7.0 and added a
dedicated `docs/releases/v0.7.0.md` for the multi-graph work.
Per scope decision: ship the graph-rename and the multi-graph
mode in one v0.6.0 release.

Changes:

* Workspace versions bumped 0.7.0 → 0.6.0 in every crate manifest
  (`omnigraph`, `omnigraph-compiler`, `omnigraph-policy`,
  `omnigraph-server`, `omnigraph-cli`) and their internal
  `path = ..., version = "..."` dependency constraints.
* `docs/releases/v0.7.0.md` content merged into
  `docs/releases/v0.6.0.md`, retargeted to a single coherent
  v0.6.0 release note covering both the graph terminology rename
  and the multi-graph server mode. The original v0.7.0.md is
  deleted.
* All `v0.7.0` / `0.7.0` doc and comment references throughout
  `crates/`, `docs/`, `AGENTS.md`, and `openapi.json` retargeted
  to `v0.6.0` / `0.6.0`. `Cargo.lock` regenerated to match.
* OpenAPI spec regenerated via `OMNIGRAPH_UPDATE_OPENAPI=1
  cargo test -p omnigraph-server --test openapi
  openapi_spec_is_up_to_date` — `"version": "0.6.0"` now.

Verification:

* `cargo build --workspace` — clean (6 pre-existing engine
  warnings only).
* `cargo test --workspace --locked` — zero failures across all
  39 test result groups.
* `bash scripts/check-agents-md.sh` — passes (34 links / 33 docs).
* `grep -rn "0\.7\.0\|v0\.7\.0" --include='*.rs' --include='*.md'
  --include='*.json' --include='*.toml' .` returns no workspace
  hits. The three remaining `0.7.0` strings in `Cargo.lock`
  belong to unrelated 3rd-party crates (`pem-rfc7468`, `radium`,
  `rand_xoshiro`).

The git tag and crates.io publish happen later — this commit
just consolidates the surface so the eventual release is one
coherent v0.6.0 covering all the work since v0.5.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* mr-668: sanitize internal refs from v0.6.0 release notes

cubic-dev-ai P2 comments flagged that the release notes carried
internal Linear ticket and RFC references (MR-668, MR-731,
MR-723, RFC 0003, RFC 0004). Per AGENTS.md maintenance rule 5,
"Release docs are public project history. Describe capabilities,
behavior changes, breaking changes, upgrade notes, and user
impact; do not reference private ticket systems, internal
codenames, or planning shorthand that an outside contributor
cannot inspect." The bot's comments are correct against our own
published contract — they were a docs-quality regression
introduced when I drafted these notes.

Replaced each internal reference with the public-facing concept
it stood for. The substantive content (capabilities, behavior,
guarantees) was already present alongside the refs; sanitization
just trimmed the bracketed ticket labels:

* Line 6: dropped `(MR-668)` from the multi-graph mode summary —
  the descriptive name was already self-sufficient.
* Line 24: `MR-731 spoof defense` → `the bearer-derived-actor-
  identity guarantee`; `Forward-compat for Cloud mode (RFC 0003)
  and OAuth provider (RFC 0004)` → "forward-compat seams for
  future multi-tenant and OAuth deployments; they're inert in
  this release" — describes what the operator sees instead of
  pointing at planning docs.
* Line 26: `MR-731's server-authoritative-actor invariant` →
  "the server-authoritative-actor invariant: actor identity is
  always sourced from the bearer-token match resolved at the
  auth boundary" — the public-facing statement of the guarantee.
* Line 36: `(MR-723 default-deny otherwise rejects …)` →
  "without a server policy the default-deny posture rejects …"
  — same content, no ticket label.
* Line 121: `MR-731 spoof regression test` → "The bearer-auth-
  derived-actor-identity regression test (client-supplied
  identity headers are ignored; the server-resolved actor is the
  only identity Cedar sees)" — describes what the test guards
  instead of naming the originating ticket.

Verified: `grep -E 'MR-\d+|RFC[ -]?\d+' docs/releases/v0.6.0.md`
returns no matches; the rest of `docs/releases/` is also clean.
`scripts/check-agents-md.sh` passes.

Note: cubic-dev-ai also flagged `crates/omnigraph-cli/src/main.rs:276`
("doc comment incorrectly references v0.6.0 for a command that
only exists in v0.7.0"). That comment is based on a stale model
of the release surface — after folding v0.7.0 into v0.6.0 in
the previous commit, the multi-graph CLI surface IS in v0.6.0
and the comment is correct as written. No change needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* fix: close validated init and multi-graph gaps

* chore: address review cleanup comments

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 16:19:31 +02:00
Ragnor Comerford
cc2412dc65
Rename repo terminology to graph (#118)
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
2026-05-24 16:46:00 +01:00
Andrew Altshuler
bb1fe57640
release: v0.5.0 (#115)
* gitignore: exclude docs/internal/ from publication

Mirrors the existing "Local-only working files (not for the public
repo)" pattern. Working notes filed under docs/internal/ stay on the
contributor's machine instead of cluttering the published doc tree
or tripping the AGENTS.md / docs-index cross-link check
(scripts/check-agents-md.sh enumerates every docs/*.md and requires
each one to be linked from an audience index — internal notes don't
have an audience index by definition).

Incidental to the v0.5.0 release; lands separately from the version
bump commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci: skip docs/internal/ in agents-md cross-link check

Matches the .gitignore exclusion. Mirrors the existing 'docs/releases/'
exclusion pattern: notes under docs/internal/ aren't part of the
published doc tree and don't need to be linked from an audience index.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* release: v0.5.0 — Lance 6 substrate, Cedar policy engine, schema-lint v1

Bumps the workspace from 0.4.2 to 0.5.0. Release notes at
docs/releases/v0.5.0.md.

Three user-visible pillars motivate the minor bump:
  1. Lance 6.0.1 substrate (DataFusion 52→53, Arrow 57→58)
  2. Engine-wide Cedar policy enforcement on every _as writer; server
     defaults to deny-all; signed-token-claim-only actor identity
  3. Schema-lint v1 chassis: OG-XXX-NNN codes, soft drops, and
     `--allow-data-loss` (Hard mode) for destructive migrations

Plus structured DataFusion Expr filter pushdown (unblocks
CompOp::Contains via array_has), HTTP allow_data_loss parity, inline
.gq sources on CLI/HTTP, optional CORS layer, and bug fixes
(merge-insert dup-rowid, branch-merge coordinator restore on error,
blob columns in branch merge).

Sites bumped:
  - 5 crate [package].version lines (omnigraph, omnigraph-cli,
    omnigraph-compiler, omnigraph-policy, omnigraph-server)
  - 10 internal path-dep `version = "..."` constraints across the
    four manifests that depend on sister crates (engine, server, cli,
    plus engine's dev-dep on the compiler)
  - Cargo.lock (regenerated via cargo update --workspace)
  - AGENTS.md "Version surveyed:"
  - openapi.json `info.version` (regenerated via
    OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test
    openapi)

Verification:
  - cargo test --workspace --locked: 907/907 green
  - cargo test -p omnigraph-engine --test failpoints --features
    failpoints: 19/19 green
  - cargo test -p omnigraph-engine --test lance_surface_guards: 3/3
  - scripts/check-agents-md.sh: clean

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 13:59:42 +01:00
Andrew Altshuler
3551e0d40e
chore(lance): bump 4.0.0 → 6.0.1 (DataFusion 52→53, Arrow 57→58) (#111)
* tests: add lance_surface_guards pre-flight pins for the v6 bump

Land 8 named guards in a new test file that pin Lance API surfaces
OmniGraph relies on. Each guard turns a silent-break risk (variant
rename, struct restructure, async-flip) into a red CI bar instead of
runtime drift.

Guards (mapped to the silent-break inventory from the v6 migration plan):

  Runtime (#[tokio::test]):
  1. lance_error_too_much_write_contention_variant_exists — pins the
     variant referenced by db/manifest/publisher.rs::map_lance_publish_error.
  2. manifest_location_field_shape — pins .path/.size/.e_tag/.naming_scheme
     types and ManifestLocation accessor returning &Self (the access
     pattern at db/manifest/metadata.rs:84-88).
  6. write_params_default_does_not_set_storage_version — confirms our
     explicit V2_2 pin remains load-bearing (blob v2 requirement).

  Compile-only async fns (#[allow(...)] + unimplemented!() placeholders;
  never run, but cargo build --tests enforces the API shape):
  3. checkout_version + restore chain — pins the recovery rollback hammer
     at db/manifest/recovery.rs:505-522.
  4. DatasetBuilder::from_namespace().with_branch().with_version().load()
     — pins the namespace builder chain at db/manifest/namespace.rs:162-174.
  5. MergeInsertBuilder fluent chain — pins the manifest CAS at
     db/manifest/publisher.rs:370-391, including the return shape
     (Arc<Dataset>, MergeStats).
  7. compact_files(&mut ds, CompactionOptions, None) — pins
     db/omnigraph/optimize.rs:107.
  8. DeleteResult { new_dataset, num_deleted_rows } — pins the inline
     delete result shape (MR-A will repurpose this guard to the staged
     two-phase variant once Lance #6658 migration lands).

This is commit 1 of the chore/lance-6.0.1 migration. Cargo bump
follows in commit 2 (will trigger the guards under v6 if any surface
drifted).

Per the migration plan at ~/.claude/plans/shimmering-percolating-duckling.md
(written this session). Two guards from the plan deferred to follow-up:
  - manifest_cas_returns_row_level_contention_variant (full publisher
    race integration test — needs harness scaffolding)
  - table_version_metadata_byte_compatible_with_v4 (TableVersionMetadata
    is pub(crate); requires test reach extension).

Verified on v4: cargo test -p omnigraph-engine --test lance_surface_guards
passes 3/3 runtime tests; cargo build -p omnigraph-engine --tests
compiles all 5 compile-only guards clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore(deps): bump Lance 4.0.0 → 6.0.1, DataFusion 52 → 53, Arrow 57 → 58

The Cargo bump itself. Source is intentionally untouched — this commit
will not compile. The compile errors are the work-list for subsequent
commits on this branch.

Lance updates: lance + 7 sub-crates 4.0.0 → 6.0.1. Transitive churn:
  + lance-tokenizer v6.0.1 (vendored tokenizer per Lance PR #6512)
  + object_store 0.13.x (Lance 6 brings it transitively; our explicit
    pin stays at 0.12.5 for now — revisit in stages if diamond bites)
  - tantivy* crates (replaced by lance-tokenizer)

Compile error landscape on this commit (11 errors):
  • 1× E0432: `lance_index::DatasetIndexExt` import (Lance PR #6280
    moved it to lance::index). Sites: table_store.rs:20,
    db/manifest.rs:37 (the second site was missed by the pre-flight
    inventory).
  • 8× E0599: `create_index_builder` / `load_indices` missing on
    `lance::Dataset` — all downstream of the DatasetIndexExt move.
    Once the import is corrected on table_store.rs and db/manifest.rs,
    these resolve automatically.
  • 2× E0063: missing field `is_only_declared` in `DescribeTableResponse`
    initializer at db/manifest/namespace.rs:221, 364. New Lance
    namespace field per the v5 namespace restructure (PR #6186).

Surface guards (lance_surface_guards.rs, commit d571fa8) all still
compile + the 3 runtime ones pass on v6 — none of the silent-break
surfaces drifted. That's the load-bearing observation: the publisher
CAS chain, ManifestLocation field shape, checkout_version/restore,
DatasetBuilder fluent chain, MergeInsertBuilder return shape,
WriteParams::default, compact_files signature, and DeleteResult
fields are all v6-stable.

Next commits address the 11 errors per the migration plan stages
3-8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* imports: move DatasetIndexExt to lance::index (Lance PR #6280)

Lance 5.0 (PR #6280) moved `DatasetIndexExt` out of `lance-index` into
`lance::index`. `is_system_index` and `IndexType` stayed in `lance-index`.

Mechanical update of 6 import sites:
  crates/omnigraph/src/table_store.rs:20 — split into two `use` lines
  crates/omnigraph-server/tests/server.rs:10 — was traits::DatasetIndexExt
  crates/omnigraph/tests/search.rs:6
  crates/omnigraph/tests/branching.rs:7
  crates/omnigraph/tests/failpoints.rs:467
  crates/omnigraph-cli/tests/cli.rs:3 — was traits::DatasetIndexExt

All 9 E0599 cascading errors on .create_index_builder / .load_indices
resolve once the trait is back in scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* namespace: add is_only_declared field to DescribeTableResponse

Lance namespace 6.0.0 added `is_only_declared: Option<bool>` to
`DescribeTableResponse` (lance-namespace-reqwest-client 0.7+ via the
v5.0 namespace API restructure, Lance PR #6186). Set to `Some(false)`
because every table BranchManifestNamespace returns from describe_table
is materialized — the manifest snapshot only includes entries for
tables we've already opened via Dataset::open.

Two sites in db/manifest/namespace.rs (BranchManifestNamespace +
StagedTableNamespace impls of LanceNamespace::describe_table).

Closes the last two compile errors from the v6 bump in the engine lib.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* cargo: add lance to omnigraph-cli + omnigraph-server dev-deps

Stage 3 moved DatasetIndexExt imports from `lance-index` to `lance::index`
in the cli and server test crates. Both crates only had `lance-index`
in their dev-dependencies; add `lance` alongside so the new path
resolves.

This is the last compile-error fix from the v6 bump — `cargo build
--workspace --tests` is now green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: refresh Lance alignment audit for v6.0.1; bump surveyed version

Per CLAUDE.md maintenance rule 2 (same-PR docs):

- docs/dev/lance.md: replace the v4.0.1 alignment audit stanza with
  the v6.0.1 audit. Captures every v5/v6 finding from this PR (the
  DatasetIndexExt move, DescribeTableResponse.is_only_declared,
  MergeInsertBuilder return shape, ManifestLocation field shape,
  LanceFileVersion::default flip, file-reader async, tokenizer
  vendor, Lance #6658/#6666/#6877 status). Cross-references each
  guard in tests/lance_surface_guards.rs.

- AGENTS.md: bump "Storage substrate: Lance 4.x" → "Lance 6.x".
  Note: surveyed crate version stays at 0.4.2 — substrate version
  bumps are independent of OmniGraph's release version.

- crates/omnigraph/src/storage_layer.rs: update the trait module-level
  doc-comment to reflect that Lance #6658 closed 2026-05-14 and
  delete_where two-phase migration is MR-A (the next follow-up).
  #6666 stays open; create_vector_index inline residual stays.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* tests: silence clippy::diverging_sub_expression on compile-only guards

The five `_compile_*` async fns in lance_surface_guards.rs use
`let ds: Dataset = unimplemented!()` as a placeholder so type inference
can chase the method chain we want to pin, without ever running the
function. Clippy's `diverging_sub_expression` lint flags this pattern
because the RHS diverges; that's the entire point. Added to the
per-fn `#[allow(...)]` list, alongside dead_code / unreachable_code /
unused_variables / unused_mut already there.

No behavior change. cargo test -p omnigraph-engine --test
lance_surface_guards still 3/3 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: correct #6658 status — closed but API ships in Lance v7.x, not v6.0.1

The audit stanza in docs/dev/lance.md and the storage_layer.rs trait
doc-comment both implied the public DeleteBuilder::execute_uncommitted
API shipped with Lance 6.0.1. It did not. Issue #6658 closed
2026-05-14, but binary search across the release stream confirms:

  v6.0.1             no pub async fn execute_uncommitted on DeleteBuilder
  v6.1.0-rc.1       
  v7.0.0-beta.5     
  v7.0.0-beta.10     first appearance
  v7.0.0-rc.1       

So MR-A (delete two-phase migration) is gated on the Lance v7.x bump,
not on this PR. v7.0.0-rc.1 dropped 2026-05-21; GA likely within a
week.

No behavior change. Doc-only correction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* ci(lib): bump recursion_limit to 256 — Lance 6 trait depth on Linux

Lance 6's heavier trait surface around futures/streams in storage_layer.rs's
staged-write API pushes the rustc trait-resolution recursion limit past
the default 128 on Linux builds. CI on PR #111 surfaced this in both
`Test Workspace` and `Test omnigraph-server --features aws`:

  error: queries overflow the depth limit!
    = help: consider increasing the recursion limit by adding a
      `#![recursion_limit = "256"]` attribute to your crate (`omnigraph`)
    = note: query depth increased by 130 when computing layout of
      `{async block@crates/omnigraph/src/storage_layer.rs:697:5: 697:10}`

(The async block is `stage_create_btree_index`'s body — its return type
is several layers of `impl Future<Output=Result<StagedHandle>>` deep on
top of Lance's own builder return types.)

Local macOS builds happened to short-circuit before tripping the limit,
which is why this didn't surface during the v6 bump sequence. The fix
rustc itself suggests is one line at the crate root.

No behavior change. Revisit if a future Lance bump stops needing it.

Verified: `cargo build --locked -p omnigraph-server --features aws`
compiles clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-23 00:42:29 +01:00
Andrew Altshuler
da42beec41
policy: chassis fan-out — _as variants on the remaining 6 writers (MR-722) (#103)
PR #102 wired apply_schema_as. This PR completes the chassis-side
coverage so every public mutating engine entry point hits the same
Omnigraph::enforce(action, scope, actor) gate regardless of transport:

- mutate_as → enforce(Change, Branch(branch), actor)
- load_as → enforce(Change, Branch(branch), actor)
- ingest_as → enforce(Change, Branch(branch), actor); also threads
  actor through the implicit branch_create_from_as so fresh-branch
  ingest correctly hits BranchCreate too
- branch_create_as → enforce(BranchCreate, TargetBranch(name), actor)
- branch_create_from_as → enforce(BranchCreate,
  BranchTransition { source, target }, actor)
- branch_delete_as → enforce(BranchDelete, TargetBranch(name), actor)
- branch_merge_as → enforce(BranchMerge,
  BranchTransition { source, target }, actor)

Three new _as variants for branch ops (create, create_from, delete)
that had no actor surface before; existing actor-less variants delegate
with actor=None so the no-policy path is a strict no-op.

HTTP handlers updated to thread the resolved actor into the new _as
variants for branch_create and branch_delete (was previously dropped).

14 new SDK chassis tests (one allow + one deny pair per wired writer);
the existing 4 apply_schema_as tests stay. All 18 pass.

docs/user/policy.md updated to describe engine-wide enforcement and the
coarse-vs-fine layer split (engine = action gate, query layer per-row =
MR-725 future). AGENTS.md capability matrix updated to match.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 03:38:18 +03:00
Andrew Altshuler
0de5f69d86
docs: drop npx mdrip; use curl | pandoc for full-page fetches (#97)
The previous "fetch the full page" recommendation in AGENTS.md and
docs/dev/lance.md pointed at an unknown-author npm CLI that, on consent,
wrote agent-targeted content into AGENTS.md and modified .gitignore /
tsconfig.json. Source audit was clean of malicious code but the
self-perpetuating prompt-injection pattern combined with a single
maintainer and ~21 downloads/day made it not worth the risk. Switched
to the curl + pandoc command already documented as the no-tool option.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 16:06:24 +03:00
Andrew Altshuler
60eee78465
docs: split user and developer docs (#93) 2026-05-15 03:45:22 +03:00
Andrew Altshuler
6bad829ed0
branch-protection: declarative policy + apply script (#89)
Branch protection on main, declared as code rather than as opaque
GitHub UI state. Pairs with the CODEOWNERS chassis (#88): once this
PR lands and an admin runs the apply script, every PR to main must
satisfy code-owner review and the listed required checks.

Components:

- .github/branch-protection.json — the policy. Edit this to change
  required checks, review counts, etc. Includes a _comment field for
  human readers; the apply script strips it before PUT.
- scripts/apply-branch-protection.sh — idempotent apply via `gh api`.
  Reads back current state for verification. Supports DRY_RUN=1.
- docs/branch-protection.md — explains the policy, how to apply, how
  to change, why declared as code.
- AGENTS.md topic-index row.

Policy summary:

- Required status checks (strict): Classify Changes, Check AGENTS.md
  Links, Test Workspace, Test omnigraph-server --features aws,
  CODEOWNERS / drift, CODEOWNERS / noedit.
- Required approving reviews: 1, must be a code owner.
- Dismiss stale reviews on new commits.
- Required linear history (squash or rebase merges only).
- No force pushes, no deletions, no admin bypasses.
- Required conversation resolution.

What's NOT in this PR:

- Required signed commits — not yet; maintainers must enroll GPG/SSH
  signing first or merges will block.
- Tag protection for v* tags — separate PR.
- Additional required checks (cargo deny, audit, fmt, clippy, CodeQL,
  schema-lint MR-946) — separate PRs as each lands.
- The script is NOT run by CI. Branch-protection changes are admin
  actions; CI-driven auto-apply would defeat the purpose. Manual
  invocation is the audit point.

How to apply after merge:

  ./scripts/apply-branch-protection.sh

Requires gh-CLI auth with repo-admin permissions.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:38:20 +03:00
Andrew Altshuler
730712b73f
codeowners: yml source of truth + generator + drift CI (#88)
* codeowners: generator + drift CI + initial roles

Source-of-truth approach to CODEOWNERS: yml is hand-edited, CODEOWNERS
is generated and CI-enforced. Every role change is a reviewable PR
with a permanent in-repo audit trail. No GitHub UI clicks, no shadow
state.

Initial roles:

  engineering  @aaltshuler            owns crates/** + default (.github/,
                                       scripts/, Cargo.*, openapi.json,
                                       everything else not docs)

  docs         @aaltshuler @ragnorc   owns docs/**, README.md, AGENTS.md,
                                       CLAUDE.md, SECURITY.md

Per GitHub semantics, multiple owners on a CODEOWNERS line means "any
one satisfies the review" — for docs, either named member can approve.
Strict "N distinct approvers" would need a CI workaround (not wired
today; tracked for future hardening).

Components:

- .github/codeowners-roles.yml — source of truth. Edit this.
- .github/scripts/render-codeowners.py — generator (PyYAML; ~100 LoC).
- .github/CODEOWNERS — generated. CI rejects hand-edits.
- .github/workflows/codeowners.yml — two checks:
  * drift: re-render and assert CODEOWNERS matches.
  * noedit: reject PRs that edit CODEOWNERS without editing the yml.
- docs/codeowners.md — explains the source-of-truth pattern, how to
  change roles, how to add new roles.
- AGENTS.md topic-index row.

What's NOT in this PR:

- Branch protection on main (separate PR; needs `gh api` call against
  the org).
- Required-reviewer enforcement (depends on branch protection landing).
- Required CI status checks (depends on branch protection landing).
- Scheduled rotation (the schedule: block in the yml + a weekly
  workflow). Today's roles are stable; rotation isn't needed yet.
- Linear-as-source-of-truth integration (Approach 4 from the design
  discussion; deferred).

Verified:
- Generator output is deterministic (idempotent re-runs).
- scripts/check-agents-md.sh OK (28 links, 28 docs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* codeowners: fix catch-all ordering (Devin review #88)

Devin caught a real bug: GitHub CODEOWNERS uses "last match wins"
semantics, but the generator emitted the catch-all `*` AFTER specific
patterns. Net effect: `*` won for every file, silently nullifying the
docs role and never routing reviews to @ragnorc.

Fix is one-line — emit the default `*` line before iterating the
specific paths. Also:

- Added a regression assertion in the generator: after rendering, the
  first non-comment line must start with `*` if a default is
  configured. Generator exits non-zero otherwise. Catches the same
  class of mistake in any future refactor.
- Rewrote the yml header comment, which incorrectly stated "keep
  more-specific paths after broader patterns" (correct for GitHub
  semantics but the generator was doing the opposite — so the comment
  read as a description of behavior when it was actually a contradicted
  intention).

Verified by re-rendering: `*` is now line 12, `crates/**` is line 14,
`docs/**` is line 15, etc. README.md matches both `*` and `README.md`;
`README.md` is later → wins → @aaltshuler + @ragnorc both assigned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:26:06 +03:00
Andrew Altshuler
c142dafdf3
schema-lint chassis v0: code-tagged diagnostics (MR-694) (#87)
First slice of the schema-lint chassis. Adds stable `OG-XXX-NNN`
codes to schema-migration rejections so operators can suppress, look
up, and filter on identifiers rather than free-text prose. Atlas-style
chassis adapted to omnigraph's typed-IR substrate (no SQL injection
vector, no per-engine locks, native edge/vector/embedding types).

What's in v0:

- New `omnigraph-compiler/src/lint/` module with:
  - `diagnostic.rs` — Family / SafetyTier / Severity enums covering ten
    families (DS, MF, CD, BC, NM, OW, NL, VE, ED, LK). Only DS and MF
    are populated in this PR.
  - `codes.rs` — 8 DiagnosticCode constants (OG-DS-101..105,
    OG-MF-103, OG-MF-104, OG-MF-106). Five of the eight are wired to
    real emission sites; the other three are reserved.
  - Unit tests for catalog invariants: codes unique, prefix matches
    family, suffixes are 3-digit, destructive defaults to error,
    lookup() works, EMITTED_IN_V0 codes exist in ALL_CODES.

- `SchemaMigrationStep::UnsupportedChange` gains an optional
  `code: Option<String>` field. New `unsupported_error_message()`
  helper prefixes the message with `[code]` when present.

- 5 of 17 existing rejection paths now carry codes:
  - `removing node type` → OG-DS-102
  - `removing edge type` → OG-DS-103
  - `removing property` → OG-DS-104
  - `adding required property without backfill` → OG-MF-103
  - `changing property type` → OG-MF-106
  Remaining 12 paths carry `code: None` and are tagged as future work.

- `schema_apply` surfaces the formatted error (with `[code]` prefix);
  CLI `omnigraph schema plan` renders the code on the
  `unsupported change on <entity>` line.

- PR #62 destructive-rejection tests in `tests/schema_apply.rs` now
  assert on the stable code (`msg.contains("OG-DS-104")`) instead of
  the error-message substring. 11/11 tests pass.

- New `docs/schema-lint.md` documents the v0 catalog + the 10 families
  + Atlas prior art. AGENTS.md index updated.

What's explicitly NOT in v0 (subsequent PRs):

- No severity config in `omnigraph.yaml` (MR-694 §2).
- No `@allow(OG-XXX-NNN, "rationale")` suppression directive (§3).
- No `--allow-data-loss` flag or destructive-tier enforcement.
- No new `SchemaMigrationStep` variants (soft/hard drops, default,
  widen/narrow). MR-700, MR-697 land those.
- No pre-migration checks (MR-941).
- No CD / VE / LK / NM family rules (MR-942..945).
- No CI integration (MR-946).

Tests: 235 compiler tests, 11 schema_apply integration tests, 14
lint module tests, 55 CLI tests — all green.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 17:08:18 +03:00
Ragnor Comerford
24c0558180
docs: lead AGENTS.md first principle with integrated-over-time framing
Reframes the first-principle section to lead with Winters' "engineering
is programming integrated over time" as the lens, keeping "minimize
ongoing liability" as the operative directive and folding in "complexity
should be earned." Adds a new Tiebreakers subsection with two rules
that the prior section lacked clean appeals for:

- correctness > simplicity > performance (lexicographic)
- reversibility shapes evidence demand (reversible → prod metrics over
  napkin math over RFCs; irreversible → RFC up-front)

Adds a Hyrum's-Law deny-list entry in both AGENTS.md and
docs/invariants.md §IX: shipping observable behavior is shipping a
contract, even when undocumented.

Net always-on context cost: ~7 lines. No renumbering of §I–VIII
invariants; Hyrum's Law lands in the deny-list to avoid breaking
back-references.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:27:24 -07:00
Ragnor Comerford
3bd072c917
docs: add docs/transactions.md — branch-as-transaction explainer (#69)
The architectural rule "no cross-query BEGIN/COMMIT; branches fill that
role" lives in docs/invariants.md §VI.23 but is not surfaced anywhere
user-facing. New users coming from Postgres/MySQL hit the gap when they
realize multiple queries on main are independently atomic, not jointly
atomic.

This page explains the model with worked examples:
* Single-query multi-statement (atomic by default)
* Two separate queries on main (NOT atomic — common surprise)
* Many queries via a branch (atomic at merge)
* Coordinating multiple agents via branch-per-agent

Plus a comparison table to BEGIN/COMMIT, failure-mode rundown, and
"when to use what" decision matrix.

Linked from AGENTS.md "Where to find each topic" between
branches-commits.md and runs.md.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 22:35:57 +03:00
Devin AI
7a338a8223 agents: keep guide short for context 2026-05-10 14:41:02 +00:00
Devin AI
a42d178119 release: prepare omnigraph 0.4.2 2026-05-10 14:02:28 +00:00
Ragnor Comerford
6bd9f1c085
agents: rules 8 and 9 — test-first for bug fixes; correct by design
Add two durable engineering rules to the maintenance contract so they
load into context on every session:

- Rule 8: write a regression test that reproduces the bug first, confirm
  it fails, land it just before the fix commit so the red→green pair is
  visible in git log. A reviewer can check out the test commit alone and
  reproduce the failure.
- Rule 9: when a bug surfaces, identify the root cause and make the fix
  correct by construction. Don't patch the symptom. If the design admits
  the bug class, close the class — don't add a guard around the latest
  instance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:26:34 +02:00
Ragnor Comerford
fcb47620d3
mr-686: bundle PR 0/1a/1b foundation + PR 2 catalog/schema_source ArcSwap
Bundles the working-tree state from the prior session (PR 0 bench harness,
PR 1a audit_actor_id removal, PR 1b WriteQueueManager + writer integration)
together with the first half of PR 2's interior-mutability foundation
(catalog and schema_source wrapped in Arc<ArcSwap<...>>). The two streams
intermix in 7 of the same files, so splitting via git add -p was
impractical. Subsequent PR 2 steps land as separate atomic commits.

PR 0 — server-level concurrent /change bench harness
  - crates/omnigraph-server/examples/bench_concurrent_http.rs (new)
  - .context/bench-results/{baseline-main,after-pr1}/ (gitignored)

PR 1a — drop the audit_actor_id field, thread per-call
  - removed Omnigraph::audit_actor_id and the swap-restore patterns in
    mutation.rs, merge.rs, loader/mod.rs
  - actor_id: Option<&str> threaded through MutationStaging::finalize,
    mutate_with_current_actor, ingest_with_current_actor,
    branch_merge_impl, branch_merge_on_current_target,
    commit_prepared_updates*, record_merge_commit,
    commit_updates_on_branch_with_expected
  - apply_schema and ensure_indices_for_branch pass None (system-attributed)

PR 1b — per-(table_key, branch) write queue + revalidation + sidecar
  - new crates/omnigraph/src/db/write_queue.rs with WriteQueueManager,
    acquire/acquire_many, sorted+deduped acquisition; 6 unit tests
  - Arc<WriteQueueManager> field on Omnigraph + db.write_queue() accessor
  - MutationStaging::finalize split into stage_all (Phase A, no queue)
    and StagedMutation::commit_all (Phase B, acquire_many + revalidate
    pins + sidecar + commit_staged); guards held across publisher
  - delete-only mutations now emit recovery sidecars; revalidation
    extended to inline_committed tables
  - branch_merge_on_current_target, apply_schema_with_lock, and
    ensure_indices_for_branch acquire per-table queues for their
    touched tables

PR 2 Step B (partial) — catalog and schema_source via ArcSwap
  - catalog: Catalog -> Arc<ArcSwap<Catalog>>
  - schema_source: String -> Arc<ArcSwap<String>>
  - public accessors return Arc<Catalog> / Arc<String>; readers bind
    locally where the borrow has to outlive an expression
  - new pub(crate) store_catalog / store_schema_source helpers replace
    the field assignments in apply_schema and reload_schema_if_source_changed
  - 117 tests across lifecycle/end_to_end/branching/runs pass; engine
    lib + workspace compile clean

Coordinator wrap (Mutex) and the &mut self -> &self engine API
conversion follow in subsequent commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:22:38 +02:00
Ragnor Comerford
05e52f2ee0
recovery: rename composite test, strip ticket references, address review
Three bundled changes:

1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and
   the test function). OmniGraph is consumed by both humans and agents
   - naming the test after one audience misframes the library.

2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and
   review-round labels from source, tests, and docs added by this
   branch. Internal traceability belongs in commit messages and PR
   descriptions, not in checked-in artifacts. Upstream
   lance-format/lance issue refs and pre-existing MR-XXX refs in docs
   not touched by this branch are left alone.

3. Two outstanding review findings addressed:
   - `needs_index_work_node` / `needs_index_work_edge`: propagate
     `count_rows` errors instead of `unwrap_or(0)`. Silently treating
     transient I/O failures as "0 rows" risked skipping a table from
     the recovery sidecar pin set that was actually about to be
     modified.
   - `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`:
     strengthen the assertion to fail when sidecar B classifies under
     a stale snapshot. The new assertion checks post-recovery Lance
     HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar
     deleted + audit rows present" pair passed in both the bug and
     fix paths because both delete the sidecar and write an audit
     row; the differentiator is the post-recovery HEAD. Strengthening
     the assertion exposed an additional nuance: in this overlapping-
     sidecar scenario sidecar B's audit kind is RolledBack (no-op)
     rather than RolledForward, since sidecar A's roll-forward
     publishes Lance HEAD as the new manifest pin (absorbing B's
     work). The docstring now explains why this is correct given
     current `roll_forward_all` semantics.

All workspace tests pass with --features failpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 13:56:36 +02:00
Ragnor Comerford
932334ba01
recovery: document MR-847 ship across all reference docs (Phase 10)
Update the doc surface to reflect MR-847 having shipped end to end —
sidecar protocol, classifier, all-or-nothing decision tree, roll-forward
via ManifestBatchPublisher, roll-back via Dataset::restore with
fragment-set short-circuit, audit trail in
_graph_commit_recoveries.lance, OpenMode::{ReadWrite, ReadOnly}, and
the four migrated writers all carrying sidecars across Phase B → Phase C.

- docs/invariants.md §VI.23: change from "upheld at the writer-trait
  surface for inserts/updates/etc., per-table commit_staged → manifest
  publish window remains" to "upheld at the writer-trait surface AND
  across process boundaries". The MR-847 sweep closes the residual on
  the next Omnigraph::open. The "continuous in-process" property
  (no ExpectedVersionMismatch surfacing to subsequent writers between
  Phase B failure and process restart) is honest follow-up at MR-856.

- docs/runs.md: replace "Finalize → publisher residual" section with
  "Open-time recovery sweep (MR-847)" — describes the sidecar protocol
  lifecycle (Phases A-D), the sweep's classifier + decision dispatch,
  the audit trail, and the operator-facing query
  (omnigraph commit list --filter actor=omnigraph:recovery).

- AGENTS.md capability matrix "Atomic single-dataset commits" row:
  drop the "Layer (3) is not yet shipped — tracked in MR-847" caveat;
  describe the three layers as all shipping; reference MR-856 for the
  background-reconciler follow-up.

- docs/storage.md: add _graph_commit_recoveries.lance and
  __recovery/{ulid}.json to the on-disk layout (mermaid + prose).

- docs/branches-commits.md: new "Recovery audit trail (MR-847)"
  subsection describing the join from
  _graph_commits.lance:actor_id="omnigraph:recovery" to
  _graph_commit_recoveries.lance:graph_commit_id for operator
  post-mortem.

- docs/maintenance.md: note the MR-847 recovery floor on cleanup —
  --keep < 3 may garbage-collect Lance versions the recovery sweep
  needs as a rollback target. Default --keep 10 is safe.

- docs/testing.md: add tests/recovery.rs to the engine integration-test
  table; expand the failpoints.rs row to mention the four MR-847
  per-writer Phase B → recovery integration tests.

- .context/mr-847-design.md: prepend a "Status: DONE" stanza listing
  every commit hash + scope across phases 1-10.

AGENTS.md ↔ docs/ cross-link check passes (26 links, 26 docs).
Full workspace test sweep passes with --features failpoints (361 tests
across 20 binaries).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:24 +02:00
Ragnor Comerford
8726ffe0a3
release: bump version to 0.4.1 2026-05-02 23:20:50 +02:00
Ragnor Comerford
5afde54d69
agents: stop overclaiming atomic multi-table publish — describe the three layers honestly
External reviewer flagged that the capability matrix's "Atomic
multi-dataset publish" cell implied Lance gives us a single primitive
for cross-table atomicity. It doesn't. The real contract is three
layers stacked:

  (1) per-table Lance `commit_staged` for the data write
  (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for
      cross-table ordering
  (3) recovery-on-open reconciler for the residual gap between (1)
      and (2) — NOT YET SHIPPED, tracked in MR-847.

Until MR-847 lands, a failure between per-table `commit_staged` and
the manifest publish leaves drift on the partially-committed tables
(the "Phase B → Phase C residual" documented in `docs/runs.md`).

Also enumerate the legacy inline-commit residuals (`append_batch`,
`merge_insert_batches`, `overwrite_batch`, `create_*_index`) alongside
`delete_where` and `create_vector_index` — they remain on the trait
pending Phase 1b call-site conversion + Phase 9 demotion.

End the row with an explicit DO NOT: future agents reading the
capability matrix should not describe atomicity as "fully upheld"
until MR-847 ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:35:34 +02:00
Ragnor Comerford
9b0920b5da
address PR #70 bot review (Cubic + Cursor): 7 inline + failpoint test + invariants notes
Cubic findings:
* `tests/forbidden_apis.rs`: expand `FORBIDDEN_PATTERNS` with `Dataset::write`
  / `Dataset::append` / `Dataset::delete` / `Dataset::merge_insert` /
  `Dataset::add_columns` / `update_columns` / `drop_columns` /
  `truncate_table` / `restore` and the bare `.merge_insert(` /
  `.add_columns(` / `.update_columns(` / `.drop_columns(` /
  `.truncate_table(` method patterns. Deliberately avoid `.append(` /
  `.delete(` / `.write(` (over-match `Vec::append`, `.delete_branch(`,
  arrow-array `.append(`, etc.). Allow-list `commit_graph.rs` and
  `graph_coordinator.rs` — they're manifest-layer infra that legitimately
  uses `Dataset::write` for system tables.
* `schema_apply.rs:253`: pass `entry.table_branch.as_deref()` (not
  `None`) to `open_dataset_head_for_write` for consistency with the
  sibling `indexed_tables` block. Schema apply rejects non-main
  branches at the lock-acquire step today, so behavior is unchanged;
  this is a defensive consistency fix that survives a future relaxation
  of the lock check.
* `storage_layer.rs:131` doc: was `Vec<&StagedWrite>` with lifetime
  claim; actually returns `Vec<StagedWrite>` (cloned). Fixed.
* `AGENTS.md:201` capability matrix row + `storage_layer.rs:1` module
  doc: softened the "stage_* + commit_staged are the only paths" /
  "trait funnels every write" overclaim. Inline-commit residuals
  (`delete_where`, `create_vector_index`) remain on the trait pending
  upstream Lance work (#6658, #6666); legacy `append_batch` etc.
  remain pending Phase 1b / Phase 9. Module doc now describes the
  current transitional state honestly.

Cursor Bugbot findings:
* `storage_layer.rs:360`: trait `delete_where` consumed `SnapshotHandle`
  but returned only `DeleteState`, dropping the post-delete dataset.
  Future callers migrating from the inherent `&mut Dataset` API would
  lose the post-delete dataset state needed for indexing /
  `table_state` queries. Fixed: returns `(SnapshotHandle, DeleteState)`
  matching `append_batch` / `overwrite_batch` shape.
* `storage_layer.rs:824`: removed dead `_scanner_type_marker` fn and
  the unused `Scanner` import (the marker existed only to suppress an
  unused-import warning — fixing the import is the cleaner answer).

Engine-level Phase A failpoint test (closes the partial-criterion
flagged in Cubic's acceptance-criteria checklist):
* `db/omnigraph/table_ops.rs::stage_and_commit_btree`: instrumented
  with `crate::failpoints::maybe_fail("ensure_indices.post_stage_pre_commit_btree")`
  between `stage_create_btree_index` and `commit_staged`.
* `tests/failpoints.rs::ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`:
  triggers the failpoint via a schema-apply that adds a new node type;
  proves that existing tables are unaffected (Person mutation succeeds
  after the failed apply) — i.e. Phase A failure leaves no Lance-HEAD
  drift on tables outside the failed `added_tables` iteration.

`docs/invariants.md` transitional notes:
* §VI.23 (atomicity per query): annotated as upheld at the
  writer-trait surface for inserts / updates / scalar-index builds /
  merge_insert / overwrite after MR-793 PR #70. Per-table
  commit_staged → manifest publish window remains; closing requires
  MR-847's recovery-on-open reconciler. `delete_where` and
  `create_vector_index` remain inline pending lance#6658 / #6666.
* §VII.35 (reconciler pattern): annotated as partial — staged
  primitives are the building blocks; the reconciler task itself is
  MR-848.
* §VIII.45 (reference impl per trait): `TableStorage` has its primary
  impl on `TableStore` with opaque-handle signatures; no test impl
  yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:47:07 +02:00
Ragnor Comerford
b87be5e9f0
agents: read every Lance page even slightly relevant, not just the obvious match
Behavior is interlocked across Lance pages — transactions reference
index lifecycle, index lifecycle references compaction, compaction
references row-id lineage. Skipping a "slightly relevant" page is how
alignment misses happen. The index alone is not a substitute for
reading the pages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:44:41 +02:00
Ragnor Comerford
17bf978d0e
MR-793 follow-up: lance docs alignment audit + mandate full-page fetch via mdrip
* AGENTS.md / docs/lance.md: agents must use `npx mdrip` (not summarizing
  WebFetch) when consulting Lance docs. WebFetch routinely drops
  load-bearing details — `pub(crate)` blockers, sub-specs behind nav hubs,
  default flags. Lesson learned during the MR-793 alignment audit.
* docs/lance.md: add "Last alignment audit: 2026-05-02" stanza
  documenting MemWAL gap, lance#6666 companion ticket, stable-row-ID
  status (experimental, may unblock MR-848), FRI as documented
  compaction-friendly alternative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:41:32 +02:00
Ragnor Comerford
3135ff5d19
MR-793 phases 1-6: TableStorage trait + staged-write surface for engine writers
Hoists Lance's stage+commit two-phase write pattern from "discipline at
each writer" to a sealed trait surface (`TableStorage`). New engine code
that needs to advance Lance HEAD MUST go through `stage_*` + `commit_staged`;
the trait's opaque `SnapshotHandle` / `StagedHandle` types keep
`lance::Dataset` and `lance::Transaction` out of trait signatures.

Phases landed (see .context/mr-793-design.md for the full plan):
* 1a: `crates/omnigraph/src/storage_layer.rs` — `TableStorage` trait,
  sealed (only in-tree types can impl), single impl on `TableStore`
  delegating to existing inherent methods; `Omnigraph::storage()`
  accessor returns `&dyn TableStorage`.
* 2: three new staged primitives — `stage_overwrite`,
  `stage_create_btree_index`, `stage_create_inverted_index` —
  implementing the simple branch of Lance's `CreateIndexBuilder::execute`
  (scalar indices only; vector indices stay inline because
  `build_index_metadata_from_segments` is `pub(crate)` in lance-4.0.0).
  Six new tests in `tests/staged_writes.rs` pin both the new primitives
  and the inline residuals (`delete_where`, `create_vector_index`).
* 3: `tests/forbidden_apis.rs` — defense-in-depth integration test
  walks engine source, fails on direct lance::* inline-commit API use
  outside `table_store.rs` / `db/manifest/`. Skips comment lines and
  honors `// forbidden-api-allow:` sentinels.
* 4: `ensure_indices` migration — scalar index builds now route through
  `stage_create_*_index` + `commit_staged` instead of
  `create_*_index(&mut Dataset)`. Vector indices stay inline (residual,
  named honestly at the call site).
* 5: `branch_merge::publish_rewritten_merge_table` migration — the
  merge_insert phase now uses `stage_merge_insert` + `commit_staged`;
  delete phase stays inline (Lance #6658 residual, named honestly).
* 6: `schema_apply` rewritten_tables migration — non-empty rewrites
  use `stage_overwrite` + `commit_staged`; empty-batch rewrites stay
  inline because `InsertBuilder::execute_uncommitted` rejects empty
  data. The narrow inline window is bounded by `__schema_apply_lock__`.

Verified-green test surface:
* `cargo test -p omnigraph-engine` — 68 lib + ~120 integration tests
  (incl. 6 new staged_writes tests + the new forbidden_apis test).
* `cargo test -p omnigraph-engine --features failpoints --test failpoints`
  — 5 tests, all green.
* `cargo test --workspace` — green.

Deferred to follow-up sessions (see design doc §17 split):
* Phase 1b — convert remaining engine call sites to `&dyn TableStorage`
  (mostly READS that don't touch the staged-write invariant).
* Phase 7 — recovery-on-open reconciler (closes Phase B → Phase C
  residual across process restarts; new subsystem).
* Phase 8 — index-coverage reconciler (full §VII.35 compliance —
  removes synchronous index work from the publish path).
* Phase 9 — demote unused `TableStore` inherent methods to `pub(crate)`
  (depends on Phase 1b).

Lance upstream blockers documented:
* lance-format/lance#6658 — two-phase delete API (open, no PRs).
* Companion: `build_index_metadata_from_segments` should be `pub` so
  vector-index builds can be staged outside the lance crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:03:15 +02:00
Ragnor Comerford
a61e82f47a
MR-794 step 2: docs — runs/invariants/architecture/execution + cleanup
Refresh user-facing and agent-facing docs for the staged-write rewire
and clean up stale Run-state-machine references that survived MR-771.

MR-794-specific updates:
* docs/runs.md — remove "Known limitation: mid-query partial failure"
  section; document the in-memory accumulator + D₂ rule + the
  LoadMode::Overwrite residual.
* docs/invariants.md §VI.25 — flip from aspirational/open to
  upheld for inserts/updates. Within-query read-your-writes is now
  load-bearing for the publisher CAS contract.
* docs/architecture.md — add "Mutation atomicity — in-memory
  accumulator (MR-794)" subsection with per-op flow; refresh the
  engine + state diagrams to drop RunRegistry and add MutationStaging.
* docs/execution.md — rewrite the mutation flow sequence diagram
  for the staged-write path; updated the LoadMode table to call
  out per-mode commit semantics; rewrote load vs ingest.
* docs/query-language.md — document the D₂ parse-time rule.
* docs/errors.md — add the D₂ BadRequest rejection path.
* docs/testing.md — extend the runs.rs row to cover the new MR-794
  contract tests; add the staged_writes.rs row.
* docs/releases/v0.4.1.md (new) — release note covering the rewire,
  test additions, residuals, and files changed.
* AGENTS.md (CLAUDE.md symlink) — update the atomic-per-query
  description and the L2 capability matrix row.

Stale-reference cleanup (MR-771 leftovers):
* docs/storage.md — drop live _graph_runs.lance / _graph_run_actors.lance
  from the layout diagram and prose; mark legacy.
* docs/branches-commits.md — move __run__<id> to a legacy note;
  remove publish_run from the publish-trigger list.
* docs/audit.md — refresh _as API list (drop begin_run_as / publish_run_as);
  legacy RunRecord.actor_id moved to a historical note.
* docs/constants.md — mark run registry / branch-prefix rows as legacy.
* docs/cli.md — replace the legacy omnigraph run * quickstart block
  with omnigraph commit list/show.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:43:19 +02:00
Ragnor Comerford
35be20cb05
MR-771: demote Run to direct-publish via expected_table_versions CAS
mutate_as and load now write directly to target tables and call the
publisher once at the end with per-table expected versions; the Run
state machine, _graph_runs.lance writers, __run__ staging branches,
and server /runs/* endpoints are removed. Multi-statement mutations
remain atomic at the manifest level via an in-memory MutationStaging
accumulator that gives read-your-writes within a query and a single
publish at the end. Concurrent-writer conflicts surface as
ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the
old DivergentUpdate merge shape. Documents one known limitation in
docs/runs.md: a multi-statement mid-query failure where op-N writes
a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the
manifest until a follow-up introduces per-table Lance branches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 08:52:50 +02:00
Claude
63babede4a
AGENTS.md: reframe 'minimize ongoing liability' as a general decision lens
The previous bullets read like a migration pattern (centralized
dispatcher, one match arm, no shape forks). That's one application, not
the principle. Reframe it as a bidirectional decision lens: ask "which
option has lower ongoing cost over time?" and let the answer be more
code, less code, DRYing, duplication, removal, addition, a new
abstraction, or flattening one — whichever shape converges over the
expected change horizon.

Add explicit examples of cases where the lower-liability option is
*more* code (dispatcher, migration framework, typed error variants) and
where it's *less* (premature abstractions, "just in case" paths,
helpers that wedge independently-evolving callers together) so readers
don't collapse the principle into "minimize code".
2026-04-29 12:32:30 +00:00
Claude
b6a596670a
AGENTS.md: add 'minimize ongoing liability' as a first principle
The always-on rules are concretizations of a broader engineering posture
that wasn't stated explicitly. Add a short section that frames the spirit
behind those rules:

- One centralized detection point, not many heal hooks.
- One dispatcher, not branch-on-shape in every consumer.
- One canonical shape after migration, not forks on "old vs new".
- Three similar lines beats a premature abstraction.
- Delete dead paths when their last caller leaves.

Plus a forward-looking review prompt ("what do these paths look like after
5 more changes like this?") so the principle bites at design time, not
just at review time.

The internal-schema-version mechanism we just shipped is a concrete
application: one stamp + one dispatcher + one match arm per change, no
heal hooks scattered across the engine. Codify the pattern so future
work doesn't drift back to ad-hoc.
2026-04-29 11:48:47 +00:00
Ragnor Comerford
6f25c4f9f8
Address reviewer feedback (Cursor + cubic) on PR #60
All eight comments verified against source and applied:

- AGENTS.md: pull @docs/{invariants,lance,testing}.md imports out of
  the markdown blockquote. Claude Code's @-import parser expects @ at
  column 0; the leading "> " of a blockquote silently broke
  recognition, so the claimed auto-include did nothing. (Cursor,
  Medium severity.)
- docs/cli-reference.md: command-family count 13 → 17. The current
  enum Command in crates/omnigraph-cli/src/main.rs has 17 top-level
  variants. (cubic P2.)
- docs/ci.md: Homebrew tap update is a regular `git push`, not a
  force-push (release.yml:117 is `git push origin HEAD:main`). (cubic
  P2.)
- docs/errors.md: add the Storage variant to the NanoError list — it
  exists at error.rs:88-89 but the doc enumerated only 10 of 11.
  (cubic P2.)
- docs/storage.md: clarify tombstone semantics. There is no
  tombstone_version column; state.rs:180 reads the tombstone version
  from the table_version column on rows where object_type =
  table_tombstone. (cubic P2.)
- docs/branches-commits.md: split the GraphCommit pseudo-struct from
  the underlying storage. actor_id is joined in-memory from
  _graph_commit_actors.lance, not a column on _graph_commits.lance.
  (cubic P2.)
- docs/schema-language.md: rename IR_VERSION to SCHEMA_IR_VERSION to
  match the actual constant name in catalog/schema_ir.rs:11.
  (cubic P3.)
- docs/testing.md: engine integration test count 16 → 15 (matches
  `ls crates/omnigraph/tests/*.rs`). (cubic P3.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-29 00:09:06 +02:00
Ragnor Comerford
ada58ccd7b
Make "check existing coverage first" a top-level testing principle
The original docs/testing.md mentioned finding existing tests as step 1
of the checklist but never explicitly said "if existing coverage
already addresses your case, extend it; don't duplicate." Adds a
prominent "First principle" section that names extend-vs-new as the
preferred outcome and lists three duplicated init_and_load blocks as
the most common form of test rot.

Adds an extra checklist item: verify your change makes an *existing*
test fail before it makes a new one pass — if you can break the code
without breaking a test, that coverage gap is the bug to fix first.

Strengthens the AGENTS.md callout so the principle ("always check what
already covers it") is in scope from the top of every session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-29 00:03:50 +02:00
Ragnor Comerford
8be0e6a067
Add docs/testing.md as required-read every session
Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:55:21 +02:00
Ragnor Comerford
a06d8bcf82
Promote docs/lance.md to required-read every session
Adds @docs/lance.md alongside @docs/invariants.md so the Lance index
loads on every turn (Claude Code @-import; explicit-open instruction
for other agents). Reframes the directive from "when you hit a
Lance-shaped problem" to "consult before every task to identify which
upstream pages are relevant." The Lance docs are the authoritative
source for substrate behavior, so reasoning about them should start
every change rather than be triggered conditionally.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:50:42 +02:00
Ragnor Comerford
b6440d6b17
Add docs/lance.md — task-organized index of Lance upstream docs
Curates the Lance documentation site (lance.org) into a problem-domain
index so agents fetch the right page when working on Lance-touching
code instead of guessing or grepping our codebase. Organized by topic:
storage format & file layout, branching/tags/time travel, indexes
(scalar + system + vector), reads/writes, schema evolution, object
store, data types, performance, compaction, DataFusion integration,
SDK reference, plus quick-starts and the upstream AGENTS.md.

Skips ~200 irrelevant URLs from the upstream sitemap (Namespace REST
API model surface, Spark/Trino/Databricks/etc. integrations,
Python/Ray/HuggingFace docs, community pages) since omnigraph is
Rust-only and doesn't run a Lance Namespace catalog.

AGENTS.md surfaces it in the topic index and adds a directive: "when
you hit a Lance-shaped problem, consult docs/lance.md and fetch the
upstream URL before guessing."

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:48:28 +02:00
Ragnor Comerford
43724b9f18
Make docs/invariants.md required reading every session
Adds a top-of-file directive plus a Claude-Code @-import so the full
invariants document is loaded into context on every turn, not only
when an agent follows a pointer. Other agents are instructed to open
it explicitly at session start. The §IX deny-list and §X review
checklist apply to every change, so they should be in scope by
default rather than gated on the agent remembering to look.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:39:09 +02:00
Ragnor Comerford
1e7334275a
Trim always-on rules to architectural-level invariants
Drops four rules whose phrasing leaned on implementation specifics
(nearest LIMIT, __run__<id> branches, __schema_apply_lock__,
branch_list filter convention) — those are real constraints, but they
live at the implementation layer and would go stale if internals are
renamed or refactored. The architectural intent is captured by the
remaining six rules and by the per-area docs.

Reframes the kept rules at the survives-a-rename level: "multi-dataset
publish is atomic across the whole graph" rather than naming the
manifest table or the publisher type, etc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:38:24 +02:00
Ragnor Comerford
c924e121d2
Add architectural invariants & deny-list as docs/invariants.md
A standing reference for invariants that hold across storage, engine,
server, schema, indexing, observability, and the OSS/Cloud split. Used
to check RFCs and PRs against the substrate boundaries (don't rebuild
what Lance gives us), layering rules (one trait boundary per layer),
distributability constraints (Send+Sync, location-neutral IR), honesty
expectations (estimate-vs-actual, bounded failure modes), unified
patterns (reconciler, Union polymorphism, SIP, factorize), the §IX
deny-list, and the §X review checklist.

§IV (additivity / migration) and §VIII (OSS/Cloud kernel-product split)
are referenced but not yet drafted — flagged as placeholders pending
upstream fill-in.

AGENTS.md surfaces it from the topic index, the always-on rules
section, and the maintenance contract; the deny-list is also inlined
there as a fast-pass review filter so it stays in scope every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:34:44 +02:00
Ragnor Comerford
a335d98854
Refactor AGENTS.md from encyclopedia to map; move spec into docs/
Splits the 990-line AGENTS.md into a 184-line map (architecture,
where-to-find index, always-on invariants, capability matrix,
maintenance contract) plus 18 new docs/*.md files holding the deep
content per topic (storage, schema and query languages, indexes,
embeddings, branches/commits, runs, merge, changes, execution, policy,
server, CLI reference, audit, errors, CI, constants, v0.3.1 notes).

Adds scripts/check-agents-md.sh and a check_agents_md CI job that
verifies every docs/ link in AGENTS.md resolves and every doc in the
canonical set is linked. CLAUDE.md remains a symlink to AGENTS.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:31:08 +02:00
Ragnor Comerford
cfea41e942
Add AGENTS.md as canonical agent guide; symlink CLAUDE.md to it
Captures the v0.3.1 feature spec (storage, schema/query languages, IR,
indexes, embeddings, branches/commits/runs, merge, server, CLI, policy,
deployment) and adds a §26 maintenance contract instructing agents to
keep this file current alongside any user-visible change. CLAUDE.md is
a symlink so there's one source of truth.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:10:09 +02:00