omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-15 01:55:13 +02:00

Author	SHA1	Message	Date
Ragnor Comerford	67baf615d9	build(deps): bump Lance 6.0.1 → 7.0.0 (correct-by-design substrate alignment) (#229 ) * build(deps): bump Lance 6.0.1 → 7.0.0 (object_store 0.13.2, roaring 0.11.4) Arrow stays 58 and DataFusion stays 53 (no change). The only transitive bump is object_store 0.12.5 → 0.13.2. 141 upstream commits reviewed; no fixes lost (the 6.0.x release-branch backports are all forward-ported into 7.0.0). - object_store 0.13 moved get/put/head/rename/delete behind a new ObjectStoreExt trait (list/list_with_delimiter/put_opts stay on the core trait). Add `use object_store::ObjectStoreExt` in storage.rs and db/manifest/namespace.rs; no call-site changes. Mirrors Lance's own migration in PR #6672. - roaring pinned to 0.11.4 (cargo update -p roaring --precise 0.11.4). Lance 7.0.0's UpdatedFragmentOffsets newtype (lance#6650) derives Eq over HashMap<u64, RoaringBitmap>, which needs RoaringBitmap: Eq, added in roaring 0.11.4; the loose `roaring = "0.11"` constraint otherwise resolves 0.11.3 and lance itself fails to compile. - lance#6774: merge-insert INSERT rows now stamp _row_created_at_version with the commit version (was a fallback of 1). Flip the lance_version_columns assertion to `== v2` and correct the changes/mod.rs rationale comment. Production change-detection keys on _row_last_updated_at_version + ID membership, so its logic is unaffected. Refs lance#6650, lance#6774, lance#6672. * fix(storage): pin WriteParams::auto_cleanup = None (lance#6755 default flip) lance#6755 flipped the WriteParams::auto_cleanup default from on (a full cleanup pass every 20th commit) to None. On 6.0.1 the on-by-default hook could silently GC versions that __manifest pins for snapshots/time-travel. OmniGraph owns cleanup explicitly (optimize.rs::cleanup_all_tables) and never set auto_cleanup, so it was relying on a default that is both wrong for our snapshot model and now changed upstream. Pin auto_cleanup: None explicitly at all 11 production WriteParams sites (table_store ×6, commit_graph ×2, recovery_audit ×1, manifest/graph ×2 — the __manifest + sub-table Create paths). Removes the dependency on a default-flag value and locks in the snapshot-safe behavior regardless of future upstream re-flips. Refs lance#6755. * test(lance): pin BTREE range-boundary correctness (lance#6796) lance#6796 (issue #6792) fixed a BTREE scalar-index range-query bound inclusiveness bug: `x <= hi AND x > lo` returned the wrong boundary row. Add lance_surface_guards.rs::btree_range_query_boundary_is_correct, which reproduces the exact #6792 shape (5 rows + an explicit BTREE drives the index path even on tiny data) and pins the corrected inclusive-<= / exclusive-> semantics. It turns red if a future Lance regression reintroduces the bug. OmniGraph today builds BTREE only on string @key columns and queries them by equality/IN, so its current patterns do not hit this; the guard protects any future BTREE-range path (BTREE-on-properties, range-on-key). Refs lance#6796. * docs(dev): align Lance docs + invariants to 7.0.0 - docs/dev/lance.md: new 2026-06-14 alignment stanza for the 6.0.1 → 7.0.0 bump (object_store ObjectStoreExt move, roaring 0.11.4, #6774/#6796/#6755 behavior, #6658 shipped → MR-A unblocked but separate, #6666 + blob compaction still open); prior 6.0.1 stanza demoted to historical. - AGENTS.md: storage substrate 6.x → 7.x (line + architecture diagram). - docs/dev/invariants.md: deletes/vector known gap updated — the staged two-phase delete API (lance#6658) now exists and MR-A is unblocked, but delete_where stays inline and D2 stays in place until the migration lands; create_vector_index still gated on lance#6666. * fix(storage): skip Lance auto-cleanup on commit paths for legacy datasets Addresses PR #229 review (Codex P1). `WriteParams::auto_cleanup` is create-time config with no effect on existing datasets (Lance write.rs docs), so the previous `auto_cleanup: None` change alone did NOT protect graphs created before the v7 bump: 6.0.1 defaulted auto_cleanup ON, leaving `lance.auto_cleanup.` config on those datasets, and Lance's per-commit hook (io/commit.rs: `if !commit_config.skip_auto_cleanup`) fires off that stored config — so omnigraph's own writes would GC versions the __manifest pins for snapshots/time-travel. Skip the hook on every commit path, covering new and legacy datasets alike: - commit_staged: CommitBuilder::with_skip_auto_cleanup(true) — the staged data path. - __manifest publisher: MergeInsertBuilder::skip_auto_cleanup(true). - all 11 WriteParams: skip_auto_cleanup: true (direct Dataset::write/append paths; auto_cleanup: None retained so new datasets store no cleanup config at all). Tests: - lance_surface_guards::skip_auto_cleanup_suppresses_version_gc — substrate: negative control (config GCs v1 without skip) + with-skip survival. - staged_writes::commit_staged_skips_auto_cleanup_so_pinned_versions_survive — omnigraph usage: commit_staged on a legacy-config dataset preserves the pinned create version. Refs lance#6755. test(lance): assert created_at-preserved + updated_at-bumped on merge_insert UPDATE Addresses PR #229 review follow-up. `lance_merge_insert_update_preserves_created_at_version` documented (in a comment) that a merge_insert UPDATE preserves created_at and bumps updated_at, but only asserted the value change — leaving the change-feed invariant unguarded. Add the two missing assertions: - bob created_at == v1 (preserved across UPDATE; what the test name promises; lance#6774 only changed INSERT-row stamping). - bob updated_at == v2 (bumped to the commit version) — the invariant OmniGraph's insert/update classification relies on (changes/mod.rs keys on _row_last_updated_at_version). A regression here would silently drop updates from the diff/change feed.	2026-06-14 20:42:24 +02:00
Ragnor Comerford	1bed998052	fix(engine): scalar index coverage + filter literal coercion (query latency) (#216 ) * fix(engine): lower date/datetime filter literals as typed Arrow scalars `literal_to_expr` lowered `Date`/`DateTime` query literals as Utf8 strings, relying on DataFusion implicit casts. Against a physical `Date32`/`Date64` column that can coerce the column side (`CAST(col AS Utf8)`), which defeats a scalar BTREE and degrades the scan to a full filtered read. Lower to typed `Date32`/`Date64` scalars instead (reusing the loader's `parse_date32_literal`/`parse_date64_literal`, already used by the in-memory comparison arm), so the predicate stays a direct column comparison and the index is used. Malformed literals fall back to the Utf8 string so pushdown behavior never regresses. Tests: unit goldens asserting the lowered literal is typed (red before, green after) + inline-binding pushdown equality in literal_filters confirming the epoch conversion selects the right rows. * fix(engine): build scalar BTREE for enum and orderable-scalar @index columns `build_indices_on_dataset_for_catalog` only handled `String` (-> FTS) and `Vector` (-> vector). Enums are physically `String`, so an enum `@index` column (e.g. `status`) got an FTS inverted index, which Lance never consults for `=`; and `DateTime`/`Date`/numeric/`Bool` `@index` columns fell through and built nothing. Both meant equality/range filters degraded to full scans with `indices_loaded=0`. Dispatch index kind by property type via a shared `node_prop_index_kind`: enum + orderable scalar -> BTREE, free-text String -> FTS, Vector -> vector, list/Blob -> none. The helper is shared by the builder and `needs_index_work_node` so they cannot drift — the latter decides recovery- sidecar pinning, and under-reporting would leave a HEAD-advancing index build uncovered (invariant 5). Tests: scalar_indexes.rs asserts enum/DateTime/numeric @index columns report `IndexCoverage::Indexed` while free-text String/un-annotated columns stay `Degraded` (negative control). Docs: docs/user/indexes.md. * feat(engine): reindex in optimize to keep index coverage current A scalar/FTS/vector index only covers the fragments it was built over. Rows appended after the build (e.g. `ingest --mode merge`, whose commit does not rebuild an existing index) are scanned unindexed, and `compact_files` rewrites fragments out of coverage. Nothing folded them back in, so coverage decayed as the graph grew — even the id/src/dst BTREEs that power traversal. `optimize_one_table` now runs Lance `optimize_indices` after `compact_files` (incremental merge, not retrain — the same compact->optimize_indices sequence LanceDB's `optimize()` uses) and enters the publish path on compaction work OR stale index coverage (new `TableStore::has_unindexed_fragments`, reusing the fragment_bitmap logic). `optimize_indices` is a committing call with no uncommitted variant in lance-6.0.1, so it is an inline-commit residual covered by the existing `SidecarKind::Optimize` recovery sidecar spanning both ops. Blob-bearing tables are still skipped (the Lance blob-compaction bug is compaction-specific; reindex-for-blob deferred as a noted follow-up). Tests: maintenance.rs asserts an appended fragment is uncovered before and covered after optimize, and idempotency holds (second pass is a no-op). lance_surface_guards pins the `optimize_indices` signature and its incremental- coverage behavior. The existing optimize Phase-B recovery failpoint now also exercises a crash after reindex. Docs: maintenance.md, writes.md, invariants.md, lance.md, AGENTS.md. * fix(engine): coerce pushdown filter literals to the column type Filter literals were pushed to Lance in their natural Arrow type (every integer Int64, every float Float64). Against a narrower indexed column DataFusion widens to the literal's type and casts the COLUMN (`CAST(n32 AS Int64)`), which defeats the scalar BTREE and degrades to a full filtered read. A physical-plan probe confirms it: an Int32 column filtered by an i32 literal uses `ScalarIndexQuery`; by an i64 literal it does not. Thread the scan's `arrow_schema` through `build_lance_filter_expr` -> `ir_filter_to_expr` and coerce each literal operand to the opposite column's exact Arrow type, reusing `projection::literal_to_array` + `arrow_cast` (the same path the in-memory arm uses, so the two arms agree). Coercion never demotes a filter to None: on failure it falls back to the natural literal, because a node scan has no in-memory fallback for inline filters. Supersedes the date-specific change in `e4ef67b` (PR1): the probe shows dates were never index-defeated — temporal coercion casts the LITERAL, not the column — so PR1's index-use rationale was wrong though harmless. The generic coercion subsumes it; `literal_to_expr`'s date arms revert to the natural Utf8 fallback, and its unit tests now assert the live coerced path. Tests: surface guard `scalar_index_use_requires_matched_literal_type` pins the substrate behavior (matched -> index, widened -> column-cast full scan); unit tests cover Int32/UInt32/Float32 coercion, range op, reversed operand order, and the natural fallback; `literal_filters` adds an I32 column with equality + range and an F32 pushdown case. * fix(engine): only coerce filter literals when the cast is lossless The literal coercion in `f064121` narrowed unconditionally. typecheck permits numeric cross-type comparisons (`types_compatible`), so an out-of-domain literal reaches `literal_to_typed_expr` and casts lossily: a fractional float vs an integer column truncates (`{ count: 2.7 }` -> `count = 2`, wrongly matching the count=2 row) and an out-of-range integer overflows to null (`count < 3e9` on I32 -> `count < NULL` -> empty). Both silently change results, and a node scan has no in-memory fallback for inline filters. Add a lossless guard for integer targets: round-trip the cast back to the natural type and, on mismatch, return None so the caller keeps the natural literal (correct via DataFusion coercion; the index is just unused for that out-of-domain predicate). Float targets stay coerced -- narrowing F64 -> F32 is the column's own precision domain, not a value error. Resolves the two valid review findings on PR #216 (Codex float truncation, Greptile out-of-range). Tests: unit cases for fractional/out-of-range fallback vs whole-float/in-range coerce vs F32 exemption; e2e `{ count: 2.7 }` returns no rows.	2026-06-14 16:31:19 +02:00
Andrew Altshuler	612741b387	docs(user): split language/branching pages + add front-door pages (Phase 2) (#225 ) Content build-out on top of the Phase 1 topic move. No behavior changes. Splits (existing content relocated, cross-linked): - queries/index.md → mutations/index.md (insert/update/delete + the inserts-vs-deletes rule) and search/index.md (the multi-modal search functions + a hybrid-ranking overview tying nearest/bm25/rrf together). queries/index.md now covers the read shape and points at both. - branching/index.md → branching/time-travel.md (snapshots/time travel) and branching/merge.md (three-way merge + the 7 conflict kinds, verified against error.rs MergeConflictKind). New pages (written from the code, user-facing): - quickstart.md — init → load → query → branch, with verified CLI flags. - concepts/index.md — what OmniGraph is + the L1/L2 (Lance/OmniGraph) framing. Expanded operations/audit.md from a 7-line struct dump into a real actor-tracking page (server token-resolved vs CLI --as chain; reading the trail; the omnigraph:recovery reserved actor). Index wiring: docs/user/index.md and AGENTS.md's topic table link every new page; also normalized AGENTS.md's docs/user link display text to match the Phase 1 retargeted paths. Verified: zero broken .md links; check-agents-md.sh green (57 links, 54 docs). Deferred to Phase 3: de-dev polish (grammar paths, IR internals still in queries/branching), guides/, and a possible reference/config.md split (the config schema is already coherent in cli/reference.md). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 13:53:46 +03:00
Andrew Altshuler	d46e50dd6d	docs(user): restructure user docs into topic sections (Phase 1) (#223 ) Move the 23 flat docs/user/*.md files into topic subdirectories so the user guide is organized by area (schema, queries, search, branching, cli, operations, clusters, concepts, reference) instead of a flat list. This is a pure structural move — whole files relocated, every cross-doc link recomputed, no prose rewrites or content splits (those follow in Phase 2). - 19 `git mv`s (install.md, deployment.md stay top-level); history preserved (renames detected at 92–100% similarity). - All intra-doc links, AGENTS.md's topic table (52 pointers), and the docs/dev + docs/releases back-links recomputed via relpath from each file's new location. - docs/user/index.md rewritten as a sectioned nav hub. - Fixed 5 doc-path references in Rust (comments + two user-facing server settings error strings) to point at the new locations. Verified: zero broken .md links across tracked docs; check-agents-md.sh green (with the untracked scratch docs set aside); touched crates build. Note: the public site (omnigraph-web) imports docs/ via a flat-only script; its import-docs.mjs needs a subdir-aware update before the next re-sync. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 13:52:14 +03:00
aaltshuler	3e2502c35e	docs: omnigraph-api-types in the crate list; RFC-009 Phase 2 outcome Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-13 17:10:00 +03:00
Ragnor Comerford	446b46d548	Recovery liveness, storage fault-injection matrix, and one storage implementation over object_store (#203 ) * test(engine): pin the long-lived-handle heal contract for sidecar-covered drift A Phase B -> Phase C failure (commit_staged advanced Lance HEAD, manifest publish did not land, recovery sidecar persists) currently wedges every subsequent staged write on the same engine handle: the commit-time drift guard rejects with 'run omnigraph repair', but repair itself refuses while a recovery sidecar is pending, so a long-lived server can only recover by restart. The documented contract (writes.md 'Long-running servers', invariants.md invariant 5) says refresh-time roll-forward closes this residual without restart -- but no write path runs it. Two red tests pin the intended contract at the write entry points: a follow-up load (the POST /ingest shape: shared handle, no reopen) and a follow-up mutation must heal roll-forward-eligible sidecars in-process and then succeed. Currently failing with: table 'node:Company' has Lance HEAD version 2 ahead of manifest version 1; run `omnigraph repair` before writing The fix lands in the next commit. * fix(engine): heal pending recovery sidecars at the staged-write entry points Close the long-lived-process gap in the recovery protocol: a Phase B -> Phase C residual (per-table commit_staged landed, manifest publish did not, sidecar persists) previously recovered only at the next ReadWrite open or via an explicit refresh() that no production write path called, so a long-lived server wedged every subsequent write on the commit-time drift guard until restart. New recovery::heal_pending_sidecars_roll_forward: - one list_dir of __recovery/ at write entry (empty -> immediate return, the steady state), so the per-write cost is one storage list; - per sidecar, acquires the same per-(table_key, table_branch) write queues every sidecar writer holds from before write_sidecar until after delete_sidecar, then re-checks sidecar existence -- this serializes the heal against live writers instead of rolling an in-flight sidecar forward from under its writer (which would fail that writer's publish CAS spuriously). Lock order queues -> coordinator matches every writer's commit->publish path. This is the queue-acquisition design recovery.rs and write_queue.rs already documented for in-process recovery; - processes in RollForwardOnly mode: the common residual rolls forward in-process; rollback-eligible sidecars still defer to the next ReadWrite open (Dataset::restore is unsafe under concurrency). Wire it into load_as and mutate_as (before the inline delete path can advance any HEAD), and rebase Omnigraph::refresh onto the same helper so refresh stops racing live writers' sidecars. The maintenance entry points (apply_schema_as, branch_merge_as, ensure_indices) intentionally keep their strict fail-loud preconditions for now; wiring the same heal there is a follow-up with its own tests. Turns the previous commit's two red tests green. * fix(engine): name the right recovery path in the commit-time drift guard The drift guard's 'run omnigraph repair before writing' advice is a dead end when the drift is covered by a pending recovery sidecar: repair refuses while a sidecar is pending. With the write-entry heal in place, reaching this guard with sidecar-covered drift means the heal deferred it (rollback-eligible), and the actual recovery path is a read-write reopen. Distinguish the two classes on the error path only (one sidecar list, after the conflict is already certain); a listing failure falls back to the uncovered-drift wording rather than masking the conflict. Pinned by extending refresh_defers_rollback_eligible_sidecar_to_next_open with a write attempt against the deferred sidecar. * docs: write-entry in-process sidecar heal — contract and coverage Update the recovery contract docs to match the previous two commits: invariant 5 now states that the staged-write entry points and refresh run in-process roll-forward recovery (long-lived processes converge on the next write, not at restart); writes.md 'Long-running servers' describes the heal's queue-acquisition concurrency contract, the improved drift-guard error, and the entry points that intentionally do not heal yet; testing.md indexes the new failpoint tests; AGENTS.md capability matrix drops the claim that in-process recovery is entirely future work (only the rollback path remains with the background reconciler). * test(engine): pin the entry heal contract for schema apply and branch merge Without the write-entry heal, the two maintenance writers do worse than wedge on sidecar-covered drift -- they proceed and decide its fate implicitly: - schema apply re-plans table rewrites from the manifest pin, orphaning the drifted Phase-B commit (its rows silently vanish from the rewritten table) while the stale sidecar lingers to misclassify against the post-apply pins; - branch merge publishes over the drift, making the failed writer's commit visible as an unattributed side effect (no recovery audit row), and leaves the stale sidecar behind. Two red tests pin the intended contract: both entry points heal the sidecar first (attributed roll-forward), then run on the converged state. Currently failing on the stale-sidecar / dropped-rows assertions; the fix lands in the next commit. * fix(engine): heal pending recovery sidecars at the schema-apply and branch-merge entries Extend the write-entry heal to the remaining two write entry points. Unlike load/mutate (which wedge on the drift guard), these proceeded over sidecar-covered drift and decided its fate implicitly: - schema apply re-planned table rewrites from the manifest pin, orphaning the drifted Phase-B commit -- its rows silently vanished from the rewritten table -- while the stale sidecar lingered to misclassify against the post-apply pins; - branch merge published over the drift, making the failed writer's commit visible without a recovery audit row, and left the stale sidecar behind. Both now run the same queue-serialized roll-forward heal at entry, before their own sidecar exists, so recovery is attributed (audit row) and deterministic. ensure_indices stays heal-free: it runs inside the load / schema-apply flows after their entry heal. Turns the previous commit's two red tests green. Docs updated in the same change (invariant 5, writes.md, testing.md, AGENTS.md). * test(engine): pin Phase A sidecar-write failure semantics Storage fault-injection matrix, row 1: a sidecar PUT failure (S3 PutObject / fs write) in Phase A. New failpoint recovery.sidecar_write at the top of write_sidecar -- the single choke point all five sidecar writers go through -- models the storage error backend-generically. Also adds the other three storage-fault failpoints used by the following commits (recovery.sidecar_delete, recovery.sidecar_list, recovery.record_audit); each is a no-op without the failpoints feature. Pinned contract: every writer writes its sidecar BEFORE its first HEAD-advancing commit, so a put failure aborts with zero drift (no sidecar, Lance HEAD == manifest pin, no rows) and a transient fault never wedges the graph -- the same handle writes/merges normally once it clears. Covered for load (the staging writer) and branch_merge (the multi-table writer, forced onto the RewriteMerged path by diverging both sides). * test(engine): pin Phase D delete, list, and audit-append storage-fault semantics Storage fault-injection matrix, rows 2/3/5, plus the real-backend run: - recovery.sidecar_delete: a Phase D delete failure (S3 DeleteObject) must NOT fail the user's write -- the manifest publish already landed, so the caller's data is durable. The swallowed failure leaves a stale sidecar; the next write's entry heal consumes it via the stale-sidecar audit-recovery path (RolledForward, attributed). - recovery.sidecar_list: a __recovery/ list failure (S3 ListObjectsV2) is loud at every consumer -- the write-entry heal fails the write and the open-time sweep fails the open. Silently skipping recovery over a pending sidecar would be consumer tolerance of drift. Once the fault clears, open recovers the pending sidecar normally. - recovery.record_audit: an audit write failure after the roll-forward's manifest publish aborts that recovery attempt and keeps the sidecar; re-entry detects the already-published manifest, records exactly ONE RolledForward audit row, and converges -- the retry tolerance documented on record_audit, exercised end-to-end. - s3_load_recovers_after_publisher_failure_without_reopen: the same-handle heal scenario on a real bucket (gated on OMNIGRAPH_S3_TEST_BUCKET, skips locally), exercising sidecar put/list/delete through S3StorageAdapter instead of the local-FS adapter. CI wiring lands in a follow-up commit. * test(engine): refuse corrupt recovery sidecars loudly Storage fault-injection matrix, row 4 (no failpoint needed -- the corrupt file is written by hand, sibling to the unknown-schema-version refusal test): a truncated/garbage __recovery/{ulid}.json must be refused loudly by both the write-entry heal (the write fails naming the parse error) and the open-time sweep (ReadWrite open fails naming the file), with the file left on disk for operator inspection. Read-only opens still work -- the sweep is skipped there. * test(engine): run the S3 sidecar-lifecycle coverage in CI + document the fault matrix - ci.yml rustfs_integration: new step running the bucket-gated failpoints tests (name filter s3_) against the RustFS container, so sidecar put/list/delete are exercised through S3StorageAdapter on every storage-affecting PR. - writes.md: sidecar I/O failure semantics -- Phase A put failure aborts with zero drift; Phase D delete failure is swallowed (write already durable) and healed by the next write; list failures are loud at heal and open; corrupt sidecars are refused with the file kept for inspection; audit-append failures are retried to exactly one audit row. - testing.md: index the storage-fault matrix in the failpoints.rs row and the new RustFS CI line. * test(engine): pin read-visibility of acknowledged local if-absent writes The cluster lib test import_missing_state_creates_state_with_graph_- observation flakes at ~50% under full-workspace load ('EOF while parsing a value' reading back the state.json its own import just acknowledged). Root cause is in the engine's local storage adapter: write_text_if_absent writes through a buffered tokio::fs::File and returns when write_all resolves -- which, per tokio's documented File semantics, means the bytes reached tokio's internal buffer, not the file. The actual write completes in a background blocking task after drop, so a caller that acknowledges success and reads the object back can see an empty or partial file. Under load the window widens; the red run fails at iteration 0 with 0 of 8192 bytes on disk. The regression test pins the contract at the adapter boundary: when write_text_if_absent resolves, the full contents are visible to any reader; a losing second claim leaves the winner's object untouched. The fix lands in the next commit. * fix(engine): publish local storage writes with atomic visibility Close the class, not the instance. The local adapter admitted three ways for a reader to observe a write that was acknowledged or visible before its bytes were complete: 1. write_text_if_absent acknowledged success when the buffered tokio::fs::File write_all resolved -- i.e. when the bytes reached tokio's internal buffer, not the file. A caller reading back its own acknowledged write could see an empty object (the ~50% cluster import flake under full-workspace load; the regression test failed at iteration 0 with 0 of 8192 bytes visible). 2. The same call published its CLAIM (create_new) before its CONTENT, so concurrent readers saw an empty claimed file in the window. 3. write_text (plain tokio::fs::write) exposed truncated content mid-replace -- silently falsifying write_sidecar's 'readers either see the complete sidecar or none' contract on local FS (true on S3, where PutObject is atomic). A flush in write_text_if_absent would have fixed only (1). Instead, both local write paths now publish complete temp files atomically: rename for replace (write_text -- the idiom write_text_if_match already used) and hard_link for no-replace (write_text_if_absent -- link fails AlreadyExists, so exactly one of N concurrent claimants wins and the winner's object is fully readable at the instant it becomes visible). The local adapter now honors the same object-level atomic-visibility contract as the S3 adapter, which is what every caller (recovery sidecar protocol, cluster state CAS) was written against. Crash-orphaned .tmp. files are inert: the sidecar sweep filters to .json, and cluster state reads address state.json by name. fsync/durability policy is unchanged (no fsync before, none now); this fix is about visibility ordering, not power-loss durability. Pre-existing on main (landed with the multi-graph server mode change, PR #119); surfaced by this branch's heal work only because one extra list_dir per write shifted test timing. Cluster lib suite: 12/25 failures before, 0/25 after. Turns the previous commit's red test green. * refactor(engine): one storage implementation over object_store for every backend Collapse LocalStorageAdapter (hand-rolled tokio::fs) and S3StorageAdapter into a single ObjectStorageAdapter backed by Arc<dyn object_store::ObjectStore> -- LocalFileSystem for local URIs, the existing AmazonS3 build for s3://, plus a pub in_memory() constructor (full contract including TRUE conditional updates; the in-memory test backend testing.md asked for at the adapter level). Why: the acknowledged-before-visible bug showed the two-impl shape has no referee -- one prose contract, two independent answers. Upstream LocalFileSystem::put_opts is byte-for-byte the staged-temp+rename/ hard_link idiom that fix converged on, and Lance's own commit protocol is built on the same primitives (put-if-not-exists / rename-if-not- exists), so the substrate-aligned move is to stop hand-rolling it. The per-backend residue shrinks to a UriCodec (URI <-> object path) and one capability flag. Semantics preserved by construction, with three deliberate deltas: - exists() is now object-store-semantics everywhere (head + non-empty prefix fallback): an EMPTY local directory no longer 'exists'. The only dir-shaped caller (_graph_commits.lance probes) self-heals via ensure_commit_graph_initialized where it previously wedged loudly. - A directory at an object path reads as NotFound, not as an IO error ('only objects exist'). The cluster unreadable-payload test used a same-named directory as a portable non-NotFound trigger; it now uses chmod 000, which still models genuine transient IO. - write_text_if_match keeps content-token semantics on local (PutMode::Update is NotImplemented upstream for LocalFileSystem in 0.12.5 and 0.13.2); the capability flag gates the token SOURCE in read_text_versioned too -- an ETag token with content-compare writes would lose every CAS. delete_prefix keeps a local remove_dir_all branch: directories are a local-FS concept, and list+delete would leave empty skeletons that cluster graph_root_exists (raw Path::exists) reports as still present. LocalStorageAdapter remains as a delegating shim so the pinned contract tests gate this swap textually unchanged; the shim and the test parameterization over local + in-memory land next. Cargo gains the explicit 'fs' feature (already transitively enabled by lance). * test(engine): one executable storage contract, run against every backend Remove the LocalStorageAdapter delegation shim and migrate its construction sites to ObjectStorageAdapter::local(). Replace the per-backend duplicated tests with a single contract_suite asserting the trait's promises (atomic replace, exists incl. the dataset-root prefix probe, one-winner if_absent, versioned CAS with loud CAS-lost, rename, list round-trip with no sibling-prefix bleed, idempotent delete/delete_prefix), run against the local backend and the new in-memory backend -- which implements true conditional updates, so the strong-CAS path is exercised without a bucket. The bucket-gated S3 variant already exists (s3_adapter_conditional_writes_contract). New local-specific pins for the deliberate semantic edges of the collapse: empty directories are not objects (exists=false; the Lance dataset-root probe shape is the non-empty case), file://-anchored and spaces-in-path list output round-trips byte-identically into read_text, dot-segment paths are lexically absolutized (the CLI's ./graph.omni shape), and upstream rename creating missing destination parents. The acknowledged-write visibility regression test stays, now documenting that the cross-API std::fs read-back is the point. * refactor(cluster): drop put_json's per-backend atomicity branch The local temp+rename dance predates the storage adapter guaranteeing atomic visibility; now that write_text publishes via a staged temp + rename on the filesystem (and a single atomic PUT on object stores) by contract, the branch duplicated upstream behavior. One call, both backends. * docs: storage adapter collapse — contract, in-memory backend, local CAS gap - testing.md: the 'no MemStorage backend' note is half-closed — ObjectStorageAdapter::in_memory() covers the text-object layer with the full contract (true conditional updates); Lance datasets bypass the adapter, so the engine substrate ask stays open. - invariants.md: truth-matrix Tests row updated; new Known Gap for local write_text_if_match (upstream PutMode::Update is unimplemented for LocalFileSystem; content-token emulation is safe only under the cluster lock protocol — close before admitting a lock-free caller). - writes.md: backend notes for the unified adapter (name#N staging residue invisible to the sweep, backend-wrapped error text with exists()-probing for missing-vs-error, loud permission failures). * docs: finish renaming the storage adapters in user docs and test comments storage.md's URI-scheme table and the S3 failpoint test's doc comment still named the deleted LocalStorageAdapter/S3StorageAdapter; both now describe the unified ObjectStorageAdapter over object_store, including the relative-path absolutization note for local URIs. * test(engine): pin branch-awareness of the drift guard's recovery advice A pending sidecar on ANOTHER branch does not cover this branch's drift: with a deferred feature-branch sidecar on disk and genuinely uncovered drift on main, the main write's error must still point at omnigraph repair -- a read-write reopen recovers the sidecar but cannot repair main's uncovered drift. Currently red: the guard matches sidecar pins by table_key only, so the feature sidecar flips main's advice to the reopen path. Fix in the next commit. Surfaced by external review of the drift-guard change. * fix(engine): branch-aware sidecar matching in the drift guard's advice The commit-time drift guard's sidecar-covered check matched pins by table_key alone, so a pending sidecar on another branch flipped this branch's uncovered-drift advice from 'run omnigraph repair' to the reopen path -- and a reopen recovers that sidecar but cannot repair this branch's drift. Compare the pin's table_branch too. Turns the previous commit's red test green. Surfaced by external review of the drift-guard change. * test(engine): pin heal non-interference with a live schema apply The write-entry heal's schema-staging reconcile runs before any queue acquisition, so a load on the same handle, overlapping a schema apply parked between its staging write and manifest commit, promotes the apply's staging files (new catalog live against the old manifest), classifies the LIVE apply's sidecar, and publishes its registrations out from under it. The resumed apply then collides with its own stolen commit. Currently red with: Lance("Concurrent modification: table version 3 already exists for node:Tag") The fix (per-sidecar reconcile under the sidecar's write-queue guards, plus a serialization key the schema-apply writer and the heal both acquire) lands in the next commit. Surfaced by external review of the write-entry heal. * fix(engine): serialize the heal's schema-staging reconcile with live schema applies The write-entry heal ran recover_schema_state_files up front, before acquiring any queue guards. Overlapping a live schema apply parked between its staging write and manifest commit, the heal promoted the apply's staging files (new catalog live against the old manifest), classified the LIVE apply's sidecar, and published its registrations — the resumed apply then collided with its own stolen commit. Correct by construction: - New schema-apply serialization queue key, acquired by the schema- apply writer (alongside its per-table keys) from before write_sidecar until after delete_sidecar. Per-table keys alone don't cover a registration-only migration, which pins no existing tables but has a sidecar and staging files on disk. - The heal reconciles schema staging lazily, PER SchemaApply sidecar, after acquiring that sidecar's guards (including the serialization key) and re-confirming the sidecar exists — a sidecar that survives the queue wait belongs to a dead writer, so the reconcile can no longer race a live apply. Recomputing per sidecar also removes the staleness of one up-front result across a multi-sidecar pass. - Omnigraph::refresh drops its up-front reconcile-and-pass-through (same race, and a pre-promoted result would make the heal's guarded reconcile see clean staging and wrongly defer the sidecar): it now reconciles standalone only when NO sidecar exists — which cannot race a live apply, whose sidecar always precedes its staging files — and otherwise defers entirely to the heal. The open-time sweep keeps its precomputed reconcile: open has no concurrent writers. Turns the previous commit's red test green. Surfaced by external review of the write-entry heal. Self-audit addendum folded in: refresh's no-sidecar gate had a TOCTOU (a live apply could write its sidecar + staging between the empty check and the reconcile) — the standalone reconcile now holds the serialization key across the list-then-reconcile pair. The remaining residual is cross-process only (in-process queues cannot serialize against a writer in another process; the open-time sweep has the same pre-existing exposure) and is now an explicit Known Gap in invariants.md rather than an implicit one. * test(engine): pin catalog reload after the heal recovers a schema apply When the write-entry heal rolls a crashed apply's SchemaApply sidecar forward on the same handle, disk and manifest move to the new schema (staging promoted, registrations published) but the handle's in-memory schema_source/catalog do not. Subsequent writes then validate against the stale catalog and reject rows of types the graph already has. Currently red with: record 1: unknown node type 'Tag' refresh() reloads after its heal; the write entry points must too. Fix in the next commit. Surfaced by external review of the write-entry heal. * fix(engine): reload the in-memory catalog after the heal recovers a schema apply heal_pending_recovery_sidecars refreshed the coordinator and invalidated the runtime cache after processing sidecars, but never reloaded schema_source/catalog — so a write whose entry heal rolled a crashed SchemaApply sidecar forward proceeded to validate against the OLD schema while disk and manifest were already on the new one. reload_schema_if_source_changed is the same post-heal step refresh() already runs; it no-ops on the (overwhelmingly common) non-schema heal because the on-disk source is unchanged. Turns the previous commit's red test green. Surfaced by external review of the write-entry heal. * test(engine): pin that a deleted-branch sidecar cannot wedge the graph A rollback-eligible sidecar pinned to a branch is deferred by every roll-forward-only pass; if the branch is then deleted, the sidecar survives, referencing a branch with no manifest tree. The heal (every write entry) and the open-time sweep (every ReadWrite open) both fail opening the dead branch, and repair refuses while a sidecar is pending -- a terminal read-only state with manual sidecar surgery as the only exit. Currently red with: Lance("Not found: .../__manifest/tree/feature/_versions") The branch's tree and forks are already reclaimed, so the pinned drift is unreachable and the sidecar is provably moot; the fix classifies it as an orphaned-branch terminal state (audit + discard) in both passes. Surfaced by review (P1, verified by repro). * fix(engine): classify deleted-branch sidecars as orphaned instead of wedging A deferred (rollback-eligible) sidecar pinned to a branch survives branch_delete; both the write-entry heal and the open-time sweep then failed unconditionally opening the dead branch -- every write and every ReadWrite open errored, and repair refuses while a sidecar pends. Terminal state, manual sidecar surgery the only exit. The branch's tree and per-table forks are already reclaimed at delete, so the drift the sidecar pins is unreachable and the sidecar is provably moot. Both passes now check the sidecar's branch against the manifest's branch list (the authority -- deliberately NOT inferred from a Not-found on open, which could be a transient storage error masking real recovery intent) and discard orphans with an OrphanedBranchDiscarded audit row, commit appended on main since the sidecar's own branch no longer has a commit graph. The open-time half is pre-existing; the write-entry heal made it hot. Turns the previous commit's red test green. Surfaced by review (P1, verified by repro). * chore: harden review nits — vacuous CI filter, root-runner skip, liveness note - ci.yml: the RustFS sidecar-lifecycle step now fails loudly if the 's3_' name filter matches zero tests (cargo passes vacuously on an empty filter; the step exists specifically to prove S3 sidecar I/O coverage). The pre-existing CLI smoke step has the same shape and is left for a follow-up. - cluster unreadable-payload test: cfg(unix) + a skip-with-log when running as root (mode 000 is still readable to root, common in container dev runners), so the test degrades instead of failing. - refresh: document the one-pass-late convergence for legacy staging residue while non-SchemaApply sidecars pend, so nobody 'fixes' it by re-running the reconcile unserialized — the exact race the serialization key closes. * test(engine): pin orphan-discard idempotency across a delete fault discard_orphaned_branch_sidecar writes its audit row and main commit before deleting the sidecar; a Phase D delete fault leaves the sidecar on disk with the audit already durable, and the retry repeated the whole path -- a second OrphanedBranchDiscarded audit row (and commit) for the same operation. Currently red: 2 rows after one fault + retry. The retry must only finish the delete. Fix next. Also promotes the recovery-audit kinds reader into the shared test helpers (it was recovery.rs-local). Surfaced by external review of the orphan-discard fix. * fix(engine): orphan-discard idempotency + heal reports acted-vs-deferred Two review findings on the recovery surface: - discard_orphaned_branch_sidecar now checks the audit table for an existing (operation_id, OrphanedBranchDiscarded) row before appending the commit + audit pair, so a Phase D delete fault retries ONLY the delete instead of duplicating audit rows and commit-graph entries. Cold path: the list scan runs only when an orphaned sidecar exists. Turns the previous commit's red test green (exactly one audit row across fault + retry). - process_sidecar returns whether durable state changed; the heal sets processed_any only for sidecars that were actually rolled forward / rolled back / audit-recovered (orphan discards count). Deferred sidecars (rollback-eligible, invariant-violating, unpromoted SchemaApply) no longer trigger a per-write schema reload + full runtime-cache invalidation while they pend -- the cache is snapshot-keyed so this was waste, not corruption, but it was paid on every write until reopen. Acted-paths' processed=true remains pinned by load_after_schema_apply_phase_b_failure_uses_recovered_catalog (the reload depends on it). Surfaced by external review. * test(engine): pin the orphan-discard audit-append fault leg as documented tolerance The orphan discard's commit append and audit append are two writes; a failure between them leaves a recovery commit with no audit row, and the retry (keyed on the audit row, the operator-facing record) appends a second commit before the audit lands. This is the same not-atomic-pair-write tolerance record_audit documents and the manifest->commit-graph Known Gap covers for every publish: bounded commit-graph noise, audit row exactly-once under clean failures. Keying idempotency on commit rows instead would need an operation_id column on _graph_commits, and audit-before-commit would dangle the graph_commit_id join -- both worse than the documented residual. Make the tolerance explicit instead of implicit: docstring names the window, a failpoint sits inside it, and the new test pins convergence across the fault (sidecar consumed, exactly one audit row), completing the orphan-discard fault matrix alongside the delete-fault leg. Surfaced by external review of the orphan-discard idempotency. * test(engine): pin honest drift-guard advice when sidecar listing fails The guard's unwrap_or(false) conflated 'classified as uncovered' with 'could not classify': a transient list fault on the guard's second list (the entry heal's first list having succeeded) confidently routed the operator to omnigraph repair even when the heal had just deferred a rollback-eligible sidecar -- and repair refuses while a sidecar is pending. Currently red: the error says 'run omnigraph repair' with no mention of the reopen path. The fix names both paths plus the failure cause when classification is impossible. Surfaced by external review of the drift-guard fallback. * fix(engine): admit ambiguity in the drift guard when sidecar listing fails Replace the unwrap_or(false) fallback with a tri-state: covered -> reopen advice; uncovered -> repair advice; listing FAILED -> say the drift could not be classified, name the cause, and give both paths in order ('run repair, or reopen read-write if repair reports a pending sidecar'). The old fallback confidently routed a transient list fault to repair, which refuses while a sidecar is pending -- a self- correcting but pointless detour. The conflict itself is still always raised; only the advice degrades honestly. Turns the previous commit's red test green. Surfaced by external review of the drift-guard fallback.	2026-06-13 11:20:08 +02:00
aaltshuler	dedd647cde	release: bump workspace to 0.7.0 All six crate manifests + their path-dependency constraints, Cargo.lock, the regenerated openapi.json version metadata, AGENTS.md's surveyed version, and the v0.7.0 release notes (object-storage clusters, config-free --cluster serving, the operator config surface, keyed credentials, operator targeting/aliases, and the omnigraph.yaml deprecation stages). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-12 14:12:33 +03:00
aaltshuler	fa6af775c1	feat(cli)!: unified load command; deprecate ingest as an alias omnigraph load is now the single data-write command: - works against remote graphs (POSTs the server's /ingest endpoint with the same bearer/actor resolution as other remote commands) — previously load was the only data command forced to open Lance storage directly - --from <base> opts into fork-if-missing for --branch (the former ingest semantics); without --from a missing branch is an error, never a fork - --mode is now required: overwrite is destructive, so there is no implicit default (the old silent default was overwrite) - output gains base_branch/branch_created (and table sums on remote loads) omnigraph ingest stays as a deprecated alias (defaults preserved: --from main --mode merge) that prints a one-line warning to stderr, matching the read/change deprecation convention; removal in a later release. Docs updated in the same change: cli.md, cli-reference.md, policy.md, audit.md, execution.md (unified load section), AGENTS.md quick-flow, README.md. BREAKING CHANGE: scripts running omnigraph load without --mode must now pass it explicitly (previously defaulted to the destructive overwrite). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 04:18:00 +03:00
aaltshuler	97eb65e921	docs(cluster): operator how-to guide for deploying and managing clusters New docs/user/cluster.md — the practical companion to cluster-config.md's reference: zero-to-served walkthrough (validate/import/plan/apply, derived roots, data loading, --cluster serving), the day-2 edit->plan->apply->restart loop with a per-change-kind table, drift observation and convergence, the approval gate for destructive changes, crash/lock/lost-ledger recovery, the boot-refusal table with remedies, deployment patterns (replicas, backup unit, CI gating), and the explicit not-yet list (hot reload, S3-hosted cluster dirs, per-query exposure, pipelines). Linked from the user index, the agent guide's topic map, and cross-linked from the reference. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:10:19 +03:00
devin-ai-integration[bot]	2c578a60b2	(feat) convert engine call sites to &dyn TableStorage; demote legacy TableStore methods to pub(crate) (#86 ) * MR-854: convert engine call sites to &dyn TableStorage; demote legacy methods Phase 1b: every db.table_store.X(...) call site converts to db.storage().X(...), reaching the storage layer through the sealed TableStorage trait (returns &dyn TableStorage). Opaque SnapshotHandle and StagedHandle replace bare lance::Dataset and Transaction in the threaded values. Phase 9: the inherent inline-commit methods on TableStore (append_batch, merge_insert_batch{,es}, overwrite_batch, create_btree_index, create_inverted_index) demote from pub to pub(crate). Their only remaining direct users are table_store.rs itself and the bulk loader's LoadMode::{Append, Overwrite, Merge} concurrent fast-paths in loader::write_batch_to_dataset (no two-phase shape in Lance 4.0.0 — closes after lance#6658 and #6666). Docs: - invariants.md \u00a7VI.23: drop "at the writer-trait surface" qualifier; staged primitives are now the only engine surface. - runs.md: residual matrix shrinks to delete_where and create_vector_index (the two upstream-blocked residuals). - forbidden_apis.rs: replace transitional language with the current allow-list shape (table_store.rs + loader concurrent fast-path only). Files touched: - changes/mod.rs, db/omnigraph.rs (+export/optimize/schema_apply/ table_ops.rs), exec/{merge,mod,mutation,staging}.rs, loader/mod.rs, storage_layer.rs, table_store.rs, tests/forbidden_apis.rs, docs/{invariants,runs}.md. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * MR-854: replace test-only inline-commit append callers with local Lance helpers After demoting TableStore::append_batch from pub to pub(crate), the integration tests in tests/recovery.rs and tests/staged_writes.rs that previously called store.append_batch(...) directly to simulate HEAD-ahead-of-manifest drift can no longer access the inherent method. Replace those calls with small in-test helpers that do a raw Dataset::append (the same body the inherent method runs). - tests/helpers/mod.rs gains lance_append_inline (shared helper). - tests/staged_writes.rs gets a file-local lance_append_inline_local (staged_writes.rs does not import helpers::). - tests/recovery.rs drops the unused TableStore import in the one function whose store binding became unused after the conversion. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * MR-854: retrigger CI for flaky Test Workspace job Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * MR-854: convert remaining table_store call sites in export.rs / read_blob Two leftover `self.table_store.X` / `db.table_store.X` call sites were missed in the initial sweep — flagged by Devin Review on PR #86. Both now go through the trait surface: - `entity_from_snapshot` (db/omnigraph/export.rs): switch from `db.table_store.open_snapshot_table` + `db.table_store.scan` to `db.storage().open_snapshot_at_table` + `db.storage().scan`. - `read_blob` (db/omnigraph.rs): replace `snapshot.open(table_key)` + `self.table_store.first_row_id_for_filter` with `self.storage().open_snapshot_at_table` + `self.storage().first_row_id_for_filter`. The follow-up `take_blobs` call still needs an `Arc<Dataset>` (it's a Lance blob accessor not surfaced through the trait), so we hand off via `SnapshotHandle::into_arc()` with a comment. After this commit, no engine code outside `table_store.rs` reaches the inherent `TableStore` API — the docs/runs.md and docs/invariants.md claim is now uniformly true. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * MR-854: post-rebase doc fixes (Lance 6.0.1, MR-A framing, into_dataset note) Reviewer feedback on the rebased PR: * docs/dev/writes.md residuals matrix: drop demoted methods from the trait-surface table (now `pub(crate)`); keep only the two genuine trait-surface residuals (`delete_where`, `create_vector_index`); reframe under MR-A (Lance v7.x bump) per docs/dev/lance.md. * tests/forbidden_apis.rs: update transitional allow-list header to (a) drop the truncate_table mislabel (truncate_table is a Lance Dataset method, not a TableStore method — overwrite_batch's internal call), (b) reframe trait-surface residuals under MR-A / Lance #6666. * crates/omnigraph/src/storage_layer.rs::SnapshotHandle::{into_arc, into_dataset}: add single-ref invariant doc — both consume Arc via try_unwrap-or-clone; sibling SnapshotHandle clones across an await point force a deep Dataset clone. * Replace lance-4.0.0 version refs with lance-6.0.1 in active source/test/dev-doc comments (storage_layer.rs, table_store.rs, table_ops.rs, schema_apply.rs, merge.rs, recovery.rs, staged_writes.rs, consistency.rs, docs/dev/execution.md, docs/user/query-language.md). Historical refs in docs/releases/v0.4.1.md and the canonical "Lance 4.0.0 → 6.0.1 migration" line in docs/dev/lance.md left intact. No engine code changes. * MR-854: update docs/dev/invariants.md Storage trait row + gap entry Reviewer feedback: the docs reorg landed; the invariant row now lives in docs/dev/invariants.md with stable headings (no more numbered §VI.23). Update two pieces to reflect MR-854 completion: * Status table 'Storage trait' row: was 'full call-site migration ... incomplete'; now 'engine call sites all route through db.storage() (MR-854); inline-commit inherent methods are pub(crate)-demoted; capability/stat surfaces are roadmap'. * 'Known Gaps' 'Storage abstraction' entry: was 'older inherent TableStore call sites and inline residuals remain'; now names the closed scope (MR-854 — call sites migrated, methods demoted, loader fast-paths) and the remaining trait-surface residuals under MR-A (Lance v7.x bump) and Lance #6666. Cross-links to docs/dev/lance.md and docs/dev/writes.md so the framing stays co-located with the canonical Lance surface tracking. * MR-854: remove dead inline-commit methods from the storage surface The loader concurrent fast-path (write_batch_to_dataset) is only reached for LoadMode::Overwrite — Append/Merge route through MutationStaging — so its Append/Merge arms were unreachable. Collapse it to overwrite-only and drop the now-unused mode params, which removes the only callers of: - TableStorage::append_batch + TableStorage::merge_insert_batches (trait) - TableStore::merge_insert_batch + merge_insert_batches (inherent) create_btree_index / create_inverted_index had zero callers anywhere (scalar index builds use the stage_* primitives). Remove both from the trait and the inherent impl. Inherent append_batch stays pub(crate): overwrite_batch and recovery tests use it. Migrate the one trait-append_batch test caller (seed_person_row) to stage_append + commit_staged. The merge_insert FirstSeen-workaround rationale moves from the deleted merge_insert_batch into stage_merge_insert (now the sole merge path). No behavior change. Also corrects the inaccurate loader residual comment (the prior text blamed Lance #6658/#6666, which are the delete and vector-index issues, for keeping overwrite inline; a stage_overwrite primitive already exists and schema_apply uses it). * MR-854: seal db.storage() to staged-only; move residuals to InlineCommitResidual Split the three remaining inline-commit writes (overwrite_batch, delete_where, create_vector_index) off the TableStorage trait onto a new sealed InlineCommitResidual trait, reachable only via the explicit Omnigraph::storage_inline_residual() accessor. db.storage() now exposes only staged primitives + reads, so engine code cannot couple a write with a Lance HEAD advance through the default surface — MR-793 acceptance §1 ("no public method commits as a side effect of writing") now holds by construction, not by review + naming. Call sites moved to storage_inline_residual(): loader overwrite fast-path, the three mutation delete_where paths, the branch-merge delete, and the vector-index build. Impl bodies are unchanged (same delegation to the pub(crate) inherent methods); this is a pure surface reshape with no behavior change. The residual trait holds two genuinely upstream-blocked methods (delete_where -> Lance #6658/v7.x, create_vector_index -> Lance #6666) plus overwrite_batch, kept for the loader's cross-table bulk-overwrite concurrency until its staged migration lands (tracked follow-up). * MR-854 docs: describe the staged-only seal; fix stale Lance index URLs - writes.md / invariants.md / AGENTS.md: the inline-commit residuals now live on InlineCommitResidual behind db.storage_inline_residual(), so acceptance §1 holds by construction rather than 'option (b)' per-method enumeration. Drop the inaccurate 'until Lance exposes Operation::Overwrite { fragments }' claim (that op exists; stage_overwrite already builds it) and reframe overwrite_batch as a removable legacy residual gated on the loader's bulk-overwrite concurrency. - forbidden_apis.rs: rewrite the allow-list doc for the split surface. - lance.md: the index spec pages moved from /format/table/index/ to /format/index/ in Lance 6.x (the old paths 404). Fix all 13 URLs. * MR-854: fix stale lance-4.0.0 comment refs flagged in review Addresses greptile (exec/merge.rs) and aaltshuler's stale-version blocker: update lance-4.0.0 -> 6.0.1 in the comment/doc refs within this PR's footprint (exec/merge.rs, exec/mutation.rs, docs/dev/writes.md). Also corrects exec/merge.rs to cite lance#6666 (not #6658) for build_index_metadata_from_segments — that is the vector-index segment-commit API; #6658 is the two-phase delete. (Pre-existing 4.0.0 refs in untouched files like architecture.md/storage.md are main's incomplete migration cleanup, left out of scope.) * fix(storage): stage loader overwrites * fix(storage): stage empty schema rewrites --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com> Co-authored-by: Ragnor Comerford <hello@ragnor.co>	2026-06-09 23:03:08 +02:00
aaltshuler	b046515e1c	Merge origin/main into cluster-config-docs	2026-06-09 18:11:12 +03:00
Ragnor Comerford	131b78705d	release: v0.6.2	2026-06-09 15:59:59 +02:00
Ragnor Comerford	d0e39e677e	fix(maintenance): route uncovered drift through repair (#156 ) * docs(invariants): note the non-atomic manifest->commit-graph publish gap Every graph publish commits __manifest then appends _graph_commits as two separate writes; a crash between them leaves the manifest ahead of the commit DAG. Live reads + durability are unaffected (reads resolve via the manifest) and recovery does not repair it; impact is bounded to commit history / time-travel by commit id / merge-base completeness. Pre-existing across all publishes, not the optimize reconcile specifically. Documented as a Known Gap; the fix is a commit-graph reconcilable from the manifest, not a recovery sidecar. * fix(maintenance): route uncovered drift through repair * fix(maintenance): harden repair review feedback	2026-06-09 14:42:54 +02:00
aaltshuler	043b02e617	feat(cluster): add read-only validate and plan	2026-06-08 20:07:39 +03:00
Ragnor Comerford	e62d9166fb	fix: optimize publishes compaction; recovery roll-back converges manifest (#141 ) * test(optimize): cover manifest publish + HEAD-drift reconcile Red against the pre-fix optimize, which ran compact_files without publishing the compacted version to __manifest: - maintenance: optimize must publish so the manifest table_version tracks the compacted Lance HEAD and a later schema apply succeeds; and must reconcile a pre-existing manifest-behind-HEAD drift (forged via raw Lance compaction) so strict writes commit again. - end_to_end + composite_flow: post-optimize query / strict update / reopen in the full lifecycle (the canonical flow previously omitted post-optimize writes as a documented "known limitation"). - failpoints: a crash between compaction and the manifest publish rolls forward on next open. * fix(optimize): publish compaction to manifest and reconcile HEAD drift optimize ran Lance compact_files without publishing the new version to __manifest, so the manifest table_version lagged the Lance HEAD: reads stayed pinned to the pre-compaction version, and the next schema apply or strict update/delete failed its HEAD-vs-manifest precondition with "stale view ... refresh and retry" (open-time recovery rollback inflated the gap on retry). optimize now publishes each compacted table's version under the per-(table, main) write queue, guarded by a manifest CAS and a SidecarKind::Optimize recovery sidecar (loose-match; roll-forward is safe because compaction is content-preserving). When a table has nothing left to compact but its Lance HEAD is already ahead of the manifest pin (pre-fix drift, or a recovery restore commit), optimize reconciles the manifest forward to HEAD (metadata-only, no sidecar). Caches and the CSR/CSC graph index are invalidated after a publish. Docs updated (maintenance, storage, branches-commits, writes, testing). * test(recovery): rollback convergence + optimize-defer regressions Red against the current code, landed before the fix: - recovery: after the open-time sweep rolls a sidecar back, the manifest must track Lance HEAD (no residual drift) so a follow-up schema apply succeeds — the original "+1 per retry" loop. Today roll-back restores without publishing, so the manifest lags HEAD and the apply fails its HEAD-vs-manifest precondition. - maintenance: optimize must refuse while a recovery sidecar is pending — operating on an unrecovered graph could publish a partial write the sweep would roll back. Also removes optimize_reconciles_preexisting_manifest_head_drift: the ad-hoc drift reconcile it covered is replaced by recovery-side convergence. * fix(recovery): converge manifest on roll-back; optimize defers on pending recovery Root of PR #141's review findings and the original "+1 per retry" loop: a Lance HEAD ahead of the manifest was ambiguous (benign content-preserving drift vs. a partial write a sidecar will roll back), and optimize's reconcile guessed it benign. Close the class instead of guessing: - Recovery roll-back now PUBLISHES the restored version (via a push_table_update_at_head helper shared with roll-forward), so the manifest tracks the Lance HEAD after recovery — symmetric with roll-forward. This fixes the +1 loop (after one roll-back the retry's HEAD-vs-manifest precondition passes) and removes the only remaining source of orphaned drift. The audit still records the logical rolled-back-to version; the manifest is published at the restore commit (identical content). - optimize drops the ad-hoc drift reconcile and instead REFUSES when a __recovery sidecar is pending, so it only ever operates on a recovered graph (manifest == HEAD); its compaction publish can no longer commit a partial write. With the reconcile gone, the blob-skip-vs-reconcile gap is moot. Updates the rollback recovery-test helper (manifest == HEAD after roll-back), the failpoints assertions, and the user/dev docs. * test(recovery): fix rollback assertion for manifest convergence The roll-back-publishes change makes the manifest version advance after a SchemaApply roll-back (to the old-schema content), so the schema_apply_without_schema_staging_rolls_back_on_next_open assertion must be `version > pre`, not `version == pre`. This update was dropped during the commit churn and surfaced as a CI Test Workspace failure; the old-schema-preserved intent stays covered by count_rows + _schema.pg + the RolledBack convergence invariant.	2026-06-08 02:50:12 +03:00
Ragnor Comerford	d54bccb940	fix(optimize): skip blob-bearing tables to avoid Lance compaction crash (#138 ) Some checks failed CI / Classify Changes (push) Has been cancelled Details CI / Check AGENTS.md Links (push) Has been cancelled Details CI / Container Entrypoint (push) Has been cancelled Details Release Edge / Prepare edge release (push) Has been cancelled Details CI / Test Workspace (push) Has been cancelled Details CI / Test omnigraph-server --features aws (push) Has been cancelled Details CI / Test Windows release binaries (push) Has been cancelled Details CI / RustFS S3 Integration (push) Has been cancelled Details Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled Details Release Edge / Smoke Windows installer (push) Has been cancelled Details * test(optimize): pin Lance blob-column compaction failure as a surface guard Lance compact_files mis-decodes blob-v2 columns under its forced BlobHandling::AllBinary read ("more fields in the schema than provided column indices"), failing even a pristine uniform-V2_2 multi-fragment blob table; reads use descriptor handling and are unaffected. Guard 10 reproduces this and is self-retiring: it turns red on the Lance bump that fixes the bug, forcing LANCE_SUPPORTS_BLOB_COMPACTION to flip. * fix(optimize): skip blob-bearing tables instead of crashing compaction omnigraph optimize aborted the whole sweep when any node/edge table had a Blob property: Lance compact_files cannot decode blob-v2 columns under AllBinary (the column-index error pinned by the surface guard). Skip blob-bearing tables behind a LANCE_SUPPORTS_BLOB_COMPACTION gate and report them via TableOptimizeStats.skipped / SkipReason (surfaced in the CLI and a tracing::warn) instead of erroring, which also isolates the failure so the other tables still compact. Reads/writes are unaffected; only fragment/space reclamation on blob tables is deferred until the upstream Lance fix. Adds a maintenance.rs regression test (validated red with the column-index symptom before the fix, green after), a concise v0.6.1 release note, and updates docs (maintenance, cli-reference, AGENTS capability matrix, invariants Known Gaps, lance.md audit, constants). * refactor(optimize): make TableOptimizeStats and SkipReason non_exhaustive Both are returned result types, never built by callers, so #[non_exhaustive] makes this the last field/variant addition that can break downstream literal construction and keeps future ones non-breaking (review feedback on the public-field addition). The v0.6.1 Compatibility Notes call out the source-level change. Also drops the now-stale "RED today / GREEN after the fix lands" narration in the optimize_skips_blob_table_and_reports_skip test (historical regression context now that the fix is in this branch), and folds in the expanded v0.6.1 release note. * chore(release): bump workspace to v0.6.1 Coherent version bump to accompany the v0.6.1 release note: all five crate manifests + path-dependency constraints, Cargo.lock, the AGENTS.md surveyed-version line, and openapi.json info.version move 0.6.0 -> 0.6.1. Matches the established release pattern (#118 landed the v0.6.0 note + bump together) and resolves the Codex/Devin review flag that a v0.6.1 note without a bump leaves CARGO_PKG_VERSION reporting 0.6.0 and mixed package versions.	2026-06-02 17:12:00 +02:00
Ragnor Comerford	2d5c4b1202	docs: rename runs.md/runs.rs → writes and repoint all references (#131 ) Some checks failed CI / Classify Changes (push) Has been cancelled Details CI / Check AGENTS.md Links (push) Has been cancelled Details CI / Container Entrypoint (push) Has been cancelled Details Release Edge / Prepare edge release (push) Has been cancelled Details CI / Test Workspace (push) Has been cancelled Details CI / Test omnigraph-server --features aws (push) Has been cancelled Details CI / Test Windows release binaries (push) Has been cancelled Details CI / RustFS S3 Integration (push) Has been cancelled Details Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled Details Release Edge / Smoke Windows installer (push) Has been cancelled Details The Run state machine was removed in MR-771 (v0.4.0); `docs/dev/runs.md` and `crates/omnigraph/tests/runs.rs` have since documented and tested the direct-publish write path, so the "runs" name was misleading. - git mv docs/dev/runs.md → docs/dev/writes.md (reframe H1 + intro; keep MR-771 history note) - git mv crates/omnigraph/tests/runs.rs → tests/writes.rs (reframe header) - repoint every runs.md / runs.rs reference across docs, AGENTS.md, and source comments - fix four pre-existing broken `docs/runs.md` links (the file never lived at that path) to `docs/dev/writes.md` - fix the stale v0.4.0 anchor to the live section No behavior change: every source edit is a comment. Engine builds and the renamed test passes 25/25; scripts/check-agents-md.sh passes. The run-removal cleanup itself (run_registry.rs guard, __run__ prefix) is deferred to MR-770.	2026-05-30 23:20:56 +02:00
Andrew Altshuler	4e431d28b8	docs: rewrite README opening + add AGENTS.md dev commands (#122 ) * docs(agents): add build/test/lint dev-command section AGENTS.md (CLAUDE.md) covered architecture and invariants but had no developer command surface — only runtime `omnigraph` CLI usage. Add a concise "Build, test, lint" section with the non-obvious gotchas: - crate dir `crates/omnigraph` is package `omnigraph-engine` (the `-p` name) - canonical CI gate is `cargo test --workspace --locked` - how to run one file / one fn - feature-gated suites (`failpoints`, server `aws`) - S3 tests skip without `OMNIGRAPH_S3_TEST_BUCKET` - the two non-test CI checks (check-agents-md, OpenAPI drift) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(readme): rewrite opening, dedupe, fix stale references - New manifesto-style opening (tagline, X-as-code, features, core use cases, coordination-layer line); drop the old prose intro, Use Cases, and Capabilities sections. - Remove Capabilities, which restated the new opening line-for-line. - Harmonize heading case: "## Core Use Cases". - Dedupe the verbatim Slack invite (kept the Community section) and the double-linked cli.md (kept the contextual pointer). - Fix stale references that no longer match the code: - drop "transactional runs" / "and runs" — no run concept remains; writes are atomic per-query, multi-query workflows use branches. - update the CLI crate command list to canonical query/mutate plus commit/lint/optimize/cleanup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 00:53:42 +01:00
Ragnor Comerford	e0f13b32c5	(feat): multi-graph server mode (#119 ) * mr-668: add GraphId newtype + Cloud-mode forward identity stubs (PR 1/10) PR 1 of the MR-668 multi-graph server work. Pure types, no runtime behavior changes yet. Ships the validated identity vocabulary that the rest of the implementation will consume: - `GraphId(String)` — `^[a-zA-Z0-9-]{1,64}$`, leading underscore rejected (engine reserves every `_` filename), reserved route names rejected (`policies`, `healthz`, `openapi`, `openapi.json`, `graphs`). Validation lives in `try_from` only; serde `Deserialize` re-runs it so JSON payloads cannot bypass. - `TenantId(String)` — same regex shape as GraphId. `None` in Cluster mode; reserved for Cloud mode (RFC 0003) where it carries the OAuth `org_id` claim. - `GraphKey { tenant_id: Option<TenantId>, graph_id }` — the registry HashMap key. `cluster()` constructor for the Cluster-mode default. - `Scope` enum with `Full` variant — Cluster mode default; RFC 0004 will extend with OAuth scopes (`graph:read`/`write`/`admin`/``). - `AuthSource` enum with `Static` variant — Cluster mode default; RFC 0001 step 1 will add `Oidc`. - `ResolvedActor { actor_id, tenant_id, scopes, source }` — replaces the upcoming refactor of `AuthenticatedActor(Arc<str>)` in PR 4a. Per MR-668 design decision 13: ship the Cloud-mode forward type shapes now (no `TokenVerifier` trait yet — that's RFC 0001 step 1) so handler signatures stay stable across the Cluster → Cloud trajectory. `Scope` and `AuthSource` use `#[non_exhaustive]` so future variants don't break caller matches. Tests: 26 new (15 graph_id + 11 identity), all passing. No regression in the existing 36 server library tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: Omnigraph::init error-path cleanup + three failpoints (PR 2a/10) PR 2a of the MR-668 multi-graph server work. Bug fix: a partially-failed `Omnigraph::init` previously left orphan schema files at the graph URI, making the URI unusable for a retry (the next `init` would refuse because `_schema.pg` already exists). Changes: 1. `init_with_storage` now wraps the I/O phase. On any error from `init_storage_phase`, calls `best_effort_cleanup_init_artifacts` to remove the three schema files before returning the original error: - `_schema.pg` - `_schema.ir.json` - `__schema_state.json` Cleanup is best-effort: a failure to delete is logged via `tracing::warn` but does NOT mask the init error. 2. Three failpoints added at the init phase boundaries: - `init.after_schema_pg_written` - `init.after_schema_contract_written` - `init.after_coordinator_init` 3. Four new failpoint tests in `tests/failpoints.rs` pin the cleanup behavior at each boundary plus the "original error wins over cleanup error" contract. All 23 failpoint tests pass. Coverage gap (documented in code comments): Lance per-type datasets and `__manifest/` directory created by `GraphCoordinator::init` are NOT cleaned up after a coordinator-init-phase failure. Recursive directory deletion requires `StorageAdapter::delete_prefix`, which was deferred along with `DELETE /graphs/{id}` (originally PR 2b). When that primitive lands, the third failpoint test can be tightened to assert the graph root is fully empty. Tests: 4 new (init_failpoint_), all 23 failpoint tests green. No regression in the 105 engine library tests or 64 end_to_end tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> mr-668: add GraphHandle + GraphRegistry data structure (PR 3/10) PR 3 of the MR-668 multi-graph server work. Pure data structure — no routing changes yet (that's PR 4a). New file: `crates/omnigraph-server/src/registry.rs` - `GraphHandle { key: GraphKey, uri: String, engine: Arc<Omnigraph>, policy: Option<Arc<PolicyEngine>> }` — the per-graph state that the routing middleware (PR 4a) will inject as a request extension. - `RegistrySnapshot { graphs: HashMap<GraphKey, Arc<GraphHandle>> }` — immutable snapshot; replaced atomically via `ArcSwap`. - `GraphRegistry { snapshot: ArcSwap<_>, mutate: Mutex<()> }` — lock-free reads, mutex-serialized mutations. - `RegistryLookup { Ready(Arc<GraphHandle>) \| Gone }` — two-valued, no `Tombstoned` variant since DELETE is deferred in v0.7.0 scope. - `InsertError { DuplicateKey \| DuplicateUri }` — both rejection cases for create-graph (maps to HTTP 409 in PR 7). - Methods: `new`, `from_handles` (bulk startup-time init), `get`, `list`, `len`, `insert`. Race semantics pinned by three multi-thread tests: - `concurrent_insert_same_key_exactly_one_succeeds` — N=8 spawned inserts with the same key; exactly 1 returns Ok, 7 return DuplicateKey. - `concurrent_insert_distinct_keys_all_succeed` — N=8 spawned inserts with distinct keys; all succeed. - `concurrent_reads_during_inserts_see_consistent_snapshots` — reader loop concurrent with sequential writes; every listed handle's key resolves via `get()` (no torn state). Why no tombstones field: `DELETE /graphs/{id}` is deferred to bound the scope of v0.7.0. Without a delete endpoint, there's no use for tombstones — every key in the registry is `Ready`, and every key not in the registry is `Gone`. When DELETE lands later, the `Tombstoned` variant + `tombstones: HashSet<GraphKey>` slot in additively without breaking caller signatures (the `Gone` variant remains the "not currently active" case). Why `tokio::sync::Mutex`: insert is async because PR 7's flow holds this mutex across the atomic YAML rewrite step (file I/O). std::Mutex would footgun across .await. Dependency additions: `arc-swap = { workspace = true }`, `thiserror = { workspace = true }` (used by InsertError). Tests: 12 new (12 passing). 74 server lib tests total green (62 from PR 1 + 12 new). Clippy clean on server crate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: router restructure + handler refactor for multi-graph (PR 4a/10) PR 4a of the MR-668 multi-graph server work. The heaviest single PR — rewires every handler to extract `Arc<GraphHandle>` from a routing middleware, replaces `AuthenticatedActor(Arc<str>)` with `ResolvedActor` everywhere, and adds the `ServerMode` discriminator. Behavior changes: - Single mode (legacy `omnigraph-server <URI>`): flat routes (`/snapshot`, `/read`, `/branches`, …) continue to work exactly as v0.6.0. Internally, the registry holds a single handle keyed by the sentinel `SINGLE_GRAPH_KEY_ID = "default"`; routing middleware injects that handle on every request. No HTTP-visible change. - Multi mode (new): routes nest under `/graphs/{graph_id}/...`. Routing middleware extracts the graph id from the path, looks it up in the registry, and injects the handle. 404 if not found. (Multi-mode startup itself lands in PR 5; this PR provides the router-side wiring.) AppState refactor: - `engine: Arc<Omnigraph>` and `policy_engine: Option<Arc<PolicyEngine>>` fields removed — both now live inside `GraphHandle` in the registry. - `mode: ServerMode { Single { uri } \| Multi { config_path } }` added. - `registry: Arc<GraphRegistry>` added. - `server_policy: Option<Arc<PolicyEngine>>` added (placeholder for management endpoints in PR 6b; unused today). - Existing constructors (`new`, `new_with_bearer_token{s,_and_policy}`, `new_with_workload`, `open`) build a single-mode AppState internally and remain source-compatible. Tests that constructed AppState via these constructors continue to work. - `with_policy_engine` post-construction setter — rebuilds the single-mode handle with the policy attached. Engine-layer enforcement is NOT reinstalled (matches the old single-field semantics; `open_with_bearer_tokens_and_policy` is the path that installs both layers). - `new_multi` constructor added for PR 5's startup loop. - `uri()` now returns `Option<&str>` (Some in single, None in multi). Routing middleware: - `resolve_graph_handle` injects `Arc<GraphHandle>` as a request extension. Mode-aware: single returns the only handle; multi parses `/graphs/{graph_id}/...` from the URI. Returns 404 in multi mode when the graph id is unregistered. Records `graph_id` on the current tracing span. - `require_bearer_auth` updated to insert `ResolvedActor` (was `AuthenticatedActor`). Handler refactor — every protected handler: - Gains `Extension(handle): Extension<Arc<GraphHandle>>` param. - Replaces `state.engine` → `handle.engine`. - Replaces `state.policy_engine()` → `handle.policy.as_deref()`. - Replaces `state.uri()` → `handle.uri.as_str()` (or `.clone()` where String is needed). - Replaces `Arc::clone(&state.engine)` → `Arc::clone(&handle.engine)` (the spawn-and-clone pattern in `server_export` — proof that a long-running export survives the registry being mutated later). authorize_request signature: - Was: `(state: &AppState, actor: Option<&AuthenticatedActor>, request: PolicyRequest)`. - Now: `(actor: Option<&ResolvedActor>, policy: Option<&PolicyEngine>, request: PolicyRequest)`. - Per-graph callers pass `handle.policy.as_deref()`. The (future PR 6b) management endpoints will pass `state.server_policy.as_deref()`. MR-731 invariant preserved: - The single chokepoint `request.actor_id = actor.actor_id.as_ref().to_string()` inside `authorize_request` still overwrites any client-supplied actor identity. Regression test `actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers` at `tests/server.rs:1114-1216` passes unchanged. Tests: 0 new (the registry race tests in PR 3 already cover the data structure; this PR exercises them indirectly via the existing test suite). 74 lib + 57 server integration + 60 openapi = 191 tests green. Clippy clean. LOC: +397 insertions, -153 deletions in `crates/omnigraph-server/src/lib.rs`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> mr-668: OpenAPI multi-mode cluster filter (PR 4b/10) PR 4b of the MR-668 multi-graph server work. In multi mode, the served `/openapi.json` reports cluster routes (`/graphs/{graph_id}/...`) instead of the legacy flat protected paths — matching what `build_app` actually mounts (PR 4a's `Router::nest`). Single mode is unchanged. Implementation: - New `server_openapi` branch: when `state.mode()` is `Multi`, call `nest_paths_under_cluster_prefix(&mut doc)` after `ApiDoc::openapi()`. - The rewrite consumes `doc.paths.paths`, then for every path-item: - If the path is in `ALWAYS_FLAT_PATHS` (`/healthz` for now), keep it flat. - Otherwise, prefix every operation_id with `cluster_` and reinsert the item at `/graphs/{graph_id}<original_path>`. - Single mode hits no extra work — the path map is untouched. - The static `ApiDoc::openapi()` still emits the flat surface, so in-process callers (the existing `openapi_json()` helper in tests) see the unmodified spec. Why cluster_ prefix on operation IDs: OpenAPI specs require unique operation_ids across the document. With both flat (single-mode) and cluster (multi-mode) surfaces ever co-existing in a generated SDK, the prefix prevents collision. The current served doc only carries one surface, so the prefix is forward-compat with potential future dual-surface generation. Tests: 6 new in `tests/openapi.rs`, all via the `/openapi.json` route (not the static `ApiDoc::openapi()` helper): - `multi_mode_openapi_lists_cluster_paths` — every protected path appears as a cluster variant. - `multi_mode_openapi_drops_flat_protected_paths` — flat protected paths are absent. - `multi_mode_openapi_keeps_healthz_flat` — `/healthz` survives. - `multi_mode_openapi_prefixes_operation_ids_with_cluster` — every cluster operation_id starts with `cluster_`. - `multi_mode_operation_ids_are_unique` — no operation_id collisions. - `single_mode_openapi_unchanged_by_cluster_filter` — single mode still emits the legacy flat surface (regression). New test helper `app_for_multi_mode(graph_ids)` exercises the new `AppState::new_multi` constructor from PR 4a — first user of multi-mode construction outside of unit tests. Result: 66 openapi tests + 57 server integration tests + 74 lib tests = 197 green. No regression in the existing OpenAPI drift check (`openapi_spec_is_up_to_date` still validates the static flat surface matches the committed openapi.json). LOC: +67 in lib.rs (rewrite logic), +219 in tests/openapi.rs (test suite + helper). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: multi-graph startup + mode inference (PR 5/10) PR 5 of the MR-668 multi-graph server work. This is the first PR that makes multi mode actually usable end-to-end: operators invoking `omnigraph-server --config omnigraph.yaml` with a non-empty `graphs:` map and no single-mode selector now get a running multi-graph server. Mode inference (MR-668 decision 2, four-rule matrix in `load_server_settings`): 1. CLI `<URI>` positional → Single 2. CLI `--target <name>` → Single (URI from graphs.<name>) 3. `server.graph` in config → Single (URI from graphs.<name>) 4. `--config` + non-empty `graphs:` + no single-mode selector → Multi (all entries in `graphs:`) 5. otherwise → error with migration hint Rule 5's error message names every escape hatch so operators can fix their invocation without grepping docs. Config schema extensions: - `TargetConfig.policy: PolicySettings` (per-graph Cedar policy file). `#[serde(default)]` so existing single-graph YAMLs keep parsing. - `ServerDefaults.policy: PolicySettings` (server-level Cedar policy for management endpoints — loaded in PR 5, wired into `GET /graphs` in PR 6b). - `OmnigraphConfig::resolve_target_policy_file(name)` and `resolve_server_policy_file()` helpers — both resolve relative to the config file's `base_dir`. Public types added to `omnigraph-server`: - `ServerConfigMode { Single { uri, policy_file } \| Multi { graphs, config_path, server_policy_file } }`. - `GraphStartupConfig { graph_id, uri, policy_file }` — one entry per graph in multi mode. `ServerConfig` shape change: - WAS: `{ uri: String, bind, policy_file, allow_unauthenticated }`. - NOW: `{ mode: ServerConfigMode, bind, allow_unauthenticated }`. - Breaking for any code that constructs `ServerConfig` directly. `main.rs` is unaffected (uses `load_server_settings`). `serve()` now forks on `ServerConfig.mode`: - Single: existing flow via `AppState::open_with_bearer_tokens_and_policy`. - Multi: parallel open via `futures::stream::iter(graphs) .map(open_single_graph).buffer_unordered(4).collect()`. Bound 4 is a rule-of-thumb for I/O-bound work — at N≤10 this trades startup latency for a small amount of concurrent S3/Lance open pressure. Fail-fast: first open error aborts startup; in-flight opens drop their engine via Arc (Lance datasets close cleanly). New helper `open_single_graph(GraphStartupConfig)`: - Validates `GraphId` per the regex in PR 1. - `Omnigraph::open(uri).await` with descriptive error context. - Loads per-graph policy file and re-applies it via `Omnigraph::with_policy` (engine-layer enforcement, MR-722). - Returns `Arc<GraphHandle>` ready for the registry. Routing middleware bug fix: - `Router::nest("/graphs/{graph_id}", inner)` rewrites `request.uri().path()` to the inner suffix (e.g. `/snapshot`). The previous middleware tried to parse `{graph_id}` from `request.uri().path()` and got 400 instead of 200. Fixed by reading from `axum::extract::OriginalUri` request extension, which preserves the pre-rewrite URI. - Caught by the two new tests `cluster_routes_dispatch_per_graph_handle` and `cluster_route_for_unknown_graph_returns_404`. Tests (14 new, all passing): - Four-rule matrix: one test per branch + the joint case `mode_inference_cli_uri_overrides_graphs_map` + the empty-graphs-map error case. - Per-graph + server-level policy file path resolution. - Reserved `GraphId` rejection at startup. - End-to-end multi-graph routing: two graphs side by side, each cluster route hits the right engine. - Unknown graph id under cluster prefix → 404. - Flat routes 404 in multi mode. Inline `ServerConfig` test (`serve_refuses_to_start_in_state_1_without_unauthenticated`) and three `server_settings_` tests updated to the new `mode` shape. Result: 211 server tests green (74 lib + 71 integration + 66 openapi), MR-731 regression test still pinned and passing. LOC: +45 config.rs, +281 lib.rs (net), +395 tests/server.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> mr-668: Cedar resource-model refactor (PR 6a/10) PR 6a of the MR-668 multi-graph server work. Policy-crate-only refactor — no HTTP handler changes, no operator-supplied policy.yaml changes. Sets up the chassis that PR 6b's `GET /graphs` consumes. Two new `PolicyAction` variants: - `GraphCreate` — gates `POST /graphs` (deferred behavioral PR). - `GraphList` — gates `GET /graphs` (lands in PR 6b). Note: `GraphDelete` is intentionally NOT added in this PR. `DELETE /graphs/{id}` is deferred from MR-668's v0.7.0 scope to bound complexity (no `delete_prefix`, no tombstone, no `RegistryLookup::Tombstoned`). Adding the Cedar action without a consumer would be the same kind of "dead vocabulary" trap the `Admin` variant already documents. New `PolicyResourceKind { Graph, Server }` enum, plus a `PolicyAction::resource_kind()` method that classifies every action. Per-graph actions (Read, Change, BranchCreate, …) bind to `Omnigraph::Graph::"<graph_label>"`; server-scoped actions (GraphCreate, GraphList) bind to the singleton `Omnigraph::Server::"root"`. `Admin` stays classified as per-graph for now — MR-724 will pick the final shape when the first consumer surface ships. Cedar schema string additions: - `entity Server;` - `action "graph_create" appliesTo { principal: Actor, resource: Server, ... }` - `action "graph_list" appliesTo { principal: Actor, resource: Server, ... }` Compiler updates: - `compile_policy_source` picks the resource literal based on the action's `resource_kind`. Existing graph-only policies generate the same Cedar source as before — pinned by `per_graph_rules_continue_to_work_alongside_server_rules`. - `compile_entities` includes the `Server::"root"` entity only when a rule references a server-scoped action. Keeps test assertions for graph-only policies tight. - `PolicyEngine::authorize` builds the right resource UID at request time based on `request.action.resource_kind()`. Validation rules added to `PolicyConfig::validate`: - A rule may not mix server-scoped and per-graph actions (different resource kinds need different `permit` clauses). - Server-scoped actions cannot have `branch_scope` or `target_branch_scope` — there's no branch context at the server level. Operator impact: zero. The Cedar schema `Omnigraph::Server` entity is internally referenced by `compile_policy_source`; operator policy.yaml files only declare actions in `rules[].allow.actions` and never reference the resource entity directly. Decision 6's "internal rename only; operator policies unaffected" contract is preserved and pinned by `per_graph_rules_continue_to_work_alongside_server_rules`. Tests: 5 new (11 policy tests total, up from 6): - `graph_list_action_authorizes_against_server_resource` - `graph_create_action_authorizes_against_server_resource` - `server_scoped_rule_cannot_use_branch_scope` - `rule_mixing_server_and_per_graph_actions_is_rejected` - `per_graph_rules_continue_to_work_alongside_server_rules` No regression: 145 server tests (74 lib + 71 integration) still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: GET /graphs endpoint + per-graph policy wire-up (PR 6b/10) PR 6b of the MR-668 multi-graph server work. First management endpoint — `GET /graphs` lists every graph registered with the server, gated by the server-level Cedar policy from PR 6a. New API shapes (in `omnigraph-server::api`): - `GraphInfo { graph_id, uri }` — one entry per registered graph. - `GraphListResponse { graphs: Vec<GraphInfo> }` — sorted alphabetically by `graph_id` for deterministic output. Handler `server_graphs_list`: - Mounted at `GET /graphs` in both modes. - Single mode: returns 405 (resource exists in the API surface, just not operational without a `graphs:` map). 405 chosen over 404 so clients see "resource exists, wrong context" rather than "no such resource". - Multi mode: requires bearer auth (when configured); Cedar-gated by `PolicyAction::GraphList` against `Omnigraph::Server::"root"` (PR 6a's chassis). Returns the sorted registry list. Cedar gate composition: - When no `server.policy.file` is configured, the MR-723 default-deny falls through: `GraphList` is not `Read`, so an authenticated actor without a server policy gets 403. This is the right default — don't expose the registry until the operator explicitly authorizes it. - When a server policy is configured, Cedar evaluates the rule. The test `get_graphs_with_server_policy_authorizes_per_cedar` pins the admin-allow / viewer-deny split. Routing: - New `management` sub-router holding `/graphs` (auth-required, no `resolve_graph_handle` middleware — operates on the registry, not a single graph). - Single mode merges flat protected routes + management. - Multi mode merges nested `/graphs/{graph_id}/...` + management. OpenAPI: - `server_graphs_list` registered in `ApiDoc::paths(...)`. - `EXPECTED_PATHS` in `tests/openapi.rs` gains `/graphs`. - `openapi.json` regenerated (auto-tracked by `openapi_spec_is_up_to_date` in CI). Tests: 4 new in `tests/server.rs::multi_graph_startup`: - `get_graphs_lists_registered_graphs_in_multi_mode` - `get_graphs_returns_405_in_single_mode` - `get_graphs_requires_bearer_auth_when_configured` - `get_graphs_with_server_policy_authorizes_per_cedar` What's NOT in this PR (deferred): - Per-graph policy enforcement is wired through `handle.policy` (PR 4a already did this); PR 6b doesn't add new per-graph behavior beyond making sure the server policy lookup composes cleanly alongside it. - `POST /graphs` (PR 7) and `DELETE /graphs/{id}` (out of scope for v0.7.0). - CLI `omnigraph graphs list` (PR 8 will add). Result: 215 server tests green (74 lib + 66 openapi + 75 integration), 11 policy tests green. MR-731 spoof regression preserved across all this work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: POST /graphs runtime create endpoint (PR 7/10) PR 7 of the MR-668 multi-graph server work. Operators can now add a graph to a running multi-graph server without restarting: curl -X POST http://server/graphs \ -H "Content-Type: application/json" \ -d '{ "graph_id": "beta", "uri": "/data/beta.omni", "schema": { "source": "node Person { name: String @key }\n" }, "policy": { "file": "./policies/beta.yaml" } }' DELETE remains deferred (out of v0.7.0 scope per the trimmed plan — no `delete_prefix`, no tombstones). Body shape (decision 7): - Nested `schema: { source: "..." }` (mirrors the `policy: { file }` pattern; leaves room for future fields without breakage). - Optional nested `policy: { file: "..." }` for per-graph Cedar. - 32 MiB body limit (reuses `INGEST_REQUEST_BODY_LIMIT_BYTES`). - Asymmetric with `SchemaApplyRequest` which keeps flat `schema_source: String` — documented in api.rs. Atomic YAML rewrite + drift detection: - New `config::rewrite_atomic(path, new_config, expected_hash)`: flock → re-read + hash check → serialize → write `.tmp` → fsync → rename → fsync parent dir. Returns the new hash for the caller to update its in-memory baseline. - New `config::hash_config_file(path)` — SHA-256 of the on-disk bytes, used at startup and after each rewrite. - New `RewriteAtomicError { Drift \| Io \| Serialize }` enum. - `AppState.config_hash: Option<Arc<Mutex<[u8;32]>>>` carries the in-memory baseline. Updated after every successful rewrite so subsequent POSTs don't false-trigger drift. - The mutex is `std::sync::Mutex` (brief critical section, no .await inside). The flock itself serializes file access process-wide AND across multiple server instances (defense in depth). - All sync I/O runs inside `tokio::task::spawn_blocking` — flock is sync. Handler ordering (the load-bearing sequence): 1. Mode check: 405 in single mode. 2. Cedar authorize: `GraphCreate` against `Omnigraph::Server::"root"`. 3. Validate body: `GraphId::try_from` (regex + reserved-name), empty schema/uri checks, per-graph policy file parse. 4. Pre-check registry for duplicate graph_id / duplicate uri (409). 5. `Omnigraph::init` the new engine. 6. Atomic YAML rewrite (drift detection inside). 7. Publish in registry (atomic re-check via `GraphRegistry::insert`). Failure modes (documented in handler rustdoc): - Init fails → orphan storage at `req.uri` (PR 2a cleans up schema files; Lance datasets remain orphans until `delete_prefix` lands). - YAML rewrite fails (drift, IO) → orphan storage; YAML unchanged. - Registry insert fails (race) → YAML has entry but registry doesn't; next restart opens it cleanly. New dependency: `fs2 = "0.4"` (workspace + omnigraph-server). POSIX-only file locking. Linux/macOS deployment supported; Windows out of scope. Tests (10 new in `tests/server.rs::multi_graph_startup`): - `post_graphs_creates_a_new_graph_end_to_end` — happy path, includes YAML inspection to confirm the rewrite landed. - `post_graphs_baseline_hash_updates_between_rewrites` — two POSTs in a row both succeed (drift baseline updates correctly). - `post_graphs_duplicate_graph_id_returns_409` - `post_graphs_duplicate_uri_returns_409` - `post_graphs_invalid_graph_id_returns_400` (reserved name) - `post_graphs_empty_schema_source_returns_400` - `post_graphs_returns_405_in_single_mode` - `post_graphs_yaml_drift_detection_returns_503` — operator hand-edits omnigraph.yaml; server refuses to clobber. - `hash_config_file_is_deterministic_and_detects_changes` - `rewrite_atomic_refuses_when_hash_drifts` OpenAPI: `server_graphs_create` registered in `ApiDoc::paths(...)`; openapi.json regenerated. Result: 225 server tests green (74 lib + 66 openapi + 85 integration), all MR-731 regressions still pinned. LOC: ~580 lib.rs net (handler + helpers), ~120 config.rs (rewrite machinery), +71 api.rs (request/response shapes), +332 tests/server.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: CLI omnigraph graphs list/create (PR 8/10) PR 8 of the MR-668 multi-graph server work. CLI parity for the v0.7.0 management surface: operators can now manage graphs from the command line against a running multi-graph server. omnigraph graphs list --target dev --json omnigraph graphs create \ --target dev \ --graph-id beta \ --graph-uri /data/beta.omni \ --schema schema.pg DELETE is intentionally absent — server-side DELETE was deferred from v0.7.0 scope, and shipping a client subcommand for a server endpoint that doesn't exist would be dead vocabulary. The help output, the subcommand enum, and the test that pins it (`graphs_subcommand_help_ lists_list_and_create`) all agree. CLI architecture (modeled on `BranchCommand`): - New `Command::Graphs { command: GraphsCommand }` top-level variant. - `GraphsCommand { List, Create }` enum. - List: GET `<base>/graphs`. Stdout is `<graph_id>\t<uri>` per line, or JSON via `--json`. - Create: reads `--schema <path>` from local disk, inlines as `schema: { source: <file> }` in the POST body (nested per MR-668 decision 7). Optional `--policy-file <path>` becomes `policy: { file: <path> }`. Returns 201 → "created graph X at Y" or JSON via `--json`. - Both subcommands reject local URI targets with a clear "remote multi-graph server URL" error. New API type imports in the CLI: `GraphCreateRequest`, `GraphCreateResponse`, `GraphListResponse`, `GraphSchemaSpec`, `GraphPolicySpec` — all from `omnigraph-server::api`. Tests: - cli.rs (4 new, non-network): * `graphs_subcommand_help_lists_list_and_create` — pins the deferral of `delete` (catches scope creep). * `graphs_list_against_local_uri_errors_with_remote_only_message` * `graphs_create_against_local_uri_errors_with_remote_only_message` * `graphs_create_with_missing_schema_file_errors` — pins the IO context in the schema-read error path. - system_remote.rs (1 new, `#[ignore]` like its peers): * `graphs_list_and_create_against_multi_graph_server` — spawns a multi-mode server, calls `graphs list` (sees `alpha`), `graphs create` (adds `beta`), `graphs list` again (sees both), and confirms the new graph is reachable via its cluster route. CLI suite: 62 tests green (58 existing + 4 new). The new ignored end-to-end test runs locally with `cargo test --ignored`. LOC: +159 main.rs (enum + handlers), +88 cli.rs (unit tests), +131 system_remote.rs (integration test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: composite e2e tests, race fix, v0.7.0 release (PR 9/10) PR 9 — the final integration PR for MR-668 multi-graph server work. Closes the v0.7.0 release. Composite lifecycle tests (closes gaps flagged in PR 7's coverage review): - `multi_graph_lifecycle_post_query_restart_persistence` — POST a graph, query it via cluster route, reload the config from disk and confirm `load_server_settings` sees the rewritten YAML. Validates the "restart resolves orphans" failure-mode story. - `per_graph_policy_enforced_on_post_created_graph` — POST a graph with a per-graph policy attached, then send authenticated read and change requests. Per-graph Cedar enforcement fires correctly on a POST-created graph (engine-layer policy reinstalled via `Omnigraph::with_policy` inside the create flow). - `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent POSTs with distinct graph_ids all return 201. Caught a real race in `rewrite_atomic` (see below). Race fix — `rewrite_atomic_with_modify`: The first composite test surfaced a real bug. The old `rewrite_atomic(path, new_config, expected_hash)` captured the baseline hash OUTSIDE the flock, then called rewrite_atomic which re-acquired it inside. Under concurrent writers: - POST A: captures baseline H0, calls rewrite_atomic. - POST B: captures baseline H0 too (before A's update lands). - A: acquires flock, on-disk == H0, writes H1, releases. - A: updates baseline H0 → H1. - B: tries to acquire flock — waits. - B: acquires flock. On-disk is now H1. Expected (captured before A finished) is H0. MISMATCH → spurious Drift error. Worse: even if the timing happens to align, B's `updated` config was constructed from BYTES read before the flock. B writes a config that doesn't include A's new graph — silent data loss. The fix: new `config::rewrite_atomic_with_modify(path, baseline, modify)` takes a closure. Inside the flock + baseline mutex: 1. Read on-disk bytes, hash, compare to baseline. 2. Parse on-disk YAML. 3. Call `modify(parsed)` to produce the new config — receives fresh on-disk state, returns the modification. 4. Serialize + write + fsync + rename + update baseline. Everything is read-modify-write under the same critical section. Concurrent writers serialize cleanly. Test confirmed this is no longer a race. The old `rewrite_atomic(path, new_config, expected_hash)` API stays for tests that don't need the read-modify-write shape; the POST handler switches to the new shape. Version bump v0.6.0 → v0.7.0: - All 5 `crates//Cargo.toml` (compiler, engine, policy, cli, server) plus their inter-crate `path` dep version constraints. - `Cargo.lock` regenerated by `cargo build --workspace`. - `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server row updated to mention multi-graph + cluster routes + atomic YAML rewrite. - `openapi.json` regenerated. Docs: - `docs/releases/v0.7.0.md` (new) — release notes with breaking changes, new features, deferred items (DELETE, `delete_prefix`, actor forwarding), and the single→multi migration recipe. - `docs/user/server.md` — substantial section additions for the two modes, mode inference, cluster endpoint table, management endpoints, `omnigraph.yaml` ownership contract, `POST /graphs` body shape + status codes. - `docs/user/cli.md` — `omnigraph graphs list/create` section, deferred-DELETE note. - `docs/user/policy.md` — server-scoped Cedar actions (`graph_create`, `graph_list`), per-graph vs server-level policy composition, example server-level policy. Workspace test pass: 573 tests green across all crates. Zero failures. MR-731 spoof regression still pinned and passing across the entire 10-PR series. This commit closes MR-668. v0.7.0 is ready for tagging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> mr-668: remove POST /graphs and CLI graphs create (defer runtime graph mgmt) The POST /graphs runtime-create endpoint shipped in PR 7/10 has three unresolved high-severity bugs: - flock-on-renamed-inode race: the YAML flock is taken on omnigraph.yaml itself, then a temp file is renamed over it. Cross-process writers end up locking different inodes — both believing they hold exclusive access. - duplicate-check outside the file lock: precheck runs against the in-memory registry only; the locked closure does config.graphs.insert(...) unconditionally. Concurrent same-id POSTs can persist the loser in YAML while the in-memory registry keeps the winner — they disagree after restart. - best_effort_cleanup_init_artifacts deletes _schema.pg / _schema.ir.json / __schema_state.json on any init failure. An accidental re-init against an existing graph's URI destroys its schema; subsequent open() fails at read_text(_schema.pg). The correct fix is a Lance-style cluster catalog (reserve → init → publish with recovery sidecars), parallel to the engine's existing __manifest discipline. That work is out of scope for v0.7.0. For now, disable runtime add/remove from the network and CLI surface. Operators add graphs by editing omnigraph.yaml and restarting. The GET /graphs read-only enumeration stays. Removed: - POST /graphs handler + router fragment + utoipa registration - 13 post_graphs_* server tests + 3 composite POST tests + multi_mode_app_with_real_config / post_graph helpers - CLI omnigraph graphs create subcommand + its handler + cli.rs tests - system_remote.rs combined list+create test trimmed to list-only - YAML rewrite infra: rewrite_atomic[_with_modify], RewriteAtomicError, staging_path, hash_config_file, AppState::config_hash field + threading through new_multi and open_multi_graph_state - fs2 dependency (verified absent from cargo tree) - sha2/fs2 imports in config.rs (only the rewrite path used them) - Cedar PolicyAction::GraphCreate variant + "graph_create" match arms + action def in Cedar schema + graph_create_action_authorizes_against_server_resource test - GraphCreateRequest / GraphCreateResponse / GraphSchemaSpec / GraphPolicySpec API types (only the POST handler / CLI imported them) Kept: - GET /graphs (read-only enumeration) and graph_list Cedar action - omnigraph graphs list CLI subcommand - All multi-graph startup, mode inference, cluster routes, per-graph + server-level Cedar policies - server_settings_drive_multi_graph_startup_end_to_end (the test that covers operator-authored YAML + restart — the path that survives) - best_effort_cleanup_init_artifacts and the three init failpoints (still reachable from CLI `omnigraph init`; preflight fix deferred as a follow-up) - GraphRegistry::insert and its concurrency tests — production callers gone, but the method is the natural seam for the future cluster-catalog work Also fixed (transcript issue 4): - ALWAYS_FLAT_PATHS now includes /graphs so multi-mode OpenAPI advertises the management route correctly (was previously rewritten to /graphs/{graph_id}/graphs) - multi_mode_openapi_keeps_healthz_flat → renamed to multi_mode_openapi_keeps_management_paths_flat, asserts both /healthz and /graphs stay flat - multi_mode_openapi_prefixes_operation_ids_with_cluster skips /graphs in addition to /healthz Doc fixes: - docs/user/cli.md: graphs list example was --target http://..., but --target is a config-graph-name lookup; corrected to --uri. Removed the graphs create example. - docs/user/server.md: dropped POST /graphs row, "omnigraph.yaml ownership", and "POST /graphs body shape" sections. Added a paragraph stating runtime add/remove is not exposed in v0.7.0. - docs/user/policy.md: dropped graph_create action; reworded the "Configuration" line to clarify that server-scoped rules (graph_list) take neither branch_scope nor target_branch_scope. - docs/releases/v0.7.0.md: rewrote release narrative — multi-graph mode ships; runtime add/remove deferred. - AGENTS.md: HTTP server bullet and capability matrix row updated to reflect read-only GET /graphs and the operator-edit workflow. - openapi.json regenerated; /graphs has only .get, no .post. Diff: 17 files, +123 −1525 LOC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: comment cleanup and policy format style Strip "PR Na/Nb" sub-PR references throughout MR-668 surfaces — they were useful during the 10-PR delivery sequence but rot now that the work is in the tree. Keep the MR-668 umbrella references. Also: - Add explicit `when = when` and `resource_literal = resource_literal` named args in `compile_policy_source`'s outer `format!` to match the surrounding crate style (already explicit for `group` and `action`). - Rename the best-effort cleanup tracing target from "omnigraph::init" to "omnigraph::init::cleanup" so operators can filter init-failure cleanup events separately from init's other log lines. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: drop actor_id from PolicyRequest; pass actor as separate arg The MR-731 "server-authoritative actor identity" invariant was enforced by an in-function chokepoint (`request.actor_id = actor.actor_id...` overwrite inside `authorize_request`). That worked but relied on every caller passing in a `PolicyRequest` and trusting the overwrite — a comment-enforced invariant. Move the invariant into the type system: * `PolicyRequest` no longer carries `actor_id`. The struct now models what a caller wants to do, not who they are. * `PolicyEngine::authorize(actor_id: &str, request: &PolicyRequest)` and `validate_request(actor_id, request)` take identity as a separate argument. The same shape `PolicyChecker::check` already had for the engine layer. * `authorize_request` in the HTTP layer extracts `actor_id` from the bearer-resolved `ResolvedActor` and passes it positionally — no overwrite step that could be skipped. * CLI `omnigraph policy explain` updated (the only other consumer that built a `PolicyRequest`). Public API break for the `omnigraph-policy` crate. Worth it: handlers can no longer accidentally populate `actor_id` from a request body field, and external consumers are forced by the compiler to source actor identity from a trusted path. The MR-731 chokepoint test `actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers` still passes — the bearer-resolved actor is what reaches the engine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: consolidate AppState single-mode constructors; delete with_policy_engine The prior `with_policy_engine` constructor reused the engine `Arc` from the existing handle (`engine: Arc::clone(&existing.engine)`) without re-applying `Omnigraph::with_policy`. Combined with `new_with_workload`, the documented composition pattern was `AppState::new_with_workload(...).with_policy_engine(p)` — which produced an `AppState` whose HTTP layer enforced Cedar but whose underlying engine had no `PolicyChecker` installed. Any caller reaching the engine via `state.registry().list()[i].engine` could bypass policy entirely. The doc comment named this gap; the type system didn't. Make composition impossible to get wrong: * Add `AppState::new_single(uri, db, tokens, Option<PolicyEngine>, WorkloadController)` — canonical single-mode constructor that takes every option together and routes through `build_single_mode` (which applies `db.with_policy(checker)` to the engine itself). * `new`, `new_with_bearer_token`, `new_with_bearer_tokens`, `new_with_bearer_tokens_and_policy`, `new_with_workload` all become thin wrappers around `new_single`. * Delete `with_policy_engine`. There is no post-construction policy install path any more; the single linear construction forces HTTP-layer and engine-layer policy to install together or not at all. Regression test `engine_layer_policy_fires_via_direct_arc_omnigraph_from_new_single` constructs an `AppState::new_single` with a deny-all policy, pulls the `Arc<Omnigraph>` from the registry handle (the same path an embedded SDK consumer would take), and asserts a direct `mutate_as` call returns `OmniError::Policy`. Pre-fix this test would have succeeded the mutation. Test caller in `ingest_per_actor_admission_cap_returns_429` migrates from `.with_policy_engine(...)` to `new_single(..., Some(policy_engine), workload)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: derive any_per_graph_policy on RegistrySnapshot; simplify dup check `AppState::requires_bearer_auth` walked the entire registry per request (cloning Arcs into a `Vec`, then `.iter().any(\|h\| h.policy .is_some())`) to decide whether the auth middleware should challenge. The walk is unnecessary — the answer only changes when the registry mutates, which is exactly the moment a new snapshot is constructed. Move the flag onto the snapshot itself: * `RegistrySnapshot { graphs, any_per_graph_policy: bool }`. * `RegistrySnapshot::new(graphs)` is the only construction path — it derives the flag from `graphs.values().any(\|h\| h.policy .is_some())` so the cached value can't drift from the source data. * `Default` delegates to `new(HashMap::new())`. * `GraphRegistry::from_handles` and `insert` build snapshots via `RegistrySnapshot::new(...)`. * `GraphRegistry::snapshot_ref()` exposes the current snapshot through an `arc_swap::Guard`; callers that need cached derived state go through this accessor (callers that only want `graphs` still use `list` / `get`). `requires_bearer_auth` becomes one `ArcSwap::load` + bool read. Also (drive-by, same file, same hunk): replace the dead `if let Some(other) = seen_uris.get(...)` + `let _ = other;` pattern in `from_handles` with a plain `seen_uris.contains_key(...)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: fail-fast multi-graph startup with try_collect The `open_multi_graph_state` doc comment claims "Fail-fast — the first open error aborts startup; other in-flight opens are dropped" but the code did .buffer_unordered(4) .collect::<Vec<_>>() .await .into_iter() .collect::<Result<Vec<_>>>()?; which drains every future in the stream before propagating the first `Err`. With N S3-backed graphs and graph #2 failing fast, the caller still waits for #1, #3, #4, … to either succeed or fail before seeing the error. Replace the four-line dance with `futures::TryStreamExt::try_collect`, which short-circuits on the first `Err` and drops the rest. The doc comment now matches behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: drop unused State extractor from 7 read-only handlers After the routing-middleware refactor moved the engine into the per-graph `GraphHandle` (extracted via `Extension<Arc<GraphHandle>>`), seven read-only handlers — `server_snapshot`, `server_read`, `server_export`, `server_schema_get`, `server_branch_list`, `server_commit_list`, `server_commit_show` — kept an unused `State(_state): State<AppState>` extractor. Drop it. Each request avoids one `FromRequestParts` clone of `AppState`'s Arcs. Handlers that actually use state (workload admission for write paths, `server_policy` for management endpoints) keep theirs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: emit info! for graph routing decision `tracing::Span::current().record("graph_id", ...)` in the routing middleware silently no-ops here: no upstream `#[tracing::instrument]` on the handlers declares a `graph_id` field, and `TraceLayer::new_for_http` doesn't either. The recorded value never lands anywhere visible. Replace with an explicit `info!(graph_id = %handle.key.graph_id, "graph routed")` event so operators can grep logs and correlate requests with the active graph. In single mode the value is the sentinel `"default"`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: align GET /graphs 405 body code with HTTP status The single-mode `GET /graphs` handler returned an `ApiError` built via struct literal with `status: METHOD_NOT_ALLOWED, code: BadRequest`. The body code disagreed with the HTTP status — clients deserializing on `code` saw `bad_request`, clients deserializing on `status` saw 405. Same bug class as the earlier 503+Conflict mismatch on the removed YAML drift path. Close the class for this one remaining instance: * Add `ErrorCode::MethodNotAllowed` to the API enum. * Add `ApiError::method_not_allowed(msg)` — pairs the 405 status with the matching code. * Replace the struct literal in `server_graphs_list` with the constructor. * Regenerate `openapi.json` (adds `method_not_allowed` to the ErrorCode schema enum). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: drop unused axum::handler::Handler import The import landed in earlier work but no current call site uses it. Emitted an `unused_imports` warning on every server build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: drop unused fs2 workspace dependency `fs2 = "0.4"` lingered in [workspace.dependencies] after the POST /graphs flock-on-rename design was pulled. `cargo tree -i fs2` reports no consumers in the workspace and the dep is not in Cargo.lock. Removing the declaration closes the "phantom dep" class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: AGENTS.md Cedar row no longer hardcodes action count The "8 actions" claim drifted as soon as MR-668 added `graph_list`. Bumping the count would just push the drift one PR forward; the correct-by-design fix is to defer to the canonical list in docs/user/policy.md and stop maintaining a duplicate count. Closes the "doc hardcodes a count that drifts from the enum" class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: cfg(test)-gate GraphRegistry::insert and its mutex `insert` and the `mutate: Mutex<()>` that serializes it had no runtime consumer in v0.7.0 — the only insertion path at startup is `from_handles`, and runtime add/remove is deferred until a managed cluster catalog ships. Leaving both `pub` and live made them a "looks like API, isn't" footgun: a future change could build on `insert` without re-establishing the concurrency contract with an actual consumer in scope. Gate both together (`#[cfg(test)]` on the method, the field, and the `tokio::sync::Mutex` import) so the race-pinning tests still compile but production cannot reach them. When a real consumer ships, ungate both — they're a unit. Closes the "public API with no runtime consumer drifts toward incorrect" class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: drop vestigial PolicyEngine surface * `validate_request` had zero callsites — pure surface for nothing. * `deny`'s `_actor_id` and `_request` parameters were both unused (the underscore prefix gave it away); the message is built by the caller before `deny` ever sees the request. Trim both. Closes the "public API that the type system can't justify" class for the policy engine. No behavior change; every existing test stays green because the deletions never had a runtime effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: regression test for init re-init footgun (red) A second `Omnigraph::init` against an existing graph URI today destroys the existing graph's schema artifacts. `init_storage_phase` overwrites `_schema.pg` before any preflight, and on the inner `GraphCoordinator::init` failure that follows, `best_effort_cleanup_init_artifacts` deletes all three schema files. The existing Lance datasets and `__manifest/` survive but the schema metadata is gone — unrecoverable without operator surgery. This test exercises that path and currently fails with "_schema.pg must not be deleted by a failed re-init", confirming the destructive cleanup branch fires. The fix in the next commit makes the test pass by preflighting with `storage.exists()` and returning a typed error before any write touches disk. Per AGENTS.md rule 12, the test commit lands just before the fix commit so the red → green pair is visible in `git log` and a reviewer can check out this commit alone to reproduce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: close init re-init footgun via InitOptions preflight (green) `Omnigraph::init` is "create a new graph"; existing graphs need an explicit overwrite. Today's behavior — silently overwrite schema files, then on inner failure delete them via best-effort cleanup — is destructive against an existing graph regardless of which branch fires. Correct-by-design fix: * New `InitOptions { force: bool }` struct (default `force: false`). * New `Omnigraph::init_with_options(uri, schema, options)`. The old `Omnigraph::init(uri, schema)` is a thin shortcut that passes `InitOptions::default()`. * `init_with_storage` runs a `storage.exists()` preflight on the three schema URIs BEFORE any parse, write, or coordinator call. Any hit → typed `OmniError::AlreadyInitialized { uri }`. The destructive code paths (the `write_text` overwrite and the best-effort cleanup) are now unreachable in strict mode against an existing graph. * `force: true` skips the preflight; existing operators who actually mean to overwrite opt in explicitly. * CLI: `omnigraph init --force` maps to `InitOptions { force: true }`. * HTTP: `OmniError::AlreadyInitialized` maps to 409 via `ApiError::from_omni`. Not currently HTTP-reachable (POST /graphs was pulled), but the wiring lands here so a future runtime create endpoint has one canonical translation. Closes the "init is destructive against existing state" class. The regression test added in the previous commit (`init_on_existing_graph_uri_does_not_destroy_existing_schema`) turns green: the original schema files now survive a second init attempt byte-for-byte, and the call errors cleanly with `AlreadyInitialized`. The four existing `init_failpoint_after__cleans_up_` tests stay green — strict mode's preflight passes on a fresh tempdir, and cleanup still runs as before when a failpoint fires mid-write. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: split PolicyEngine::load into kind-typed loaders Pre-fix, every caller of `PolicyEngine::load(path, graph_id)` passed some `graph_id` argument — even when the policy was server-scoped and Cedar's resolution would never touch a Graph entity. The server-level loader at lib.rs passed the meaningless sentinel `"server"`. A graph policy file containing a `graph_list` rule compiled fine; a server policy file containing a `read` rule compiled fine. Both silently no-op'd at request time because the engine kind and the rule's resource kind disagreed. Correct-by-design fix: replace `load` with two kind-typed loaders. * `PolicyEngine::load_graph(path, graph_id)` — for per-graph policy files. Rejects any rule whose action `resource_kind()` is `Server`. * `PolicyEngine::load_server(path)` — for server-level policy files. Takes no `graph_id`: server-scoped actions resolve against the singleton `Omnigraph::Server::"root"` entity, never a Graph. Rejects any rule whose action `resource_kind()` is `Graph`. The old `load` is hard-deleted in the same commit because every in-tree consumer migrates here (no semver promise on the workspace crate, no external pinners). New `PolicyEngineKind` enum types the loader's intent; `validate_kind_alignment` is the load-time check that closes the "wrong action, wrong file, silent no-op" class — operators get a load-time error instead of confused-and- silent behavior at request time. Callsites migrated: * server lib.rs:374 (single-mode per-graph) → load_graph * server lib.rs:1065 (multi-mode server) → load_server * server lib.rs:1103 (multi-mode per-graph) → load_graph * CLI main.rs:732 (resolve_policy_engine) → load_graph * tests/server.rs ×5 (4 graph, 1 server) → load_graph/load_server * policy_engine_chassis.rs → load_graph Four new in-source tests pin the contract: both rejection paths and both positive paths. Closes the "operator puts an action in the wrong file and the rule silently never matches" class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: introduce GraphRouting, retire single_mode_handle Pre-fix, `AppState` always carried `Arc<GraphRegistry>` even when serving one graph. Single mode populated the registry with one handle keyed by the `SINGLE_GRAPH_KEY_ID = "default"` sentinel; `single_mode_handle` walked the registry, asserted `len == 1`, and returned the single element with a 500-class "programmer error" branch on mismatch. Three smells in a row — magic key, walk-and-assert, programmer-error guard — all because the single-mode runtime was forced through a multi-mode abstraction. Correct-by-design fix: type the routing. * New `pub enum GraphRouting { Single { handle }, Multi { registry, config_path } }` on `AppState`. The `Single` arm carries the handle directly — no registry, no key, no walk. * `resolve_graph_handle` middleware matches on `routing`. Single mode returns the handle in O(1); multi mode does the same path-extract + registry lookup as before. The 500-class programmer-error branch is gone — the type system now makes the violated invariant ("single mode has exactly one handle") unrepresentable. * `requires_bearer_auth` reads `handle.policy.is_some()` directly in the Single arm; Multi arm still uses the cached `any_per_graph_policy` flag. `ServerMode` and the legacy `registry` field on `AppState` are still populated for now — C-3 removes both once every reader is migrated. The `SINGLE_GRAPH_KEY_ID` sentinel and `ServerMode` will also go away in C-3. Closes the "single mode forced through a multi-mode abstraction" class. All 76 server integration tests stay green: handlers still extract `Extension<Arc<GraphHandle>>` from the request, so the middleware's internal change is invisible to them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: remove ServerMode, registry field, and the SINGLE_GRAPH sentinel C-1/C-2 introduced `GraphRouting` and pointed the middleware at it. This commit removes the legacy shape that's now dead: * `ServerMode` enum — deleted. Single mode's `uri` lives on `handle.uri`; multi mode's `config_path` lives on the `GraphRouting::Multi` arm. * `AppState.mode: ServerMode` field — deleted. * `AppState.registry: Arc<GraphRegistry>` field — deleted. Multi mode's registry is on `GraphRouting::Multi { registry, .. }`; single mode has no registry at all. * `AppState::mode()`, `AppState::uri()`, `AppState::registry()` accessors — deleted. New `AppState::routing() -> &GraphRouting` is the single public entry point. * `SINGLE_GRAPH_KEY_ID` constant — deleted. `GraphHandle.key` is still required by the struct, but in single mode the key is now only a tracing label (`"default"`, inlined with a comment naming its sole remaining purpose). Single-mode flat routes never carry a `{graph_id}` parameter, so the key is never compared against user input, and there is no registry where it could be a map key. C-1/C-2 already removed the registry walk that the sentinel was named for. Callers migrated: * `build_app` (lib.rs:944) — matches on `state.routing()` instead of `state.mode()`. * `server_graphs_list` (lib.rs:1162) — destructures the Multi arm to get the registry; Single arm short-circuits to 405. * `server_openapi` (lib.rs:1217) — matches the Multi arm for the cluster-prefix rewrite. * `tests/server.rs:3735` — the B2 footgun regression test now matches on `state.routing()` to extract the single-mode handle (the test's earlier `state.registry().list().next()` shape was the closest pre-fix analog to "embedded consumer reaches the engine"; the new shape is more direct). Closes the entire "single mode forced through a multi-mode abstraction" class. After this commit: * No magic sentinel as a routing key. * No `single_mode_handle` walk-and-assert helper. * No 500-class "programmer error" branch in the middleware. * No two-field discriminant on `AppState` where one would do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: regression test for nested-route path extraction (red) `server_branch_delete` and `server_commit_show` use bare `Path<String>` extractors. In single-mode flat routes (`/branches/{branch}`, `/commits/{commit_id}`) this works — one capture, one value. In multi-graph cluster routes (`/graphs/{graph_id}/branches/{branch}`, `/graphs/{graph_id}/commits/{commit_id}`) axum 0.8 propagates the outer `{graph_id}` capture into the inner handler, so the extractor sees two captures and 500s with "Wrong number of path arguments. Expected 1 but got 2." `cluster_routes_dispatch_per_graph_handle` only exercises `/snapshot` (no Path extractor), so the regression slipped through. This test closes that gap structurally: every cluster route with an inner path param gets exercised here. Currently fails with the exact symptom above. Fix in the next commit makes it pass. Per AGENTS.md rule 12, the red test commit lands just before the fix so the pair is visible in `git log` and a reviewer can check out this commit alone to reproduce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: named-field path-param structs for nested cluster routes (green) `Path<String>` deserializes one path-param value positionally. Single-mode flat routes (`/branches/{branch}`, `/commits/{commit_id}`) have one capture; multi-mode nested routes (`/graphs/{graph_id}/branches/{branch}`, `/graphs/{graph_id}/commits/{commit_id}`) have two — axum 0.8 propagates the outer capture into nested handlers. Same handler, two different shapes; the multi-mode shape 500s with "Wrong number of path arguments. Expected 1 but got 2." Symptomatic fix: change to `Path<(String, String)>` and ignore the first element. Breaks again the moment we add another nest layer (e.g. tenant in Cloud mode). Correct-by-design fix: named-field structs deserialized by name from axum's path-param map. Each handler picks only the fields it needs. Stable across single / multi / future-cloud nest depths because deserialization is by field name, not position. * New `BranchPath { branch: String }` (file-local to lib.rs) * New `CommitPath { commit_id: String }` * `server_branch_delete` extractor → `Path<BranchPath>` * `server_commit_show` extractor → `Path<CommitPath>` Closes the "handler path-extractor type is positional and breaks when route nesting changes" class. Red test from the previous commit turns green. All 77 server tests pass (single-mode branch delete + commit show, plus new multi-mode coverage). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: centralize policy-requires-tokens check in the runtime classifier Single-mode `open_with_bearer_tokens_and_policy` bailed at lib.rs:380 when policy was installed and no tokens. Multi-mode `open_multi_graph_state` had no equivalent: the server started, every request 401'd because no token could ever match, and the operator spent time debugging a misconfiguration the single-mode path would have caught at startup. The doc/code contradiction made the gap easy to miss: the `ServerRuntimeState::PolicyEnabled` docstring said tokens-or-not was "unusual but valid — every request fails 401 without a bearer, which is effectively 'locked'." The single-mode bail contradicted that. In practice, silent-401-on-every-request is bug-shaped, not feature-shaped (operators wanting deny-all should configure tokens plus a deny-all Cedar rule to get meaningful 403s with policy-decision logging). Symptomatic fix: add a copy of the bail to multi-mode. Two copies that can drift again the next time a startup path is added. Correct-by-design fix: hoist the check into `classify_server_runtime_state` so both modes get the same enforcement from one source of truth. The classifier becomes the single source of truth for "should we start?" and adding a startup invariant there is now the natural extension point for any future mode. Classifier matrix is now complete: \| has_tokens \| has_policy \| allow_unauthenticated \| Result \| \|---\|---\|---\|---\| \| F \| F \| F \| bail (existing) \| \| F \| F \| T \| Open (existing) \| \| T \| F \| * \| DefaultDeny (existing) \| \| F \| T \| * \| bail (NEW — closes the gap) \| \| T \| T \| * \| PolicyEnabled (existing) \| Changes: * `classify_server_runtime_state` (lib.rs:870-890) gains the `(false, true, _) => bail!(…)` arm with a clear message naming the failure mode and the two valid resolutions. * `open_with_bearer_tokens_and_policy` (lib.rs:369+) drops its redundant local bail — the classifier rejected the invalid case before construction was reached. * `ServerRuntimeState::PolicyEnabled` docstring is rewritten: drops the "(unusual but valid)" carve-out and states plainly that PolicyEnabled requires tokens. Names the explicit alternative (tokens + deny-all Cedar rule) for operators who want the all-requests-denied behavior. * `classify_policy_enabled_always_wins` test is renamed to `classify_policy_enabled_requires_tokens` and the now-invalid `(false, true, _)` assertion is removed (covered by the new rejection test). * New `classify_policy_without_tokens_is_rejected` test covers the new arm. * New `serve_refuses_to_start_with_policy_but_no_tokens_multi_mode` integration test pins the multi-mode propagation path — symmetric with the existing single-mode `serve_refuses_to_start_in_state_1_without_unauthenticated`. Closes the "single mode and multi mode startup branches can drift on safety invariants" class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: close coverage gaps surfaced by the test-coverage audit The bot-review pass and the subsequent coverage audit surfaced two material gaps in PR #119's test surface — both easy to close, both worth closing before merge. * Gap 1 — cluster-route sweep. The Bug-1 path-extractor regression slipped through because `cluster_routes_dispatch_per_graph_handle` only exercised `/snapshot`. The other six protected cluster routes (`/read`, `/change`, `/export`, `/schema`, `/schema/apply`, `/ingest`, `/branches/merge`) were implicitly trusted to work without any multi-mode integration test. Add `all_protected_cluster_routes_resolve_to_their_handler` (`tests/server.rs`) that hits each protected cluster route with a minimal request and asserts the response is consistent with the handler being reached — no 404 (router didn't match), no 500 with "Wrong number of path arguments" (Bug-1 class), no 500 with "missing extension" (routing middleware didn't inject the handle). Status code is a negative assertion because each handler's happy-path inputs differ; what matters is "the request reached the handler," not "the handler returned 200" — that's already pinned by the single-mode tests. * Gap 2 — `--force` happy path. The strict re-init regression test (`init_on_existing_graph_uri_does_not_destroy_existing_schema`) pins the error path; nothing pinned the `force: true` escape hatch actually doing what its docstring claims. Add `init_with_force_recovers_from_orphan_schema_files` (`tests/lifecycle.rs`). Writes a bare `_schema.pg` to simulate orphan files from a failed prior init, confirms strict mode bails as expected, then confirms `init_with_options(force: true)` succeeds and produces a functional graph. Note: the test follows the documented semantics — force skips the preflight only, it does NOT purge existing Lance state. An earlier draft of the test (against full overwrite of an existing populated graph) failed because `GraphCoordinator::init` errored on the existing `__manifest`, which is exactly the limitation the `InitOptions::force` docstring already calls out. Recursive purge needs `StorageAdapter::delete_prefix` (tracked separately). Coverage is now fully aligned with the PR's claims. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: regression test for GraphList open-mode bypass (red) Cursor bot's review at commit `4120448` surfaced that `server_graphs_list` returns 200 in Open mode (`--unauthenticated`, no tokens, no policy), exposing the full graph registry — graph IDs and URIs that may contain S3 bucket paths or internal hostnames — to any unauthenticated caller. Root cause: `authorize_request`'s no-policy fallback only denies when `actor.is_some()`. In Open mode `actor: None`, so the denial branch never fires and the call returns `Ok(())`. The docstring on `server_graphs_list` claims the endpoint is "Cedar-gated" and that we "don't leak the registry until the operator explicitly authorizes it" — but Open mode has no Cedar at all, so the docstring intent and the code disagree. This commit renames the existing `get_graphs_lists_registered_graphs_in_multi_mode` test to `get_graphs_denied_in_open_mode_without_server_policy` and flips the assertion from 200 → 403. Today this fails (server returns 200) — exactly the symptom the bot named. The fix in the next commit tightens the no-policy fallback to deny server-scoped actions unconditionally, regardless of mode. Per AGENTS.md rule 12, the red test commit lands just before the fix so the red → green pair is visible in `git log` and a reviewer can check out this commit alone to reproduce. Sort-order coverage that previously lived in the renamed test moves to `get_graphs_with_server_policy_authorizes_per_cedar` in the next commit, where the admin-200 response is operator- authorized and a non-empty body is asserted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: server-scoped actions always require explicit policy (green) `server_graphs_list` returned 200 in Open mode (`--unauthenticated`, no tokens, no policy) because `authorize_request`'s no-policy fallback only denied when `actor.is_some()` AND action != Read. In Open mode `actor: None`, so the denial branch never fired and the call returned `Ok(())` — leaking the registry (graph IDs + URIs that may contain S3 bucket paths or internal hostnames) to any unauthenticated caller. The docstring on `server_graphs_list` claimed it was "Cedar-gated" and that the server should "not leak the registry until the operator explicitly authorizes it" — docstring intent and code disagreed. Symptomatic fix: special-case GraphList. Breaks the moment another server-scoped action (`graph_create`, `graph_delete`) is added. Correct-by-design fix: tie authorization to the action's `resource_kind()`. Server-scoped actions (`PolicyResourceKind::Server`) always require explicit policy authorization — there is no runtime state where they're served by default. Per-graph actions keep the existing default-deny logic (DefaultDeny denies non-Read for authenticated actors; Open mode allows everything per the operator's `--unauthenticated` opt-in for graph DATA, but not for server topology). The fix uses the existing `PolicyResourceKind` enum that #119 already added — no new abstraction. Future server-scoped actions (runtime `graph_create`/`graph_delete` when the cluster catalog ships) automatically pick up the same enforcement without any per-action handler change. Changes: * `crates/omnigraph-server/src/lib.rs:51` — re-export `PolicyResourceKind` (the kind discriminator was already public on the omnigraph-policy crate; needed in scope here). * `crates/omnigraph-server/src/lib.rs:1457` — `authorize_request`'s no-policy fallback gains a server-scoped-action check that fires before the actor-based default-deny logic. Error message names the failure mode and points at `server.policy.file`. * `crates/omnigraph-server/tests/server.rs:5037` — `get_graphs_with_server_policy_authorizes_per_cedar` extended to register two graphs in non-alphabetical order and assert the admin-200 response is sorted alphabetically. Restores the sort-order coverage that lived in `get_graphs_lists_registered_graphs_in_multi_mode` before the red commit renamed it to assert denial. Also bundles a small adjacent cleanup that the bot-review flagged: * `crates/omnigraph-server/src/graph_id.rs:124` — drop the unreachable `"openapi.json"` entry from `is_reserved`. The regex `^[a-zA-Z0-9-]{1,64}$` rejects every dot-containing name before `is_reserved` can run, so dotted entries in this list were dead code that misled readers into thinking the list needed to cover them. Comment now names the structural exclusion. The `rejects_reserved_route_names` test loses its `openapi.json` row (covered by `rejects_dots` via the regex). Closes the "server-scoped management actions silently leak in Open mode" class. Red test from the previous commit (`get_graphs_denied_in_open_mode_without_server_policy`) turns green; all 78 server integration tests + 76 lib tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: fold multi-graph work into v0.6.0 (no separate v0.7.0 release) The branch had bumped workspace versions to 0.7.0 and added a dedicated `docs/releases/v0.7.0.md` for the multi-graph work. Per scope decision: ship the graph-rename and the multi-graph mode in one v0.6.0 release. Changes: * Workspace versions bumped 0.7.0 → 0.6.0 in every crate manifest (`omnigraph`, `omnigraph-compiler`, `omnigraph-policy`, `omnigraph-server`, `omnigraph-cli`) and their internal `path = ..., version = "..."` dependency constraints. * `docs/releases/v0.7.0.md` content merged into `docs/releases/v0.6.0.md`, retargeted to a single coherent v0.6.0 release note covering both the graph terminology rename and the multi-graph server mode. The original v0.7.0.md is deleted. * All `v0.7.0` / `0.7.0` doc and comment references throughout `crates/`, `docs/`, `AGENTS.md`, and `openapi.json` retargeted to `v0.6.0` / `0.6.0`. `Cargo.lock` regenerated to match. * OpenAPI spec regenerated via `OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapi openapi_spec_is_up_to_date` — `"version": "0.6.0"` now. Verification: * `cargo build --workspace` — clean (6 pre-existing engine warnings only). * `cargo test --workspace --locked` — zero failures across all 39 test result groups. * `bash scripts/check-agents-md.sh` — passes (34 links / 33 docs). * `grep -rn "0\.7\.0\\|v0\.7\.0" --include='.rs' --include='.md' --include='.json' --include='.toml' .` returns no workspace hits. The three remaining `0.7.0` strings in `Cargo.lock` belong to unrelated 3rd-party crates (`pem-rfc7468`, `radium`, `rand_xoshiro`). The git tag and crates.io publish happen later — this commit just consolidates the surface so the eventual release is one coherent v0.6.0 covering all the work since v0.5.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: sanitize internal refs from v0.6.0 release notes cubic-dev-ai P2 comments flagged that the release notes carried internal Linear ticket and RFC references (MR-668, MR-731, MR-723, RFC 0003, RFC 0004). Per AGENTS.md maintenance rule 5, "Release docs are public project history. Describe capabilities, behavior changes, breaking changes, upgrade notes, and user impact; do not reference private ticket systems, internal codenames, or planning shorthand that an outside contributor cannot inspect." The bot's comments are correct against our own published contract — they were a docs-quality regression introduced when I drafted these notes. Replaced each internal reference with the public-facing concept it stood for. The substantive content (capabilities, behavior, guarantees) was already present alongside the refs; sanitization just trimmed the bracketed ticket labels: * Line 6: dropped `(MR-668)` from the multi-graph mode summary — the descriptive name was already self-sufficient. * Line 24: `MR-731 spoof defense` → `the bearer-derived-actor- identity guarantee`; `Forward-compat for Cloud mode (RFC 0003) and OAuth provider (RFC 0004)` → "forward-compat seams for future multi-tenant and OAuth deployments; they're inert in this release" — describes what the operator sees instead of pointing at planning docs. * Line 26: `MR-731's server-authoritative-actor invariant` → "the server-authoritative-actor invariant: actor identity is always sourced from the bearer-token match resolved at the auth boundary" — the public-facing statement of the guarantee. * Line 36: `(MR-723 default-deny otherwise rejects …)` → "without a server policy the default-deny posture rejects …" — same content, no ticket label. * Line 121: `MR-731 spoof regression test` → "The bearer-auth- derived-actor-identity regression test (client-supplied identity headers are ignored; the server-resolved actor is the only identity Cedar sees)" — describes what the test guards instead of naming the originating ticket. Verified: `grep -E 'MR-\d+\|RFC[ -]?\d+' docs/releases/v0.6.0.md` returns no matches; the rest of `docs/releases/` is also clean. `scripts/check-agents-md.sh` passes. Note: cubic-dev-ai also flagged `crates/omnigraph-cli/src/main.rs:276` ("doc comment incorrectly references v0.6.0 for a command that only exists in v0.7.0"). That comment is based on a stale model of the release surface — after folding v0.7.0 into v0.6.0 in the previous commit, the multi-graph CLI surface IS in v0.6.0 and the comment is correct as written. No change needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: close validated init and multi-graph gaps * chore: address review cleanup comments --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 16:19:31 +02:00
Ragnor Comerford	cc2412dc65	Rename repo terminology to graph (#118 ) Some checks failed CI / Classify Changes (push) Has been cancelled Details CI / Check AGENTS.md Links (push) Has been cancelled Details Release Edge / Prepare edge release (push) Has been cancelled Details CI / Test Workspace (push) Has been cancelled Details CI / Test omnigraph-server --features aws (push) Has been cancelled Details CI / RustFS S3 Integration (push) Has been cancelled Details Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled Details	2026-05-24 16:46:00 +01:00
Andrew Altshuler	bb1fe57640	release: v0.5.0 (#115 ) * gitignore: exclude docs/internal/ from publication Mirrors the existing "Local-only working files (not for the public repo)" pattern. Working notes filed under docs/internal/ stay on the contributor's machine instead of cluttering the published doc tree or tripping the AGENTS.md / docs-index cross-link check (scripts/check-agents-md.sh enumerates every docs/.md and requires each one to be linked from an audience index — internal notes don't have an audience index by definition). Incidental to the v0.5.0 release; lands separately from the version bump commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> ci: skip docs/internal/ in agents-md cross-link check Matches the .gitignore exclusion. Mirrors the existing 'docs/releases/' exclusion pattern: notes under docs/internal/ aren't part of the published doc tree and don't need to be linked from an audience index. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * release: v0.5.0 — Lance 6 substrate, Cedar policy engine, schema-lint v1 Bumps the workspace from 0.4.2 to 0.5.0. Release notes at docs/releases/v0.5.0.md. Three user-visible pillars motivate the minor bump: 1. Lance 6.0.1 substrate (DataFusion 52→53, Arrow 57→58) 2. Engine-wide Cedar policy enforcement on every _as writer; server defaults to deny-all; signed-token-claim-only actor identity 3. Schema-lint v1 chassis: OG-XXX-NNN codes, soft drops, and `--allow-data-loss` (Hard mode) for destructive migrations Plus structured DataFusion Expr filter pushdown (unblocks CompOp::Contains via array_has), HTTP allow_data_loss parity, inline .gq sources on CLI/HTTP, optional CORS layer, and bug fixes (merge-insert dup-rowid, branch-merge coordinator restore on error, blob columns in branch merge). Sites bumped: - 5 crate [package].version lines (omnigraph, omnigraph-cli, omnigraph-compiler, omnigraph-policy, omnigraph-server) - 10 internal path-dep `version = "..."` constraints across the four manifests that depend on sister crates (engine, server, cli, plus engine's dev-dep on the compiler) - Cargo.lock (regenerated via cargo update --workspace) - AGENTS.md "Version surveyed:" - openapi.json `info.version` (regenerated via OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapi) Verification: - cargo test --workspace --locked: 907/907 green - cargo test -p omnigraph-engine --test failpoints --features failpoints: 19/19 green - cargo test -p omnigraph-engine --test lance_surface_guards: 3/3 - scripts/check-agents-md.sh: clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 13:59:42 +01:00
Andrew Altshuler	3551e0d40e	chore(lance): bump 4.0.0 → 6.0.1 (DataFusion 52→53, Arrow 57→58) (#111 ) * tests: add lance_surface_guards pre-flight pins for the v6 bump Land 8 named guards in a new test file that pin Lance API surfaces OmniGraph relies on. Each guard turns a silent-break risk (variant rename, struct restructure, async-flip) into a red CI bar instead of runtime drift. Guards (mapped to the silent-break inventory from the v6 migration plan): Runtime (#[tokio::test]): 1. lance_error_too_much_write_contention_variant_exists — pins the variant referenced by db/manifest/publisher.rs::map_lance_publish_error. 2. manifest_location_field_shape — pins .path/.size/.e_tag/.naming_scheme types and ManifestLocation accessor returning &Self (the access pattern at db/manifest/metadata.rs:84-88). 6. write_params_default_does_not_set_storage_version — confirms our explicit V2_2 pin remains load-bearing (blob v2 requirement). Compile-only async fns (#[allow(...)] + unimplemented!() placeholders; never run, but cargo build --tests enforces the API shape): 3. checkout_version + restore chain — pins the recovery rollback hammer at db/manifest/recovery.rs:505-522. 4. DatasetBuilder::from_namespace().with_branch().with_version().load() — pins the namespace builder chain at db/manifest/namespace.rs:162-174. 5. MergeInsertBuilder fluent chain — pins the manifest CAS at db/manifest/publisher.rs:370-391, including the return shape (Arc<Dataset>, MergeStats). 7. compact_files(&mut ds, CompactionOptions, None) — pins db/omnigraph/optimize.rs:107. 8. DeleteResult { new_dataset, num_deleted_rows } — pins the inline delete result shape (MR-A will repurpose this guard to the staged two-phase variant once Lance #6658 migration lands). This is commit 1 of the chore/lance-6.0.1 migration. Cargo bump follows in commit 2 (will trigger the guards under v6 if any surface drifted). Per the migration plan at ~/.claude/plans/shimmering-percolating-duckling.md (written this session). Two guards from the plan deferred to follow-up: - manifest_cas_returns_row_level_contention_variant (full publisher race integration test — needs harness scaffolding) - table_version_metadata_byte_compatible_with_v4 (TableVersionMetadata is pub(crate); requires test reach extension). Verified on v4: cargo test -p omnigraph-engine --test lance_surface_guards passes 3/3 runtime tests; cargo build -p omnigraph-engine --tests compiles all 5 compile-only guards clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): bump Lance 4.0.0 → 6.0.1, DataFusion 52 → 53, Arrow 57 → 58 The Cargo bump itself. Source is intentionally untouched — this commit will not compile. The compile errors are the work-list for subsequent commits on this branch. Lance updates: lance + 7 sub-crates 4.0.0 → 6.0.1. Transitive churn: + lance-tokenizer v6.0.1 (vendored tokenizer per Lance PR #6512) + object_store 0.13.x (Lance 6 brings it transitively; our explicit pin stays at 0.12.5 for now — revisit in stages if diamond bites) - tantivy* crates (replaced by lance-tokenizer) Compile error landscape on this commit (11 errors): • 1× E0432: `lance_index::DatasetIndexExt` import (Lance PR #6280 moved it to lance::index). Sites: table_store.rs:20, db/manifest.rs:37 (the second site was missed by the pre-flight inventory). • 8× E0599: `create_index_builder` / `load_indices` missing on `lance::Dataset` — all downstream of the DatasetIndexExt move. Once the import is corrected on table_store.rs and db/manifest.rs, these resolve automatically. • 2× E0063: missing field `is_only_declared` in `DescribeTableResponse` initializer at db/manifest/namespace.rs:221, 364. New Lance namespace field per the v5 namespace restructure (PR #6186). Surface guards (lance_surface_guards.rs, commit `d571fa8`) all still compile + the 3 runtime ones pass on v6 — none of the silent-break surfaces drifted. That's the load-bearing observation: the publisher CAS chain, ManifestLocation field shape, checkout_version/restore, DatasetBuilder fluent chain, MergeInsertBuilder return shape, WriteParams::default, compact_files signature, and DeleteResult fields are all v6-stable. Next commits address the 11 errors per the migration plan stages 3-8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * imports: move DatasetIndexExt to lance::index (Lance PR #6280) Lance 5.0 (PR #6280) moved `DatasetIndexExt` out of `lance-index` into `lance::index`. `is_system_index` and `IndexType` stayed in `lance-index`. Mechanical update of 6 import sites: crates/omnigraph/src/table_store.rs:20 — split into two `use` lines crates/omnigraph-server/tests/server.rs:10 — was traits::DatasetIndexExt crates/omnigraph/tests/search.rs:6 crates/omnigraph/tests/branching.rs:7 crates/omnigraph/tests/failpoints.rs:467 crates/omnigraph-cli/tests/cli.rs:3 — was traits::DatasetIndexExt All 9 E0599 cascading errors on .create_index_builder / .load_indices resolve once the trait is back in scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * namespace: add is_only_declared field to DescribeTableResponse Lance namespace 6.0.0 added `is_only_declared: Option<bool>` to `DescribeTableResponse` (lance-namespace-reqwest-client 0.7+ via the v5.0 namespace API restructure, Lance PR #6186). Set to `Some(false)` because every table BranchManifestNamespace returns from describe_table is materialized — the manifest snapshot only includes entries for tables we've already opened via Dataset::open. Two sites in db/manifest/namespace.rs (BranchManifestNamespace + StagedTableNamespace impls of LanceNamespace::describe_table). Closes the last two compile errors from the v6 bump in the engine lib. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cargo: add lance to omnigraph-cli + omnigraph-server dev-deps Stage 3 moved DatasetIndexExt imports from `lance-index` to `lance::index` in the cli and server test crates. Both crates only had `lance-index` in their dev-dependencies; add `lance` alongside so the new path resolves. This is the last compile-error fix from the v6 bump — `cargo build --workspace --tests` is now green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: refresh Lance alignment audit for v6.0.1; bump surveyed version Per CLAUDE.md maintenance rule 2 (same-PR docs): - docs/dev/lance.md: replace the v4.0.1 alignment audit stanza with the v6.0.1 audit. Captures every v5/v6 finding from this PR (the DatasetIndexExt move, DescribeTableResponse.is_only_declared, MergeInsertBuilder return shape, ManifestLocation field shape, LanceFileVersion::default flip, file-reader async, tokenizer vendor, Lance #6658/#6666/#6877 status). Cross-references each guard in tests/lance_surface_guards.rs. - AGENTS.md: bump "Storage substrate: Lance 4.x" → "Lance 6.x". Note: surveyed crate version stays at 0.4.2 — substrate version bumps are independent of OmniGraph's release version. - crates/omnigraph/src/storage_layer.rs: update the trait module-level doc-comment to reflect that Lance #6658 closed 2026-05-14 and delete_where two-phase migration is MR-A (the next follow-up). #6666 stays open; create_vector_index inline residual stays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tests: silence clippy::diverging_sub_expression on compile-only guards The five `_compile_` async fns in lance_surface_guards.rs use `let ds: Dataset = unimplemented!()` as a placeholder so type inference can chase the method chain we want to pin, without ever running the function. Clippy's `diverging_sub_expression` lint flags this pattern because the RHS diverges; that's the entire point. Added to the per-fn `#[allow(...)]` list, alongside dead_code / unreachable_code / unused_variables / unused_mut already there. No behavior change. cargo test -p omnigraph-engine --test lance_surface_guards still 3/3 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> docs: correct #6658 status — closed but API ships in Lance v7.x, not v6.0.1 The audit stanza in docs/dev/lance.md and the storage_layer.rs trait doc-comment both implied the public DeleteBuilder::execute_uncommitted API shipped with Lance 6.0.1. It did not. Issue #6658 closed 2026-05-14, but binary search across the release stream confirms: v6.0.1 ❌ no pub async fn execute_uncommitted on DeleteBuilder v6.1.0-rc.1 ❌ v7.0.0-beta.5 ❌ v7.0.0-beta.10 ✅ first appearance v7.0.0-rc.1 ✅ So MR-A (delete two-phase migration) is gated on the Lance v7.x bump, not on this PR. v7.0.0-rc.1 dropped 2026-05-21; GA likely within a week. No behavior change. Doc-only correction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(lib): bump recursion_limit to 256 — Lance 6 trait depth on Linux Lance 6's heavier trait surface around futures/streams in storage_layer.rs's staged-write API pushes the rustc trait-resolution recursion limit past the default 128 on Linux builds. CI on PR #111 surfaced this in both `Test Workspace` and `Test omnigraph-server --features aws`: error: queries overflow the depth limit! = help: consider increasing the recursion limit by adding a `#![recursion_limit = "256"]` attribute to your crate (`omnigraph`) = note: query depth increased by 130 when computing layout of `{async block@crates/omnigraph/src/storage_layer.rs:697:5: 697:10}` (The async block is `stage_create_btree_index`'s body — its return type is several layers of `impl Future<Output=Result<StagedHandle>>` deep on top of Lance's own builder return types.) Local macOS builds happened to short-circuit before tripping the limit, which is why this didn't surface during the v6 bump sequence. The fix rustc itself suggests is one line at the crate root. No behavior change. Revisit if a future Lance bump stops needing it. Verified: `cargo build --locked -p omnigraph-server --features aws` compiles clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 00:42:29 +01:00
Andrew Altshuler	da42beec41	policy: chassis fan-out — _as variants on the remaining 6 writers (MR-722) (#103 ) PR #102 wired apply_schema_as. This PR completes the chassis-side coverage so every public mutating engine entry point hits the same Omnigraph::enforce(action, scope, actor) gate regardless of transport: - mutate_as → enforce(Change, Branch(branch), actor) - load_as → enforce(Change, Branch(branch), actor) - ingest_as → enforce(Change, Branch(branch), actor); also threads actor through the implicit branch_create_from_as so fresh-branch ingest correctly hits BranchCreate too - branch_create_as → enforce(BranchCreate, TargetBranch(name), actor) - branch_create_from_as → enforce(BranchCreate, BranchTransition { source, target }, actor) - branch_delete_as → enforce(BranchDelete, TargetBranch(name), actor) - branch_merge_as → enforce(BranchMerge, BranchTransition { source, target }, actor) Three new _as variants for branch ops (create, create_from, delete) that had no actor surface before; existing actor-less variants delegate with actor=None so the no-policy path is a strict no-op. HTTP handlers updated to thread the resolved actor into the new _as variants for branch_create and branch_delete (was previously dropped). 14 new SDK chassis tests (one allow + one deny pair per wired writer); the existing 4 apply_schema_as tests stay. All 18 pass. docs/user/policy.md updated to describe engine-wide enforcement and the coarse-vs-fine layer split (engine = action gate, query layer per-row = MR-725 future). AGENTS.md capability matrix updated to match. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 03:38:18 +03:00
Andrew Altshuler	0de5f69d86	docs: drop npx mdrip; use curl \| pandoc for full-page fetches (#97 ) The previous "fetch the full page" recommendation in AGENTS.md and docs/dev/lance.md pointed at an unknown-author npm CLI that, on consent, wrote agent-targeted content into AGENTS.md and modified .gitignore / tsconfig.json. Source audit was clean of malicious code but the self-perpetuating prompt-injection pattern combined with a single maintainer and ~21 downloads/day made it not worth the risk. Switched to the curl + pandoc command already documented as the no-tool option. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 16:06:24 +03:00
Andrew Altshuler	60eee78465	docs: split user and developer docs (#93 )	2026-05-15 03:45:22 +03:00
Andrew Altshuler	6bad829ed0	branch-protection: declarative policy + apply script (#89 ) Branch protection on main, declared as code rather than as opaque GitHub UI state. Pairs with the CODEOWNERS chassis (#88): once this PR lands and an admin runs the apply script, every PR to main must satisfy code-owner review and the listed required checks. Components: - .github/branch-protection.json — the policy. Edit this to change required checks, review counts, etc. Includes a _comment field for human readers; the apply script strips it before PUT. - scripts/apply-branch-protection.sh — idempotent apply via `gh api`. Reads back current state for verification. Supports DRY_RUN=1. - docs/branch-protection.md — explains the policy, how to apply, how to change, why declared as code. - AGENTS.md topic-index row. Policy summary: - Required status checks (strict): Classify Changes, Check AGENTS.md Links, Test Workspace, Test omnigraph-server --features aws, CODEOWNERS / drift, CODEOWNERS / noedit. - Required approving reviews: 1, must be a code owner. - Dismiss stale reviews on new commits. - Required linear history (squash or rebase merges only). - No force pushes, no deletions, no admin bypasses. - Required conversation resolution. What's NOT in this PR: - Required signed commits — not yet; maintainers must enroll GPG/SSH signing first or merges will block. - Tag protection for v* tags — separate PR. - Additional required checks (cargo deny, audit, fmt, clippy, CodeQL, schema-lint MR-946) — separate PRs as each lands. - The script is NOT run by CI. Branch-protection changes are admin actions; CI-driven auto-apply would defeat the purpose. Manual invocation is the audit point. How to apply after merge: ./scripts/apply-branch-protection.sh Requires gh-CLI auth with repo-admin permissions. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:38:20 +03:00
Andrew Altshuler	730712b73f	codeowners: yml source of truth + generator + drift CI (#88 ) * codeowners: generator + drift CI + initial roles Source-of-truth approach to CODEOWNERS: yml is hand-edited, CODEOWNERS is generated and CI-enforced. Every role change is a reviewable PR with a permanent in-repo audit trail. No GitHub UI clicks, no shadow state. Initial roles: engineering @aaltshuler owns crates/** + default (.github/, scripts/, Cargo., openapi.json, everything else not docs) docs @aaltshuler @ragnorc owns docs/, README.md, AGENTS.md, CLAUDE.md, SECURITY.md Per GitHub semantics, multiple owners on a CODEOWNERS line means "any one satisfies the review" — for docs, either named member can approve. Strict "N distinct approvers" would need a CI workaround (not wired today; tracked for future hardening). Components: - .github/codeowners-roles.yml — source of truth. Edit this. - .github/scripts/render-codeowners.py — generator (PyYAML; ~100 LoC). - .github/CODEOWNERS — generated. CI rejects hand-edits. - .github/workflows/codeowners.yml — two checks: drift: re-render and assert CODEOWNERS matches. * noedit: reject PRs that edit CODEOWNERS without editing the yml. - docs/codeowners.md — explains the source-of-truth pattern, how to change roles, how to add new roles. - AGENTS.md topic-index row. What's NOT in this PR: - Branch protection on main (separate PR; needs `gh api` call against the org). - Required-reviewer enforcement (depends on branch protection landing). - Required CI status checks (depends on branch protection landing). - Scheduled rotation (the schedule: block in the yml + a weekly workflow). Today's roles are stable; rotation isn't needed yet. - Linear-as-source-of-truth integration (Approach 4 from the design discussion; deferred). Verified: - Generator output is deterministic (idempotent re-runs). - scripts/check-agents-md.sh OK (28 links, 28 docs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * codeowners: fix catch-all ordering (Devin review #88) Devin caught a real bug: GitHub CODEOWNERS uses "last match wins" semantics, but the generator emitted the catch-all `` AFTER specific patterns. Net effect: `` won for every file, silently nullifying the docs role and never routing reviews to @ragnorc. Fix is one-line — emit the default `` line before iterating the specific paths. Also: - Added a regression assertion in the generator: after rendering, the first non-comment line must start with `` if a default is configured. Generator exits non-zero otherwise. Catches the same class of mistake in any future refactor. - Rewrote the yml header comment, which incorrectly stated "keep more-specific paths after broader patterns" (correct for GitHub semantics but the generator was doing the opposite — so the comment read as a description of behavior when it was actually a contradicted intention). Verified by re-rendering: `` is now line 12, `crates/` is line 14, `docs/` is line 15, etc. README.md matches both `` and `README.md`; `README.md` is later → wins → @aaltshuler + @ragnorc both assigned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:26:06 +03:00
Andrew Altshuler	c142dafdf3	schema-lint chassis v0: code-tagged diagnostics (MR-694) (#87 ) First slice of the schema-lint chassis. Adds stable `OG-XXX-NNN` codes to schema-migration rejections so operators can suppress, look up, and filter on identifiers rather than free-text prose. Atlas-style chassis adapted to omnigraph's typed-IR substrate (no SQL injection vector, no per-engine locks, native edge/vector/embedding types). What's in v0: - New `omnigraph-compiler/src/lint/` module with: - `diagnostic.rs` — Family / SafetyTier / Severity enums covering ten families (DS, MF, CD, BC, NM, OW, NL, VE, ED, LK). Only DS and MF are populated in this PR. - `codes.rs` — 8 DiagnosticCode constants (OG-DS-101..105, OG-MF-103, OG-MF-104, OG-MF-106). Five of the eight are wired to real emission sites; the other three are reserved. - Unit tests for catalog invariants: codes unique, prefix matches family, suffixes are 3-digit, destructive defaults to error, lookup() works, EMITTED_IN_V0 codes exist in ALL_CODES. - `SchemaMigrationStep::UnsupportedChange` gains an optional `code: Option<String>` field. New `unsupported_error_message()` helper prefixes the message with `[code]` when present. - 5 of 17 existing rejection paths now carry codes: - `removing node type` → OG-DS-102 - `removing edge type` → OG-DS-103 - `removing property` → OG-DS-104 - `adding required property without backfill` → OG-MF-103 - `changing property type` → OG-MF-106 Remaining 12 paths carry `code: None` and are tagged as future work. - `schema_apply` surfaces the formatted error (with `[code]` prefix); CLI `omnigraph schema plan` renders the code on the `unsupported change on <entity>` line. - PR #62 destructive-rejection tests in `tests/schema_apply.rs` now assert on the stable code (`msg.contains("OG-DS-104")`) instead of the error-message substring. 11/11 tests pass. - New `docs/schema-lint.md` documents the v0 catalog + the 10 families + Atlas prior art. AGENTS.md index updated. What's explicitly NOT in v0 (subsequent PRs): - No severity config in `omnigraph.yaml` (MR-694 §2). - No `@allow(OG-XXX-NNN, "rationale")` suppression directive (§3). - No `--allow-data-loss` flag or destructive-tier enforcement. - No new `SchemaMigrationStep` variants (soft/hard drops, default, widen/narrow). MR-700, MR-697 land those. - No pre-migration checks (MR-941). - No CD / VE / LK / NM family rules (MR-942..945). - No CI integration (MR-946). Tests: 235 compiler tests, 11 schema_apply integration tests, 14 lint module tests, 55 CLI tests — all green. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:08:18 +03:00
Ragnor Comerford	24c0558180	docs: lead AGENTS.md first principle with integrated-over-time framing Reframes the first-principle section to lead with Winters' "engineering is programming integrated over time" as the lens, keeping "minimize ongoing liability" as the operative directive and folding in "complexity should be earned." Adds a new Tiebreakers subsection with two rules that the prior section lacked clean appeals for: - correctness > simplicity > performance (lexicographic) - reversibility shapes evidence demand (reversible → prod metrics over napkin math over RFCs; irreversible → RFC up-front) Adds a Hyrum's-Law deny-list entry in both AGENTS.md and docs/invariants.md §IX: shipping observable behavior is shipping a contract, even when undocumented. Net always-on context cost: ~7 lines. No renumbering of §I–VIII invariants; Hyrum's Law lands in the deny-list to avoid breaking back-references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:27:24 -07:00
Ragnor Comerford	3bd072c917	docs: add docs/transactions.md — branch-as-transaction explainer (#69 ) The architectural rule "no cross-query BEGIN/COMMIT; branches fill that role" lives in docs/invariants.md §VI.23 but is not surfaced anywhere user-facing. New users coming from Postgres/MySQL hit the gap when they realize multiple queries on main are independently atomic, not jointly atomic. This page explains the model with worked examples: * Single-query multi-statement (atomic by default) * Two separate queries on main (NOT atomic — common surprise) * Many queries via a branch (atomic at merge) * Coordinating multiple agents via branch-per-agent Plus a comparison table to BEGIN/COMMIT, failure-mode rundown, and "when to use what" decision matrix. Linked from AGENTS.md "Where to find each topic" between branches-commits.md and runs.md. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 22:35:57 +03:00
Devin AI	7a338a8223	agents: keep guide short for context	2026-05-10 14:41:02 +00:00
Devin AI	a42d178119	release: prepare omnigraph 0.4.2	2026-05-10 14:02:28 +00:00
Ragnor Comerford	6bd9f1c085	agents: rules 8 and 9 — test-first for bug fixes; correct by design Add two durable engineering rules to the maintenance contract so they load into context on every session: - Rule 8: write a regression test that reproduces the bug first, confirm it fails, land it just before the fix commit so the red→green pair is visible in git log. A reviewer can check out the test commit alone and reproduce the failure. - Rule 9: when a bug surfaces, identify the root cause and make the fix correct by construction. Don't patch the symptom. If the design admits the bug class, close the class — don't add a guard around the latest instance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:26:34 +02:00
Ragnor Comerford	fcb47620d3	mr-686: bundle PR 0/1a/1b foundation + PR 2 catalog/schema_source ArcSwap Bundles the working-tree state from the prior session (PR 0 bench harness, PR 1a audit_actor_id removal, PR 1b WriteQueueManager + writer integration) together with the first half of PR 2's interior-mutability foundation (catalog and schema_source wrapped in Arc<ArcSwap<...>>). The two streams intermix in 7 of the same files, so splitting via git add -p was impractical. Subsequent PR 2 steps land as separate atomic commits. PR 0 — server-level concurrent /change bench harness - crates/omnigraph-server/examples/bench_concurrent_http.rs (new) - .context/bench-results/{baseline-main,after-pr1}/ (gitignored) PR 1a — drop the audit_actor_id field, thread per-call - removed Omnigraph::audit_actor_id and the swap-restore patterns in mutation.rs, merge.rs, loader/mod.rs - actor_id: Option<&str> threaded through MutationStaging::finalize, mutate_with_current_actor, ingest_with_current_actor, branch_merge_impl, branch_merge_on_current_target, commit_prepared_updates*, record_merge_commit, commit_updates_on_branch_with_expected - apply_schema and ensure_indices_for_branch pass None (system-attributed) PR 1b — per-(table_key, branch) write queue + revalidation + sidecar - new crates/omnigraph/src/db/write_queue.rs with WriteQueueManager, acquire/acquire_many, sorted+deduped acquisition; 6 unit tests - Arc<WriteQueueManager> field on Omnigraph + db.write_queue() accessor - MutationStaging::finalize split into stage_all (Phase A, no queue) and StagedMutation::commit_all (Phase B, acquire_many + revalidate pins + sidecar + commit_staged); guards held across publisher - delete-only mutations now emit recovery sidecars; revalidation extended to inline_committed tables - branch_merge_on_current_target, apply_schema_with_lock, and ensure_indices_for_branch acquire per-table queues for their touched tables PR 2 Step B (partial) — catalog and schema_source via ArcSwap - catalog: Catalog -> Arc<ArcSwap<Catalog>> - schema_source: String -> Arc<ArcSwap<String>> - public accessors return Arc<Catalog> / Arc<String>; readers bind locally where the borrow has to outlive an expression - new pub(crate) store_catalog / store_schema_source helpers replace the field assignments in apply_schema and reload_schema_if_source_changed - 117 tests across lifecycle/end_to_end/branching/runs pass; engine lib + workspace compile clean Coordinator wrap (Mutex) and the &mut self -> &self engine API conversion follow in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 16:22:38 +02:00
Ragnor Comerford	05e52f2ee0	recovery: rename composite test, strip ticket references, address review Three bundled changes: 1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and the test function). OmniGraph is consumed by both humans and agents - naming the test after one audience misframes the library. 2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and review-round labels from source, tests, and docs added by this branch. Internal traceability belongs in commit messages and PR descriptions, not in checked-in artifacts. Upstream lance-format/lance issue refs and pre-existing MR-XXX refs in docs not touched by this branch are left alone. 3. Two outstanding review findings addressed: - `needs_index_work_node` / `needs_index_work_edge`: propagate `count_rows` errors instead of `unwrap_or(0)`. Silently treating transient I/O failures as "0 rows" risked skipping a table from the recovery sidecar pin set that was actually about to be modified. - `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`: strengthen the assertion to fail when sidecar B classifies under a stale snapshot. The new assertion checks post-recovery Lance HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar deleted + audit rows present" pair passed in both the bug and fix paths because both delete the sidecar and write an audit row; the differentiator is the post-recovery HEAD. Strengthening the assertion exposed an additional nuance: in this overlapping- sidecar scenario sidecar B's audit kind is RolledBack (no-op) rather than RolledForward, since sidecar A's roll-forward publishes Lance HEAD as the new manifest pin (absorbing B's work). The docstring now explains why this is correct given current `roll_forward_all` semantics. All workspace tests pass with --features failpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 13:56:36 +02:00
Ragnor Comerford	932334ba01	recovery: document MR-847 ship across all reference docs (Phase 10) Update the doc surface to reflect MR-847 having shipped end to end — sidecar protocol, classifier, all-or-nothing decision tree, roll-forward via ManifestBatchPublisher, roll-back via Dataset::restore with fragment-set short-circuit, audit trail in _graph_commit_recoveries.lance, OpenMode::{ReadWrite, ReadOnly}, and the four migrated writers all carrying sidecars across Phase B → Phase C. - docs/invariants.md §VI.23: change from "upheld at the writer-trait surface for inserts/updates/etc., per-table commit_staged → manifest publish window remains" to "upheld at the writer-trait surface AND across process boundaries". The MR-847 sweep closes the residual on the next Omnigraph::open. The "continuous in-process" property (no ExpectedVersionMismatch surfacing to subsequent writers between Phase B failure and process restart) is honest follow-up at MR-856. - docs/runs.md: replace "Finalize → publisher residual" section with "Open-time recovery sweep (MR-847)" — describes the sidecar protocol lifecycle (Phases A-D), the sweep's classifier + decision dispatch, the audit trail, and the operator-facing query (omnigraph commit list --filter actor=omnigraph:recovery). - AGENTS.md capability matrix "Atomic single-dataset commits" row: drop the "Layer (3) is not yet shipped — tracked in MR-847" caveat; describe the three layers as all shipping; reference MR-856 for the background-reconciler follow-up. - docs/storage.md: add _graph_commit_recoveries.lance and __recovery/{ulid}.json to the on-disk layout (mermaid + prose). - docs/branches-commits.md: new "Recovery audit trail (MR-847)" subsection describing the join from _graph_commits.lance:actor_id="omnigraph:recovery" to _graph_commit_recoveries.lance:graph_commit_id for operator post-mortem. - docs/maintenance.md: note the MR-847 recovery floor on cleanup — --keep < 3 may garbage-collect Lance versions the recovery sweep needs as a rollback target. Default --keep 10 is safe. - docs/testing.md: add tests/recovery.rs to the engine integration-test table; expand the failpoints.rs row to mention the four MR-847 per-writer Phase B → recovery integration tests. - .context/mr-847-design.md: prepend a "Status: DONE" stanza listing every commit hash + scope across phases 1-10. AGENTS.md ↔ docs/ cross-link check passes (26 links, 26 docs). Full workspace test sweep passes with --features failpoints (361 tests across 20 binaries). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 00:46:24 +02:00
Ragnor Comerford	8726ffe0a3	release: bump version to 0.4.1	2026-05-02 23:20:50 +02:00
Ragnor Comerford	5afde54d69	agents: stop overclaiming atomic multi-table publish — describe the three layers honestly External reviewer flagged that the capability matrix's "Atomic multi-dataset publish" cell implied Lance gives us a single primitive for cross-table atomicity. It doesn't. The real contract is three layers stacked: (1) per-table Lance `commit_staged` for the data write (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for cross-table ordering (3) recovery-on-open reconciler for the residual gap between (1) and (2) — NOT YET SHIPPED, tracked in MR-847. Until MR-847 lands, a failure between per-table `commit_staged` and the manifest publish leaves drift on the partially-committed tables (the "Phase B → Phase C residual" documented in `docs/runs.md`). Also enumerate the legacy inline-commit residuals (`append_batch`, `merge_insert_batches`, `overwrite_batch`, `create_*_index`) alongside `delete_where` and `create_vector_index` — they remain on the trait pending Phase 1b call-site conversion + Phase 9 demotion. End the row with an explicit DO NOT: future agents reading the capability matrix should not describe atomicity as "fully upheld" until MR-847 ships. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 19:35:34 +02:00
Ragnor Comerford	9b0920b5da	address PR #70 bot review (Cubic + Cursor): 7 inline + failpoint test + invariants notes Cubic findings: * `tests/forbidden_apis.rs`: expand `FORBIDDEN_PATTERNS` with `Dataset::write` / `Dataset::append` / `Dataset::delete` / `Dataset::merge_insert` / `Dataset::add_columns` / `update_columns` / `drop_columns` / `truncate_table` / `restore` and the bare `.merge_insert(` / `.add_columns(` / `.update_columns(` / `.drop_columns(` / `.truncate_table(` method patterns. Deliberately avoid `.append(` / `.delete(` / `.write(` (over-match `Vec::append`, `.delete_branch(`, arrow-array `.append(`, etc.). Allow-list `commit_graph.rs` and `graph_coordinator.rs` — they're manifest-layer infra that legitimately uses `Dataset::write` for system tables. * `schema_apply.rs:253`: pass `entry.table_branch.as_deref()` (not `None`) to `open_dataset_head_for_write` for consistency with the sibling `indexed_tables` block. Schema apply rejects non-main branches at the lock-acquire step today, so behavior is unchanged; this is a defensive consistency fix that survives a future relaxation of the lock check. * `storage_layer.rs:131` doc: was `Vec<&StagedWrite>` with lifetime claim; actually returns `Vec<StagedWrite>` (cloned). Fixed. * `AGENTS.md:201` capability matrix row + `storage_layer.rs:1` module doc: softened the "stage_* + commit_staged are the only paths" / "trait funnels every write" overclaim. Inline-commit residuals (`delete_where`, `create_vector_index`) remain on the trait pending upstream Lance work (#6658, #6666); legacy `append_batch` etc. remain pending Phase 1b / Phase 9. Module doc now describes the current transitional state honestly. Cursor Bugbot findings: * `storage_layer.rs:360`: trait `delete_where` consumed `SnapshotHandle` but returned only `DeleteState`, dropping the post-delete dataset. Future callers migrating from the inherent `&mut Dataset` API would lose the post-delete dataset state needed for indexing / `table_state` queries. Fixed: returns `(SnapshotHandle, DeleteState)` matching `append_batch` / `overwrite_batch` shape. * `storage_layer.rs:824`: removed dead `_scanner_type_marker` fn and the unused `Scanner` import (the marker existed only to suppress an unused-import warning — fixing the import is the cleaner answer). Engine-level Phase A failpoint test (closes the partial-criterion flagged in Cubic's acceptance-criteria checklist): * `db/omnigraph/table_ops.rs::stage_and_commit_btree`: instrumented with `crate::failpoints::maybe_fail("ensure_indices.post_stage_pre_commit_btree")` between `stage_create_btree_index` and `commit_staged`. * `tests/failpoints.rs::ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`: triggers the failpoint via a schema-apply that adds a new node type; proves that existing tables are unaffected (Person mutation succeeds after the failed apply) — i.e. Phase A failure leaves no Lance-HEAD drift on tables outside the failed `added_tables` iteration. `docs/invariants.md` transitional notes: * §VI.23 (atomicity per query): annotated as upheld at the writer-trait surface for inserts / updates / scalar-index builds / merge_insert / overwrite after MR-793 PR #70. Per-table commit_staged → manifest publish window remains; closing requires MR-847's recovery-on-open reconciler. `delete_where` and `create_vector_index` remain inline pending lance#6658 / #6666. * §VII.35 (reconciler pattern): annotated as partial — staged primitives are the building blocks; the reconciler task itself is MR-848. * §VIII.45 (reference impl per trait): `TableStorage` has its primary impl on `TableStore` with opaque-handle signatures; no test impl yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 18:47:07 +02:00
Ragnor Comerford	b87be5e9f0	agents: read every Lance page even slightly relevant, not just the obvious match Behavior is interlocked across Lance pages — transactions reference index lifecycle, index lifecycle references compaction, compaction references row-id lineage. Skipping a "slightly relevant" page is how alignment misses happen. The index alone is not a substitute for reading the pages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 17:44:41 +02:00
Ragnor Comerford	17bf978d0e	MR-793 follow-up: lance docs alignment audit + mandate full-page fetch via mdrip * AGENTS.md / docs/lance.md: agents must use `npx mdrip` (not summarizing WebFetch) when consulting Lance docs. WebFetch routinely drops load-bearing details — `pub(crate)` blockers, sub-specs behind nav hubs, default flags. Lesson learned during the MR-793 alignment audit. * docs/lance.md: add "Last alignment audit: 2026-05-02" stanza documenting MemWAL gap, lance#6666 companion ticket, stable-row-ID status (experimental, may unblock MR-848), FRI as documented compaction-friendly alternative. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 17:41:32 +02:00
Ragnor Comerford	3135ff5d19	MR-793 phases 1-6: TableStorage trait + staged-write surface for engine writers Hoists Lance's stage+commit two-phase write pattern from "discipline at each writer" to a sealed trait surface (`TableStorage`). New engine code that needs to advance Lance HEAD MUST go through `stage_` + `commit_staged`; the trait's opaque `SnapshotHandle` / `StagedHandle` types keep `lance::Dataset` and `lance::Transaction` out of trait signatures. Phases landed (see .context/mr-793-design.md for the full plan): 1a: `crates/omnigraph/src/storage_layer.rs` — `TableStorage` trait, sealed (only in-tree types can impl), single impl on `TableStore` delegating to existing inherent methods; `Omnigraph::storage()` accessor returns `&dyn TableStorage`. * 2: three new staged primitives — `stage_overwrite`, `stage_create_btree_index`, `stage_create_inverted_index` — implementing the simple branch of Lance's `CreateIndexBuilder::execute` (scalar indices only; vector indices stay inline because `build_index_metadata_from_segments` is `pub(crate)` in lance-4.0.0). Six new tests in `tests/staged_writes.rs` pin both the new primitives and the inline residuals (`delete_where`, `create_vector_index`). * 3: `tests/forbidden_apis.rs` — defense-in-depth integration test walks engine source, fails on direct lance::* inline-commit API use outside `table_store.rs` / `db/manifest/`. Skips comment lines and honors `// forbidden-api-allow:` sentinels. * 4: `ensure_indices` migration — scalar index builds now route through `stage_create__index` + `commit_staged` instead of `create__index(&mut Dataset)`. Vector indices stay inline (residual, named honestly at the call site). * 5: `branch_merge::publish_rewritten_merge_table` migration — the merge_insert phase now uses `stage_merge_insert` + `commit_staged`; delete phase stays inline (Lance #6658 residual, named honestly). * 6: `schema_apply` rewritten_tables migration — non-empty rewrites use `stage_overwrite` + `commit_staged`; empty-batch rewrites stay inline because `InsertBuilder::execute_uncommitted` rejects empty data. The narrow inline window is bounded by `__schema_apply_lock__`. Verified-green test surface: * `cargo test -p omnigraph-engine` — 68 lib + ~120 integration tests (incl. 6 new staged_writes tests + the new forbidden_apis test). * `cargo test -p omnigraph-engine --features failpoints --test failpoints` — 5 tests, all green. * `cargo test --workspace` — green. Deferred to follow-up sessions (see design doc §17 split): * Phase 1b — convert remaining engine call sites to `&dyn TableStorage` (mostly READS that don't touch the staged-write invariant). * Phase 7 — recovery-on-open reconciler (closes Phase B → Phase C residual across process restarts; new subsystem). * Phase 8 — index-coverage reconciler (full §VII.35 compliance — removes synchronous index work from the publish path). * Phase 9 — demote unused `TableStore` inherent methods to `pub(crate)` (depends on Phase 1b). Lance upstream blockers documented: * lance-format/lance#6658 — two-phase delete API (open, no PRs). * Companion: `build_index_metadata_from_segments` should be `pub` so vector-index builds can be staged outside the lance crate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-02 11:03:15 +02:00
Ragnor Comerford	a61e82f47a	MR-794 step 2: docs — runs/invariants/architecture/execution + cleanup Refresh user-facing and agent-facing docs for the staged-write rewire and clean up stale Run-state-machine references that survived MR-771. MR-794-specific updates: * docs/runs.md — remove "Known limitation: mid-query partial failure" section; document the in-memory accumulator + D₂ rule + the LoadMode::Overwrite residual. * docs/invariants.md §VI.25 — flip from aspirational/open to upheld for inserts/updates. Within-query read-your-writes is now load-bearing for the publisher CAS contract. * docs/architecture.md — add "Mutation atomicity — in-memory accumulator (MR-794)" subsection with per-op flow; refresh the engine + state diagrams to drop RunRegistry and add MutationStaging. * docs/execution.md — rewrite the mutation flow sequence diagram for the staged-write path; updated the LoadMode table to call out per-mode commit semantics; rewrote load vs ingest. * docs/query-language.md — document the D₂ parse-time rule. * docs/errors.md — add the D₂ BadRequest rejection path. * docs/testing.md — extend the runs.rs row to cover the new MR-794 contract tests; add the staged_writes.rs row. * docs/releases/v0.4.1.md (new) — release note covering the rewire, test additions, residuals, and files changed. * AGENTS.md (CLAUDE.md symlink) — update the atomic-per-query description and the L2 capability matrix row. Stale-reference cleanup (MR-771 leftovers): * docs/storage.md — drop live _graph_runs.lance / _graph_run_actors.lance from the layout diagram and prose; mark legacy. * docs/branches-commits.md — move __run__<id> to a legacy note; remove publish_run from the publish-trigger list. * docs/audit.md — refresh _as API list (drop begin_run_as / publish_run_as); legacy RunRecord.actor_id moved to a historical note. * docs/constants.md — mark run registry / branch-prefix rows as legacy. * docs/cli.md — replace the legacy omnigraph run * quickstart block with omnigraph commit list/show. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:43:19 +02:00
Ragnor Comerford	35be20cb05	MR-771: demote Run to direct-publish via expected_table_versions CAS mutate_as and load now write directly to target tables and call the publisher once at the end with per-table expected versions; the Run state machine, _graph_runs.lance writers, __run__ staging branches, and server /runs/* endpoints are removed. Multi-statement mutations remain atomic at the manifest level via an in-memory MutationStaging accumulator that gives read-your-writes within a query and a single publish at the end. Concurrent-writer conflicts surface as ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the old DivergentUpdate merge shape. Documents one known limitation in docs/runs.md: a multi-statement mid-query failure where op-N writes a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the manifest until a follow-up introduces per-table Lance branches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-30 08:52:50 +02:00
Claude	63babede4a	AGENTS.md: reframe 'minimize ongoing liability' as a general decision lens The previous bullets read like a migration pattern (centralized dispatcher, one match arm, no shape forks). That's one application, not the principle. Reframe it as a bidirectional decision lens: ask "which option has lower ongoing cost over time?" and let the answer be more code, less code, DRYing, duplication, removal, addition, a new abstraction, or flattening one — whichever shape converges over the expected change horizon. Add explicit examples of cases where the lower-liability option is more code (dispatcher, migration framework, typed error variants) and where it's less (premature abstractions, "just in case" paths, helpers that wedge independently-evolving callers together) so readers don't collapse the principle into "minimize code".	2026-04-29 12:32:30 +00:00
Claude	b6a596670a	AGENTS.md: add 'minimize ongoing liability' as a first principle The always-on rules are concretizations of a broader engineering posture that wasn't stated explicitly. Add a short section that frames the spirit behind those rules: - One centralized detection point, not many heal hooks. - One dispatcher, not branch-on-shape in every consumer. - One canonical shape after migration, not forks on "old vs new". - Three similar lines beats a premature abstraction. - Delete dead paths when their last caller leaves. Plus a forward-looking review prompt ("what do these paths look like after 5 more changes like this?") so the principle bites at design time, not just at review time. The internal-schema-version mechanism we just shipped is a concrete application: one stamp + one dispatcher + one match arm per change, no heal hooks scattered across the engine. Codify the pattern so future work doesn't drift back to ad-hoc.	2026-04-29 11:48:47 +00:00
Ragnor Comerford	6f25c4f9f8	Address reviewer feedback (Cursor + cubic) on PR #60 All eight comments verified against source and applied: - AGENTS.md: pull @docs/{invariants,lance,testing}.md imports out of the markdown blockquote. Claude Code's @-import parser expects @ at column 0; the leading "> " of a blockquote silently broke recognition, so the claimed auto-include did nothing. (Cursor, Medium severity.) - docs/cli-reference.md: command-family count 13 → 17. The current enum Command in crates/omnigraph-cli/src/main.rs has 17 top-level variants. (cubic P2.) - docs/ci.md: Homebrew tap update is a regular `git push`, not a force-push (release.yml:117 is `git push origin HEAD:main`). (cubic P2.) - docs/errors.md: add the Storage variant to the NanoError list — it exists at error.rs:88-89 but the doc enumerated only 10 of 11. (cubic P2.) - docs/storage.md: clarify tombstone semantics. There is no tombstone_version column; state.rs:180 reads the tombstone version from the table_version column on rows where object_type = table_tombstone. (cubic P2.) - docs/branches-commits.md: split the GraphCommit pseudo-struct from the underlying storage. actor_id is joined in-memory from _graph_commit_actors.lance, not a column on _graph_commits.lance. (cubic P2.) - docs/schema-language.md: rename IR_VERSION to SCHEMA_IR_VERSION to match the actual constant name in catalog/schema_ir.rs:11. (cubic P3.) - docs/testing.md: engine integration test count 16 → 15 (matches `ls crates/omnigraph/tests/*.rs`). (cubic P3.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-29 00:09:06 +02:00
Ragnor Comerford	ada58ccd7b	Make "check existing coverage first" a top-level testing principle The original docs/testing.md mentioned finding existing tests as step 1 of the checklist but never explicitly said "if existing coverage already addresses your case, extend it; don't duplicate." Adds a prominent "First principle" section that names extend-vs-new as the preferred outcome and lists three duplicated init_and_load blocks as the most common form of test rot. Adds an extra checklist item: verify your change makes an existing test fail before it makes a new one pass — if you can break the code without breaking a test, that coverage gap is the bug to fix first. Strengthens the AGENTS.md callout so the principle ("always check what already covers it") is in scope from the top of every session. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-29 00:03:50 +02:00
Ragnor Comerford	8be0e6a067	Add docs/testing.md as required-read every session Maps the test surface (engine integration tests by area, CLI/server tests, helpers harness, fixtures, failpoints feature, RustFS S3 integration, OpenAPI drift) and gives a before-every-task checklist: find existing tests for the area, run them as a clean baseline, plan the new test up front, reuse helpers, mind the layer boundary per invariants §VII.33. Notes that there's no coverage tooling today — coverage knowledge comes from reading and running the relevant integration tests, not a tarpaulin/codecov report. Threaded into AGENTS.md as the third required-reading file alongside invariants.md and lance.md, with a Claude-Code @-import so agents load it on every turn. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 23:55:21 +02:00
Ragnor Comerford	a06d8bcf82	Promote docs/lance.md to required-read every session Adds @docs/lance.md alongside @docs/invariants.md so the Lance index loads on every turn (Claude Code @-import; explicit-open instruction for other agents). Reframes the directive from "when you hit a Lance-shaped problem" to "consult before every task to identify which upstream pages are relevant." The Lance docs are the authoritative source for substrate behavior, so reasoning about them should start every change rather than be triggered conditionally. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-28 23:50:42 +02:00

1 2

56 commits