omnigraph/docs/user/concepts/storage.md

116 lines
11 KiB
Markdown
Raw Normal View History

# Storage
## L1 — Lance dataset (per node/edge type)
Every node type and every edge type is its own Lance dataset:
- **Columnar Arrow storage**: each property is a column; nullable per Arrow schema.
- **Fragments**: data is partitioned into fragments; new writes create new fragments.
- **Manifest versioning**: every commit produces a new dataset version; old versions remain readable.
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
- **Stable row IDs**: stable row IDs are enabled on every Lance dataset OmniGraph creates — node and edge data tables, `__manifest`, the commit-graph datasets, and any future system tables. This is an architectural invariant: the flag is one-way at dataset create, so a future change that introduces a Lance dataset must preserve it. Consequences: `_row_created_at_version` and `_row_last_updated_at_version` are available on every dataset (load-bearing for change-feed validators); indices survive `omnigraph optimize`. Pre-0.4.x graphs created before this code path settled may have datasets without the flag and cannot be retrofitted in place — the supported path is dump-and-reload. The rewrite path used by `schema_apply` preserves the flag.
- **Append / delete / `merge_insert`**: native Lance write modes.
- **Per-dataset branches** (Lance native): copy-on-write at the dataset level.
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
- **Object-store agnostic**: file://, s3://, gs://, az://, http (read-only via Lance) — OmniGraph wires file:// and s3://.
## L2 — Multi-dataset coordination via `__manifest`
OmniGraph is **not** a single Lance dataset; it is a *graph* of datasets coordinated through one append-only manifest table.
- **Manifest table**: `__manifest/` Lance dataset.
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
- **Layout**:
- `nodes/{fnv1a64-hex(type_name)}` — one Lance dataset per node type
- `edges/{fnv1a64-hex(edge_type_name)}` — one Lance dataset per edge type
feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299) * docs(rfc-013): bank the #295 spec-review comments as step-5 constraints (§5.1) 3b shipped a minimal WriteTxn{branch,base} and deferred the full §4.1 opener unification (pinned-base opener, shared Session, write-local handle cache, strict-op conflict-timing move) to step 5. The greptile comments on the #295 spec were moot for #298 (none of those constructs were built) but are load-bearing for step 5: (1) the handle cache must be Send+Sync (Mutex, not RefCell); (2) the strict-op timing move needs an explicit retry contract — txn discarded after any commit, retry re-opens a fresh base — which is the SAME contract as the stale-view false-fail (§1d.2); (3) the opener-equivalence test must advance HEAD externally then assert pinned-base, not the trivial HEAD==base. * feat(engine): fold graph lineage into the __manifest publish CAS (RFC-013 Phase 7) Graph lineage no longer lives in a second write to _graph_commits.lance. Each commit's graph_commit + graph_head:<branch> rows now ride the SAME __manifest merge-insert as the table-version rows (one atomic version), and CommitGraph reads its cache from the manifest projection (read_graph_lineage). _graph_commits.lance is no longer written commit rows (it remains only as a Lance branch-ref carrier). Mechanism: a LineageIntent { graph_commit_id (ULID, minted once), branch, actor, merged_parent, created_at } threads through ManifestBatchPublisher::publish. Inside the publisher retry loop the parent is resolved per attempt from the just-loaded branch-scoped manifest (the should_replace_head winner over the visible graph_commit rows — branch-correct by Lance branch isolation; the graph_head row is written for forward-compat + the §7.1 contention point but is not the parent source, so a freshly-forked branch resolves the right fork-point parent). A CAS-conflict retry re-reads the advanced head → correct new parent; the commit_id is stable across retries. Closes two known gaps BY CONSTRUCTION (one write, no second step to fail/ race): - manifest→commit-graph atomicity (no crash window between manifest + lineage), - commit-graph parent under concurrency (no refresh→append TOCTOU; the per-write commit_graph.refresh() is gone). Recovery, branch-merge, and genesis route their lineage through the same CAS (merge: one commit_merge_with_actor; recovery: publish_recovery_commit folds the recovery commit, actor=omnigraph:recovery; genesis rides the init __manifest write). The dead _graph_commits write helpers (append_commit/_merge/_actor) are #[allow(dead_code)] (the actor sidecar table is still enumerated by optimize). Verified (sequential): build clean; the new lineage_projection gate (manifest-only — _graph_commits/_actors have 0 rows; full lineage reconstructs via the projection); branching/merge_truth_table (exhaustive, branch-aware)/composite_flow/point_in_time/ changes/consistency/recovery; failpoints (59, incl. recovery lifecycle + the now-closed atomicity gap); full --workspace. Cost tests REVERT to their pre-fold values (writes +1, write_cost ceiling 80) — the proof of true single-CAS (no extra write). invariants.md marks both gaps CLOSED. PENDING (next stages, this PR): the §7.1 concurrent graph_head one-winner gate (stage 5 — two concurrent same-branch commits, exactly one wins); the stamp bump v4 + migrate_v3_to_v4 backfill + read-only refuse for EXISTING graphs (stage 4); full doc-sync of storage.md/architecture.md/writes.md. * feat(engine): migrate existing v3 graphs to manifest lineage (RFC-013 Phase 7 stage 4) The Phase-7 fold made CommitGraph read lineage from the __manifest projection, so a pre-Phase-7 (internal-schema v3) graph — lineage in _graph_commits.lance, none in __manifest — would read an empty commit DAG. Stage 4 makes existing graphs upgrade seamlessly and not break reads. - Stamp 3 -> 4 + migrate_v3_to_v4: bumps INTERNAL_MANIFEST_SCHEMA_VERSION and adds the 3 => migrate_v3_to_v4 arm. The migration reads this branch's _graph_commits/_actors, emits one graph_commit row per commit + exactly one graph_head:<branch> for the head (should_replace_head winner, deterministic id-sort — no hash-map-order in migration output), merge-inserts into __manifest, then set_stamp(4) LAST. Idempotency guard first (read_graph_lineage non-empty -> just stamp); crash before set_stamp re-enters at v3 and the guard completes it. Does NOT touch the unenforced-PK metadata. Runs per branch: migrate_on_open backfills main; load_publish_state backfills each branch on its first write (root_uri/branch threaded through migrate_internal_schema). - v3-read fallback: CommitGraph version-gates the lineage source — stamp < 4 reads the (re-activated) _graph_commits.lance; >= 4 uses the manifest projection. So a READ-ONLY open of an un-migrated graph reads correct history with no write. Correctness catch: the legacy _graph_commit_actors.lance was never branched, so the fallback reads it FLAT (no branch checkout) while checking out the branch only on the commits dataset. - Read-only stamp-refuse: a ReadOnly open of a FUTURE-stamped graph now refuses with the same upgrade error (future-proofing the next format bump; the write path already refused via migrate_internal_schema). - Docs: storage/architecture/writes/invariants/constants updated to manifest-stored lineage; release note docs/releases/v0.8.0.md (format v4, old writers clean-break, data preserved, upgrade writers first). 6 new tests (v3 backfill, idempotent, v3 read-only fallback, future-stamp refuse in both modes, crash-before-stamp completes, legacy branch+flat-actor read). Full engine suite + failpoints (59) + cargo test --workspace --locked green; check-agents-md passes. * test(engine): graph_head concurrency gate — disjoint same-branch writers form a linear commit DAG (RFC-013 Phase 7) Two (or N) writers committing disjoint tables on one branch still share the mutable `graph_head:<branch>` manifest row, so the only row-level CAS contention is that row. The contract — exactly one writer wins each CAS round; the loser retries inside the publisher, re-resolves its parent off the freshly-advanced head, and re-commits, so every writer lands and the graph_commit DAG stays a single LINEAR chain (no fork) — had no acceptance test. This adds it. - concurrent_disjoint_writes_share_head_and_form_linear_chain: two disjoint writers + distinct LineageIntent, tokio::join!; both commit; the on-disk DAG is genesis -> c -> c' (asserted linear: exactly one genesis, no two commits share a parent, the head is the unique non-parent). - n_concurrent_disjoint_writers_converge_to_one_linear_chain: N=8 disjoint writers each with an app-level retry loop (the publisher's internal budget can be exhausted under contention); all converge to one linear chain of 8. - concurrent_disjoint_writes_form_linear_chain_on_s3: the same race on a real object store (true conditional-put CAS), bucket-gated. Cites both tests from the §7.1 contention note in invariants.md. Test-only; no production change. * perf(engine): fold the lineage parent scan into the publish path's single __manifest scan (RFC-013 P2) Each lineage publish scanned `__manifest` twice: `load_publish_state` read table state via one scan, then `resolve_lineage_rows` did a second full `read_graph_lineage` scan only to find the parent commit. Fold the `graph_commit` extraction into the existing scan. - `read_manifest_scan` gains a `collect_lineage` flag. The publish path (`read_publish_scan`) collects the `graph_commit` rows in the same pass; the table-state hot path leaves them in the forward-compat skip arm, so it never pays the O(commits) lineage JSON decode (it also skips reading the `object_id` column entirely). One shared `decode_graph_commit_row` serves both the folded path and the standalone `read_graph_lineage`, so the two cannot drift. - `resolve_lineage_rows` is now sync and takes the already-parsed rows; the per-attempt re-read is preserved because `load_publish_state` runs once per CAS attempt, so a retry still re-parents off the advanced head. - `load_publish_state` returns a named `LoadedPublishState` instead of a four-tuple; the thin `read_registered_table_locations` / `read_tombstone_versions` accessors fold away. `read_manifest_entries` becomes `#[cfg(test)]`: the fold removes its last production caller, leaving only the test-only namespace module (`db/manifest.rs`: `#[cfg(test)] mod namespace`), so gating it keeps it from becoming dead code in non-test builds. Measured at depth ~5: per-write `__manifest` reads drop 44 -> 26 (total reads 54 -> 36). write_cost.rs gains a `manifest_reads <= 34` sub-ceiling that trips if a publish-path scan is re-added, and its calibration comment is corrected. * test(engine): red — transient legacy-open failure silently completes the v3→v4 migration A pre-Phase-7 (internal schema v3) graph keeps its graph lineage in `_graph_commits.lance`; the v3→v4 internal-schema migration backfills it into `__manifest` and stamps v4. `read_legacy_commit_cache` currently maps EVERY `Dataset::open` error to "no legacy data" (`Err(_) => empty`), so a transient or corrupt open during the one-time migration backfills nothing and still stamps v4 — orphaning the real lineage permanently (the migration runs once; the v3 fallback is then disabled). Add a `migration.v3_to_v4.legacy_open` failpoint that injects a non-not-found Lance error at the legacy open, and a fault-injection regression test in the `failpoints` binary. Against the current swallow the migration completes anyway, so the test fails on its "migration must abort" assertion — the predicted symptom. The fix follows in the next commit. Test support reachable from the `failpoints` integration binary (it compiles the crate without `cfg(test)`): the v3-fixture helpers and a stamp/row-count reader are gated `cfg(any(test, feature = "failpoints"))`, still excluded from release builds. Failpoint tests stay in the integration binary because the fail registry is process-global. * fix(engine): propagate non-not-found legacy-open errors in the v3→v4 migration `read_legacy_commit_cache` mapped EVERY `Dataset::open` error to an empty cache (`Err(_) => empty`) on both the legacy commits dataset and its actor sidecar. The v3→v4 internal-schema migration reads this once before stamping internal-schema v4; a transient or corrupt open therefore backfilled nothing and stamped v4 anyway, orphaning the graph's real lineage permanently (the migration runs once, and the stamp-gated v3 fallback is disabled at v4). This is the "no silent failures" deny-list violation, and realistic on object storage. Both opens now match the not-found variants — Lance maps an object-store NotFound to `DatasetNotFound` — as the benign "no legacy data" / "no authors" signal, and propagate anything else as a loud error. The two arms share the variant contract but carry different rationale (commits-absent is the legitimate empty signal; actor-sidecar-absent is benign, but a corrupt actor open silently wiping authorship before stamping v4 is the same loss hole), commented at each site. Pinned by the `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant` guard (turns red if a Lance bump changes the absence variant) and greens the fault-injection regression test from the previous commit. * test(engine): cover the per-branch v3→v4 migration against a real Lance branch `seed_legacy_v3_lineage` writes every commit (including the "feature"-tagged one) to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field, so the production per-branch migration path — `read_legacy_commit_cache` checking out a real Lance branch, and a branch-scoped `__manifest` — was never exercised. Add `seed_legacy_v3_lineage_with_branch`, which forks a real `feature` Lance branch on BOTH `_graph_commits.lance` and `__manifest` (the branch inherits main's stripped v3 state), and a test that migrates the BRANCH and asserts the branch's lineage lands in the BRANCH's `__manifest` (genesis + A + branch commit, `graph_head:feature` → branch commit, parents + actors intact) with main's `__manifest` untouched. This empirically resolves the open question behind the merge robustness work: the fast-path `read_graph_lineage(dataset)` has no `manifest_branch` filter, but `__manifest` is Lance-branched per graph-branch, so a branch reads only its own lineage — the test confirms migrating one branch does not leak into another. No branch filter is needed. * refactor(engine): type the lineage-backfill merge conflict via the publisher classifier `state::merge_lineage_rows` (the v3→v4 lineage backfill's standalone `__manifest` merge-insert) stringified its `execute_reader` error, discarding the Lance variant. Route it through the publisher's `map_lance_publish_error` (now `pub(crate)`) so a concurrent first-open's row-level CAS loss surfaces as the SAME typed `OmniError::Manifest{ details: RowLevelCasContention }` the publisher's own retry consumes — one vocabulary, no raw-Lance matching in the migration. Deliberately NOT unified with `optimize::is_retryable_lance_conflict`: that classifier also matches `CommitConflict`/`RetryableCommitConflict` from the compaction commit path, which a row-level merge-insert never emits. Cross-linked with a comment at both sites. Behavior-preserving: the only path that changes is the error TYPE on a CAS loss (previously an opaque `Lance` string, now a typed conflict); no success/failure outcome changes. The bounded re-open retry that consumes the new type lands next. * test(engine): red — concurrent v3→v4 migrations error instead of converging `migrate_v2_to_v3` is concurrent-runner idempotent by design; v3→v4 regressed it. `merge_lineage_rows` uses `conflict_retries(0)` and `migrate_v3_to_v4` has no app-level retry, so when two processes open the same legacy graph at once the backfill's row-level CAS loser errors the whole open instead of converging. The test opens two `__manifest` handles at the same pre-migration (v3, empty-lineage) HEAD and runs both `migrate_internal_schema` calls under `tokio::join!`, forcing the `graph_head:main` CAS to fire every run. Against the current code the loser fails with `RowLevelCasContention` ("Attempted 0 retries.") — the predicted symptom — so the "both must converge" assertion panics. The bounded re-open retry that makes both converge lands next. * fix(engine): make the v3→v4 lineage backfill converge under concurrent runners `migrate_v2_to_v3` is concurrent-runner idempotent; v3→v4 was not. Two processes (or open-for-write handles) opening the same legacy graph at once both reach the backfill merge, and `merge_lineage_rows`'s `conflict_retries(0)` made the row-level CAS loser error the whole open instead of converging. Two contention points, both now handled all-or-nothing: 1. The backfill merge on `graph_head:<branch>`. Wrap (fast-path re-read → read legacy → merge) in a bounded re-open retry loop: a `RowLevelCasContention` loss re-opens the manifest past the winner's (atomic) commit and re-loops; the fast-path re-read then sees the winner's lineage and stamps. On budget exhaustion it returns a `RowLevelCasContention`-typed error so the publisher's OUTER retry loop completes it. The retry decision reuses the publisher's `is_retryable_publish_conflict` so the two stay in lockstep. 2. The terminal stamp bump. Making the merge loser converge newly lets BOTH runners reach `set_stamp(4)` — an `UpdateConfig` commit on the same key — so the loser gets `lance::Error::IncompatibleTransaction` (NOT a row-level CAS, so the merge loop doesn't catch it). This surfaced only under the concurrent full-suite run, not the isolated test. Both write the SAME value, so the conflict is benign: `commit_v4_stamp_idempotently` re-opens and, if the stamp already reached the target, succeeds; else re-applies (bounded). Greens the race test from the previous commit (3x isolated, 5x full-suite, no flake). The new `IncompatibleTransaction` match is pinned by `lance_surface_guards.rs::lance_error_incompatible_transaction_variant_exists`. * fix(engine): refuse a future internal-schema stamp on the branch read path `load_commit_cache_for_branch` dispatched on the branch's internal-schema stamp — `< CURRENT` to the v3 legacy fallback, `>= CURRENT` to the manifest projection — but never refused a `> CURRENT` branch stamp, so a newer-binary shape would be misread by the projection rather than rejected. Add `refuse_if_stamp_too_new(stamp)` (re-exported `pub(crate)` from `migrations`) right after the branch stamp is read, mirroring the main read path's `refuse_if_internal_schema_too_new`. This is defense-in-depth, not a live hole: migrations run main-first (main migrates on open; each branch on its first write), so main's stamp is always >= every branch's and the main path refuses first. The guard closes the gap if that ordering invariant is ever weakened. Tested by force-stamping a real branch past CURRENT and asserting the branch read refuses with the upgrade error (the test misreads via the projection — returns Ok — without the guard, confirmed by removing it). * docs(rfc-013): record the v3→v4 migration robustness fixes invariants.md Known Gaps: the `migrate_v3_to_v4` entry now states the migration is loud on non-not-found legacy-open errors and concurrent-runner idempotent (bounded re-open retry on the merge CAS + idempotent stamp bump), and that the branch read path refuses a `> CURRENT` stamp. lance.md: note the two new surface guards the migration depends on (`dataset_open_missing_returns_not_found_variant`, `lance_error_incompatible_transaction_variant_exists`). testing.md: note the migration fault-injection test in the failpoints row. * refactor: remove dead code and silence warnings across engine + cluster Dead-code sweep follow-up to the RFC-013 stack. No behavior change. - engine: delete the orphaned `validate_edge_cardinality` — the load path uses `validate_edge_cardinality_with_pending_loader` for every mode (including Overwrite, which it treats as the replacement table image), so the old standalone validator had no caller — and correct its sibling's now-stale doc reference. Gate `TableStore::append_batch` `#[cfg(test)]`: it is the inline- commit residual kept only for recovery test setup, with no non-test caller. - cluster: drop unused imports in `lib.rs`, delete the unused `ClusterStore::payload_display`, and raise `LiveGraphObservation` / `GraphObservationJson` / `PolicyTarget` to `pub(crate)` to match the functions that return them. Both lib crates now build warning-free. * fix(engine): match Lance's typed DatasetAlreadyExists, not the message string The internal create-or-open idempotency fallbacks in `db/commit_graph.rs` and `db/recovery_audit.rs` classified the "already exists" race by `err.to_string().contains("Dataset already exists")` — a Lance display string, not an API contract. A wording change upstream would silently break the fallback (a re-create would error instead of opening the existing table). Match the typed `lance::Error::DatasetAlreadyExists { .. }` variant instead — the same discipline as the v3→v4 migration's not-found classifier — pinned by the new `lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists` guard so a Lance rename turns red instead of silently regressing. * refactor(engine): consolidate now_micros into one crate::db helper Four `fn now_micros() -> Result<i64>` copies (commit_graph, recovery_audit, graph_coordinator, manifest/graph) had already drifted: three mapped the clock error to `OmniError::manifest("...UNIX_EPOCH...")` while recovery_audit used `OmniError::manifest_internal("...unix epoch...")`. Replace all four with one `pub(crate) fn now_micros()` in `db/mod.rs` (the majority `manifest` variant), and repoint the eight call sites at `crate::db::now_micros()`. No test asserts on the failure message, so unifying the variant is behavior-safe; the timestamp-mapping contract can no longer fork across the rows it stamps. * refactor(engine): drop the dead snapshot param from roll_back_sidecar `roll_back_sidecar` took `snapshot: &Snapshot` only to discard it with `let _ = snapshot;` — rollbacks now always publish (the restored HEAD plus a recovery-commit lineage row), so the snapshot is never read to decide whether to skip a publish. Remove the parameter, the two call-site arguments, and the suppressor. A signature must not advertise inputs it does not consume. The `Snapshot` import stays — `process_sidecar`, `roll_forward_all`, and `record_audit_recovery_rollforward` still take it. * test(engine): red — open_at_branch wedges a branch on a missing commit-graph ref A v4 graph keeps its graph lineage in `__manifest` (RFC-013 Phase 7); the `_graph_commits.lance` branch ref is a derived artifact. An interrupted fork-reclaim or a `cleanup` race can drop that derived ref while the manifest lineage stays intact. Per invariants 7 + 15 a missing derived ref must not fail a logical read of the lineage. This wedge builds a real v4 `feature` branch (its `graph_head:feature` row in `__manifest`), force-deletes ONLY the `_graph_commits.lance` `feature` ref, then asserts the branch reads (`open_at_branch` / list-commits / `merge_base`) succeed from `__manifest` while a write that needs the derived ref (`create_branch`) fails loudly with the typed actionable error. Red against current code: `open_at_branch`'s hard `checkout_branch(branch)?` on the missing ref errors `OmniError::Lance` (Lance "Not found: _graph_commits.lance/tree/feature/_versions"), wedging the logical read. * fix(engine): read manifest lineage independent of the derived _graph_commits ref `CommitGraph::open_at_branch` did a hard `checkout_branch(branch)?` on the `_graph_commits.lance` branch ref before reading lineage — so a missing derived ref (an interrupted fork-reclaim, or a `cleanup` race) wedged the branch's commit-list / merge-base / snapshot resolution even though the lineage is readable from the authoritative `__manifest` (RFC-013 Phase 7). That is a derived/physical artifact failing a logical read — invariants 7 and 15. Make the held commits handle `Option<Dataset>` (mirroring `actor_dataset`). `open_at_branch` and `refresh` check out the derived ref best-effort: a typed not-found (`RefNotFound`/`NotFound`) yields a `None` handle while the read re-syncs from `__manifest`; any other open error still propagates. The manifest existence gate is unchanged — `load_commit_cache_for_branch` keeps its hard `?`, so a truly absent branch still fails loudly at the manifest. `create_branch` (the only writer that forks a ref) and the folded-in version lookup return a loud, actionable error on `None`, deferring repair to `cleanup`'s existing orphan reconciler rather than inlining a write on a read-side refresh. Reads (`head_commit`/`load_commits`/`get_commit`/`merge_base`) never touch the handle. Greens the wedge regression from the preceding commit. * fix(engine): v3→v4 retry loops return retryable contention on exhaustion `commit_v4_stamp_idempotently`'s retry loop used `0..=STAMP_RETRY_BUDGET` (6 iterations) with an `attempt < STAMP_RETRY_BUDGET` guard, so the LAST iteration's `IncompatibleTransaction` fell through to `Err(e) => OmniError::Lance(...)` — stringified, non-retryable — instead of the intended `RowLevelCasContention`, and the post-loop contention return was dead code. The publisher's outer retry only re-runs `is_retryable_publish_conflict`, so under sustained concurrent v3→v4 migration the one-time stamp bump could fail instead of converging, defeating the idempotency the migration is supposed to add. Fix the loop to `0..BUDGET` with an UNGUARDED `IncompatibleTransaction` arm: the retryable variant is always handled inside the loop (re-open + same-value check + retry), so it can never reach the stringifying catch-all, and the post-loop is the SINGLE reachable exhaustion path — the typed `RowLevelCasContention`. The `Err(e)` arm now catches only genuine non-contention errors. Apply the same range alignment to the sibling merge loop in `migrate_v3_to_v4` (behaviorally correct today — its `Err(err)` returns the already-typed contention — but it carried the identical off-by-one structure the stamp loop was copied from; aligning both stops the next copy from re-introducing it). Test-first. The exhaustion path is otherwise near-unreachable — a real concurrent winner stamps the same value, so the re-read returns Ok on the first retry — so a new `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to lose, driving exhaustion deterministically. Against the pre-fix loop the new `v4_stamp_exhaustion_returns_retryable_contention` test goes red with `Lance("Incompatible transaction: injected failpoint triggered…")`; with the fix it asserts the typed `RowLevelCasContention`. Found by automated review on #299. * feat(engine): minimum-supported internal-schema floor + retirement tripwire The internal-schema migration chain (`migrate_internal_schema`) had a too-new ceiling but no floor, so every old `migrate_vN_…` arm and the v3 legacy readers it needs stay forever — the pile grows by one migration + readers + tests every schema version. Add `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (1 today, a pure no-op: `read_stamp` floors an absent stamp at 1 and no real graph carries 0) as the oldest stamp this binary opens; raising it is how the chain sheds old code. Collapse the one-sided `refuse_if_stamp_too_new` into `refuse_if_stamp_unsupported` checking both bounds, so the floor lands at all three stamp-enforcement sites — the write-path migrate dispatcher, the read-only open guard, and the branch lineage-read path (`commit_graph.rs`) — via one compiler-enforced rename. A hand-wired floor twin would have had to touch each site, and the branch-read path is easy to miss; one combined guard cannot half-enforce. Rename the read-only wrapper `refuse_if_internal_schema_unsupported` to match. A compile-time tripwire (`const _: () = assert!(LOWEST_REGISTERED_MIGRATION_SOURCE == MIN_SUPPORTED…)`) fails the build if a future floor bump forgets to delete the now-dead migration arm (or vice versa) — stronger than a runtime test, impossible to skip, and it doubles as the use that keeps the mirror const live. Tests: a sub-floor graph is refused in both open modes (twin of `future_stamp_is_refused_in_both_open_modes`); the guard accepts exactly [MIN, CURRENT]. No behavior change for any real graph. The retirement runbook lives on the `MIN_SUPPORTED` doc-comment + invariants.md. * fix(engine): compose migration contention with publisher retry; precise recovery-converge audit commit Three review-surfaced fixes on the RFC-013 Phase 7 path. Publisher retry vs migration contention: `publish()` propagated a `load_publish_state` error fatally via `?`, so a `RowLevelCasContention` surfaced by the v3->v4 migration's exhausted merge/stamp budgets aborted the publish instead of being retried — only `merge_rows` conflicts hit the retry. This contradicted the migration's own design, which returns that typed error EXPECTING the publisher to re-run the load (by which point a concurrent winner has usually finished the migration, so the next scan is a no-op). Route a retryable load error through the same retry path as a retryable `merge_rows` conflict. Regression test (failpoints): a one-shot retryable contention injected into `load_publish_state` now commits via the retry; red without the fix (the write fails with the injected contention). Recovery-converge audit commit id: `converge_or_defer_roll_forward` recorded the branch HEAD as the audit row's `graph_commit_id`, but a concurrent user write can advance `graph_head` past the recovery commit between the winner's publish and this read — attributing the audit to a later, wrong commit. Use the latest `RECOVERY_ACTOR`-authored commit (what `publish_recovery_commit` mints), which is the recovery commit by construction. The audit's actor was already correct (it comes from `sidecar.actor_id`, not the commit). Dead param: drop the unused `snapshot` from `record_audit_recovery_rollforward` (removing the `let _ = snapshot;` suppressor). `storage` stays — it is used to delete the sidecar.
2026-06-25 13:55:34 +02:00
- `__manifest/` — the catalog of all sub-tables and their published versions, **and** the graph commit lineage (RFC-013 Phase 7)
- `_graph_commits.lance` / `_graph_commit_actors.lance` — legacy / branch-ref carriers. Since RFC-013 Phase 7 the graph lineage lives in `__manifest` (`graph_commit` / `graph_head` rows, written in the publish CAS); `_graph_commits.lance` no longer receives commit rows, but is retained to carry the Lance branch refs that `create_branch` / `list_branches` / the `cleanup` orphan reconciler operate on. A graph created before Phase 7 (internal schema v3) keeps its lineage here until its first read-write open, which migrates it into `__manifest` via `migrate_v3_to_v4`.
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
- (legacy `_graph_runs.lance` / `_graph_run_actors.lance` from pre-v0.4.0 graphs are inert; the run state machine was removed. The internal schema migration sweeps stale `__run__*` branches on first write-open; the inert dataset bytes themselves remain until a prefix-delete storage primitive lands)
- **Manifest row schema** (`object_id, object_type, location, metadata, base_objects, table_key, table_version, table_branch, row_count`):
feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299) * docs(rfc-013): bank the #295 spec-review comments as step-5 constraints (§5.1) 3b shipped a minimal WriteTxn{branch,base} and deferred the full §4.1 opener unification (pinned-base opener, shared Session, write-local handle cache, strict-op conflict-timing move) to step 5. The greptile comments on the #295 spec were moot for #298 (none of those constructs were built) but are load-bearing for step 5: (1) the handle cache must be Send+Sync (Mutex, not RefCell); (2) the strict-op timing move needs an explicit retry contract — txn discarded after any commit, retry re-opens a fresh base — which is the SAME contract as the stale-view false-fail (§1d.2); (3) the opener-equivalence test must advance HEAD externally then assert pinned-base, not the trivial HEAD==base. * feat(engine): fold graph lineage into the __manifest publish CAS (RFC-013 Phase 7) Graph lineage no longer lives in a second write to _graph_commits.lance. Each commit's graph_commit + graph_head:<branch> rows now ride the SAME __manifest merge-insert as the table-version rows (one atomic version), and CommitGraph reads its cache from the manifest projection (read_graph_lineage). _graph_commits.lance is no longer written commit rows (it remains only as a Lance branch-ref carrier). Mechanism: a LineageIntent { graph_commit_id (ULID, minted once), branch, actor, merged_parent, created_at } threads through ManifestBatchPublisher::publish. Inside the publisher retry loop the parent is resolved per attempt from the just-loaded branch-scoped manifest (the should_replace_head winner over the visible graph_commit rows — branch-correct by Lance branch isolation; the graph_head row is written for forward-compat + the §7.1 contention point but is not the parent source, so a freshly-forked branch resolves the right fork-point parent). A CAS-conflict retry re-reads the advanced head → correct new parent; the commit_id is stable across retries. Closes two known gaps BY CONSTRUCTION (one write, no second step to fail/ race): - manifest→commit-graph atomicity (no crash window between manifest + lineage), - commit-graph parent under concurrency (no refresh→append TOCTOU; the per-write commit_graph.refresh() is gone). Recovery, branch-merge, and genesis route their lineage through the same CAS (merge: one commit_merge_with_actor; recovery: publish_recovery_commit folds the recovery commit, actor=omnigraph:recovery; genesis rides the init __manifest write). The dead _graph_commits write helpers (append_commit/_merge/_actor) are #[allow(dead_code)] (the actor sidecar table is still enumerated by optimize). Verified (sequential): build clean; the new lineage_projection gate (manifest-only — _graph_commits/_actors have 0 rows; full lineage reconstructs via the projection); branching/merge_truth_table (exhaustive, branch-aware)/composite_flow/point_in_time/ changes/consistency/recovery; failpoints (59, incl. recovery lifecycle + the now-closed atomicity gap); full --workspace. Cost tests REVERT to their pre-fold values (writes +1, write_cost ceiling 80) — the proof of true single-CAS (no extra write). invariants.md marks both gaps CLOSED. PENDING (next stages, this PR): the §7.1 concurrent graph_head one-winner gate (stage 5 — two concurrent same-branch commits, exactly one wins); the stamp bump v4 + migrate_v3_to_v4 backfill + read-only refuse for EXISTING graphs (stage 4); full doc-sync of storage.md/architecture.md/writes.md. * feat(engine): migrate existing v3 graphs to manifest lineage (RFC-013 Phase 7 stage 4) The Phase-7 fold made CommitGraph read lineage from the __manifest projection, so a pre-Phase-7 (internal-schema v3) graph — lineage in _graph_commits.lance, none in __manifest — would read an empty commit DAG. Stage 4 makes existing graphs upgrade seamlessly and not break reads. - Stamp 3 -> 4 + migrate_v3_to_v4: bumps INTERNAL_MANIFEST_SCHEMA_VERSION and adds the 3 => migrate_v3_to_v4 arm. The migration reads this branch's _graph_commits/_actors, emits one graph_commit row per commit + exactly one graph_head:<branch> for the head (should_replace_head winner, deterministic id-sort — no hash-map-order in migration output), merge-inserts into __manifest, then set_stamp(4) LAST. Idempotency guard first (read_graph_lineage non-empty -> just stamp); crash before set_stamp re-enters at v3 and the guard completes it. Does NOT touch the unenforced-PK metadata. Runs per branch: migrate_on_open backfills main; load_publish_state backfills each branch on its first write (root_uri/branch threaded through migrate_internal_schema). - v3-read fallback: CommitGraph version-gates the lineage source — stamp < 4 reads the (re-activated) _graph_commits.lance; >= 4 uses the manifest projection. So a READ-ONLY open of an un-migrated graph reads correct history with no write. Correctness catch: the legacy _graph_commit_actors.lance was never branched, so the fallback reads it FLAT (no branch checkout) while checking out the branch only on the commits dataset. - Read-only stamp-refuse: a ReadOnly open of a FUTURE-stamped graph now refuses with the same upgrade error (future-proofing the next format bump; the write path already refused via migrate_internal_schema). - Docs: storage/architecture/writes/invariants/constants updated to manifest-stored lineage; release note docs/releases/v0.8.0.md (format v4, old writers clean-break, data preserved, upgrade writers first). 6 new tests (v3 backfill, idempotent, v3 read-only fallback, future-stamp refuse in both modes, crash-before-stamp completes, legacy branch+flat-actor read). Full engine suite + failpoints (59) + cargo test --workspace --locked green; check-agents-md passes. * test(engine): graph_head concurrency gate — disjoint same-branch writers form a linear commit DAG (RFC-013 Phase 7) Two (or N) writers committing disjoint tables on one branch still share the mutable `graph_head:<branch>` manifest row, so the only row-level CAS contention is that row. The contract — exactly one writer wins each CAS round; the loser retries inside the publisher, re-resolves its parent off the freshly-advanced head, and re-commits, so every writer lands and the graph_commit DAG stays a single LINEAR chain (no fork) — had no acceptance test. This adds it. - concurrent_disjoint_writes_share_head_and_form_linear_chain: two disjoint writers + distinct LineageIntent, tokio::join!; both commit; the on-disk DAG is genesis -> c -> c' (asserted linear: exactly one genesis, no two commits share a parent, the head is the unique non-parent). - n_concurrent_disjoint_writers_converge_to_one_linear_chain: N=8 disjoint writers each with an app-level retry loop (the publisher's internal budget can be exhausted under contention); all converge to one linear chain of 8. - concurrent_disjoint_writes_form_linear_chain_on_s3: the same race on a real object store (true conditional-put CAS), bucket-gated. Cites both tests from the §7.1 contention note in invariants.md. Test-only; no production change. * perf(engine): fold the lineage parent scan into the publish path's single __manifest scan (RFC-013 P2) Each lineage publish scanned `__manifest` twice: `load_publish_state` read table state via one scan, then `resolve_lineage_rows` did a second full `read_graph_lineage` scan only to find the parent commit. Fold the `graph_commit` extraction into the existing scan. - `read_manifest_scan` gains a `collect_lineage` flag. The publish path (`read_publish_scan`) collects the `graph_commit` rows in the same pass; the table-state hot path leaves them in the forward-compat skip arm, so it never pays the O(commits) lineage JSON decode (it also skips reading the `object_id` column entirely). One shared `decode_graph_commit_row` serves both the folded path and the standalone `read_graph_lineage`, so the two cannot drift. - `resolve_lineage_rows` is now sync and takes the already-parsed rows; the per-attempt re-read is preserved because `load_publish_state` runs once per CAS attempt, so a retry still re-parents off the advanced head. - `load_publish_state` returns a named `LoadedPublishState` instead of a four-tuple; the thin `read_registered_table_locations` / `read_tombstone_versions` accessors fold away. `read_manifest_entries` becomes `#[cfg(test)]`: the fold removes its last production caller, leaving only the test-only namespace module (`db/manifest.rs`: `#[cfg(test)] mod namespace`), so gating it keeps it from becoming dead code in non-test builds. Measured at depth ~5: per-write `__manifest` reads drop 44 -> 26 (total reads 54 -> 36). write_cost.rs gains a `manifest_reads <= 34` sub-ceiling that trips if a publish-path scan is re-added, and its calibration comment is corrected. * test(engine): red — transient legacy-open failure silently completes the v3→v4 migration A pre-Phase-7 (internal schema v3) graph keeps its graph lineage in `_graph_commits.lance`; the v3→v4 internal-schema migration backfills it into `__manifest` and stamps v4. `read_legacy_commit_cache` currently maps EVERY `Dataset::open` error to "no legacy data" (`Err(_) => empty`), so a transient or corrupt open during the one-time migration backfills nothing and still stamps v4 — orphaning the real lineage permanently (the migration runs once; the v3 fallback is then disabled). Add a `migration.v3_to_v4.legacy_open` failpoint that injects a non-not-found Lance error at the legacy open, and a fault-injection regression test in the `failpoints` binary. Against the current swallow the migration completes anyway, so the test fails on its "migration must abort" assertion — the predicted symptom. The fix follows in the next commit. Test support reachable from the `failpoints` integration binary (it compiles the crate without `cfg(test)`): the v3-fixture helpers and a stamp/row-count reader are gated `cfg(any(test, feature = "failpoints"))`, still excluded from release builds. Failpoint tests stay in the integration binary because the fail registry is process-global. * fix(engine): propagate non-not-found legacy-open errors in the v3→v4 migration `read_legacy_commit_cache` mapped EVERY `Dataset::open` error to an empty cache (`Err(_) => empty`) on both the legacy commits dataset and its actor sidecar. The v3→v4 internal-schema migration reads this once before stamping internal-schema v4; a transient or corrupt open therefore backfilled nothing and stamped v4 anyway, orphaning the graph's real lineage permanently (the migration runs once, and the stamp-gated v3 fallback is disabled at v4). This is the "no silent failures" deny-list violation, and realistic on object storage. Both opens now match the not-found variants — Lance maps an object-store NotFound to `DatasetNotFound` — as the benign "no legacy data" / "no authors" signal, and propagate anything else as a loud error. The two arms share the variant contract but carry different rationale (commits-absent is the legitimate empty signal; actor-sidecar-absent is benign, but a corrupt actor open silently wiping authorship before stamping v4 is the same loss hole), commented at each site. Pinned by the `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant` guard (turns red if a Lance bump changes the absence variant) and greens the fault-injection regression test from the previous commit. * test(engine): cover the per-branch v3→v4 migration against a real Lance branch `seed_legacy_v3_lineage` writes every commit (including the "feature"-tagged one) to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field, so the production per-branch migration path — `read_legacy_commit_cache` checking out a real Lance branch, and a branch-scoped `__manifest` — was never exercised. Add `seed_legacy_v3_lineage_with_branch`, which forks a real `feature` Lance branch on BOTH `_graph_commits.lance` and `__manifest` (the branch inherits main's stripped v3 state), and a test that migrates the BRANCH and asserts the branch's lineage lands in the BRANCH's `__manifest` (genesis + A + branch commit, `graph_head:feature` → branch commit, parents + actors intact) with main's `__manifest` untouched. This empirically resolves the open question behind the merge robustness work: the fast-path `read_graph_lineage(dataset)` has no `manifest_branch` filter, but `__manifest` is Lance-branched per graph-branch, so a branch reads only its own lineage — the test confirms migrating one branch does not leak into another. No branch filter is needed. * refactor(engine): type the lineage-backfill merge conflict via the publisher classifier `state::merge_lineage_rows` (the v3→v4 lineage backfill's standalone `__manifest` merge-insert) stringified its `execute_reader` error, discarding the Lance variant. Route it through the publisher's `map_lance_publish_error` (now `pub(crate)`) so a concurrent first-open's row-level CAS loss surfaces as the SAME typed `OmniError::Manifest{ details: RowLevelCasContention }` the publisher's own retry consumes — one vocabulary, no raw-Lance matching in the migration. Deliberately NOT unified with `optimize::is_retryable_lance_conflict`: that classifier also matches `CommitConflict`/`RetryableCommitConflict` from the compaction commit path, which a row-level merge-insert never emits. Cross-linked with a comment at both sites. Behavior-preserving: the only path that changes is the error TYPE on a CAS loss (previously an opaque `Lance` string, now a typed conflict); no success/failure outcome changes. The bounded re-open retry that consumes the new type lands next. * test(engine): red — concurrent v3→v4 migrations error instead of converging `migrate_v2_to_v3` is concurrent-runner idempotent by design; v3→v4 regressed it. `merge_lineage_rows` uses `conflict_retries(0)` and `migrate_v3_to_v4` has no app-level retry, so when two processes open the same legacy graph at once the backfill's row-level CAS loser errors the whole open instead of converging. The test opens two `__manifest` handles at the same pre-migration (v3, empty-lineage) HEAD and runs both `migrate_internal_schema` calls under `tokio::join!`, forcing the `graph_head:main` CAS to fire every run. Against the current code the loser fails with `RowLevelCasContention` ("Attempted 0 retries.") — the predicted symptom — so the "both must converge" assertion panics. The bounded re-open retry that makes both converge lands next. * fix(engine): make the v3→v4 lineage backfill converge under concurrent runners `migrate_v2_to_v3` is concurrent-runner idempotent; v3→v4 was not. Two processes (or open-for-write handles) opening the same legacy graph at once both reach the backfill merge, and `merge_lineage_rows`'s `conflict_retries(0)` made the row-level CAS loser error the whole open instead of converging. Two contention points, both now handled all-or-nothing: 1. The backfill merge on `graph_head:<branch>`. Wrap (fast-path re-read → read legacy → merge) in a bounded re-open retry loop: a `RowLevelCasContention` loss re-opens the manifest past the winner's (atomic) commit and re-loops; the fast-path re-read then sees the winner's lineage and stamps. On budget exhaustion it returns a `RowLevelCasContention`-typed error so the publisher's OUTER retry loop completes it. The retry decision reuses the publisher's `is_retryable_publish_conflict` so the two stay in lockstep. 2. The terminal stamp bump. Making the merge loser converge newly lets BOTH runners reach `set_stamp(4)` — an `UpdateConfig` commit on the same key — so the loser gets `lance::Error::IncompatibleTransaction` (NOT a row-level CAS, so the merge loop doesn't catch it). This surfaced only under the concurrent full-suite run, not the isolated test. Both write the SAME value, so the conflict is benign: `commit_v4_stamp_idempotently` re-opens and, if the stamp already reached the target, succeeds; else re-applies (bounded). Greens the race test from the previous commit (3x isolated, 5x full-suite, no flake). The new `IncompatibleTransaction` match is pinned by `lance_surface_guards.rs::lance_error_incompatible_transaction_variant_exists`. * fix(engine): refuse a future internal-schema stamp on the branch read path `load_commit_cache_for_branch` dispatched on the branch's internal-schema stamp — `< CURRENT` to the v3 legacy fallback, `>= CURRENT` to the manifest projection — but never refused a `> CURRENT` branch stamp, so a newer-binary shape would be misread by the projection rather than rejected. Add `refuse_if_stamp_too_new(stamp)` (re-exported `pub(crate)` from `migrations`) right after the branch stamp is read, mirroring the main read path's `refuse_if_internal_schema_too_new`. This is defense-in-depth, not a live hole: migrations run main-first (main migrates on open; each branch on its first write), so main's stamp is always >= every branch's and the main path refuses first. The guard closes the gap if that ordering invariant is ever weakened. Tested by force-stamping a real branch past CURRENT and asserting the branch read refuses with the upgrade error (the test misreads via the projection — returns Ok — without the guard, confirmed by removing it). * docs(rfc-013): record the v3→v4 migration robustness fixes invariants.md Known Gaps: the `migrate_v3_to_v4` entry now states the migration is loud on non-not-found legacy-open errors and concurrent-runner idempotent (bounded re-open retry on the merge CAS + idempotent stamp bump), and that the branch read path refuses a `> CURRENT` stamp. lance.md: note the two new surface guards the migration depends on (`dataset_open_missing_returns_not_found_variant`, `lance_error_incompatible_transaction_variant_exists`). testing.md: note the migration fault-injection test in the failpoints row. * refactor: remove dead code and silence warnings across engine + cluster Dead-code sweep follow-up to the RFC-013 stack. No behavior change. - engine: delete the orphaned `validate_edge_cardinality` — the load path uses `validate_edge_cardinality_with_pending_loader` for every mode (including Overwrite, which it treats as the replacement table image), so the old standalone validator had no caller — and correct its sibling's now-stale doc reference. Gate `TableStore::append_batch` `#[cfg(test)]`: it is the inline- commit residual kept only for recovery test setup, with no non-test caller. - cluster: drop unused imports in `lib.rs`, delete the unused `ClusterStore::payload_display`, and raise `LiveGraphObservation` / `GraphObservationJson` / `PolicyTarget` to `pub(crate)` to match the functions that return them. Both lib crates now build warning-free. * fix(engine): match Lance's typed DatasetAlreadyExists, not the message string The internal create-or-open idempotency fallbacks in `db/commit_graph.rs` and `db/recovery_audit.rs` classified the "already exists" race by `err.to_string().contains("Dataset already exists")` — a Lance display string, not an API contract. A wording change upstream would silently break the fallback (a re-create would error instead of opening the existing table). Match the typed `lance::Error::DatasetAlreadyExists { .. }` variant instead — the same discipline as the v3→v4 migration's not-found classifier — pinned by the new `lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists` guard so a Lance rename turns red instead of silently regressing. * refactor(engine): consolidate now_micros into one crate::db helper Four `fn now_micros() -> Result<i64>` copies (commit_graph, recovery_audit, graph_coordinator, manifest/graph) had already drifted: three mapped the clock error to `OmniError::manifest("...UNIX_EPOCH...")` while recovery_audit used `OmniError::manifest_internal("...unix epoch...")`. Replace all four with one `pub(crate) fn now_micros()` in `db/mod.rs` (the majority `manifest` variant), and repoint the eight call sites at `crate::db::now_micros()`. No test asserts on the failure message, so unifying the variant is behavior-safe; the timestamp-mapping contract can no longer fork across the rows it stamps. * refactor(engine): drop the dead snapshot param from roll_back_sidecar `roll_back_sidecar` took `snapshot: &Snapshot` only to discard it with `let _ = snapshot;` — rollbacks now always publish (the restored HEAD plus a recovery-commit lineage row), so the snapshot is never read to decide whether to skip a publish. Remove the parameter, the two call-site arguments, and the suppressor. A signature must not advertise inputs it does not consume. The `Snapshot` import stays — `process_sidecar`, `roll_forward_all`, and `record_audit_recovery_rollforward` still take it. * test(engine): red — open_at_branch wedges a branch on a missing commit-graph ref A v4 graph keeps its graph lineage in `__manifest` (RFC-013 Phase 7); the `_graph_commits.lance` branch ref is a derived artifact. An interrupted fork-reclaim or a `cleanup` race can drop that derived ref while the manifest lineage stays intact. Per invariants 7 + 15 a missing derived ref must not fail a logical read of the lineage. This wedge builds a real v4 `feature` branch (its `graph_head:feature` row in `__manifest`), force-deletes ONLY the `_graph_commits.lance` `feature` ref, then asserts the branch reads (`open_at_branch` / list-commits / `merge_base`) succeed from `__manifest` while a write that needs the derived ref (`create_branch`) fails loudly with the typed actionable error. Red against current code: `open_at_branch`'s hard `checkout_branch(branch)?` on the missing ref errors `OmniError::Lance` (Lance "Not found: _graph_commits.lance/tree/feature/_versions"), wedging the logical read. * fix(engine): read manifest lineage independent of the derived _graph_commits ref `CommitGraph::open_at_branch` did a hard `checkout_branch(branch)?` on the `_graph_commits.lance` branch ref before reading lineage — so a missing derived ref (an interrupted fork-reclaim, or a `cleanup` race) wedged the branch's commit-list / merge-base / snapshot resolution even though the lineage is readable from the authoritative `__manifest` (RFC-013 Phase 7). That is a derived/physical artifact failing a logical read — invariants 7 and 15. Make the held commits handle `Option<Dataset>` (mirroring `actor_dataset`). `open_at_branch` and `refresh` check out the derived ref best-effort: a typed not-found (`RefNotFound`/`NotFound`) yields a `None` handle while the read re-syncs from `__manifest`; any other open error still propagates. The manifest existence gate is unchanged — `load_commit_cache_for_branch` keeps its hard `?`, so a truly absent branch still fails loudly at the manifest. `create_branch` (the only writer that forks a ref) and the folded-in version lookup return a loud, actionable error on `None`, deferring repair to `cleanup`'s existing orphan reconciler rather than inlining a write on a read-side refresh. Reads (`head_commit`/`load_commits`/`get_commit`/`merge_base`) never touch the handle. Greens the wedge regression from the preceding commit. * fix(engine): v3→v4 retry loops return retryable contention on exhaustion `commit_v4_stamp_idempotently`'s retry loop used `0..=STAMP_RETRY_BUDGET` (6 iterations) with an `attempt < STAMP_RETRY_BUDGET` guard, so the LAST iteration's `IncompatibleTransaction` fell through to `Err(e) => OmniError::Lance(...)` — stringified, non-retryable — instead of the intended `RowLevelCasContention`, and the post-loop contention return was dead code. The publisher's outer retry only re-runs `is_retryable_publish_conflict`, so under sustained concurrent v3→v4 migration the one-time stamp bump could fail instead of converging, defeating the idempotency the migration is supposed to add. Fix the loop to `0..BUDGET` with an UNGUARDED `IncompatibleTransaction` arm: the retryable variant is always handled inside the loop (re-open + same-value check + retry), so it can never reach the stringifying catch-all, and the post-loop is the SINGLE reachable exhaustion path — the typed `RowLevelCasContention`. The `Err(e)` arm now catches only genuine non-contention errors. Apply the same range alignment to the sibling merge loop in `migrate_v3_to_v4` (behaviorally correct today — its `Err(err)` returns the already-typed contention — but it carried the identical off-by-one structure the stamp loop was copied from; aligning both stops the next copy from re-introducing it). Test-first. The exhaustion path is otherwise near-unreachable — a real concurrent winner stamps the same value, so the re-read returns Ok on the first retry — so a new `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to lose, driving exhaustion deterministically. Against the pre-fix loop the new `v4_stamp_exhaustion_returns_retryable_contention` test goes red with `Lance("Incompatible transaction: injected failpoint triggered…")`; with the fix it asserts the typed `RowLevelCasContention`. Found by automated review on #299. * feat(engine): minimum-supported internal-schema floor + retirement tripwire The internal-schema migration chain (`migrate_internal_schema`) had a too-new ceiling but no floor, so every old `migrate_vN_…` arm and the v3 legacy readers it needs stay forever — the pile grows by one migration + readers + tests every schema version. Add `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (1 today, a pure no-op: `read_stamp` floors an absent stamp at 1 and no real graph carries 0) as the oldest stamp this binary opens; raising it is how the chain sheds old code. Collapse the one-sided `refuse_if_stamp_too_new` into `refuse_if_stamp_unsupported` checking both bounds, so the floor lands at all three stamp-enforcement sites — the write-path migrate dispatcher, the read-only open guard, and the branch lineage-read path (`commit_graph.rs`) — via one compiler-enforced rename. A hand-wired floor twin would have had to touch each site, and the branch-read path is easy to miss; one combined guard cannot half-enforce. Rename the read-only wrapper `refuse_if_internal_schema_unsupported` to match. A compile-time tripwire (`const _: () = assert!(LOWEST_REGISTERED_MIGRATION_SOURCE == MIN_SUPPORTED…)`) fails the build if a future floor bump forgets to delete the now-dead migration arm (or vice versa) — stronger than a runtime test, impossible to skip, and it doubles as the use that keeps the mirror const live. Tests: a sub-floor graph is refused in both open modes (twin of `future_stamp_is_refused_in_both_open_modes`); the guard accepts exactly [MIN, CURRENT]. No behavior change for any real graph. The retirement runbook lives on the `MIN_SUPPORTED` doc-comment + invariants.md. * fix(engine): compose migration contention with publisher retry; precise recovery-converge audit commit Three review-surfaced fixes on the RFC-013 Phase 7 path. Publisher retry vs migration contention: `publish()` propagated a `load_publish_state` error fatally via `?`, so a `RowLevelCasContention` surfaced by the v3->v4 migration's exhausted merge/stamp budgets aborted the publish instead of being retried — only `merge_rows` conflicts hit the retry. This contradicted the migration's own design, which returns that typed error EXPECTING the publisher to re-run the load (by which point a concurrent winner has usually finished the migration, so the next scan is a no-op). Route a retryable load error through the same retry path as a retryable `merge_rows` conflict. Regression test (failpoints): a one-shot retryable contention injected into `load_publish_state` now commits via the retry; red without the fix (the write fails with the injected contention). Recovery-converge audit commit id: `converge_or_defer_roll_forward` recorded the branch HEAD as the audit row's `graph_commit_id`, but a concurrent user write can advance `graph_head` past the recovery commit between the winner's publish and this read — attributing the audit to a later, wrong commit. Use the latest `RECOVERY_ACTOR`-authored commit (what `publish_recovery_commit` mints), which is the recovery commit by construction. The audit's actor was already correct (it comes from `sidecar.actor_id`, not the commit). Dead param: drop the unused `snapshot` from `record_audit_recovery_rollforward` (removing the `let _ = snapshot;` suppressor). `storage` stays — it is used to delete the sidecar.
2026-06-25 13:55:34 +02:00
- `object_type``table | table_version | table_tombstone | graph_commit | graph_head`
- `table_key``node:<TypeName> | edge:<EdgeName>` (empty for `graph_commit` / `graph_head` lineage rows)
- `table_branch` is `null` for the main lineage and the branch name otherwise
feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299) * docs(rfc-013): bank the #295 spec-review comments as step-5 constraints (§5.1) 3b shipped a minimal WriteTxn{branch,base} and deferred the full §4.1 opener unification (pinned-base opener, shared Session, write-local handle cache, strict-op conflict-timing move) to step 5. The greptile comments on the #295 spec were moot for #298 (none of those constructs were built) but are load-bearing for step 5: (1) the handle cache must be Send+Sync (Mutex, not RefCell); (2) the strict-op timing move needs an explicit retry contract — txn discarded after any commit, retry re-opens a fresh base — which is the SAME contract as the stale-view false-fail (§1d.2); (3) the opener-equivalence test must advance HEAD externally then assert pinned-base, not the trivial HEAD==base. * feat(engine): fold graph lineage into the __manifest publish CAS (RFC-013 Phase 7) Graph lineage no longer lives in a second write to _graph_commits.lance. Each commit's graph_commit + graph_head:<branch> rows now ride the SAME __manifest merge-insert as the table-version rows (one atomic version), and CommitGraph reads its cache from the manifest projection (read_graph_lineage). _graph_commits.lance is no longer written commit rows (it remains only as a Lance branch-ref carrier). Mechanism: a LineageIntent { graph_commit_id (ULID, minted once), branch, actor, merged_parent, created_at } threads through ManifestBatchPublisher::publish. Inside the publisher retry loop the parent is resolved per attempt from the just-loaded branch-scoped manifest (the should_replace_head winner over the visible graph_commit rows — branch-correct by Lance branch isolation; the graph_head row is written for forward-compat + the §7.1 contention point but is not the parent source, so a freshly-forked branch resolves the right fork-point parent). A CAS-conflict retry re-reads the advanced head → correct new parent; the commit_id is stable across retries. Closes two known gaps BY CONSTRUCTION (one write, no second step to fail/ race): - manifest→commit-graph atomicity (no crash window between manifest + lineage), - commit-graph parent under concurrency (no refresh→append TOCTOU; the per-write commit_graph.refresh() is gone). Recovery, branch-merge, and genesis route their lineage through the same CAS (merge: one commit_merge_with_actor; recovery: publish_recovery_commit folds the recovery commit, actor=omnigraph:recovery; genesis rides the init __manifest write). The dead _graph_commits write helpers (append_commit/_merge/_actor) are #[allow(dead_code)] (the actor sidecar table is still enumerated by optimize). Verified (sequential): build clean; the new lineage_projection gate (manifest-only — _graph_commits/_actors have 0 rows; full lineage reconstructs via the projection); branching/merge_truth_table (exhaustive, branch-aware)/composite_flow/point_in_time/ changes/consistency/recovery; failpoints (59, incl. recovery lifecycle + the now-closed atomicity gap); full --workspace. Cost tests REVERT to their pre-fold values (writes +1, write_cost ceiling 80) — the proof of true single-CAS (no extra write). invariants.md marks both gaps CLOSED. PENDING (next stages, this PR): the §7.1 concurrent graph_head one-winner gate (stage 5 — two concurrent same-branch commits, exactly one wins); the stamp bump v4 + migrate_v3_to_v4 backfill + read-only refuse for EXISTING graphs (stage 4); full doc-sync of storage.md/architecture.md/writes.md. * feat(engine): migrate existing v3 graphs to manifest lineage (RFC-013 Phase 7 stage 4) The Phase-7 fold made CommitGraph read lineage from the __manifest projection, so a pre-Phase-7 (internal-schema v3) graph — lineage in _graph_commits.lance, none in __manifest — would read an empty commit DAG. Stage 4 makes existing graphs upgrade seamlessly and not break reads. - Stamp 3 -> 4 + migrate_v3_to_v4: bumps INTERNAL_MANIFEST_SCHEMA_VERSION and adds the 3 => migrate_v3_to_v4 arm. The migration reads this branch's _graph_commits/_actors, emits one graph_commit row per commit + exactly one graph_head:<branch> for the head (should_replace_head winner, deterministic id-sort — no hash-map-order in migration output), merge-inserts into __manifest, then set_stamp(4) LAST. Idempotency guard first (read_graph_lineage non-empty -> just stamp); crash before set_stamp re-enters at v3 and the guard completes it. Does NOT touch the unenforced-PK metadata. Runs per branch: migrate_on_open backfills main; load_publish_state backfills each branch on its first write (root_uri/branch threaded through migrate_internal_schema). - v3-read fallback: CommitGraph version-gates the lineage source — stamp < 4 reads the (re-activated) _graph_commits.lance; >= 4 uses the manifest projection. So a READ-ONLY open of an un-migrated graph reads correct history with no write. Correctness catch: the legacy _graph_commit_actors.lance was never branched, so the fallback reads it FLAT (no branch checkout) while checking out the branch only on the commits dataset. - Read-only stamp-refuse: a ReadOnly open of a FUTURE-stamped graph now refuses with the same upgrade error (future-proofing the next format bump; the write path already refused via migrate_internal_schema). - Docs: storage/architecture/writes/invariants/constants updated to manifest-stored lineage; release note docs/releases/v0.8.0.md (format v4, old writers clean-break, data preserved, upgrade writers first). 6 new tests (v3 backfill, idempotent, v3 read-only fallback, future-stamp refuse in both modes, crash-before-stamp completes, legacy branch+flat-actor read). Full engine suite + failpoints (59) + cargo test --workspace --locked green; check-agents-md passes. * test(engine): graph_head concurrency gate — disjoint same-branch writers form a linear commit DAG (RFC-013 Phase 7) Two (or N) writers committing disjoint tables on one branch still share the mutable `graph_head:<branch>` manifest row, so the only row-level CAS contention is that row. The contract — exactly one writer wins each CAS round; the loser retries inside the publisher, re-resolves its parent off the freshly-advanced head, and re-commits, so every writer lands and the graph_commit DAG stays a single LINEAR chain (no fork) — had no acceptance test. This adds it. - concurrent_disjoint_writes_share_head_and_form_linear_chain: two disjoint writers + distinct LineageIntent, tokio::join!; both commit; the on-disk DAG is genesis -> c -> c' (asserted linear: exactly one genesis, no two commits share a parent, the head is the unique non-parent). - n_concurrent_disjoint_writers_converge_to_one_linear_chain: N=8 disjoint writers each with an app-level retry loop (the publisher's internal budget can be exhausted under contention); all converge to one linear chain of 8. - concurrent_disjoint_writes_form_linear_chain_on_s3: the same race on a real object store (true conditional-put CAS), bucket-gated. Cites both tests from the §7.1 contention note in invariants.md. Test-only; no production change. * perf(engine): fold the lineage parent scan into the publish path's single __manifest scan (RFC-013 P2) Each lineage publish scanned `__manifest` twice: `load_publish_state` read table state via one scan, then `resolve_lineage_rows` did a second full `read_graph_lineage` scan only to find the parent commit. Fold the `graph_commit` extraction into the existing scan. - `read_manifest_scan` gains a `collect_lineage` flag. The publish path (`read_publish_scan`) collects the `graph_commit` rows in the same pass; the table-state hot path leaves them in the forward-compat skip arm, so it never pays the O(commits) lineage JSON decode (it also skips reading the `object_id` column entirely). One shared `decode_graph_commit_row` serves both the folded path and the standalone `read_graph_lineage`, so the two cannot drift. - `resolve_lineage_rows` is now sync and takes the already-parsed rows; the per-attempt re-read is preserved because `load_publish_state` runs once per CAS attempt, so a retry still re-parents off the advanced head. - `load_publish_state` returns a named `LoadedPublishState` instead of a four-tuple; the thin `read_registered_table_locations` / `read_tombstone_versions` accessors fold away. `read_manifest_entries` becomes `#[cfg(test)]`: the fold removes its last production caller, leaving only the test-only namespace module (`db/manifest.rs`: `#[cfg(test)] mod namespace`), so gating it keeps it from becoming dead code in non-test builds. Measured at depth ~5: per-write `__manifest` reads drop 44 -> 26 (total reads 54 -> 36). write_cost.rs gains a `manifest_reads <= 34` sub-ceiling that trips if a publish-path scan is re-added, and its calibration comment is corrected. * test(engine): red — transient legacy-open failure silently completes the v3→v4 migration A pre-Phase-7 (internal schema v3) graph keeps its graph lineage in `_graph_commits.lance`; the v3→v4 internal-schema migration backfills it into `__manifest` and stamps v4. `read_legacy_commit_cache` currently maps EVERY `Dataset::open` error to "no legacy data" (`Err(_) => empty`), so a transient or corrupt open during the one-time migration backfills nothing and still stamps v4 — orphaning the real lineage permanently (the migration runs once; the v3 fallback is then disabled). Add a `migration.v3_to_v4.legacy_open` failpoint that injects a non-not-found Lance error at the legacy open, and a fault-injection regression test in the `failpoints` binary. Against the current swallow the migration completes anyway, so the test fails on its "migration must abort" assertion — the predicted symptom. The fix follows in the next commit. Test support reachable from the `failpoints` integration binary (it compiles the crate without `cfg(test)`): the v3-fixture helpers and a stamp/row-count reader are gated `cfg(any(test, feature = "failpoints"))`, still excluded from release builds. Failpoint tests stay in the integration binary because the fail registry is process-global. * fix(engine): propagate non-not-found legacy-open errors in the v3→v4 migration `read_legacy_commit_cache` mapped EVERY `Dataset::open` error to an empty cache (`Err(_) => empty`) on both the legacy commits dataset and its actor sidecar. The v3→v4 internal-schema migration reads this once before stamping internal-schema v4; a transient or corrupt open therefore backfilled nothing and stamped v4 anyway, orphaning the graph's real lineage permanently (the migration runs once, and the stamp-gated v3 fallback is disabled at v4). This is the "no silent failures" deny-list violation, and realistic on object storage. Both opens now match the not-found variants — Lance maps an object-store NotFound to `DatasetNotFound` — as the benign "no legacy data" / "no authors" signal, and propagate anything else as a loud error. The two arms share the variant contract but carry different rationale (commits-absent is the legitimate empty signal; actor-sidecar-absent is benign, but a corrupt actor open silently wiping authorship before stamping v4 is the same loss hole), commented at each site. Pinned by the `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant` guard (turns red if a Lance bump changes the absence variant) and greens the fault-injection regression test from the previous commit. * test(engine): cover the per-branch v3→v4 migration against a real Lance branch `seed_legacy_v3_lineage` writes every commit (including the "feature"-tagged one) to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field, so the production per-branch migration path — `read_legacy_commit_cache` checking out a real Lance branch, and a branch-scoped `__manifest` — was never exercised. Add `seed_legacy_v3_lineage_with_branch`, which forks a real `feature` Lance branch on BOTH `_graph_commits.lance` and `__manifest` (the branch inherits main's stripped v3 state), and a test that migrates the BRANCH and asserts the branch's lineage lands in the BRANCH's `__manifest` (genesis + A + branch commit, `graph_head:feature` → branch commit, parents + actors intact) with main's `__manifest` untouched. This empirically resolves the open question behind the merge robustness work: the fast-path `read_graph_lineage(dataset)` has no `manifest_branch` filter, but `__manifest` is Lance-branched per graph-branch, so a branch reads only its own lineage — the test confirms migrating one branch does not leak into another. No branch filter is needed. * refactor(engine): type the lineage-backfill merge conflict via the publisher classifier `state::merge_lineage_rows` (the v3→v4 lineage backfill's standalone `__manifest` merge-insert) stringified its `execute_reader` error, discarding the Lance variant. Route it through the publisher's `map_lance_publish_error` (now `pub(crate)`) so a concurrent first-open's row-level CAS loss surfaces as the SAME typed `OmniError::Manifest{ details: RowLevelCasContention }` the publisher's own retry consumes — one vocabulary, no raw-Lance matching in the migration. Deliberately NOT unified with `optimize::is_retryable_lance_conflict`: that classifier also matches `CommitConflict`/`RetryableCommitConflict` from the compaction commit path, which a row-level merge-insert never emits. Cross-linked with a comment at both sites. Behavior-preserving: the only path that changes is the error TYPE on a CAS loss (previously an opaque `Lance` string, now a typed conflict); no success/failure outcome changes. The bounded re-open retry that consumes the new type lands next. * test(engine): red — concurrent v3→v4 migrations error instead of converging `migrate_v2_to_v3` is concurrent-runner idempotent by design; v3→v4 regressed it. `merge_lineage_rows` uses `conflict_retries(0)` and `migrate_v3_to_v4` has no app-level retry, so when two processes open the same legacy graph at once the backfill's row-level CAS loser errors the whole open instead of converging. The test opens two `__manifest` handles at the same pre-migration (v3, empty-lineage) HEAD and runs both `migrate_internal_schema` calls under `tokio::join!`, forcing the `graph_head:main` CAS to fire every run. Against the current code the loser fails with `RowLevelCasContention` ("Attempted 0 retries.") — the predicted symptom — so the "both must converge" assertion panics. The bounded re-open retry that makes both converge lands next. * fix(engine): make the v3→v4 lineage backfill converge under concurrent runners `migrate_v2_to_v3` is concurrent-runner idempotent; v3→v4 was not. Two processes (or open-for-write handles) opening the same legacy graph at once both reach the backfill merge, and `merge_lineage_rows`'s `conflict_retries(0)` made the row-level CAS loser error the whole open instead of converging. Two contention points, both now handled all-or-nothing: 1. The backfill merge on `graph_head:<branch>`. Wrap (fast-path re-read → read legacy → merge) in a bounded re-open retry loop: a `RowLevelCasContention` loss re-opens the manifest past the winner's (atomic) commit and re-loops; the fast-path re-read then sees the winner's lineage and stamps. On budget exhaustion it returns a `RowLevelCasContention`-typed error so the publisher's OUTER retry loop completes it. The retry decision reuses the publisher's `is_retryable_publish_conflict` so the two stay in lockstep. 2. The terminal stamp bump. Making the merge loser converge newly lets BOTH runners reach `set_stamp(4)` — an `UpdateConfig` commit on the same key — so the loser gets `lance::Error::IncompatibleTransaction` (NOT a row-level CAS, so the merge loop doesn't catch it). This surfaced only under the concurrent full-suite run, not the isolated test. Both write the SAME value, so the conflict is benign: `commit_v4_stamp_idempotently` re-opens and, if the stamp already reached the target, succeeds; else re-applies (bounded). Greens the race test from the previous commit (3x isolated, 5x full-suite, no flake). The new `IncompatibleTransaction` match is pinned by `lance_surface_guards.rs::lance_error_incompatible_transaction_variant_exists`. * fix(engine): refuse a future internal-schema stamp on the branch read path `load_commit_cache_for_branch` dispatched on the branch's internal-schema stamp — `< CURRENT` to the v3 legacy fallback, `>= CURRENT` to the manifest projection — but never refused a `> CURRENT` branch stamp, so a newer-binary shape would be misread by the projection rather than rejected. Add `refuse_if_stamp_too_new(stamp)` (re-exported `pub(crate)` from `migrations`) right after the branch stamp is read, mirroring the main read path's `refuse_if_internal_schema_too_new`. This is defense-in-depth, not a live hole: migrations run main-first (main migrates on open; each branch on its first write), so main's stamp is always >= every branch's and the main path refuses first. The guard closes the gap if that ordering invariant is ever weakened. Tested by force-stamping a real branch past CURRENT and asserting the branch read refuses with the upgrade error (the test misreads via the projection — returns Ok — without the guard, confirmed by removing it). * docs(rfc-013): record the v3→v4 migration robustness fixes invariants.md Known Gaps: the `migrate_v3_to_v4` entry now states the migration is loud on non-not-found legacy-open errors and concurrent-runner idempotent (bounded re-open retry on the merge CAS + idempotent stamp bump), and that the branch read path refuses a `> CURRENT` stamp. lance.md: note the two new surface guards the migration depends on (`dataset_open_missing_returns_not_found_variant`, `lance_error_incompatible_transaction_variant_exists`). testing.md: note the migration fault-injection test in the failpoints row. * refactor: remove dead code and silence warnings across engine + cluster Dead-code sweep follow-up to the RFC-013 stack. No behavior change. - engine: delete the orphaned `validate_edge_cardinality` — the load path uses `validate_edge_cardinality_with_pending_loader` for every mode (including Overwrite, which it treats as the replacement table image), so the old standalone validator had no caller — and correct its sibling's now-stale doc reference. Gate `TableStore::append_batch` `#[cfg(test)]`: it is the inline- commit residual kept only for recovery test setup, with no non-test caller. - cluster: drop unused imports in `lib.rs`, delete the unused `ClusterStore::payload_display`, and raise `LiveGraphObservation` / `GraphObservationJson` / `PolicyTarget` to `pub(crate)` to match the functions that return them. Both lib crates now build warning-free. * fix(engine): match Lance's typed DatasetAlreadyExists, not the message string The internal create-or-open idempotency fallbacks in `db/commit_graph.rs` and `db/recovery_audit.rs` classified the "already exists" race by `err.to_string().contains("Dataset already exists")` — a Lance display string, not an API contract. A wording change upstream would silently break the fallback (a re-create would error instead of opening the existing table). Match the typed `lance::Error::DatasetAlreadyExists { .. }` variant instead — the same discipline as the v3→v4 migration's not-found classifier — pinned by the new `lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists` guard so a Lance rename turns red instead of silently regressing. * refactor(engine): consolidate now_micros into one crate::db helper Four `fn now_micros() -> Result<i64>` copies (commit_graph, recovery_audit, graph_coordinator, manifest/graph) had already drifted: three mapped the clock error to `OmniError::manifest("...UNIX_EPOCH...")` while recovery_audit used `OmniError::manifest_internal("...unix epoch...")`. Replace all four with one `pub(crate) fn now_micros()` in `db/mod.rs` (the majority `manifest` variant), and repoint the eight call sites at `crate::db::now_micros()`. No test asserts on the failure message, so unifying the variant is behavior-safe; the timestamp-mapping contract can no longer fork across the rows it stamps. * refactor(engine): drop the dead snapshot param from roll_back_sidecar `roll_back_sidecar` took `snapshot: &Snapshot` only to discard it with `let _ = snapshot;` — rollbacks now always publish (the restored HEAD plus a recovery-commit lineage row), so the snapshot is never read to decide whether to skip a publish. Remove the parameter, the two call-site arguments, and the suppressor. A signature must not advertise inputs it does not consume. The `Snapshot` import stays — `process_sidecar`, `roll_forward_all`, and `record_audit_recovery_rollforward` still take it. * test(engine): red — open_at_branch wedges a branch on a missing commit-graph ref A v4 graph keeps its graph lineage in `__manifest` (RFC-013 Phase 7); the `_graph_commits.lance` branch ref is a derived artifact. An interrupted fork-reclaim or a `cleanup` race can drop that derived ref while the manifest lineage stays intact. Per invariants 7 + 15 a missing derived ref must not fail a logical read of the lineage. This wedge builds a real v4 `feature` branch (its `graph_head:feature` row in `__manifest`), force-deletes ONLY the `_graph_commits.lance` `feature` ref, then asserts the branch reads (`open_at_branch` / list-commits / `merge_base`) succeed from `__manifest` while a write that needs the derived ref (`create_branch`) fails loudly with the typed actionable error. Red against current code: `open_at_branch`'s hard `checkout_branch(branch)?` on the missing ref errors `OmniError::Lance` (Lance "Not found: _graph_commits.lance/tree/feature/_versions"), wedging the logical read. * fix(engine): read manifest lineage independent of the derived _graph_commits ref `CommitGraph::open_at_branch` did a hard `checkout_branch(branch)?` on the `_graph_commits.lance` branch ref before reading lineage — so a missing derived ref (an interrupted fork-reclaim, or a `cleanup` race) wedged the branch's commit-list / merge-base / snapshot resolution even though the lineage is readable from the authoritative `__manifest` (RFC-013 Phase 7). That is a derived/physical artifact failing a logical read — invariants 7 and 15. Make the held commits handle `Option<Dataset>` (mirroring `actor_dataset`). `open_at_branch` and `refresh` check out the derived ref best-effort: a typed not-found (`RefNotFound`/`NotFound`) yields a `None` handle while the read re-syncs from `__manifest`; any other open error still propagates. The manifest existence gate is unchanged — `load_commit_cache_for_branch` keeps its hard `?`, so a truly absent branch still fails loudly at the manifest. `create_branch` (the only writer that forks a ref) and the folded-in version lookup return a loud, actionable error on `None`, deferring repair to `cleanup`'s existing orphan reconciler rather than inlining a write on a read-side refresh. Reads (`head_commit`/`load_commits`/`get_commit`/`merge_base`) never touch the handle. Greens the wedge regression from the preceding commit. * fix(engine): v3→v4 retry loops return retryable contention on exhaustion `commit_v4_stamp_idempotently`'s retry loop used `0..=STAMP_RETRY_BUDGET` (6 iterations) with an `attempt < STAMP_RETRY_BUDGET` guard, so the LAST iteration's `IncompatibleTransaction` fell through to `Err(e) => OmniError::Lance(...)` — stringified, non-retryable — instead of the intended `RowLevelCasContention`, and the post-loop contention return was dead code. The publisher's outer retry only re-runs `is_retryable_publish_conflict`, so under sustained concurrent v3→v4 migration the one-time stamp bump could fail instead of converging, defeating the idempotency the migration is supposed to add. Fix the loop to `0..BUDGET` with an UNGUARDED `IncompatibleTransaction` arm: the retryable variant is always handled inside the loop (re-open + same-value check + retry), so it can never reach the stringifying catch-all, and the post-loop is the SINGLE reachable exhaustion path — the typed `RowLevelCasContention`. The `Err(e)` arm now catches only genuine non-contention errors. Apply the same range alignment to the sibling merge loop in `migrate_v3_to_v4` (behaviorally correct today — its `Err(err)` returns the already-typed contention — but it carried the identical off-by-one structure the stamp loop was copied from; aligning both stops the next copy from re-introducing it). Test-first. The exhaustion path is otherwise near-unreachable — a real concurrent winner stamps the same value, so the re-read returns Ok on the first retry — so a new `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to lose, driving exhaustion deterministically. Against the pre-fix loop the new `v4_stamp_exhaustion_returns_retryable_contention` test goes red with `Lance("Incompatible transaction: injected failpoint triggered…")`; with the fix it asserts the typed `RowLevelCasContention`. Found by automated review on #299. * feat(engine): minimum-supported internal-schema floor + retirement tripwire The internal-schema migration chain (`migrate_internal_schema`) had a too-new ceiling but no floor, so every old `migrate_vN_…` arm and the v3 legacy readers it needs stay forever — the pile grows by one migration + readers + tests every schema version. Add `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (1 today, a pure no-op: `read_stamp` floors an absent stamp at 1 and no real graph carries 0) as the oldest stamp this binary opens; raising it is how the chain sheds old code. Collapse the one-sided `refuse_if_stamp_too_new` into `refuse_if_stamp_unsupported` checking both bounds, so the floor lands at all three stamp-enforcement sites — the write-path migrate dispatcher, the read-only open guard, and the branch lineage-read path (`commit_graph.rs`) — via one compiler-enforced rename. A hand-wired floor twin would have had to touch each site, and the branch-read path is easy to miss; one combined guard cannot half-enforce. Rename the read-only wrapper `refuse_if_internal_schema_unsupported` to match. A compile-time tripwire (`const _: () = assert!(LOWEST_REGISTERED_MIGRATION_SOURCE == MIN_SUPPORTED…)`) fails the build if a future floor bump forgets to delete the now-dead migration arm (or vice versa) — stronger than a runtime test, impossible to skip, and it doubles as the use that keeps the mirror const live. Tests: a sub-floor graph is refused in both open modes (twin of `future_stamp_is_refused_in_both_open_modes`); the guard accepts exactly [MIN, CURRENT]. No behavior change for any real graph. The retirement runbook lives on the `MIN_SUPPORTED` doc-comment + invariants.md. * fix(engine): compose migration contention with publisher retry; precise recovery-converge audit commit Three review-surfaced fixes on the RFC-013 Phase 7 path. Publisher retry vs migration contention: `publish()` propagated a `load_publish_state` error fatally via `?`, so a `RowLevelCasContention` surfaced by the v3->v4 migration's exhausted merge/stamp budgets aborted the publish instead of being retried — only `merge_rows` conflicts hit the retry. This contradicted the migration's own design, which returns that typed error EXPECTING the publisher to re-run the load (by which point a concurrent winner has usually finished the migration, so the next scan is a no-op). Route a retryable load error through the same retry path as a retryable `merge_rows` conflict. Regression test (failpoints): a one-shot retryable contention injected into `load_publish_state` now commits via the retry; red without the fix (the write fails with the injected contention). Recovery-converge audit commit id: `converge_or_defer_roll_forward` recorded the branch HEAD as the audit row's `graph_commit_id`, but a concurrent user write can advance `graph_head` past the recovery commit between the winner's publish and this read — attributing the audit to a later, wrong commit. Use the latest `RECOVERY_ACTOR`-authored commit (what `publish_recovery_commit` mints), which is the recovery commit by construction. The audit's actor was already correct (it comes from `sidecar.actor_id`, not the commit). Dead param: drop the unused `snapshot` from `record_audit_recovery_rollforward` (removing the `let _ = snapshot;` suppressor). `storage` stays — it is used to delete the sidecar.
2026-06-25 13:55:34 +02:00
- **Graph lineage rows** (RFC-013 Phase 7): one immutable `graph_commit` row per commit (`object_id` = the commit ULID; `metadata` JSON carries parent / merged-parent / actor / timestamp) plus one mutable `graph_head:<branch>` pointer per branch (`graph_head:main` for main). The in-memory commit DAG is a projection of these rows.
Address reviewer feedback (Cursor + cubic) on PR #60 All eight comments verified against source and applied: - AGENTS.md: pull @docs/{invariants,lance,testing}.md imports out of the markdown blockquote. Claude Code's @-import parser expects @ at column 0; the leading "> " of a blockquote silently broke recognition, so the claimed auto-include did nothing. (Cursor, Medium severity.) - docs/cli-reference.md: command-family count 13 → 17. The current enum Command in crates/omnigraph-cli/src/main.rs has 17 top-level variants. (cubic P2.) - docs/ci.md: Homebrew tap update is a regular `git push`, not a force-push (release.yml:117 is `git push origin HEAD:main`). (cubic P2.) - docs/errors.md: add the Storage variant to the NanoError list — it exists at error.rs:88-89 but the doc enumerated only 10 of 11. (cubic P2.) - docs/storage.md: clarify tombstone semantics. There is no tombstone_version column; state.rs:180 reads the tombstone version from the table_version column on rows where object_type = table_tombstone. (cubic P2.) - docs/branches-commits.md: split the GraphCommit pseudo-struct from the underlying storage. actor_id is joined in-memory from _graph_commit_actors.lance, not a column on _graph_commits.lance. (cubic P2.) - docs/schema-language.md: rename IR_VERSION to SCHEMA_IR_VERSION to match the actual constant name in catalog/schema_ir.rs:11. (cubic P3.) - docs/testing.md: engine integration test count 16 → 15 (matches `ls crates/omnigraph/tests/*.rs`). (cubic P3.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-29 00:09:06 +02:00
- **Snapshot reconstruction**: latest visible `table_version` per `(table_key, table_branch)` minus tombstones — rows where `object_type = table_tombstone`, whose own `table_version` (acting as the tombstone version) is `>= the entry's table_version`.
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
- **Atomic publish**: multi-dataset commits publish so that a single write to `__manifest` flips all the new sub-table versions visible at once.
- **Row-level CAS on the merge-insert join key**: `object_id` carries an unenforced-primary-key annotation so Lance's bloom-filter conflict resolver rejects two concurrent commits that land the same `object_id` row. Without this annotation, Lance's transparent rebase would admit silent duplicates from racing publishers.
- **Optimistic concurrency control on publish**: a publish asserts the manifest's current latest non-tombstoned version for each touched table is exactly what the caller observed; mismatches surface as an `ExpectedVersionMismatch` manifest conflict naming the table and the expected/actual versions. Concurrent advances surface as a conflict rather than being silently rebased through.
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
### Internal schema versioning
Add internal-schema versioning + auto-migration for __manifest The on-disk shape of `__manifest` is reconciled with the binary via a single stamp + dispatcher in `db/manifest/migrations.rs`: - `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` declares the shape this binary writes. - The on-disk stamp `omnigraph:internal_schema_version` lives in the manifest dataset's schema-level metadata (Lance `update_schema_metadata`). - `migrate_internal_schema(&mut dataset)` walks `match`-arm steps forward from the on-disk stamp until it matches the binary, then returns. Idempotent. - `init_manifest_repo` stamps the current version at creation; the publisher's open-for-write path runs pending migrations before reading state. Reads stay side-effect-free. - Forward-version protection: a stamp higher than the binary's known version triggers a clear "upgrade omnigraph first" error so an old binary cannot clobber a newer schema. Self-heals existing pre-MR-766 deployments by auto-applying the v1→v2 step: the `lance-schema:unenforced-primary-key` annotation on `__manifest.object_id` that engages Lance's row-level CAS at commit time. New repos created via `init` are stamped at v2 immediately and don't need migration. Adding a future on-disk shape change is one constant bump, one match arm in `migrate_internal_schema`, and one test — no new branches in unrelated code paths. Code outside the migration module never inspects the stamp. New tests in `manifest/tests.rs`: - `test_init_stamps_internal_schema_version` - `test_publish_migrates_pre_stamp_manifest_to_current_version` - `test_publish_rejects_manifest_stamped_at_future_version` Docs: `docs/storage.md`, `docs/maintenance.md`, `docs/constants.md` updated per the AGENTS.md maintenance contract.
2026-04-29 11:44:14 +00:00
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
The on-disk shape of `__manifest` is reconciled with the binary via a single version stamp held in the manifest dataset's schema-level metadata.
Add internal-schema versioning + auto-migration for __manifest The on-disk shape of `__manifest` is reconciled with the binary via a single stamp + dispatcher in `db/manifest/migrations.rs`: - `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` declares the shape this binary writes. - The on-disk stamp `omnigraph:internal_schema_version` lives in the manifest dataset's schema-level metadata (Lance `update_schema_metadata`). - `migrate_internal_schema(&mut dataset)` walks `match`-arm steps forward from the on-disk stamp until it matches the binary, then returns. Idempotent. - `init_manifest_repo` stamps the current version at creation; the publisher's open-for-write path runs pending migrations before reading state. Reads stay side-effect-free. - Forward-version protection: a stamp higher than the binary's known version triggers a clear "upgrade omnigraph first" error so an old binary cannot clobber a newer schema. Self-heals existing pre-MR-766 deployments by auto-applying the v1→v2 step: the `lance-schema:unenforced-primary-key` annotation on `__manifest.object_id` that engages Lance's row-level CAS at commit time. New repos created via `init` are stamped at v2 immediately and don't need migration. Adding a future on-disk shape change is one constant bump, one match arm in `migrate_internal_schema`, and one test — no new branches in unrelated code paths. Code outside the migration module never inspects the stamp. New tests in `manifest/tests.rs`: - `test_init_stamps_internal_schema_version` - `test_publish_migrates_pre_stamp_manifest_to_current_version` - `test_publish_rejects_manifest_stamped_at_future_version` Docs: `docs/storage.md`, `docs/maintenance.md`, `docs/constants.md` updated per the AGENTS.md maintenance contract.
2026-04-29 11:44:14 +00:00
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
- **Graph creation** stamps the current version, so newly initialized graphs never need migration.
- **The open-for-write path** migrates the on-disk stamp before reading state. When the stamp matches the binary, this is a single metadata read with no writes; otherwise the migration walks steps forward (1→2, 2→3, …) until the stamp matches, then proceeds with the publish. Reads stay side-effect-free.
Add internal-schema versioning + auto-migration for __manifest The on-disk shape of `__manifest` is reconciled with the binary via a single stamp + dispatcher in `db/manifest/migrations.rs`: - `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` declares the shape this binary writes. - The on-disk stamp `omnigraph:internal_schema_version` lives in the manifest dataset's schema-level metadata (Lance `update_schema_metadata`). - `migrate_internal_schema(&mut dataset)` walks `match`-arm steps forward from the on-disk stamp until it matches the binary, then returns. Idempotent. - `init_manifest_repo` stamps the current version at creation; the publisher's open-for-write path runs pending migrations before reading state. Reads stay side-effect-free. - Forward-version protection: a stamp higher than the binary's known version triggers a clear "upgrade omnigraph first" error so an old binary cannot clobber a newer schema. Self-heals existing pre-MR-766 deployments by auto-applying the v1→v2 step: the `lance-schema:unenforced-primary-key` annotation on `__manifest.object_id` that engages Lance's row-level CAS at commit time. New repos created via `init` are stamped at v2 immediately and don't need migration. Adding a future on-disk shape change is one constant bump, one match arm in `migrate_internal_schema`, and one test — no new branches in unrelated code paths. Code outside the migration module never inspects the stamp. New tests in `manifest/tests.rs`: - `test_init_stamps_internal_schema_version` - `test_publish_migrates_pre_stamp_manifest_to_current_version` - `test_publish_rejects_manifest_stamped_at_future_version` Docs: `docs/storage.md`, `docs/maintenance.md`, `docs/constants.md` updated per the AGENTS.md maintenance contract.
2026-04-29 11:44:14 +00:00
- **Forward-version protection**: a stamp *higher* than the binary's known version triggers a clear "upgrade omnigraph first" error. An old binary cannot clobber a newer schema by silently treating "unknown stamp" as "missing stamp".
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
- **Idempotency**: each migration step is safe to re-run. A crash between two metadata updates inside a single step leaves the partial state; the next open re-runs the step and the second update lands.
Add internal-schema versioning + auto-migration for __manifest The on-disk shape of `__manifest` is reconciled with the binary via a single stamp + dispatcher in `db/manifest/migrations.rs`: - `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` declares the shape this binary writes. - The on-disk stamp `omnigraph:internal_schema_version` lives in the manifest dataset's schema-level metadata (Lance `update_schema_metadata`). - `migrate_internal_schema(&mut dataset)` walks `match`-arm steps forward from the on-disk stamp until it matches the binary, then returns. Idempotent. - `init_manifest_repo` stamps the current version at creation; the publisher's open-for-write path runs pending migrations before reading state. Reads stay side-effect-free. - Forward-version protection: a stamp higher than the binary's known version triggers a clear "upgrade omnigraph first" error so an old binary cannot clobber a newer schema. Self-heals existing pre-MR-766 deployments by auto-applying the v1→v2 step: the `lance-schema:unenforced-primary-key` annotation on `__manifest.object_id` that engages Lance's row-level CAS at commit time. New repos created via `init` are stamped at v2 immediately and don't need migration. Adding a future on-disk shape change is one constant bump, one match arm in `migrate_internal_schema`, and one test — no new branches in unrelated code paths. Code outside the migration module never inspects the stamp. New tests in `manifest/tests.rs`: - `test_init_stamps_internal_schema_version` - `test_publish_migrates_pre_stamp_manifest_to_current_version` - `test_publish_rejects_manifest_stamped_at_future_version` Docs: `docs/storage.md`, `docs/maintenance.md`, `docs/constants.md` updated per the AGENTS.md maintenance contract.
2026-04-29 11:44:14 +00:00
| Stamp | Shape change |
|---|---|
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
| v1 (implicit, pre-stamp) | `__manifest.object_id` had no PK annotation; no row-level CAS protection. |
| v2 | `__manifest.object_id` carries an unenforced-primary-key annotation; row-level CAS engaged. |
| v3 | One-time sweep of legacy `__run__*` staging branches (pre-v0.4.0 Run state machine, removed) off `__manifest`. Runs at read-write open and on publish. |
Add internal-schema versioning + auto-migration for __manifest The on-disk shape of `__manifest` is reconciled with the binary via a single stamp + dispatcher in `db/manifest/migrations.rs`: - `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` declares the shape this binary writes. - The on-disk stamp `omnigraph:internal_schema_version` lives in the manifest dataset's schema-level metadata (Lance `update_schema_metadata`). - `migrate_internal_schema(&mut dataset)` walks `match`-arm steps forward from the on-disk stamp until it matches the binary, then returns. Idempotent. - `init_manifest_repo` stamps the current version at creation; the publisher's open-for-write path runs pending migrations before reading state. Reads stay side-effect-free. - Forward-version protection: a stamp higher than the binary's known version triggers a clear "upgrade omnigraph first" error so an old binary cannot clobber a newer schema. Self-heals existing pre-MR-766 deployments by auto-applying the v1→v2 step: the `lance-schema:unenforced-primary-key` annotation on `__manifest.object_id` that engages Lance's row-level CAS at commit time. New repos created via `init` are stamped at v2 immediately and don't need migration. Adding a future on-disk shape change is one constant bump, one match arm in `migrate_internal_schema`, and one test — no new branches in unrelated code paths. Code outside the migration module never inspects the stamp. New tests in `manifest/tests.rs`: - `test_init_stamps_internal_schema_version` - `test_publish_migrates_pre_stamp_manifest_to_current_version` - `test_publish_rejects_manifest_stamped_at_future_version` Docs: `docs/storage.md`, `docs/maintenance.md`, `docs/constants.md` updated per the AGENTS.md maintenance contract.
2026-04-29 11:44:14 +00:00
docs: add Mermaid architecture diagrams across architecture / storage / execution Replace the single ASCII stack in docs/architecture.md with a hierarchy of Mermaid diagrams that show the system from external context down to the component level. Add an on-disk layout diagram in docs/storage.md and two sequence diagrams (read query, mutation) in docs/execution.md so readers can navigate from "what is OmniGraph" to "how does a query run" without opening source. Static structure (docs/architecture.md): - System context — agents/clients, embedding providers, Cedar, object store. - Layer view — eight-layer stack with L1 (Lance) / L2 (OmniGraph) styling via classDef, replacing the pre-existing ASCII art. - Component zoom-ins — compiler, engine, storage trait, index lifecycle, server/CLI. Each zoom-in cites file:line entry points. Aspirational shapes (storage trait, full reconciler) are visually marked and pointed at the relevant invariants.md section so readers see the intended seam without thinking it's already implemented. On-disk layout (docs/storage.md): - Tree from repo URI through __manifest, nodes/, edges/, _graph_commits.lance, _graph_runs.lance, _refs/branches/ down into Lance's per-dataset internals (_versions/, data/, _indices/, _refs/, _transactions/). - Annotated with the actual filenames so readers can `ls` the same paths. - Slots in below the existing __manifest CAS / OCC / migration prose; does not move or rewrite that content. Runtime flows (docs/execution.md): - Read flow sequence: client → Omnigraph::query → typecheck → lower → execute_query → table_store → Lance scanner → RecordBatch stream. - Mutation flow sequence: Omnigraph::mutate → resolve literals → Lance write op (Append / merge_insert) → ManifestRepo::commit → __manifest upsert. - Both diagrams are followed by a "Code paths" block with verified file:line citations so readers can navigate from diagram element to source in one step. Conventions established (this is the first Mermaid in the repo): - L1 = orange (#fef3e8), L2 = blue (#e8f4fd), aspirational = dashed. - Diagram size cap ~9 elements; more detail goes in a sub-diagram. - Diagrams paired with prose; code-path citations follow each diagram. - Consistent vocabulary across diagrams: frontend / compiler / engine / storage trait / Lance / object store. No accidental synonyms. Subsequent PRs will add flow diagrams for schema apply, branch + merge, run isolation, index reconcile, and the embedding pipeline in the same conventions.
2026-04-29 16:58:56 +02:00
## On-disk layout
A graph on disk is a directory tree of Lance datasets. Each dataset follows the standard Lance layout (`_versions/`, `data/`, `_indices/`, `_refs/`); OmniGraph adds the multi-dataset coordination by keeping `__manifest/` alongside the per-type datasets.
docs: add Mermaid architecture diagrams across architecture / storage / execution Replace the single ASCII stack in docs/architecture.md with a hierarchy of Mermaid diagrams that show the system from external context down to the component level. Add an on-disk layout diagram in docs/storage.md and two sequence diagrams (read query, mutation) in docs/execution.md so readers can navigate from "what is OmniGraph" to "how does a query run" without opening source. Static structure (docs/architecture.md): - System context — agents/clients, embedding providers, Cedar, object store. - Layer view — eight-layer stack with L1 (Lance) / L2 (OmniGraph) styling via classDef, replacing the pre-existing ASCII art. - Component zoom-ins — compiler, engine, storage trait, index lifecycle, server/CLI. Each zoom-in cites file:line entry points. Aspirational shapes (storage trait, full reconciler) are visually marked and pointed at the relevant invariants.md section so readers see the intended seam without thinking it's already implemented. On-disk layout (docs/storage.md): - Tree from repo URI through __manifest, nodes/, edges/, _graph_commits.lance, _graph_runs.lance, _refs/branches/ down into Lance's per-dataset internals (_versions/, data/, _indices/, _refs/, _transactions/). - Annotated with the actual filenames so readers can `ls` the same paths. - Slots in below the existing __manifest CAS / OCC / migration prose; does not move or rewrite that content. Runtime flows (docs/execution.md): - Read flow sequence: client → Omnigraph::query → typecheck → lower → execute_query → table_store → Lance scanner → RecordBatch stream. - Mutation flow sequence: Omnigraph::mutate → resolve literals → Lance write op (Append / merge_insert) → ManifestRepo::commit → __manifest upsert. - Both diagrams are followed by a "Code paths" block with verified file:line citations so readers can navigate from diagram element to source in one step. Conventions established (this is the first Mermaid in the repo): - L1 = orange (#fef3e8), L2 = blue (#e8f4fd), aspirational = dashed. - Diagram size cap ~9 elements; more detail goes in a sub-diagram. - Diagrams paired with prose; code-path citations follow each diagram. - Consistent vocabulary across diagrams: frontend / compiler / engine / storage trait / Lance / object store. No accidental synonyms. Subsequent PRs will add flow diagrams for schema apply, branch + merge, run isolation, index reconcile, and the embedding pipeline in the same conventions.
2026-04-29 16:58:56 +02:00
```mermaid
flowchart TB
classDef l1 fill:#fef3e8,stroke:#c46900,color:#000
classDef l2 fill:#e8f4fd,stroke:#1e6aa8,color:#000
graph["graph URI<br/>file:// or s3://bucket/prefix"]:::l2
docs: add Mermaid architecture diagrams across architecture / storage / execution Replace the single ASCII stack in docs/architecture.md with a hierarchy of Mermaid diagrams that show the system from external context down to the component level. Add an on-disk layout diagram in docs/storage.md and two sequence diagrams (read query, mutation) in docs/execution.md so readers can navigate from "what is OmniGraph" to "how does a query run" without opening source. Static structure (docs/architecture.md): - System context — agents/clients, embedding providers, Cedar, object store. - Layer view — eight-layer stack with L1 (Lance) / L2 (OmniGraph) styling via classDef, replacing the pre-existing ASCII art. - Component zoom-ins — compiler, engine, storage trait, index lifecycle, server/CLI. Each zoom-in cites file:line entry points. Aspirational shapes (storage trait, full reconciler) are visually marked and pointed at the relevant invariants.md section so readers see the intended seam without thinking it's already implemented. On-disk layout (docs/storage.md): - Tree from repo URI through __manifest, nodes/, edges/, _graph_commits.lance, _graph_runs.lance, _refs/branches/ down into Lance's per-dataset internals (_versions/, data/, _indices/, _refs/, _transactions/). - Annotated with the actual filenames so readers can `ls` the same paths. - Slots in below the existing __manifest CAS / OCC / migration prose; does not move or rewrite that content. Runtime flows (docs/execution.md): - Read flow sequence: client → Omnigraph::query → typecheck → lower → execute_query → table_store → Lance scanner → RecordBatch stream. - Mutation flow sequence: Omnigraph::mutate → resolve literals → Lance write op (Append / merge_insert) → ManifestRepo::commit → __manifest upsert. - Both diagrams are followed by a "Code paths" block with verified file:line citations so readers can navigate from diagram element to source in one step. Conventions established (this is the first Mermaid in the repo): - L1 = orange (#fef3e8), L2 = blue (#e8f4fd), aspirational = dashed. - Diagram size cap ~9 elements; more detail goes in a sub-diagram. - Diagrams paired with prose; code-path citations follow each diagram. - Consistent vocabulary across diagrams: frontend / compiler / engine / storage trait / Lance / object store. No accidental synonyms. Subsequent PRs will add flow diagrams for schema apply, branch + merge, run isolation, index reconcile, and the embedding pipeline in the same conventions.
2026-04-29 16:58:56 +02:00
manifest["__manifest/<br/>L2 catalog of sub-tables"]:::l2
nodes["nodes/{fnv1a64-hex}/<br/>one dataset per node type"]:::l2
edges["edges/{fnv1a64-hex}/<br/>one dataset per edge type"]:::l2
recovery: document MR-847 ship across all reference docs (Phase 10) Update the doc surface to reflect MR-847 having shipped end to end — sidecar protocol, classifier, all-or-nothing decision tree, roll-forward via ManifestBatchPublisher, roll-back via Dataset::restore with fragment-set short-circuit, audit trail in _graph_commit_recoveries.lance, OpenMode::{ReadWrite, ReadOnly}, and the four migrated writers all carrying sidecars across Phase B → Phase C. - docs/invariants.md §VI.23: change from "upheld at the writer-trait surface for inserts/updates/etc., per-table commit_staged → manifest publish window remains" to "upheld at the writer-trait surface AND across process boundaries". The MR-847 sweep closes the residual on the next Omnigraph::open. The "continuous in-process" property (no ExpectedVersionMismatch surfacing to subsequent writers between Phase B failure and process restart) is honest follow-up at MR-856. - docs/runs.md: replace "Finalize → publisher residual" section with "Open-time recovery sweep (MR-847)" — describes the sidecar protocol lifecycle (Phases A-D), the sweep's classifier + decision dispatch, the audit trail, and the operator-facing query (omnigraph commit list --filter actor=omnigraph:recovery). - AGENTS.md capability matrix "Atomic single-dataset commits" row: drop the "Layer (3) is not yet shipped — tracked in MR-847" caveat; describe the three layers as all shipping; reference MR-856 for the background-reconciler follow-up. - docs/storage.md: add _graph_commit_recoveries.lance and __recovery/{ulid}.json to the on-disk layout (mermaid + prose). - docs/branches-commits.md: new "Recovery audit trail (MR-847)" subsection describing the join from _graph_commits.lance:actor_id="omnigraph:recovery" to _graph_commit_recoveries.lance:graph_commit_id for operator post-mortem. - docs/maintenance.md: note the MR-847 recovery floor on cleanup — --keep < 3 may garbage-collect Lance versions the recovery sweep needs as a rollback target. Default --keep 10 is safe. - docs/testing.md: add tests/recovery.rs to the engine integration-test table; expand the failpoints.rs row to mention the four MR-847 per-writer Phase B → recovery integration tests. - .context/mr-847-design.md: prepend a "Status: DONE" stanza listing every commit hash + scope across phases 1-10. AGENTS.md ↔ docs/ cross-link check passes (26 links, 26 docs). Full workspace test sweep passes with --features failpoints (361 tests across 20 binaries). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:03 +02:00
cgraph["_graph_commits.lance/<br/>_graph_commit_actors.lance/<br/>_graph_commit_recoveries.lance/"]:::l2
recovery: rename composite test, strip ticket references, address review Three bundled changes: 1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and the test function). OmniGraph is consumed by both humans and agents - naming the test after one audience misframes the library. 2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and review-round labels from source, tests, and docs added by this branch. Internal traceability belongs in commit messages and PR descriptions, not in checked-in artifacts. Upstream lance-format/lance issue refs and pre-existing MR-XXX refs in docs not touched by this branch are left alone. 3. Two outstanding review findings addressed: - `needs_index_work_node` / `needs_index_work_edge`: propagate `count_rows` errors instead of `unwrap_or(0)`. Silently treating transient I/O failures as "0 rows" risked skipping a table from the recovery sidecar pin set that was actually about to be modified. - `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`: strengthen the assertion to fail when sidecar B classifies under a stale snapshot. The new assertion checks post-recovery Lance HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar deleted + audit rows present" pair passed in both the bug and fix paths because both delete the sidecar and write an audit row; the differentiator is the post-recovery HEAD. Strengthening the assertion exposed an additional nuance: in this overlapping- sidecar scenario sidecar B's audit kind is RolledBack (no-op) rather than RolledForward, since sidecar A's roll-forward publishes Lance HEAD as the new manifest pin (absorbing B's work). The docstring now explains why this is correct given current `roll_forward_all` semantics. All workspace tests pass with --features failpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 13:56:36 +02:00
recovery["__recovery/{ulid}.json<br/>recovery sidecars (transient)"]:::l2
docs: add Mermaid architecture diagrams across architecture / storage / execution Replace the single ASCII stack in docs/architecture.md with a hierarchy of Mermaid diagrams that show the system from external context down to the component level. Add an on-disk layout diagram in docs/storage.md and two sequence diagrams (read query, mutation) in docs/execution.md so readers can navigate from "what is OmniGraph" to "how does a query run" without opening source. Static structure (docs/architecture.md): - System context — agents/clients, embedding providers, Cedar, object store. - Layer view — eight-layer stack with L1 (Lance) / L2 (OmniGraph) styling via classDef, replacing the pre-existing ASCII art. - Component zoom-ins — compiler, engine, storage trait, index lifecycle, server/CLI. Each zoom-in cites file:line entry points. Aspirational shapes (storage trait, full reconciler) are visually marked and pointed at the relevant invariants.md section so readers see the intended seam without thinking it's already implemented. On-disk layout (docs/storage.md): - Tree from repo URI through __manifest, nodes/, edges/, _graph_commits.lance, _graph_runs.lance, _refs/branches/ down into Lance's per-dataset internals (_versions/, data/, _indices/, _refs/, _transactions/). - Annotated with the actual filenames so readers can `ls` the same paths. - Slots in below the existing __manifest CAS / OCC / migration prose; does not move or rewrite that content. Runtime flows (docs/execution.md): - Read flow sequence: client → Omnigraph::query → typecheck → lower → execute_query → table_store → Lance scanner → RecordBatch stream. - Mutation flow sequence: Omnigraph::mutate → resolve literals → Lance write op (Append / merge_insert) → ManifestRepo::commit → __manifest upsert. - Both diagrams are followed by a "Code paths" block with verified file:line citations so readers can navigate from diagram element to source in one step. Conventions established (this is the first Mermaid in the repo): - L1 = orange (#fef3e8), L2 = blue (#e8f4fd), aspirational = dashed. - Diagram size cap ~9 elements; more detail goes in a sub-diagram. - Diagrams paired with prose; code-path citations follow each diagram. - Consistent vocabulary across diagrams: frontend / compiler / engine / storage trait / Lance / object store. No accidental synonyms. Subsequent PRs will add flow diagrams for schema apply, branch + merge, run isolation, index reconcile, and the embedding pipeline in the same conventions.
2026-04-29 16:58:56 +02:00
refs["_refs/branches/{name}.json<br/>graph-level branches"]:::l2
graph --> manifest
graph --> nodes
graph --> edges
graph --> cgraph
graph --> recovery
graph --> refs
docs: add Mermaid architecture diagrams across architecture / storage / execution Replace the single ASCII stack in docs/architecture.md with a hierarchy of Mermaid diagrams that show the system from external context down to the component level. Add an on-disk layout diagram in docs/storage.md and two sequence diagrams (read query, mutation) in docs/execution.md so readers can navigate from "what is OmniGraph" to "how does a query run" without opening source. Static structure (docs/architecture.md): - System context — agents/clients, embedding providers, Cedar, object store. - Layer view — eight-layer stack with L1 (Lance) / L2 (OmniGraph) styling via classDef, replacing the pre-existing ASCII art. - Component zoom-ins — compiler, engine, storage trait, index lifecycle, server/CLI. Each zoom-in cites file:line entry points. Aspirational shapes (storage trait, full reconciler) are visually marked and pointed at the relevant invariants.md section so readers see the intended seam without thinking it's already implemented. On-disk layout (docs/storage.md): - Tree from repo URI through __manifest, nodes/, edges/, _graph_commits.lance, _graph_runs.lance, _refs/branches/ down into Lance's per-dataset internals (_versions/, data/, _indices/, _refs/, _transactions/). - Annotated with the actual filenames so readers can `ls` the same paths. - Slots in below the existing __manifest CAS / OCC / migration prose; does not move or rewrite that content. Runtime flows (docs/execution.md): - Read flow sequence: client → Omnigraph::query → typecheck → lower → execute_query → table_store → Lance scanner → RecordBatch stream. - Mutation flow sequence: Omnigraph::mutate → resolve literals → Lance write op (Append / merge_insert) → ManifestRepo::commit → __manifest upsert. - Both diagrams are followed by a "Code paths" block with verified file:line citations so readers can navigate from diagram element to source in one step. Conventions established (this is the first Mermaid in the repo): - L1 = orange (#fef3e8), L2 = blue (#e8f4fd), aspirational = dashed. - Diagram size cap ~9 elements; more detail goes in a sub-diagram. - Diagrams paired with prose; code-path citations follow each diagram. - Consistent vocabulary across diagrams: frontend / compiler / engine / storage trait / Lance / object store. No accidental synonyms. Subsequent PRs will add flow diagrams for schema apply, branch + merge, run isolation, index reconcile, and the embedding pipeline in the same conventions.
2026-04-29 16:58:56 +02:00
subgraph dataset[Inside each Lance dataset — L1]
ds_v["_versions/{n}.manifest<br/>per-dataset versions"]:::l1
ds_data["data/<br/>fragment files (Arrow IPC)"]:::l1
ds_idx["_indices/{uuid}/<br/>BTREE · Inverted FTS · IVF/HNSW"]:::l1
ds_refs["_refs/<br/>per-dataset Lance branches/tags"]:::l1
ds_tx["_transactions/<br/>commit transaction logs"]:::l1
end
nodes -.-> dataset
edges -.-> dataset
manifest -.-> dataset
```
**What's where:**
- **Graph root** is one directory (or S3 prefix). Everything below is part of one OmniGraph graph.
docs: add Mermaid architecture diagrams across architecture / storage / execution Replace the single ASCII stack in docs/architecture.md with a hierarchy of Mermaid diagrams that show the system from external context down to the component level. Add an on-disk layout diagram in docs/storage.md and two sequence diagrams (read query, mutation) in docs/execution.md so readers can navigate from "what is OmniGraph" to "how does a query run" without opening source. Static structure (docs/architecture.md): - System context — agents/clients, embedding providers, Cedar, object store. - Layer view — eight-layer stack with L1 (Lance) / L2 (OmniGraph) styling via classDef, replacing the pre-existing ASCII art. - Component zoom-ins — compiler, engine, storage trait, index lifecycle, server/CLI. Each zoom-in cites file:line entry points. Aspirational shapes (storage trait, full reconciler) are visually marked and pointed at the relevant invariants.md section so readers see the intended seam without thinking it's already implemented. On-disk layout (docs/storage.md): - Tree from repo URI through __manifest, nodes/, edges/, _graph_commits.lance, _graph_runs.lance, _refs/branches/ down into Lance's per-dataset internals (_versions/, data/, _indices/, _refs/, _transactions/). - Annotated with the actual filenames so readers can `ls` the same paths. - Slots in below the existing __manifest CAS / OCC / migration prose; does not move or rewrite that content. Runtime flows (docs/execution.md): - Read flow sequence: client → Omnigraph::query → typecheck → lower → execute_query → table_store → Lance scanner → RecordBatch stream. - Mutation flow sequence: Omnigraph::mutate → resolve literals → Lance write op (Append / merge_insert) → ManifestRepo::commit → __manifest upsert. - Both diagrams are followed by a "Code paths" block with verified file:line citations so readers can navigate from diagram element to source in one step. Conventions established (this is the first Mermaid in the repo): - L1 = orange (#fef3e8), L2 = blue (#e8f4fd), aspirational = dashed. - Diagram size cap ~9 elements; more detail goes in a sub-diagram. - Diagrams paired with prose; code-path citations follow each diagram. - Consistent vocabulary across diagrams: frontend / compiler / engine / storage trait / Lance / object store. No accidental synonyms. Subsequent PRs will add flow diagrams for schema apply, branch + merge, run isolation, index reconcile, and the embedding pipeline in the same conventions.
2026-04-29 16:58:56 +02:00
- **`__manifest/`** is a Lance dataset whose rows describe which sub-table version is published at which graph-branch. Reading a snapshot starts here.
- **`nodes/`** and **`edges/`** are sibling directories holding one Lance dataset per declared type. Names are `fnv1a64-hex` of the type name to keep paths fixed-length and case-safe.
feat(engine): graph lineage in __manifest — single-source fold, v3→v4 migration, schema-version floor (#299) * docs(rfc-013): bank the #295 spec-review comments as step-5 constraints (§5.1) 3b shipped a minimal WriteTxn{branch,base} and deferred the full §4.1 opener unification (pinned-base opener, shared Session, write-local handle cache, strict-op conflict-timing move) to step 5. The greptile comments on the #295 spec were moot for #298 (none of those constructs were built) but are load-bearing for step 5: (1) the handle cache must be Send+Sync (Mutex, not RefCell); (2) the strict-op timing move needs an explicit retry contract — txn discarded after any commit, retry re-opens a fresh base — which is the SAME contract as the stale-view false-fail (§1d.2); (3) the opener-equivalence test must advance HEAD externally then assert pinned-base, not the trivial HEAD==base. * feat(engine): fold graph lineage into the __manifest publish CAS (RFC-013 Phase 7) Graph lineage no longer lives in a second write to _graph_commits.lance. Each commit's graph_commit + graph_head:<branch> rows now ride the SAME __manifest merge-insert as the table-version rows (one atomic version), and CommitGraph reads its cache from the manifest projection (read_graph_lineage). _graph_commits.lance is no longer written commit rows (it remains only as a Lance branch-ref carrier). Mechanism: a LineageIntent { graph_commit_id (ULID, minted once), branch, actor, merged_parent, created_at } threads through ManifestBatchPublisher::publish. Inside the publisher retry loop the parent is resolved per attempt from the just-loaded branch-scoped manifest (the should_replace_head winner over the visible graph_commit rows — branch-correct by Lance branch isolation; the graph_head row is written for forward-compat + the §7.1 contention point but is not the parent source, so a freshly-forked branch resolves the right fork-point parent). A CAS-conflict retry re-reads the advanced head → correct new parent; the commit_id is stable across retries. Closes two known gaps BY CONSTRUCTION (one write, no second step to fail/ race): - manifest→commit-graph atomicity (no crash window between manifest + lineage), - commit-graph parent under concurrency (no refresh→append TOCTOU; the per-write commit_graph.refresh() is gone). Recovery, branch-merge, and genesis route their lineage through the same CAS (merge: one commit_merge_with_actor; recovery: publish_recovery_commit folds the recovery commit, actor=omnigraph:recovery; genesis rides the init __manifest write). The dead _graph_commits write helpers (append_commit/_merge/_actor) are #[allow(dead_code)] (the actor sidecar table is still enumerated by optimize). Verified (sequential): build clean; the new lineage_projection gate (manifest-only — _graph_commits/_actors have 0 rows; full lineage reconstructs via the projection); branching/merge_truth_table (exhaustive, branch-aware)/composite_flow/point_in_time/ changes/consistency/recovery; failpoints (59, incl. recovery lifecycle + the now-closed atomicity gap); full --workspace. Cost tests REVERT to their pre-fold values (writes +1, write_cost ceiling 80) — the proof of true single-CAS (no extra write). invariants.md marks both gaps CLOSED. PENDING (next stages, this PR): the §7.1 concurrent graph_head one-winner gate (stage 5 — two concurrent same-branch commits, exactly one wins); the stamp bump v4 + migrate_v3_to_v4 backfill + read-only refuse for EXISTING graphs (stage 4); full doc-sync of storage.md/architecture.md/writes.md. * feat(engine): migrate existing v3 graphs to manifest lineage (RFC-013 Phase 7 stage 4) The Phase-7 fold made CommitGraph read lineage from the __manifest projection, so a pre-Phase-7 (internal-schema v3) graph — lineage in _graph_commits.lance, none in __manifest — would read an empty commit DAG. Stage 4 makes existing graphs upgrade seamlessly and not break reads. - Stamp 3 -> 4 + migrate_v3_to_v4: bumps INTERNAL_MANIFEST_SCHEMA_VERSION and adds the 3 => migrate_v3_to_v4 arm. The migration reads this branch's _graph_commits/_actors, emits one graph_commit row per commit + exactly one graph_head:<branch> for the head (should_replace_head winner, deterministic id-sort — no hash-map-order in migration output), merge-inserts into __manifest, then set_stamp(4) LAST. Idempotency guard first (read_graph_lineage non-empty -> just stamp); crash before set_stamp re-enters at v3 and the guard completes it. Does NOT touch the unenforced-PK metadata. Runs per branch: migrate_on_open backfills main; load_publish_state backfills each branch on its first write (root_uri/branch threaded through migrate_internal_schema). - v3-read fallback: CommitGraph version-gates the lineage source — stamp < 4 reads the (re-activated) _graph_commits.lance; >= 4 uses the manifest projection. So a READ-ONLY open of an un-migrated graph reads correct history with no write. Correctness catch: the legacy _graph_commit_actors.lance was never branched, so the fallback reads it FLAT (no branch checkout) while checking out the branch only on the commits dataset. - Read-only stamp-refuse: a ReadOnly open of a FUTURE-stamped graph now refuses with the same upgrade error (future-proofing the next format bump; the write path already refused via migrate_internal_schema). - Docs: storage/architecture/writes/invariants/constants updated to manifest-stored lineage; release note docs/releases/v0.8.0.md (format v4, old writers clean-break, data preserved, upgrade writers first). 6 new tests (v3 backfill, idempotent, v3 read-only fallback, future-stamp refuse in both modes, crash-before-stamp completes, legacy branch+flat-actor read). Full engine suite + failpoints (59) + cargo test --workspace --locked green; check-agents-md passes. * test(engine): graph_head concurrency gate — disjoint same-branch writers form a linear commit DAG (RFC-013 Phase 7) Two (or N) writers committing disjoint tables on one branch still share the mutable `graph_head:<branch>` manifest row, so the only row-level CAS contention is that row. The contract — exactly one writer wins each CAS round; the loser retries inside the publisher, re-resolves its parent off the freshly-advanced head, and re-commits, so every writer lands and the graph_commit DAG stays a single LINEAR chain (no fork) — had no acceptance test. This adds it. - concurrent_disjoint_writes_share_head_and_form_linear_chain: two disjoint writers + distinct LineageIntent, tokio::join!; both commit; the on-disk DAG is genesis -> c -> c' (asserted linear: exactly one genesis, no two commits share a parent, the head is the unique non-parent). - n_concurrent_disjoint_writers_converge_to_one_linear_chain: N=8 disjoint writers each with an app-level retry loop (the publisher's internal budget can be exhausted under contention); all converge to one linear chain of 8. - concurrent_disjoint_writes_form_linear_chain_on_s3: the same race on a real object store (true conditional-put CAS), bucket-gated. Cites both tests from the §7.1 contention note in invariants.md. Test-only; no production change. * perf(engine): fold the lineage parent scan into the publish path's single __manifest scan (RFC-013 P2) Each lineage publish scanned `__manifest` twice: `load_publish_state` read table state via one scan, then `resolve_lineage_rows` did a second full `read_graph_lineage` scan only to find the parent commit. Fold the `graph_commit` extraction into the existing scan. - `read_manifest_scan` gains a `collect_lineage` flag. The publish path (`read_publish_scan`) collects the `graph_commit` rows in the same pass; the table-state hot path leaves them in the forward-compat skip arm, so it never pays the O(commits) lineage JSON decode (it also skips reading the `object_id` column entirely). One shared `decode_graph_commit_row` serves both the folded path and the standalone `read_graph_lineage`, so the two cannot drift. - `resolve_lineage_rows` is now sync and takes the already-parsed rows; the per-attempt re-read is preserved because `load_publish_state` runs once per CAS attempt, so a retry still re-parents off the advanced head. - `load_publish_state` returns a named `LoadedPublishState` instead of a four-tuple; the thin `read_registered_table_locations` / `read_tombstone_versions` accessors fold away. `read_manifest_entries` becomes `#[cfg(test)]`: the fold removes its last production caller, leaving only the test-only namespace module (`db/manifest.rs`: `#[cfg(test)] mod namespace`), so gating it keeps it from becoming dead code in non-test builds. Measured at depth ~5: per-write `__manifest` reads drop 44 -> 26 (total reads 54 -> 36). write_cost.rs gains a `manifest_reads <= 34` sub-ceiling that trips if a publish-path scan is re-added, and its calibration comment is corrected. * test(engine): red — transient legacy-open failure silently completes the v3→v4 migration A pre-Phase-7 (internal schema v3) graph keeps its graph lineage in `_graph_commits.lance`; the v3→v4 internal-schema migration backfills it into `__manifest` and stamps v4. `read_legacy_commit_cache` currently maps EVERY `Dataset::open` error to "no legacy data" (`Err(_) => empty`), so a transient or corrupt open during the one-time migration backfills nothing and still stamps v4 — orphaning the real lineage permanently (the migration runs once; the v3 fallback is then disabled). Add a `migration.v3_to_v4.legacy_open` failpoint that injects a non-not-found Lance error at the legacy open, and a fault-injection regression test in the `failpoints` binary. Against the current swallow the migration completes anyway, so the test fails on its "migration must abort" assertion — the predicted symptom. The fix follows in the next commit. Test support reachable from the `failpoints` integration binary (it compiles the crate without `cfg(test)`): the v3-fixture helpers and a stamp/row-count reader are gated `cfg(any(test, feature = "failpoints"))`, still excluded from release builds. Failpoint tests stay in the integration binary because the fail registry is process-global. * fix(engine): propagate non-not-found legacy-open errors in the v3→v4 migration `read_legacy_commit_cache` mapped EVERY `Dataset::open` error to an empty cache (`Err(_) => empty`) on both the legacy commits dataset and its actor sidecar. The v3→v4 internal-schema migration reads this once before stamping internal-schema v4; a transient or corrupt open therefore backfilled nothing and stamped v4 anyway, orphaning the graph's real lineage permanently (the migration runs once, and the stamp-gated v3 fallback is disabled at v4). This is the "no silent failures" deny-list violation, and realistic on object storage. Both opens now match the not-found variants — Lance maps an object-store NotFound to `DatasetNotFound` — as the benign "no legacy data" / "no authors" signal, and propagate anything else as a loud error. The two arms share the variant contract but carry different rationale (commits-absent is the legitimate empty signal; actor-sidecar-absent is benign, but a corrupt actor open silently wiping authorship before stamping v4 is the same loss hole), commented at each site. Pinned by the `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant` guard (turns red if a Lance bump changes the absence variant) and greens the fault-injection regression test from the previous commit. * test(engine): cover the per-branch v3→v4 migration against a real Lance branch `seed_legacy_v3_lineage` writes every commit (including the "feature"-tagged one) to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field, so the production per-branch migration path — `read_legacy_commit_cache` checking out a real Lance branch, and a branch-scoped `__manifest` — was never exercised. Add `seed_legacy_v3_lineage_with_branch`, which forks a real `feature` Lance branch on BOTH `_graph_commits.lance` and `__manifest` (the branch inherits main's stripped v3 state), and a test that migrates the BRANCH and asserts the branch's lineage lands in the BRANCH's `__manifest` (genesis + A + branch commit, `graph_head:feature` → branch commit, parents + actors intact) with main's `__manifest` untouched. This empirically resolves the open question behind the merge robustness work: the fast-path `read_graph_lineage(dataset)` has no `manifest_branch` filter, but `__manifest` is Lance-branched per graph-branch, so a branch reads only its own lineage — the test confirms migrating one branch does not leak into another. No branch filter is needed. * refactor(engine): type the lineage-backfill merge conflict via the publisher classifier `state::merge_lineage_rows` (the v3→v4 lineage backfill's standalone `__manifest` merge-insert) stringified its `execute_reader` error, discarding the Lance variant. Route it through the publisher's `map_lance_publish_error` (now `pub(crate)`) so a concurrent first-open's row-level CAS loss surfaces as the SAME typed `OmniError::Manifest{ details: RowLevelCasContention }` the publisher's own retry consumes — one vocabulary, no raw-Lance matching in the migration. Deliberately NOT unified with `optimize::is_retryable_lance_conflict`: that classifier also matches `CommitConflict`/`RetryableCommitConflict` from the compaction commit path, which a row-level merge-insert never emits. Cross-linked with a comment at both sites. Behavior-preserving: the only path that changes is the error TYPE on a CAS loss (previously an opaque `Lance` string, now a typed conflict); no success/failure outcome changes. The bounded re-open retry that consumes the new type lands next. * test(engine): red — concurrent v3→v4 migrations error instead of converging `migrate_v2_to_v3` is concurrent-runner idempotent by design; v3→v4 regressed it. `merge_lineage_rows` uses `conflict_retries(0)` and `migrate_v3_to_v4` has no app-level retry, so when two processes open the same legacy graph at once the backfill's row-level CAS loser errors the whole open instead of converging. The test opens two `__manifest` handles at the same pre-migration (v3, empty-lineage) HEAD and runs both `migrate_internal_schema` calls under `tokio::join!`, forcing the `graph_head:main` CAS to fire every run. Against the current code the loser fails with `RowLevelCasContention` ("Attempted 0 retries.") — the predicted symptom — so the "both must converge" assertion panics. The bounded re-open retry that makes both converge lands next. * fix(engine): make the v3→v4 lineage backfill converge under concurrent runners `migrate_v2_to_v3` is concurrent-runner idempotent; v3→v4 was not. Two processes (or open-for-write handles) opening the same legacy graph at once both reach the backfill merge, and `merge_lineage_rows`'s `conflict_retries(0)` made the row-level CAS loser error the whole open instead of converging. Two contention points, both now handled all-or-nothing: 1. The backfill merge on `graph_head:<branch>`. Wrap (fast-path re-read → read legacy → merge) in a bounded re-open retry loop: a `RowLevelCasContention` loss re-opens the manifest past the winner's (atomic) commit and re-loops; the fast-path re-read then sees the winner's lineage and stamps. On budget exhaustion it returns a `RowLevelCasContention`-typed error so the publisher's OUTER retry loop completes it. The retry decision reuses the publisher's `is_retryable_publish_conflict` so the two stay in lockstep. 2. The terminal stamp bump. Making the merge loser converge newly lets BOTH runners reach `set_stamp(4)` — an `UpdateConfig` commit on the same key — so the loser gets `lance::Error::IncompatibleTransaction` (NOT a row-level CAS, so the merge loop doesn't catch it). This surfaced only under the concurrent full-suite run, not the isolated test. Both write the SAME value, so the conflict is benign: `commit_v4_stamp_idempotently` re-opens and, if the stamp already reached the target, succeeds; else re-applies (bounded). Greens the race test from the previous commit (3x isolated, 5x full-suite, no flake). The new `IncompatibleTransaction` match is pinned by `lance_surface_guards.rs::lance_error_incompatible_transaction_variant_exists`. * fix(engine): refuse a future internal-schema stamp on the branch read path `load_commit_cache_for_branch` dispatched on the branch's internal-schema stamp — `< CURRENT` to the v3 legacy fallback, `>= CURRENT` to the manifest projection — but never refused a `> CURRENT` branch stamp, so a newer-binary shape would be misread by the projection rather than rejected. Add `refuse_if_stamp_too_new(stamp)` (re-exported `pub(crate)` from `migrations`) right after the branch stamp is read, mirroring the main read path's `refuse_if_internal_schema_too_new`. This is defense-in-depth, not a live hole: migrations run main-first (main migrates on open; each branch on its first write), so main's stamp is always >= every branch's and the main path refuses first. The guard closes the gap if that ordering invariant is ever weakened. Tested by force-stamping a real branch past CURRENT and asserting the branch read refuses with the upgrade error (the test misreads via the projection — returns Ok — without the guard, confirmed by removing it). * docs(rfc-013): record the v3→v4 migration robustness fixes invariants.md Known Gaps: the `migrate_v3_to_v4` entry now states the migration is loud on non-not-found legacy-open errors and concurrent-runner idempotent (bounded re-open retry on the merge CAS + idempotent stamp bump), and that the branch read path refuses a `> CURRENT` stamp. lance.md: note the two new surface guards the migration depends on (`dataset_open_missing_returns_not_found_variant`, `lance_error_incompatible_transaction_variant_exists`). testing.md: note the migration fault-injection test in the failpoints row. * refactor: remove dead code and silence warnings across engine + cluster Dead-code sweep follow-up to the RFC-013 stack. No behavior change. - engine: delete the orphaned `validate_edge_cardinality` — the load path uses `validate_edge_cardinality_with_pending_loader` for every mode (including Overwrite, which it treats as the replacement table image), so the old standalone validator had no caller — and correct its sibling's now-stale doc reference. Gate `TableStore::append_batch` `#[cfg(test)]`: it is the inline- commit residual kept only for recovery test setup, with no non-test caller. - cluster: drop unused imports in `lib.rs`, delete the unused `ClusterStore::payload_display`, and raise `LiveGraphObservation` / `GraphObservationJson` / `PolicyTarget` to `pub(crate)` to match the functions that return them. Both lib crates now build warning-free. * fix(engine): match Lance's typed DatasetAlreadyExists, not the message string The internal create-or-open idempotency fallbacks in `db/commit_graph.rs` and `db/recovery_audit.rs` classified the "already exists" race by `err.to_string().contains("Dataset already exists")` — a Lance display string, not an API contract. A wording change upstream would silently break the fallback (a re-create would error instead of opening the existing table). Match the typed `lance::Error::DatasetAlreadyExists { .. }` variant instead — the same discipline as the v3→v4 migration's not-found classifier — pinned by the new `lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists` guard so a Lance rename turns red instead of silently regressing. * refactor(engine): consolidate now_micros into one crate::db helper Four `fn now_micros() -> Result<i64>` copies (commit_graph, recovery_audit, graph_coordinator, manifest/graph) had already drifted: three mapped the clock error to `OmniError::manifest("...UNIX_EPOCH...")` while recovery_audit used `OmniError::manifest_internal("...unix epoch...")`. Replace all four with one `pub(crate) fn now_micros()` in `db/mod.rs` (the majority `manifest` variant), and repoint the eight call sites at `crate::db::now_micros()`. No test asserts on the failure message, so unifying the variant is behavior-safe; the timestamp-mapping contract can no longer fork across the rows it stamps. * refactor(engine): drop the dead snapshot param from roll_back_sidecar `roll_back_sidecar` took `snapshot: &Snapshot` only to discard it with `let _ = snapshot;` — rollbacks now always publish (the restored HEAD plus a recovery-commit lineage row), so the snapshot is never read to decide whether to skip a publish. Remove the parameter, the two call-site arguments, and the suppressor. A signature must not advertise inputs it does not consume. The `Snapshot` import stays — `process_sidecar`, `roll_forward_all`, and `record_audit_recovery_rollforward` still take it. * test(engine): red — open_at_branch wedges a branch on a missing commit-graph ref A v4 graph keeps its graph lineage in `__manifest` (RFC-013 Phase 7); the `_graph_commits.lance` branch ref is a derived artifact. An interrupted fork-reclaim or a `cleanup` race can drop that derived ref while the manifest lineage stays intact. Per invariants 7 + 15 a missing derived ref must not fail a logical read of the lineage. This wedge builds a real v4 `feature` branch (its `graph_head:feature` row in `__manifest`), force-deletes ONLY the `_graph_commits.lance` `feature` ref, then asserts the branch reads (`open_at_branch` / list-commits / `merge_base`) succeed from `__manifest` while a write that needs the derived ref (`create_branch`) fails loudly with the typed actionable error. Red against current code: `open_at_branch`'s hard `checkout_branch(branch)?` on the missing ref errors `OmniError::Lance` (Lance "Not found: _graph_commits.lance/tree/feature/_versions"), wedging the logical read. * fix(engine): read manifest lineage independent of the derived _graph_commits ref `CommitGraph::open_at_branch` did a hard `checkout_branch(branch)?` on the `_graph_commits.lance` branch ref before reading lineage — so a missing derived ref (an interrupted fork-reclaim, or a `cleanup` race) wedged the branch's commit-list / merge-base / snapshot resolution even though the lineage is readable from the authoritative `__manifest` (RFC-013 Phase 7). That is a derived/physical artifact failing a logical read — invariants 7 and 15. Make the held commits handle `Option<Dataset>` (mirroring `actor_dataset`). `open_at_branch` and `refresh` check out the derived ref best-effort: a typed not-found (`RefNotFound`/`NotFound`) yields a `None` handle while the read re-syncs from `__manifest`; any other open error still propagates. The manifest existence gate is unchanged — `load_commit_cache_for_branch` keeps its hard `?`, so a truly absent branch still fails loudly at the manifest. `create_branch` (the only writer that forks a ref) and the folded-in version lookup return a loud, actionable error on `None`, deferring repair to `cleanup`'s existing orphan reconciler rather than inlining a write on a read-side refresh. Reads (`head_commit`/`load_commits`/`get_commit`/`merge_base`) never touch the handle. Greens the wedge regression from the preceding commit. * fix(engine): v3→v4 retry loops return retryable contention on exhaustion `commit_v4_stamp_idempotently`'s retry loop used `0..=STAMP_RETRY_BUDGET` (6 iterations) with an `attempt < STAMP_RETRY_BUDGET` guard, so the LAST iteration's `IncompatibleTransaction` fell through to `Err(e) => OmniError::Lance(...)` — stringified, non-retryable — instead of the intended `RowLevelCasContention`, and the post-loop contention return was dead code. The publisher's outer retry only re-runs `is_retryable_publish_conflict`, so under sustained concurrent v3→v4 migration the one-time stamp bump could fail instead of converging, defeating the idempotency the migration is supposed to add. Fix the loop to `0..BUDGET` with an UNGUARDED `IncompatibleTransaction` arm: the retryable variant is always handled inside the loop (re-open + same-value check + retry), so it can never reach the stringifying catch-all, and the post-loop is the SINGLE reachable exhaustion path — the typed `RowLevelCasContention`. The `Err(e)` arm now catches only genuine non-contention errors. Apply the same range alignment to the sibling merge loop in `migrate_v3_to_v4` (behaviorally correct today — its `Err(err)` returns the already-typed contention — but it carried the identical off-by-one structure the stamp loop was copied from; aligning both stops the next copy from re-introducing it). Test-first. The exhaustion path is otherwise near-unreachable — a real concurrent winner stamps the same value, so the re-read returns Ok on the first retry — so a new `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to lose, driving exhaustion deterministically. Against the pre-fix loop the new `v4_stamp_exhaustion_returns_retryable_contention` test goes red with `Lance("Incompatible transaction: injected failpoint triggered…")`; with the fix it asserts the typed `RowLevelCasContention`. Found by automated review on #299. * feat(engine): minimum-supported internal-schema floor + retirement tripwire The internal-schema migration chain (`migrate_internal_schema`) had a too-new ceiling but no floor, so every old `migrate_vN_…` arm and the v3 legacy readers it needs stay forever — the pile grows by one migration + readers + tests every schema version. Add `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (1 today, a pure no-op: `read_stamp` floors an absent stamp at 1 and no real graph carries 0) as the oldest stamp this binary opens; raising it is how the chain sheds old code. Collapse the one-sided `refuse_if_stamp_too_new` into `refuse_if_stamp_unsupported` checking both bounds, so the floor lands at all three stamp-enforcement sites — the write-path migrate dispatcher, the read-only open guard, and the branch lineage-read path (`commit_graph.rs`) — via one compiler-enforced rename. A hand-wired floor twin would have had to touch each site, and the branch-read path is easy to miss; one combined guard cannot half-enforce. Rename the read-only wrapper `refuse_if_internal_schema_unsupported` to match. A compile-time tripwire (`const _: () = assert!(LOWEST_REGISTERED_MIGRATION_SOURCE == MIN_SUPPORTED…)`) fails the build if a future floor bump forgets to delete the now-dead migration arm (or vice versa) — stronger than a runtime test, impossible to skip, and it doubles as the use that keeps the mirror const live. Tests: a sub-floor graph is refused in both open modes (twin of `future_stamp_is_refused_in_both_open_modes`); the guard accepts exactly [MIN, CURRENT]. No behavior change for any real graph. The retirement runbook lives on the `MIN_SUPPORTED` doc-comment + invariants.md. * fix(engine): compose migration contention with publisher retry; precise recovery-converge audit commit Three review-surfaced fixes on the RFC-013 Phase 7 path. Publisher retry vs migration contention: `publish()` propagated a `load_publish_state` error fatally via `?`, so a `RowLevelCasContention` surfaced by the v3->v4 migration's exhausted merge/stamp budgets aborted the publish instead of being retried — only `merge_rows` conflicts hit the retry. This contradicted the migration's own design, which returns that typed error EXPECTING the publisher to re-run the load (by which point a concurrent winner has usually finished the migration, so the next scan is a no-op). Route a retryable load error through the same retry path as a retryable `merge_rows` conflict. Regression test (failpoints): a one-shot retryable contention injected into `load_publish_state` now commits via the retry; red without the fix (the write fails with the injected contention). Recovery-converge audit commit id: `converge_or_defer_roll_forward` recorded the branch HEAD as the audit row's `graph_commit_id`, but a concurrent user write can advance `graph_head` past the recovery commit between the winner's publish and this read — attributing the audit to a later, wrong commit. Use the latest `RECOVERY_ACTOR`-authored commit (what `publish_recovery_commit` mints), which is the recovery commit by construction. The audit's actor was already correct (it comes from `sidecar.actor_id`, not the commit). Dead param: drop the unused `snapshot` from `record_audit_recovery_rollforward` (removing the `let _ = snapshot;` suppressor). `storage` stays — it is used to delete the sidecar.
2026-06-25 13:55:34 +02:00
- **`_graph_commits.lance`** is an L2 dataset retained only as a branch-ref carrier (and, on a pre-Phase-7 graph, the migration source). Since RFC-013 Phase 7 the graph commit DAG lives in `__manifest` as `graph_commit` / `graph_head` rows written in the publish CAS — `_graph_commits.lance` and its paired `_graph_commit_actors.lance` no longer receive commit rows. A graph created before Phase 7 (internal schema v3) backfills its lineage into `__manifest` on its first read-write open (`migrate_v3_to_v4`). (Pre-v0.4.0 graphs also have inert `_graph_runs.lance` / `_graph_run_actors.lance` from the removed Run state machine; the internal schema migration sweeps their stale `__run__*` branches, and the dataset bytes are reclaimed once a prefix-delete primitive lands.)
- **`_graph_commit_recoveries.lance`** — one row per crash-recovery action. Joined by `graph_commit_id` to the graph commit lineage (the `graph_commit` rows in `__manifest` since RFC-013 Phase 7); the linked commit carries `actor_id=omnigraph:recovery`. Operators correlate recoveries with the original mutations they rolled forward / back via this join.
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
- **`__recovery/{ulid}.json`** — transient sidecar files written by a writer before it advances the underlying dataset, deleted once the matching manifest publish succeeds. A sidecar persisting after process exit means the writer crashed mid-commit; the next read-write open processes it. Steady-state directory is empty.
docs: add Mermaid architecture diagrams across architecture / storage / execution Replace the single ASCII stack in docs/architecture.md with a hierarchy of Mermaid diagrams that show the system from external context down to the component level. Add an on-disk layout diagram in docs/storage.md and two sequence diagrams (read query, mutation) in docs/execution.md so readers can navigate from "what is OmniGraph" to "how does a query run" without opening source. Static structure (docs/architecture.md): - System context — agents/clients, embedding providers, Cedar, object store. - Layer view — eight-layer stack with L1 (Lance) / L2 (OmniGraph) styling via classDef, replacing the pre-existing ASCII art. - Component zoom-ins — compiler, engine, storage trait, index lifecycle, server/CLI. Each zoom-in cites file:line entry points. Aspirational shapes (storage trait, full reconciler) are visually marked and pointed at the relevant invariants.md section so readers see the intended seam without thinking it's already implemented. On-disk layout (docs/storage.md): - Tree from repo URI through __manifest, nodes/, edges/, _graph_commits.lance, _graph_runs.lance, _refs/branches/ down into Lance's per-dataset internals (_versions/, data/, _indices/, _refs/, _transactions/). - Annotated with the actual filenames so readers can `ls` the same paths. - Slots in below the existing __manifest CAS / OCC / migration prose; does not move or rewrite that content. Runtime flows (docs/execution.md): - Read flow sequence: client → Omnigraph::query → typecheck → lower → execute_query → table_store → Lance scanner → RecordBatch stream. - Mutation flow sequence: Omnigraph::mutate → resolve literals → Lance write op (Append / merge_insert) → ManifestRepo::commit → __manifest upsert. - Both diagrams are followed by a "Code paths" block with verified file:line citations so readers can navigate from diagram element to source in one step. Conventions established (this is the first Mermaid in the repo): - L1 = orange (#fef3e8), L2 = blue (#e8f4fd), aspirational = dashed. - Diagram size cap ~9 elements; more detail goes in a sub-diagram. - Diagrams paired with prose; code-path citations follow each diagram. - Consistent vocabulary across diagrams: frontend / compiler / engine / storage trait / Lance / object store. No accidental synonyms. Subsequent PRs will add flow diagrams for schema apply, branch + merge, run isolation, index reconcile, and the embedding pipeline in the same conventions.
2026-04-29 16:58:56 +02:00
- **`_refs/branches/{name}.json`** is graph-level branch metadata — pointers from a branch name to the manifest version it heads.
- **Inside each Lance dataset** (orange): the standard Lance directory layout. `_versions/{n}.manifest` records every commit; `data/` holds the actual Arrow fragments; `_indices/{uuid}/` holds index segments with their own `fragment_bitmap` for partial coverage; `_refs/` holds Lance-native per-dataset branches and tags.
The split — L2 owns the cross-dataset catalog; L1 owns the per-dataset internals — means that schema work (which adds or removes datasets) updates `__manifest`, while data work (which adds fragments) updates `_versions/` inside the affected dataset and then bumps `__manifest`.
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
## URI scheme support
| Scheme | Backend | Notes |
|---|---|---|
docs(user): de-dev polish — strip internal scaffolding from user docs (Phase 3a) (#226) Remove developer-only scaffolding that leaked into the public user/operator docs, while preserving every user-facing behavior, command, flag, endpoint, constant, and env var. No behavior changes. Removed across 18 files: - internal ticket / sequencing refs (MR-NNN, RFC-NNN, "Phase N"); - source-code paths (crates/**/*.rs, *.pest) and internal struct/function dumps (e.g. the QueryIR / GraphCommit / SchemaMigrationPlan Rust types, internal fn names like fork_branch_from_state, optimize_all_tables); - Lance-internal blocker prose (upstream issue numbers, blob-decode cause, sidecar Phase-B/C mechanics) — keeping the user-visible behavior (e.g. "optimize skips Blob-column tables; reads/writes unaffected"); - pre-v0.4.0 Run-state-machine archaeology. Internal IR/lowering/recovery-internals sections were either trimmed to a brief user-facing note (e.g. "Traversal execution", "interrupted writes recover automatically; recovery commits are recorded under actor omnigraph:recovery") or removed. Kept: all language syntax, lint codes, Cedar actions/scopes, endpoints, error taxonomy, every constant and env var (verified none dropped from the constants cheat-sheet), and the operator-facing explanations of on-disk artifacts. Residual "legacy" mentions are all user-facing (the deprecated omnigraph.yaml, the legacy token chain, old command names). Verified: zero internal-scaffolding leaks (MR/RFC/Phase/.rs/.pest = 0) across docs/user; zero broken links; check-agents-md.sh green. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 14:39:25 +03:00
| local path / `file://` | local filesystem | Normalized to absolute paths; relative and dot-segment paths are lexically absolutized |
| `s3://bucket/prefix` | S3 object store | Honors `AWS_ENDPOINT_URL_S3`, `AWS_ALLOW_HTTP`, `AWS_S3_FORCE_PATH_STYLE` |
| `http(s)://host:port` | HTTP client to `omnigraph-server` | Used by CLI as a target, not a storage backend |
## Object-store env vars (S3-compatible)
- `AWS_REGION`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`
- `AWS_ENDPOINT_URL`, `AWS_ENDPOINT_URL_S3` — for MinIO / RustFS / GCS-via-XML
- `AWS_S3_FORCE_PATH_STYLE=true` — path-style URLs
- `AWS_ALLOW_HTTP=true` — allow plain HTTP (local dev)