mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-27 02:39:38 +02:00
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
CI / Container Entrypoint (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled
Release Edge / Smoke Windows installer (push) Has been cancelled
* docs(rfc-013): bank the #295 spec-review comments as step-5 constraints (§5.1) 3b shipped a minimal WriteTxn{branch,base} and deferred the full §4.1 opener unification (pinned-base opener, shared Session, write-local handle cache, strict-op conflict-timing move) to step 5. The greptile comments on the #295 spec were moot for #298 (none of those constructs were built) but are load-bearing for step 5: (1) the handle cache must be Send+Sync (Mutex, not RefCell); (2) the strict-op timing move needs an explicit retry contract — txn discarded after any commit, retry re-opens a fresh base — which is the SAME contract as the stale-view false-fail (§1d.2); (3) the opener-equivalence test must advance HEAD externally then assert pinned-base, not the trivial HEAD==base. * feat(engine): fold graph lineage into the __manifest publish CAS (RFC-013 Phase 7) Graph lineage no longer lives in a second write to _graph_commits.lance. Each commit's graph_commit + graph_head:<branch> rows now ride the SAME __manifest merge-insert as the table-version rows (one atomic version), and CommitGraph reads its cache from the manifest projection (read_graph_lineage). _graph_commits.lance is no longer written commit rows (it remains only as a Lance branch-ref carrier). Mechanism: a LineageIntent { graph_commit_id (ULID, minted once), branch, actor, merged_parent, created_at } threads through ManifestBatchPublisher::publish. Inside the publisher retry loop the parent is resolved per attempt from the just-loaded branch-scoped manifest (the should_replace_head winner over the visible graph_commit rows — branch-correct by Lance branch isolation; the graph_head row is written for forward-compat + the §7.1 contention point but is not the parent source, so a freshly-forked branch resolves the right fork-point parent). A CAS-conflict retry re-reads the advanced head → correct new parent; the commit_id is stable across retries. Closes two known gaps BY CONSTRUCTION (one write, no second step to fail/ race): - manifest→commit-graph atomicity (no crash window between manifest + lineage), - commit-graph parent under concurrency (no refresh→append TOCTOU; the per-write commit_graph.refresh() is gone). Recovery, branch-merge, and genesis route their lineage through the same CAS (merge: one commit_merge_with_actor; recovery: publish_recovery_commit folds the recovery commit, actor=omnigraph:recovery; genesis rides the init __manifest write). The dead _graph_commits write helpers (append_commit/_merge/_actor) are #[allow(dead_code)] (the actor sidecar table is still enumerated by optimize). Verified (sequential): build clean; the new lineage_projection gate (manifest-only — _graph_commits/_actors have 0 rows; full lineage reconstructs via the projection); branching/merge_truth_table (exhaustive, branch-aware)/composite_flow/point_in_time/ changes/consistency/recovery; failpoints (59, incl. recovery lifecycle + the now-closed atomicity gap); full --workspace. Cost tests REVERT to their pre-fold values (writes +1, write_cost ceiling 80) — the proof of true single-CAS (no extra write). invariants.md marks both gaps CLOSED. PENDING (next stages, this PR): the §7.1 concurrent graph_head one-winner gate (stage 5 — two concurrent same-branch commits, exactly one wins); the stamp bump v4 + migrate_v3_to_v4 backfill + read-only refuse for EXISTING graphs (stage 4); full doc-sync of storage.md/architecture.md/writes.md. * feat(engine): migrate existing v3 graphs to manifest lineage (RFC-013 Phase 7 stage 4) The Phase-7 fold made CommitGraph read lineage from the __manifest projection, so a pre-Phase-7 (internal-schema v3) graph — lineage in _graph_commits.lance, none in __manifest — would read an empty commit DAG. Stage 4 makes existing graphs upgrade seamlessly and not break reads. - Stamp 3 -> 4 + migrate_v3_to_v4: bumps INTERNAL_MANIFEST_SCHEMA_VERSION and adds the 3 => migrate_v3_to_v4 arm. The migration reads this branch's _graph_commits/_actors, emits one graph_commit row per commit + exactly one graph_head:<branch> for the head (should_replace_head winner, deterministic id-sort — no hash-map-order in migration output), merge-inserts into __manifest, then set_stamp(4) LAST. Idempotency guard first (read_graph_lineage non-empty -> just stamp); crash before set_stamp re-enters at v3 and the guard completes it. Does NOT touch the unenforced-PK metadata. Runs per branch: migrate_on_open backfills main; load_publish_state backfills each branch on its first write (root_uri/branch threaded through migrate_internal_schema). - v3-read fallback: CommitGraph version-gates the lineage source — stamp < 4 reads the (re-activated) _graph_commits.lance; >= 4 uses the manifest projection. So a READ-ONLY open of an un-migrated graph reads correct history with no write. Correctness catch: the legacy _graph_commit_actors.lance was never branched, so the fallback reads it FLAT (no branch checkout) while checking out the branch only on the commits dataset. - Read-only stamp-refuse: a ReadOnly open of a FUTURE-stamped graph now refuses with the same upgrade error (future-proofing the next format bump; the write path already refused via migrate_internal_schema). - Docs: storage/architecture/writes/invariants/constants updated to manifest-stored lineage; release note docs/releases/v0.8.0.md (format v4, old writers clean-break, data preserved, upgrade writers first). 6 new tests (v3 backfill, idempotent, v3 read-only fallback, future-stamp refuse in both modes, crash-before-stamp completes, legacy branch+flat-actor read). Full engine suite + failpoints (59) + cargo test --workspace --locked green; check-agents-md passes. * test(engine): graph_head concurrency gate — disjoint same-branch writers form a linear commit DAG (RFC-013 Phase 7) Two (or N) writers committing disjoint tables on one branch still share the mutable `graph_head:<branch>` manifest row, so the only row-level CAS contention is that row. The contract — exactly one writer wins each CAS round; the loser retries inside the publisher, re-resolves its parent off the freshly-advanced head, and re-commits, so every writer lands and the graph_commit DAG stays a single LINEAR chain (no fork) — had no acceptance test. This adds it. - concurrent_disjoint_writes_share_head_and_form_linear_chain: two disjoint writers + distinct LineageIntent, tokio::join!; both commit; the on-disk DAG is genesis -> c -> c' (asserted linear: exactly one genesis, no two commits share a parent, the head is the unique non-parent). - n_concurrent_disjoint_writers_converge_to_one_linear_chain: N=8 disjoint writers each with an app-level retry loop (the publisher's internal budget can be exhausted under contention); all converge to one linear chain of 8. - concurrent_disjoint_writes_form_linear_chain_on_s3: the same race on a real object store (true conditional-put CAS), bucket-gated. Cites both tests from the §7.1 contention note in invariants.md. Test-only; no production change. * perf(engine): fold the lineage parent scan into the publish path's single __manifest scan (RFC-013 P2) Each lineage publish scanned `__manifest` twice: `load_publish_state` read table state via one scan, then `resolve_lineage_rows` did a second full `read_graph_lineage` scan only to find the parent commit. Fold the `graph_commit` extraction into the existing scan. - `read_manifest_scan` gains a `collect_lineage` flag. The publish path (`read_publish_scan`) collects the `graph_commit` rows in the same pass; the table-state hot path leaves them in the forward-compat skip arm, so it never pays the O(commits) lineage JSON decode (it also skips reading the `object_id` column entirely). One shared `decode_graph_commit_row` serves both the folded path and the standalone `read_graph_lineage`, so the two cannot drift. - `resolve_lineage_rows` is now sync and takes the already-parsed rows; the per-attempt re-read is preserved because `load_publish_state` runs once per CAS attempt, so a retry still re-parents off the advanced head. - `load_publish_state` returns a named `LoadedPublishState` instead of a four-tuple; the thin `read_registered_table_locations` / `read_tombstone_versions` accessors fold away. `read_manifest_entries` becomes `#[cfg(test)]`: the fold removes its last production caller, leaving only the test-only namespace module (`db/manifest.rs`: `#[cfg(test)] mod namespace`), so gating it keeps it from becoming dead code in non-test builds. Measured at depth ~5: per-write `__manifest` reads drop 44 -> 26 (total reads 54 -> 36). write_cost.rs gains a `manifest_reads <= 34` sub-ceiling that trips if a publish-path scan is re-added, and its calibration comment is corrected. * test(engine): red — transient legacy-open failure silently completes the v3→v4 migration A pre-Phase-7 (internal schema v3) graph keeps its graph lineage in `_graph_commits.lance`; the v3→v4 internal-schema migration backfills it into `__manifest` and stamps v4. `read_legacy_commit_cache` currently maps EVERY `Dataset::open` error to "no legacy data" (`Err(_) => empty`), so a transient or corrupt open during the one-time migration backfills nothing and still stamps v4 — orphaning the real lineage permanently (the migration runs once; the v3 fallback is then disabled). Add a `migration.v3_to_v4.legacy_open` failpoint that injects a non-not-found Lance error at the legacy open, and a fault-injection regression test in the `failpoints` binary. Against the current swallow the migration completes anyway, so the test fails on its "migration must abort" assertion — the predicted symptom. The fix follows in the next commit. Test support reachable from the `failpoints` integration binary (it compiles the crate without `cfg(test)`): the v3-fixture helpers and a stamp/row-count reader are gated `cfg(any(test, feature = "failpoints"))`, still excluded from release builds. Failpoint tests stay in the integration binary because the fail registry is process-global. * fix(engine): propagate non-not-found legacy-open errors in the v3→v4 migration `read_legacy_commit_cache` mapped EVERY `Dataset::open` error to an empty cache (`Err(_) => empty`) on both the legacy commits dataset and its actor sidecar. The v3→v4 internal-schema migration reads this once before stamping internal-schema v4; a transient or corrupt open therefore backfilled nothing and stamped v4 anyway, orphaning the graph's real lineage permanently (the migration runs once, and the stamp-gated v3 fallback is disabled at v4). This is the "no silent failures" deny-list violation, and realistic on object storage. Both opens now match the not-found variants — Lance maps an object-store NotFound to `DatasetNotFound` — as the benign "no legacy data" / "no authors" signal, and propagate anything else as a loud error. The two arms share the variant contract but carry different rationale (commits-absent is the legitimate empty signal; actor-sidecar-absent is benign, but a corrupt actor open silently wiping authorship before stamping v4 is the same loss hole), commented at each site. Pinned by the `lance_surface_guards.rs::dataset_open_missing_returns_not_found_variant` guard (turns red if a Lance bump changes the absence variant) and greens the fault-injection regression test from the previous commit. * test(engine): cover the per-branch v3→v4 migration against a real Lance branch `seed_legacy_v3_lineage` writes every commit (including the "feature"-tagged one) to MAIN's `_graph_commits.lance` with `manifest_branch` as a mere field, so the production per-branch migration path — `read_legacy_commit_cache` checking out a real Lance branch, and a branch-scoped `__manifest` — was never exercised. Add `seed_legacy_v3_lineage_with_branch`, which forks a real `feature` Lance branch on BOTH `_graph_commits.lance` and `__manifest` (the branch inherits main's stripped v3 state), and a test that migrates the BRANCH and asserts the branch's lineage lands in the BRANCH's `__manifest` (genesis + A + branch commit, `graph_head:feature` → branch commit, parents + actors intact) with main's `__manifest` untouched. This empirically resolves the open question behind the merge robustness work: the fast-path `read_graph_lineage(dataset)` has no `manifest_branch` filter, but `__manifest` is Lance-branched per graph-branch, so a branch reads only its own lineage — the test confirms migrating one branch does not leak into another. No branch filter is needed. * refactor(engine): type the lineage-backfill merge conflict via the publisher classifier `state::merge_lineage_rows` (the v3→v4 lineage backfill's standalone `__manifest` merge-insert) stringified its `execute_reader` error, discarding the Lance variant. Route it through the publisher's `map_lance_publish_error` (now `pub(crate)`) so a concurrent first-open's row-level CAS loss surfaces as the SAME typed `OmniError::Manifest{ details: RowLevelCasContention }` the publisher's own retry consumes — one vocabulary, no raw-Lance matching in the migration. Deliberately NOT unified with `optimize::is_retryable_lance_conflict`: that classifier also matches `CommitConflict`/`RetryableCommitConflict` from the compaction commit path, which a row-level merge-insert never emits. Cross-linked with a comment at both sites. Behavior-preserving: the only path that changes is the error TYPE on a CAS loss (previously an opaque `Lance` string, now a typed conflict); no success/failure outcome changes. The bounded re-open retry that consumes the new type lands next. * test(engine): red — concurrent v3→v4 migrations error instead of converging `migrate_v2_to_v3` is concurrent-runner idempotent by design; v3→v4 regressed it. `merge_lineage_rows` uses `conflict_retries(0)` and `migrate_v3_to_v4` has no app-level retry, so when two processes open the same legacy graph at once the backfill's row-level CAS loser errors the whole open instead of converging. The test opens two `__manifest` handles at the same pre-migration (v3, empty-lineage) HEAD and runs both `migrate_internal_schema` calls under `tokio::join!`, forcing the `graph_head:main` CAS to fire every run. Against the current code the loser fails with `RowLevelCasContention` ("Attempted 0 retries.") — the predicted symptom — so the "both must converge" assertion panics. The bounded re-open retry that makes both converge lands next. * fix(engine): make the v3→v4 lineage backfill converge under concurrent runners `migrate_v2_to_v3` is concurrent-runner idempotent; v3→v4 was not. Two processes (or open-for-write handles) opening the same legacy graph at once both reach the backfill merge, and `merge_lineage_rows`'s `conflict_retries(0)` made the row-level CAS loser error the whole open instead of converging. Two contention points, both now handled all-or-nothing: 1. The backfill merge on `graph_head:<branch>`. Wrap (fast-path re-read → read legacy → merge) in a bounded re-open retry loop: a `RowLevelCasContention` loss re-opens the manifest past the winner's (atomic) commit and re-loops; the fast-path re-read then sees the winner's lineage and stamps. On budget exhaustion it returns a `RowLevelCasContention`-typed error so the publisher's OUTER retry loop completes it. The retry decision reuses the publisher's `is_retryable_publish_conflict` so the two stay in lockstep. 2. The terminal stamp bump. Making the merge loser converge newly lets BOTH runners reach `set_stamp(4)` — an `UpdateConfig` commit on the same key — so the loser gets `lance::Error::IncompatibleTransaction` (NOT a row-level CAS, so the merge loop doesn't catch it). This surfaced only under the concurrent full-suite run, not the isolated test. Both write the SAME value, so the conflict is benign: `commit_v4_stamp_idempotently` re-opens and, if the stamp already reached the target, succeeds; else re-applies (bounded). Greens the race test from the previous commit (3x isolated, 5x full-suite, no flake). The new `IncompatibleTransaction` match is pinned by `lance_surface_guards.rs::lance_error_incompatible_transaction_variant_exists`. * fix(engine): refuse a future internal-schema stamp on the branch read path `load_commit_cache_for_branch` dispatched on the branch's internal-schema stamp — `< CURRENT` to the v3 legacy fallback, `>= CURRENT` to the manifest projection — but never refused a `> CURRENT` branch stamp, so a newer-binary shape would be misread by the projection rather than rejected. Add `refuse_if_stamp_too_new(stamp)` (re-exported `pub(crate)` from `migrations`) right after the branch stamp is read, mirroring the main read path's `refuse_if_internal_schema_too_new`. This is defense-in-depth, not a live hole: migrations run main-first (main migrates on open; each branch on its first write), so main's stamp is always >= every branch's and the main path refuses first. The guard closes the gap if that ordering invariant is ever weakened. Tested by force-stamping a real branch past CURRENT and asserting the branch read refuses with the upgrade error (the test misreads via the projection — returns Ok — without the guard, confirmed by removing it). * docs(rfc-013): record the v3→v4 migration robustness fixes invariants.md Known Gaps: the `migrate_v3_to_v4` entry now states the migration is loud on non-not-found legacy-open errors and concurrent-runner idempotent (bounded re-open retry on the merge CAS + idempotent stamp bump), and that the branch read path refuses a `> CURRENT` stamp. lance.md: note the two new surface guards the migration depends on (`dataset_open_missing_returns_not_found_variant`, `lance_error_incompatible_transaction_variant_exists`). testing.md: note the migration fault-injection test in the failpoints row. * refactor: remove dead code and silence warnings across engine + cluster Dead-code sweep follow-up to the RFC-013 stack. No behavior change. - engine: delete the orphaned `validate_edge_cardinality` — the load path uses `validate_edge_cardinality_with_pending_loader` for every mode (including Overwrite, which it treats as the replacement table image), so the old standalone validator had no caller — and correct its sibling's now-stale doc reference. Gate `TableStore::append_batch` `#[cfg(test)]`: it is the inline- commit residual kept only for recovery test setup, with no non-test caller. - cluster: drop unused imports in `lib.rs`, delete the unused `ClusterStore::payload_display`, and raise `LiveGraphObservation` / `GraphObservationJson` / `PolicyTarget` to `pub(crate)` to match the functions that return them. Both lib crates now build warning-free. * fix(engine): match Lance's typed DatasetAlreadyExists, not the message string The internal create-or-open idempotency fallbacks in `db/commit_graph.rs` and `db/recovery_audit.rs` classified the "already exists" race by `err.to_string().contains("Dataset already exists")` — a Lance display string, not an API contract. A wording change upstream would silently break the fallback (a re-create would error instead of opening the existing table). Match the typed `lance::Error::DatasetAlreadyExists { .. }` variant instead — the same discipline as the v3→v4 migration's not-found classifier — pinned by the new `lance_surface_guards.rs::lance_error_dataset_already_exists_variant_exists` guard so a Lance rename turns red instead of silently regressing. * refactor(engine): consolidate now_micros into one crate::db helper Four `fn now_micros() -> Result<i64>` copies (commit_graph, recovery_audit, graph_coordinator, manifest/graph) had already drifted: three mapped the clock error to `OmniError::manifest("...UNIX_EPOCH...")` while recovery_audit used `OmniError::manifest_internal("...unix epoch...")`. Replace all four with one `pub(crate) fn now_micros()` in `db/mod.rs` (the majority `manifest` variant), and repoint the eight call sites at `crate::db::now_micros()`. No test asserts on the failure message, so unifying the variant is behavior-safe; the timestamp-mapping contract can no longer fork across the rows it stamps. * refactor(engine): drop the dead snapshot param from roll_back_sidecar `roll_back_sidecar` took `snapshot: &Snapshot` only to discard it with `let _ = snapshot;` — rollbacks now always publish (the restored HEAD plus a recovery-commit lineage row), so the snapshot is never read to decide whether to skip a publish. Remove the parameter, the two call-site arguments, and the suppressor. A signature must not advertise inputs it does not consume. The `Snapshot` import stays — `process_sidecar`, `roll_forward_all`, and `record_audit_recovery_rollforward` still take it. * test(engine): red — open_at_branch wedges a branch on a missing commit-graph ref A v4 graph keeps its graph lineage in `__manifest` (RFC-013 Phase 7); the `_graph_commits.lance` branch ref is a derived artifact. An interrupted fork-reclaim or a `cleanup` race can drop that derived ref while the manifest lineage stays intact. Per invariants 7 + 15 a missing derived ref must not fail a logical read of the lineage. This wedge builds a real v4 `feature` branch (its `graph_head:feature` row in `__manifest`), force-deletes ONLY the `_graph_commits.lance` `feature` ref, then asserts the branch reads (`open_at_branch` / list-commits / `merge_base`) succeed from `__manifest` while a write that needs the derived ref (`create_branch`) fails loudly with the typed actionable error. Red against current code: `open_at_branch`'s hard `checkout_branch(branch)?` on the missing ref errors `OmniError::Lance` (Lance "Not found: _graph_commits.lance/tree/feature/_versions"), wedging the logical read. * fix(engine): read manifest lineage independent of the derived _graph_commits ref `CommitGraph::open_at_branch` did a hard `checkout_branch(branch)?` on the `_graph_commits.lance` branch ref before reading lineage — so a missing derived ref (an interrupted fork-reclaim, or a `cleanup` race) wedged the branch's commit-list / merge-base / snapshot resolution even though the lineage is readable from the authoritative `__manifest` (RFC-013 Phase 7). That is a derived/physical artifact failing a logical read — invariants 7 and 15. Make the held commits handle `Option<Dataset>` (mirroring `actor_dataset`). `open_at_branch` and `refresh` check out the derived ref best-effort: a typed not-found (`RefNotFound`/`NotFound`) yields a `None` handle while the read re-syncs from `__manifest`; any other open error still propagates. The manifest existence gate is unchanged — `load_commit_cache_for_branch` keeps its hard `?`, so a truly absent branch still fails loudly at the manifest. `create_branch` (the only writer that forks a ref) and the folded-in version lookup return a loud, actionable error on `None`, deferring repair to `cleanup`'s existing orphan reconciler rather than inlining a write on a read-side refresh. Reads (`head_commit`/`load_commits`/`get_commit`/`merge_base`) never touch the handle. Greens the wedge regression from the preceding commit. * fix(engine): v3→v4 retry loops return retryable contention on exhaustion `commit_v4_stamp_idempotently`'s retry loop used `0..=STAMP_RETRY_BUDGET` (6 iterations) with an `attempt < STAMP_RETRY_BUDGET` guard, so the LAST iteration's `IncompatibleTransaction` fell through to `Err(e) => OmniError::Lance(...)` — stringified, non-retryable — instead of the intended `RowLevelCasContention`, and the post-loop contention return was dead code. The publisher's outer retry only re-runs `is_retryable_publish_conflict`, so under sustained concurrent v3→v4 migration the one-time stamp bump could fail instead of converging, defeating the idempotency the migration is supposed to add. Fix the loop to `0..BUDGET` with an UNGUARDED `IncompatibleTransaction` arm: the retryable variant is always handled inside the loop (re-open + same-value check + retry), so it can never reach the stringifying catch-all, and the post-loop is the SINGLE reachable exhaustion path — the typed `RowLevelCasContention`. The `Err(e)` arm now catches only genuine non-contention errors. Apply the same range alignment to the sibling merge loop in `migrate_v3_to_v4` (behaviorally correct today — its `Err(err)` returns the already-typed contention — but it carried the identical off-by-one structure the stamp loop was copied from; aligning both stops the next copy from re-introducing it). Test-first. The exhaustion path is otherwise near-unreachable — a real concurrent winner stamps the same value, so the re-read returns Ok on the first retry — so a new `migration.v4_stamp.force_incompatible` failpoint forces every stamp attempt to lose, driving exhaustion deterministically. Against the pre-fix loop the new `v4_stamp_exhaustion_returns_retryable_contention` test goes red with `Lance("Incompatible transaction: injected failpoint triggered…")`; with the fix it asserts the typed `RowLevelCasContention`. Found by automated review on #299. * feat(engine): minimum-supported internal-schema floor + retirement tripwire The internal-schema migration chain (`migrate_internal_schema`) had a too-new ceiling but no floor, so every old `migrate_vN_…` arm and the v3 legacy readers it needs stay forever — the pile grows by one migration + readers + tests every schema version. Add `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (1 today, a pure no-op: `read_stamp` floors an absent stamp at 1 and no real graph carries 0) as the oldest stamp this binary opens; raising it is how the chain sheds old code. Collapse the one-sided `refuse_if_stamp_too_new` into `refuse_if_stamp_unsupported` checking both bounds, so the floor lands at all three stamp-enforcement sites — the write-path migrate dispatcher, the read-only open guard, and the branch lineage-read path (`commit_graph.rs`) — via one compiler-enforced rename. A hand-wired floor twin would have had to touch each site, and the branch-read path is easy to miss; one combined guard cannot half-enforce. Rename the read-only wrapper `refuse_if_internal_schema_unsupported` to match. A compile-time tripwire (`const _: () = assert!(LOWEST_REGISTERED_MIGRATION_SOURCE == MIN_SUPPORTED…)`) fails the build if a future floor bump forgets to delete the now-dead migration arm (or vice versa) — stronger than a runtime test, impossible to skip, and it doubles as the use that keeps the mirror const live. Tests: a sub-floor graph is refused in both open modes (twin of `future_stamp_is_refused_in_both_open_modes`); the guard accepts exactly [MIN, CURRENT]. No behavior change for any real graph. The retirement runbook lives on the `MIN_SUPPORTED` doc-comment + invariants.md. * fix(engine): compose migration contention with publisher retry; precise recovery-converge audit commit Three review-surfaced fixes on the RFC-013 Phase 7 path. Publisher retry vs migration contention: `publish()` propagated a `load_publish_state` error fatally via `?`, so a `RowLevelCasContention` surfaced by the v3->v4 migration's exhausted merge/stamp budgets aborted the publish instead of being retried — only `merge_rows` conflicts hit the retry. This contradicted the migration's own design, which returns that typed error EXPECTING the publisher to re-run the load (by which point a concurrent winner has usually finished the migration, so the next scan is a no-op). Route a retryable load error through the same retry path as a retryable `merge_rows` conflict. Regression test (failpoints): a one-shot retryable contention injected into `load_publish_state` now commits via the retry; red without the fix (the write fails with the injected contention). Recovery-converge audit commit id: `converge_or_defer_roll_forward` recorded the branch HEAD as the audit row's `graph_commit_id`, but a concurrent user write can advance `graph_head` past the recovery commit between the winner's publish and this read — attributing the audit to a later, wrong commit. Use the latest `RECOVERY_ACTOR`-authored commit (what `publish_recovery_commit` mints), which is the recovery commit by construction. The audit's actor was already correct (it comes from `sidecar.actor_id`, not the commit). Dead param: drop the unused `snapshot` from `record_audit_recovery_rollforward` (removing the `let _ = snapshot;` suppressor). `storage` stays — it is used to delete the sidecar.
422 lines
29 KiB
Markdown
422 lines
29 KiB
Markdown
# Architectural Invariants
|
|
|
|
**Type:** standing review checklist
|
|
**Status:** living document
|
|
**Audience:** anyone proposing, reviewing, or implementing an OmniGraph change
|
|
|
|
This file is intentionally short. It records the rules that should be in
|
|
working memory for every non-trivial change. Detailed mechanics live in the
|
|
area docs linked below.
|
|
|
|
Use it this way:
|
|
|
|
- Review the change against **Hard Invariants** and the **Deny-list**.
|
|
- If code and docs disagree, either fix the code or add/update a **Known Gap**.
|
|
- Keep implementation ledgers, roadmap detail, and historical MR notes in the
|
|
per-area docs. This file is the filter, not the encyclopedia.
|
|
|
|
## Governing principle: logical contract over physical state
|
|
|
|
The hard invariants below are instances of one rule. Keep it in view whenever
|
|
a change touches the boundary between what the graph *means* and how it is
|
|
physically stored.
|
|
|
|
> **Logical state is the contract. Physical state — index coverage, fragment
|
|
> layout, compaction versions, staged writes — is derived, rebuildable, and may
|
|
> be produced asynchronously. A physical operation must never fail a logical
|
|
> one. Preconditions are checked against logical state; physical reconciliation
|
|
> is idempotent and may lag or retry. Genuine logical conflicts still fail
|
|
> loudly: the licence to lag covers physical convergence, not correctness.**
|
|
|
|
Invariants that instantiate it: **2** (manifest-atomic visibility) and **5**
|
|
(recovery is part of the commit protocol) — a partially-written physical layer
|
|
never changes what a graph commit means; **7** (indexes are derived state) — a
|
|
query is correct under partial index coverage, and expensive index work
|
|
converges from manifest state instead of gating the write path; **13** (failures
|
|
bounded and observable) — the licence to lag is not a licence to drop, so a
|
|
physical step that cannot make progress is surfaced, not swallowed. Deny-list
|
|
items that enforce it: synchronous inline vector/FTS index rebuilds on the
|
|
commit path; state that drifts from Lance or the manifest when it can be
|
|
derived; job queues for manifest-derivable state where a reconciler fits.
|
|
|
|
The failure shape it rules out: a legitimate background operation on the
|
|
physical layer (compaction, an index build, an interrupted staged write) is
|
|
allowed to break a logical operation (a query's correctness, a migration's
|
|
success, a branch's writability). The smell to watch for is a logical operation
|
|
whose precondition is a *physical* fact — a cached file version, an index's
|
|
existence, a fragment count. Make the precondition logical and let a reconciler
|
|
converge the physical state.
|
|
|
|
## Hard Invariants
|
|
|
|
1. **Respect the substrate.** Lance owns columnar storage, per-dataset
|
|
versioning, fragments, branches, compaction, cleanup, and index primitives.
|
|
DataFusion should own relational execution where it fits. Do not add custom
|
|
WALs, transaction managers, buffer pools, page formats, or local clones of
|
|
substrate behavior. Read [lance.md](lance.md) before guessing. Respecting the
|
|
substrate also means *using* it idiomatically, not only refraining from
|
|
rebuilding it: reuse long-lived handles instead of re-opening per call,
|
|
resolve latest state through the substrate's cheap primitive instead of
|
|
re-scanning, and share its caches/session. Re-deriving per call what the
|
|
substrate keeps warm is a substrate violation even when no code is
|
|
reimplemented.
|
|
|
|
2. **Graph visibility is manifest-atomic.** Lance commits are per dataset.
|
|
OmniGraph's graph-level atomicity comes from publishing one manifest update
|
|
for the whole graph, guarded by expected table versions and sidecar recovery.
|
|
No write path may make a subset of touched node/edge tables visible as a
|
|
graph commit.
|
|
|
|
3. **A query reads one snapshot.** Query execution captures a manifest snapshot
|
|
for its lifetime. Do not re-read branch head mid-query to discover newer
|
|
table versions.
|
|
|
|
4. **Mutations publish at one boundary.** A `mutate_as` or `load` operation
|
|
accumulates constructive writes, commits each touched table at the end, then
|
|
publishes one manifest update. Do not commit per statement. Delete-only
|
|
queries are the documented inline residual; the parse-time D2 rule prevents
|
|
mixing deletes with insert/update until Lance exposes two-phase delete.
|
|
Read [writes.md](writes.md) and [execution.md](execution.md).
|
|
|
|
5. **Recovery is part of the commit protocol.** Writers that can advance Lance
|
|
HEAD before manifest publish must write `__recovery/{ulid}.json` sidecars.
|
|
`Omnigraph::open` in read-write mode runs the all-or-nothing sweep; the
|
|
write entry points (`load_as`, `mutate_as`, `apply_schema_as`,
|
|
`branch_merge_as`) and `refresh` run roll-forward-only recovery in-process,
|
|
so a long-lived process converges on its next write rather than at restart. Do not add a new writer kind without
|
|
sidecar coverage or an explicit proof that no Lance HEAD can move before
|
|
manifest publish.
|
|
|
|
6. **Strong consistency is the default.** Reads are snapshot-isolated, writes
|
|
are durable before acknowledgement, and branch reads observe the current
|
|
committed graph state. Any eventual-consistency mode must be explicit,
|
|
read-only, auditable, and non-default.
|
|
|
|
7. **Indexes are derived state.** Reads must see the correct result for the
|
|
branch they read even when index coverage is partial. Expensive index work
|
|
should converge from manifest state instead of extending the critical write
|
|
path. Scalar staged index builds and vector inline residuals are documented
|
|
in [writes.md](writes.md) and [indexes.md](../user/search/indexes.md).
|
|
|
|
8. **Schema identity survives renames.** Accepted schema identity must remain
|
|
stable across type and property renames. Rename support belongs in migration
|
|
planning, not in "drop and recreate" behavior. See the known gap below.
|
|
|
|
9. **Schema/data integrity failures are loud.** Type errors, required-field
|
|
misses, invalid edge endpoints, cardinality violations, and unsupported
|
|
mixed mutation modes fail before a graph commit is published. The system must
|
|
not invent placeholder nodes or silently weaken integrity.
|
|
|
|
10. **Query semantics are first-class IR concepts.** Search modes, mutations,
|
|
polymorphism, traversal, retrieval scores, imports, and policy predicates
|
|
belong in typed AST/IR/planner structures. Do not smuggle semantics through
|
|
strings, side tables, global state, or transport-specific flags.
|
|
|
|
11. **Transport/auth stay at the boundary.** Kernel crates should not depend on
|
|
HTTP, OpenAPI, bearer-token parsing, or future transport protocols. The
|
|
server resolves bearer tokens to actors; clients cannot set actor identity
|
|
directly.
|
|
|
|
12. **Bearer-token plaintext is not retained.** Server startup hashes bearer
|
|
tokens, authentication uses constant-time comparison, and request handling
|
|
carries only the resolved actor identity and hash-derived match state.
|
|
|
|
13. **Operational failures are bounded and observable.** Timeout, memory, OOM,
|
|
partial result, recovery, and conflict paths must fail loudly or degrade in
|
|
a documented way. If a metric affects plan choice or operator behavior, it
|
|
must be exposed through the relevant trait or observability surface.
|
|
|
|
14. **Tests match the boundary being changed.** Prefer extending the existing
|
|
test that owns the area. Planner changes need planner-level coverage,
|
|
storage changes need storage/recovery coverage, and end-to-end tests are not
|
|
a substitute for missing lower-level assertions. Read [testing.md](testing.md)
|
|
before adding tests.
|
|
|
|
15. **One source of truth, cheaply derived.** Lance and the manifest are the
|
|
source of truth. Everything the engine needs at runtime is a derived view of
|
|
them: read or projected on demand, held warm, refreshed by a cheap probe. Two
|
|
failure modes are forbidden. A *parallel copy* the engine maintains can drift
|
|
from the source, and that divergence compounds over time. *Cold
|
|
re-derivation* rebuilds the view from the full source on every call, so its
|
|
cost grows with history. Invariants 1 and 7, and the deny-list "state that
|
|
drifts" and "manifest-derivable reconciler" items, are instances; so is
|
|
bounding a read's cost to its working set rather than the commit count. This
|
|
is the structural face of "engineering is programming integrated over time":
|
|
both failure modes are liabilities that compound as the system grows.
|
|
|
|
## Current Truth Matrix
|
|
|
|
| Area | Current state | Source |
|
|
|---|---|---|
|
|
| Multi-table commit | Manifest CAS plus recovery sidecars; not a single Lance primitive | [writes.md](writes.md), [architecture.md](architecture.md) |
|
|
| Constructive mutations | In-memory `MutationStaging`, one end-of-query table commit per touched table, then one manifest publish | [writes.md](writes.md), [execution.md](execution.md) |
|
|
| Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/queries/index.md), [writes.md](writes.md) |
|
|
| Branch delete | Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branching/index.md), [maintenance.md](../user/operations/maintenance.md) |
|
|
| Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema/index.md), [execution.md](execution.md) |
|
|
| Unique constraints | Intra-batch and write-path checks exist; intake and branch-merge derive the composite key through one shared function (`loader::composite_unique_key`, a separator-free `Vec<String>` tuple) and fail loudly on an un-keyable column type rather than silently exempting it; full cross-version uniqueness against already-committed rows is still a gap | [schema-language.md](../user/schema/index.md) |
|
|
| Storage trait | `TableStorage` (via `db.storage()`) is staged-only; the inline-commit residuals (`delete_where`, `create_vector_index`) are split onto a separate sealed `InlineCommitResidual` trait reached via `db.storage_inline_residual()` (MR-854), so §1 holds by construction; capability/stat surfaces are roadmap | [writes.md](writes.md), [architecture.md](architecture.md) |
|
|
| Index lifecycle | `@index`/`@key` declares *intent*; the physical index is derived state and never fails a logical op. `schema apply` builds no indexes (records intent only; index-only changes touch no table data). `load`/`mutate` build inline through one chokepoint (`build_indices_on_dataset_for_catalog`, type-dispatched by `node_prop_index_kind`: enum + orderable scalar → BTREE, free-text String → FTS, Vector → vector) that fault-isolates an untrainable Vector column into a *pending* index instead of aborting. `optimize`/`ensure_indices` is the reconciler: it creates declared-but-missing indexes and folds appended/rewritten fragments into existing ones (`optimize_indices`), reporting still-pending columns. Explicit maintenance call, not yet a background loop | [indexes.md](../user/search/indexes.md), [maintenance.md](../user/operations/maintenance.md) |
|
|
| Traversal IDs | Runtime still builds `TypeIndex`; Lance stable row-id based graph IDs are roadmap | [architecture.md](architecture.md), [query-language.md](../user/queries/index.md) |
|
|
| Auth | Bearer token hashing and server-side actor resolution are implemented at the HTTP boundary | [server.md](../user/operations/server.md), [policy.md](../user/operations/policy.md) |
|
|
| Tests | Tempdir-backed Lance tests are the current substrate; the storage adapter has an in-memory backend for adapter-level contract tests, but Lance datasets bypass it | [testing.md](testing.md) |
|
|
|
|
The branch-delete reconciler is authority-derived: it reclaims orphaned forks
|
|
today and degrades to a no-op if Lance ships an atomic multi-dataset branch
|
|
operation, so the design composes with that future rather than blocking it. This
|
|
is the same shape as invariant 7 (indexes are derived state); prefer it over a
|
|
recovery-sidecar-style approach for any new multi-dataset metadata operation,
|
|
since the sidecar would be scaffolding to remove once the substrate closes the gap.
|
|
|
|
## Known Gaps
|
|
|
|
Do not hide these behind invariant wording. Either move them forward or keep
|
|
them explicit.
|
|
|
|
- **Rename-stable schema identity:** the invariant is that accepted IDs survive
|
|
renames. The current compiler still derives type IDs from `kind:name`; this
|
|
must be fixed before relying on renamed IDs across accepted schemas.
|
|
- **Storage abstraction:** `TableStorage` is present, sealed, and canonical for
|
|
staged writes. MR-854 sealed it: `db.storage()` exposes only staged primitives
|
|
+ reads, and the inline-commit residuals are split onto a separate sealed
|
|
`InlineCommitResidual` trait reached via `db.storage_inline_residual()`, so a
|
|
new writer cannot couple a write with a HEAD advance through the default
|
|
surface. The dead legacy methods (`append_batch` on the trait,
|
|
`merge_insert_batch{,es}`, `create_{btree,inverted}_index`) were removed. The
|
|
remaining residuals are `delete_where` and `create_vector_index`. The Lance
|
|
6.0.1 → 7.0.0 bump landed, so the staged two-phase delete API
|
|
(`DeleteBuilder::execute_uncommitted`, Lance #6658) is now available and MR-A
|
|
is unblocked — but the migration itself is still pending, so `delete_where`
|
|
stays inline for now. `create_vector_index` remains gated on Lance #6666
|
|
(still open). See [lance.md](lance.md) and [writes.md](writes.md). New write
|
|
paths should use the staged shape unless a documented Lance blocker applies.
|
|
- **Deletes and vector indexes:** `delete_where` and vector index creation still
|
|
advance Lance HEAD inline. The public delete two-phase API now exists (Lance
|
|
#6658 shipped in 7.0.0), so the delete residual is unblocked pending the MR-A
|
|
migration; vector index creation is still blocked (Lance #6666 open). Keep D2
|
|
and recovery coverage in place until those residuals are removed.
|
|
- **Blob-column compaction:** Lance `compact_files` mis-decodes blob-v2 columns
|
|
under its forced `BlobHandling::AllBinary` read ("more fields in the schema
|
|
than provided column indices"), so `optimize` skips any table with a `Blob`
|
|
property — reporting `SkipReason::BlobColumnsUnsupportedByLance` (loud, not a
|
|
silent drop) behind the `LANCE_SUPPORTS_BLOB_COMPACTION` gate. Reads and writes
|
|
are unaffected; only space/fragment reclamation on blob tables is deferred.
|
|
Remove the skip when the upstream Lance fix lands — the
|
|
`lance_surface_guards.rs::compact_files_still_fails_on_blob_columns` guard
|
|
turns red on that bump to force it.
|
|
- **Recovery is serialized against live writers in-process only:** the
|
|
write-entry heal (and `refresh`) serialize against a live writer's sidecar
|
|
lifetime via the per-`(table, branch)` write queues plus the schema-apply
|
|
serialization key — all in-process primitives. A recovery pass in one
|
|
process cannot serialize against a live writer in another (the open-time
|
|
sweep has the same exposure, and always has): it may roll a live foreign
|
|
writer's sidecar forward, which degrades to publisher-CAS contention for
|
|
data writes but can race the schema-staging promotion for a foreign live
|
|
schema apply. The roll-**forward** CAS contention is now
|
|
convergence-idempotent: when the publish loses the CAS to a concurrent
|
|
writer that already reached the sidecar's goal, the sweep treats it as
|
|
convergence (record the `RolledForward` audit + delete) rather than a fatal
|
|
`ExpectedVersionMismatch`, and defers when the manifest is only partway
|
|
(`converge_or_defer_roll_forward` in `db/manifest/recovery.rs`;
|
|
iss-schema-apply-reopen-recovery-race). So a concurrent advance no longer
|
|
fails the open. The schema-staging promotion race and the destructive
|
|
roll-**back** path (Lance `Restore` "trumps" a concurrent commit, so it
|
|
cannot be made idempotent — iss-recovery-sweep-live-writer-rollback) still
|
|
need the cross-process primitive. Multi-process writers on one graph are
|
|
already documented one-winner-CAS territory; closing this fully needs a
|
|
cross-process serialization primitive (e.g. lease-based use of the
|
|
schema-apply lock branch) — design it before promoting multi-process write
|
|
topologies.
|
|
- **Fork reclaim is in-process-safe only:** the first write to a table on a
|
|
branch forks it (a Lance `create_branch` that advances state before the
|
|
manifest publish). An interrupted fork (crash, or a cancelled request
|
|
future) leaves a manifest-unreferenced branch ref. The next write self-heals
|
|
it — `reclaim_orphaned_fork_and_refork` (`force_delete_branch` + re-fork)
|
|
— but reclaim is only safe because the writer holds the per-`(table,
|
|
branch)` write queue from before the fork through the publish AND re-checks
|
|
the live manifest under it, so no *in-process* writer can be mid-fork. A
|
|
reclaim cannot serialize against a foreign-*process* in-flight fork: it may
|
|
force-delete a peer's just-created ref, which makes that peer's commit fail
|
|
and retry — the same one-winner-CAS exposure as above, not corruption. The
|
|
reclaim never fires unless in-process-queue + manifest authority both prove
|
|
the ref is manifest-unreferenced. `cleanup`'s per-table reconciler
|
|
(`reconcile_orphaned_branches`) is the guaranteed backstop for any fork the
|
|
write path never revisits. Both degrade to a no-op if Lance ships an atomic
|
|
multi-dataset branch op.
|
|
- **Local `write_text_if_match` is not a cross-process CAS:** object-store
|
|
backends use a true conditional put (ETag If-Match; the in-memory test
|
|
backend too), but upstream `object_store` leaves `PutMode::Update`
|
|
unimplemented for `LocalFileSystem`, so the local path emulates CAS with
|
|
a content-token compare followed by an atomic replace — a check-then-act
|
|
gap plus content-token ABA. Every current caller goes through the cluster
|
|
lock protocol first, which makes this safe. A lock-free caller would get
|
|
S3-correct but local-racy behavior — the same divergence shape as the
|
|
acknowledged-before-visible bug this branch fixed. Close it (local CAS
|
|
primitive, or a trait-level lock requirement) before admitting any
|
|
lock-free `if_match` caller.
|
|
- **Manifest→commit-graph publish atomicity — CLOSED (RFC-013 Phase 7):** graph
|
|
lineage now lives ONLY in `__manifest`, as `graph_commit` + `graph_head:<branch>`
|
|
rows written in the SAME `MergeInsertBuilder` commit as the table-version rows
|
|
(`commit_changes_with_lineage` → `GraphNamespacePublisher::publish` with a
|
|
`LineageIntent`). There is no second write to fail between — a graph commit and
|
|
its lineage land at one manifest version atomically, so a crash after the publish
|
|
leaves no gap. The commit-graph cache is a derived projection of those manifest
|
|
rows; nothing writes `_graph_commits.lance` (it persists only to carry branch
|
|
refs). The prior two-write gap (manifest at N with no `_graph_commits` row for N)
|
|
is gone by construction. A graph created before Phase 7 (internal schema v3)
|
|
carries its lineage only in `_graph_commits.lance`; the `migrate_v3_to_v4`
|
|
internal-schema step (`db/manifest/migrations.rs`) backfills it into `__manifest`
|
|
per-branch on the first read-write open (idempotent, crash-safe, data-preserving),
|
|
and a read-only open of an un-migrated v3 graph sources the DAG from
|
|
`_graph_commits.lance` via a stamp-gated transitional fallback so reads stay
|
|
correct until the first write migrates it. An old binary refuses a v4-stamped
|
|
graph (read-write and read-only) with the standard upgrade error. The migration
|
|
is **loud on failure and concurrent-runner idempotent**: the legacy-open read
|
|
(`read_legacy_commit_cache`) treats only a genuine not-found as "no legacy data"
|
|
and propagates any other open error (so a transient/corrupt open can never stamp
|
|
v4 over an empty backfill — orphaning lineage permanently), and the backfill
|
|
converges all-or-nothing when two runners open the same legacy graph at once — a
|
|
bounded re-open retry on the `graph_head:<branch>` row-level CAS plus an
|
|
idempotent terminal stamp bump (both runners write the same value, so a concurrent
|
|
`UpdateConfig`/`IncompatibleTransaction` loss re-opens and no-ops if the stamp
|
|
already landed). The branch read path (`load_commit_cache_for_branch`) also
|
|
refuses an out-of-range branch stamp (`> CURRENT` or `< MIN_SUPPORTED`;
|
|
defense-in-depth; not a live hole because migrations run main-first, so main
|
|
refuses first). The migration chain is **floor-bounded**:
|
|
`MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` (migrations.rs; 1 today, a pure no-op) is
|
|
the oldest stamp this binary opens, enforced symmetrically with the ceiling by the
|
|
single `refuse_if_stamp_unsupported` guard at all three stamp-read sites
|
|
(write-path migrate, read-only open, branch lineage-read). Raising MIN sheds the
|
|
now-dead `migrate_vN_…` arms and (at MIN ≥ 4) the `commit_graph_legacy_v3` legacy
|
|
readers; a compile-time tripwire (`LOWEST_REGISTERED_MIGRATION_SOURCE`) fails the
|
|
build if the floor and the lowest registered arm drift. Retirement runbook lives on
|
|
the `MIN_SUPPORTED_INTERNAL_SCHEMA_VERSION` doc-comment.
|
|
- **Planner capability/stat surfaces:** cost-aware planning, complete
|
|
capability advertisement, and explain-with-cost are roadmap. Do not describe
|
|
them as implemented.
|
|
- **Traversal execution:** current multi-hop execution still uses `TypeIndex`,
|
|
ad-hoc ID filtering, and eager materialization in places. Stable row IDs, SIP,
|
|
and factorization are target patterns, not current fact.
|
|
- **Retrieval ranks:** hybrid search works, but rank/score are not yet carried
|
|
everywhere as ordinary columns through the plan.
|
|
- **Policy pushdown and `Source`:** Cedar enforcement is at the HTTP boundary
|
|
today, and imports are still loader-shaped. Planner predicates and a unified
|
|
`Source` operator are roadmap.
|
|
- **Resource bounds:** some operations still lack enforced per-query memory or
|
|
time budgets. New long-running work should add explicit bounds rather than
|
|
widening the gap.
|
|
- **Read-path re-derivation (largely closed by the query-latency work):**
|
|
snapshot resolution used to re-open a fresh coordinator per read (a full
|
|
`__manifest` re-scan plus two commit-graph scans), open each table through the
|
|
namespace (two more `__manifest` scans per table), validate the schema twice,
|
|
and share no Lance `Session`. That was an O(commits) cost that never warmed up.
|
|
Fix 1 (warm coordinator reuse behind a `latest_version_id` probe), Fix 2 (open
|
|
tables by location+version), finding A (validate once), and Fix 3 (a held
|
|
`Dataset`-handle cache keyed by `(table, branch, version, e_tag when Lance
|
|
exposes it)` plus one shared `Session` per graph) remove that tax: a warm
|
|
same-branch read does one probe, one schema read, and zero opens on a repeat.
|
|
Non-main branch freshness compares the manifest incarnation (`version` plus
|
|
manifest-location e_tag when available, otherwise Lance manifest timestamp),
|
|
because Lance branch names can be deleted/recreated at the same version number;
|
|
the manifest e_tag is carried into synthetic snapshot ids when available, and
|
|
a detected same-branch manifest refresh clears read caches as the fallback for
|
|
e_tag-less table locations/topology. Remaining: `optimize` now compacts the
|
|
internal metadata tables (`__manifest`, `_graph_commits`) too (RFC-013 step 2),
|
|
so a *periodically-optimized* graph keeps the probe/refresh/per-write scan flat
|
|
in history; but they are not yet brought into `cleanup` (version GC), so the
|
|
`_versions/` chain still grows until an explicit cleanup (the cleanup half is
|
|
deferred — it needs the Q8 cleanup-resurrection watermark first). The commit
|
|
graph IS now reconcilable from the manifest (RFC-013 Phase 7 — it is a pure
|
|
projection of the `graph_commit`/`graph_head` rows); the traversal id-map is
|
|
still rebuilt.
|
|
- **Commit-graph parent under concurrency — CLOSED (RFC-013 Phase 7):** the graph
|
|
commit is now recorded in the manifest publish CAS, and the publisher resolves
|
|
the new commit's parent INSIDE its retry loop, per attempt, from the just-loaded
|
|
`__manifest` (the `should_replace_head` winner over the visible `graph_commit`
|
|
rows). A CAS-conflict retry re-reads the advanced head and parents correctly, so
|
|
the refresh-then-append TOCTOU is gone. Two processes writing disjoint tables on
|
|
the same branch now also contend on the shared `graph_head:<branch>` row (one
|
|
`object_id`, `WhenMatched::UpdateAll`): one wins, the other retries and re-parents
|
|
— so the cross-process disjoint-table fork is closed too. This is the intended
|
|
§7.1 contention point, pinned by
|
|
`manifest::tests::concurrent_disjoint_writes_share_head_and_form_linear_chain`
|
|
(two disjoint writers → both commit, single linear chain) and
|
|
`manifest::tests::n_concurrent_disjoint_writers_converge_to_one_linear_chain`
|
|
(N=8 disjoint writers with app-level retry → one linear chain of 8, no fork).
|
|
|
|
## Deny-list
|
|
|
|
If a proposal fits one of these, the burden is on the proposer to prove why the
|
|
case is exceptional.
|
|
|
|
- Custom WAL, transaction manager, buffer pool, page format, or storage engine.
|
|
- Per-table graph publishing outside the manifest publisher.
|
|
- Re-reading current branch head during a query instead of using the captured
|
|
snapshot.
|
|
- New write paths that can advance Lance HEAD before manifest publish without a
|
|
recovery sidecar.
|
|
- Cross-query `BEGIN`/`COMMIT` transactions in the OSS engine. Use branches and
|
|
merges for multi-query workflows.
|
|
- Acknowledging writes before durable Lance and manifest persistence.
|
|
- Silent fallback to eventual consistency, partial results, or dropped rows.
|
|
- State that drifts from Lance or the manifest when it can be derived.
|
|
- Job queues for manifest-derivable state where a reconciler is the right shape.
|
|
- Synchronous inline vector/FTS index rebuilds on the query commit path, except
|
|
for documented Lance API residuals.
|
|
- Side-channels for query semantics: hidden globals, magic strings, transport
|
|
flags, or out-of-band metadata.
|
|
- Cost-blind plan choice when statistics are available or required.
|
|
- Hidden statistics for behavior that affects planning or operator choice.
|
|
- Hash-map iteration order in result ordering, plan choice, or migration output.
|
|
- Cold re-derivation on the hot path: rebuilding from the full source what could
|
|
be held warm and refreshed cheaply, so cost scales with history rather than the
|
|
working set (the cost face of invariant 15; "state that drifts" above is its
|
|
shadow-copy face).
|
|
- String-flattened SQL/filter generation when a structured pushdown API is
|
|
available.
|
|
- Eager multi-hop cross-product materialization when factorization fits.
|
|
- Ad-hoc `IN`-list filtering where SIP or another structured selectivity path
|
|
fits.
|
|
- Discarding retrieval score/rank before fusion or projection decisions.
|
|
- Auto-creating placeholder nodes for orphan edges.
|
|
- Raw filesystem I/O for cluster-stored state (ledger, lock, sidecars,
|
|
approvals, catalog) outside the cluster crate's storage module — every
|
|
stored byte goes through the engine `StorageAdapter` so `file://` and
|
|
`s3://` stay one code path.
|
|
- Wire-protocol-specific code in compiler or engine crates.
|
|
- Cloud-only correctness fixes or forks of the OSS engine for correctness.
|
|
- Mutating immutable substrate state in place, including Lance fragments or
|
|
index segments.
|
|
- Shipping observable behavior as if it were not part of the contract. Output
|
|
ordering, error text, timestamp precision, defaults, and latency profiles all
|
|
become dependencies once exposed.
|
|
|
|
## Review Checklist
|
|
|
|
Use this as yes/no/NA for any non-trivial design or PR:
|
|
|
|
- Does it respect Lance/DataFusion instead of rebuilding them?
|
|
- Does it preserve manifest-atomic graph visibility?
|
|
- Does every query keep one snapshot for its lifetime?
|
|
- Do mutations publish once at the commit boundary?
|
|
- Can every Lance-HEAD-before-manifest gap recover all-or-nothing?
|
|
- Are schema and edge integrity checks strict by default?
|
|
- Are query semantics represented in AST/IR/planner structures?
|
|
- Are transport, auth, and policy boundaries preserved?
|
|
- Are failures bounded, typed, and observable?
|
|
- Are result ordering and plan choices deterministic within a snapshot?
|
|
- Are stats/capabilities exposed when behavior depends on them?
|
|
- Are existing known gaps left no worse and documented if touched?
|
|
- Does the test live at the same boundary as the change?
|
|
- Is this operation's cost bounded with respect to history and scale, or does it
|
|
re-derive warm state from cold storage per call?
|
|
- Does the change avoid every deny-list pattern, or justify the exception?
|
|
|
|
## Maintenance Policy
|
|
|
|
Update this file when an invariant changes, a known gap opens or closes, or a
|
|
new review anti-pattern deserves deny-list treatment. Prefer stable headings
|
|
over numbered sections so other docs can link here without churn.
|
|
|
|
Removing or relaxing a hard invariant requires the same review process as code.
|
|
Adding a known gap is acceptable when it makes reality explicit; leaving stale
|
|
claims is not.
|