omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-27 02:39:38 +02:00

Author	SHA1	Message	Date
Ragnor Comerford	7d3a52d674	feat(engine): `WriteTxn` - validate schema + open each data table once per write (#298 ) Some checks failed CI / Classify Changes (push) Has been cancelled Details CI / Check AGENTS.md Links (push) Has been cancelled Details CI / Container Entrypoint (push) Has been cancelled Details Release Edge / Prepare edge release (push) Has been cancelled Details CI / Test Workspace (push) Has been cancelled Details CI / Test omnigraph-server --features aws (push) Has been cancelled Details CI / RustFS S3 Integration (push) Has been cancelled Details Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled Details Release Edge / Smoke Windows installer (push) Has been cancelled Details * docs(rfc-013): step-3b handoff + §4.1 corrections (validated) Add the RFC-013 write-path handoff doc, and correct §4.1's WriteTxn sketch from the 4-subagent validation against current code: - HandleCache → handle-threading (forward the commit-return handle; a version-keyed cache misses because HEAD walks N→N+1→N+2 across staging + index-build commits). - "re-resolution unrepresentable" softened to "pinned base for the pre-commit phase + named fresh re-reads at the commit/fork boundary" — three reads (commit-time OCC, the live-HEAD drift probe, fork authority) are irreducible correctness machinery. - WriteParams DOES carry a session field; the real constraint is "stage off an open Dataset," so attach the Session by opening read-style then staging off it. * test(engine): RED step-3b capture-once fitness asserts + open_count probe Two write-path cost gates, RED today, GREEN after the WriteTxn lands: - write_validates_schema_contract_once: a write must validate the schema contract once (3 read_text + 2 exists). Today re-validates at every resolve point — measured 12 read_text / 9 exists (~4 validations) via CountingStorageAdapter (zero production change; the write twin of the read-path schema-once test). - keyed_insert_opens_table_at_most_once: a keyed single-table write must open its table <=1x. Today measured 10 opens. Adds an exact open-CALL probe: open_count + record_open() on QueryIoProbes (mirroring probe_count/record_probe), called at both open chokepoints; surfaced as IoCounts.open_count. forbidden_apis guarantees every write open routes through them. * feat(engine): WriteTxn carrier + open_write_txn (3b scaffolding) The capture-once write transaction (RFC-013 step 3b): WriteTxn{branch, base: Snapshot, session} + Omnigraph::open_write_txn, which validates the schema contract once and pins the base snapshot + the shared per-graph Session. Landed as reviewed scaffolding (gated #[allow(dead_code)]); the next pass threads Option<&WriteTxn> through open_for_mutation_on_branch / staging on the non-strict bound-branch path — opening the base once from the pinned entry with the warm session (a session-aware pinned opener returning a SnapshotHandle) and skipping the per-table schema re-validation — to turn the two RED cost gates green. Strict ops / fork / the commit-time OCC re-read keep their fresh reads. * test(engine): scope write-path open_count to data tables (RFC-013 step 3b) The keyed_insert_opens_table_at_most_once gate asserted open_count <= 1, but open_count was a single unclassified counter: record_open() fires in both open chokepoints, and open_dataset_tracked also opens the internal/system tables (__manifest via layout.rs, _graph_commits/_graph_commit_actors via commit_graph.rs). So the count conflated data-table opens with the publisher CAS + commit-graph append opens — making the gate measure the wrong quantity and unreachable by threading alone (the manifest publish keeps it >1 regardless). Scope it by table class, mirroring the read-side counters (which already split by URI prefix via separate wrappers): record_open(uri) classifies the open's last path segment and feeds data_open_count vs internal_open_count. IoCounts exposes both; the gate now asserts data_open_count <= 1. Re-baselined: a single keyed insert is data_open_count=4 / internal_open_count=6 (sum 10, the old conflated value). The RED target for the WriteTxn threading is now the real data-table-open count (4 -> 1), with internal opens correctly out of scope. Pure test-harness/instrumentation; no production behavior change (classification runs only inside the probe closure, skipped when no probes are installed). Also marks #297 (optimize-vs-write race) as landed in the step-3b handoff — this branch is already stacked on origin/main after it merged. * feat(engine): validate the schema contract once per write (RFC-013 step 3b) A single mutate/load re-validated the schema contract ~4 times: at the entry (ensure_schema_state_valid), per-table in open_for_mutation_on_branch (resolved_branch_target), at the commit-time OCC re-read (fresh_snapshot_for_branch), and in the publisher's index-build snapshot (snapshot_for_branch). Each validation is 3 read_text + 2 exists on the storage adapter — O(touched resolve-points) of redundant contract I/O on every write. Thread the already-landed WriteTxn carrier through the write path: capture `txn = open_write_txn(branch)` once at the mutate/load entry (the single validation), then source the per-table entry and the commit/publish snapshots from `txn.base` instead of re-resolving. When `txn` is None (branch merge, schema apply, tests) every function is byte-identical to before. - mutate_with_current_actor / load_jsonl_reader capture txn once (replacing the entry-point ensure_schema_state_valid) and thread Some(&txn) through execute_/open_table_for_mutation, commit_all, and commit_updates_on_branch_with_expected. - open_for_mutation_on_branch sources (snapshot, branch) from txn.base/txn.branch when present — skipping resolved_branch_target's re-validation. The OPEN itself is unchanged (still HEAD via open_dataset_head_for_write), and strict ops keep ensure_expected_version. Schema-once applies to strict and non-strict alike; the data-open collapse is a separate change. - commit_all uses fresh_snapshot_for_branch_unchecked (the OCC manifest re-read minus the schema re-validation) when txn is present; the drift guard is unchanged. - prepare_updates_for_commit uses txn.base for the publisher index-build snapshot. fresh_snapshot_for_branch{,_unchecked} now read the manifest directly via ManifestCoordinator instead of resolve_target. The OCC re-read consumes only the Snapshot (per-table location + version), which ManifestCoordinator::open().snapshot() produces identically — but resolve_target additionally opened the commit graph (a spurious _graph_commits.lance exists probe the OCC read never consults). Dropping that load is a pure read-cost reduction for every fresh-snapshot caller (commit_all's None arm, optimize, repair, fork reclaim); the returned Snapshot is unchanged and the read is a fresher cold manifest re-read, so the OCC freshness guarantee is preserved. Greens write_validates_schema_contract_once (3 read_text / 2 exists, was 12/9). keyed_insert_opens_table_at_most_once stays red (data_open_count=4) — the open collapse lands next. Full engine suite green otherwise. feat(engine): open each data table once per write (RFC-013 step 3b) A single keyed-node mutate opened its data table 4 times: accumulation (to read .version()), staging (the real write base), the commit-time drift guard (to read live HEAD), and the publisher's index build (reopen at the just-committed version). Collapse three of the four — using the WriteTxn carrier threaded for schema-once — so a write opens each touched data table at most once. - #1 accumulation: open_for_mutation_on_branch now returns (Option<SnapshotHandle>, expected_version, full_path, table_branch). On the txn's own branch, a non-strict (Insert/Merge) op needs no open — the only thing the caller reads is .version() (the CAS fence), which is exactly the pinned base version (entry.table_version). So skip open_dataset_head_for_write and source the version from txn.base. The node insert path already discarded that handle; the edge path resolves a pinned read only when non-default cardinality needs it. STRICT ops and any write that must fork still open live HEAD + ensure_expected_version. - #3 commit drift guard: commit_all reads live HEAD via entry.dataset.dataset().latest_version_id() — a cheap manifest-pointer probe off the already-open staging handle (the same primitive ManifestCoordinator:: probe_latest_version uses) instead of a fresh open_dataset_head_for_write. The head<current / head>current drift classification is byte-identical. - #4 index build: commit_all now returns the per-table post-commit_staged SnapshotHandle map; commit_updates_on_branch_with_expected threads it into prepare_updates_for_commit, which builds indices on the threaded handle instead of reopening at the same just-committed version. Absent a handle (other writers, inline/delete tables) the reopen path is byte-identical. When txn is None (branch merge, schema apply, tests) every function opens and checks exactly as before. Greens keyed_insert_opens_table_at_most_once (data_open_count 4->1). Schema-once gate stays 3/2. Full engine suite + failpoints (recovery sidecar lifecycle) green. * refactor(engine): name the write-path open/commit returns (RFC-013 step 3b) The open collapse left two positional returns that are easy to mis-thread and carry an unwritten contract: open_for_mutation_on_branch's (Option<SnapshotHandle>, u64, String, Option<String>) and commit_all's 5-tuple (updates, expected_versions, sidecar_handle, guards, committed_handles). Replace both with named structs so each field reads at the call site and the Option's contract is documented, not folklore. - OpenedForMutation { handle, expected_version, full_path, table_branch } with a require_handle(ctx) helper for the callers that must have a handle (strict ops, the fork path, every no-txn caller — branch merge, the seed test). The handle is None only on the non-strict-txn open-skip path (collapse #1); require_handle panics with a named context if that contract is ever broken. - CommittedMutation { updates, expected_versions, sidecar_handle, guards, committed_handles } for commit_all; consumers destructure into the same local bindings they already used, so the publish/sidecar/guard-hold logic is unchanged. - A debug_assert in open_table_for_mutation pins the skip contract: a missing handle is legal only on the non-strict txn path, so a future strict arm returning None trips in debug builds instead of handing None to a require_handle consumer. Pure refactor — no behavior change. Both cost gates stay green (schema 3/2, data_open_count=1), full engine suite + lib (162) green. * refactor(engine): drop the unearned session field from WriteTxn (RFC-013 step 3b) The open collapse greens data_open_count<=1 by SKIPPING the accumulation open, PROBING live HEAD with latest_version_id, and REUSING the commit_staged handle — none of which consume a session. The captured WriteTxn.session was therefore dead (`#[allow(dead_code)]`): unearned surface a reviewer rightly flags. Remove it. The carrier is now {branch, base} — exactly what schema-once + the open collapse use. Step 5 (PublishPlan unification) makes WriteTxn the non-optional publish carrier and is the right home for session-aware base opens, where the warm-session benefit on the single remaining open — an object-store (S3) phenomenon, invisible on local FS — can be earned by its own cost gate rather than carried dead through this PR. No behavior change; both cost gates stay green (schema 3/2, data_open_count=1). * docs(rfc-013): mark step 3b DONE — schema-once + open-collapse shipped, session deferred to step 5 * docs(rfc-013): capture the write-base-staleness convergence (§1d) Three findings this cycle share one root — the write base is a stale, un-probed, un-classified pin (the read path probes; the write path returns the warm coordinator snapshot): - #298 edge-@card stale-read regression (cursor High / codex P1, VALID): collapse #1 made the cardinality scan read txn.base instead of live HEAD, so a concurrent edge is uncounted and a max can be exceeded. Fix on #298: restore the live-HEAD read + deterministic test + correct the single-writer doc comment. - The structural liability underneath: no unified write-validation read-set — endpoint/cardinality/uniqueness each pick freshness ad hoc (warm/pinned/live), the same cardinality check forks mutation-vs-loader, none re-validated at commit. - The served-strict-write stale-view false-fail (validated on prod + a #[ignore] repro): a strict update/delete false-fails ExpectedVersionMismatch after an external optimize advance — the write-side mirror of #297/§6.6. The naive blanket probe is proven wrong (breaks the cross-process lost-update OCC contract). All three converge on Design A (step 5): open_txn's warm probe makes the base fresh, the op-class-aware precondition (derive maintenance vs logical from Lance per-version transaction metadata — no parallel marker) fast-forwards maintenance and fails logical, and §7.1's read-set-in-CAS unifies + re-validates the validation read-set. §8 records the #298 follow-up, the widened §7.1 scope, and the step-5 two-test acceptance contract. * test(engine): RED — edge @card must scan live HEAD, not stale txn.base (#298) Regression guard for the cursor-High/codex-P1 finding on #298: 3b's collapse #1 made the non-strict edge-insert cardinality scan read the pinned txn.base instead of live HEAD (edge_cardinality_read_handle), so a concurrent edge committed after txn capture is uncounted and a @card max is silently exceeded (invariant 9). Deterministic two-handle test (no failpoint): handle A commits WorksAt(Alice->Acme) to the @card(0..1) max; stale handle B (never read since) inserts a second WorksAt for Alice. B's coordinator is stale by construction (the write path doesn't probe), so B scans txn.base (Alice has 0) and wrongly commits the 2nd edge. RED: the insert that must be rejected currently succeeds (panics at unwrap_err). Goes green when the scan reads live HEAD. * fix(engine): scan live HEAD for edge @card, not the pinned txn.base (#298) 3b's collapse #1 skips the non-strict edge accumulation open, so edge_cardinality_ read_handle reopened the edge table at the pinned txn.base for the @card scan. Since cardinality is validated once (never rechecked at commit), a concurrent edge committed after txn capture was uncounted and a @card max could be silently exceeded (invariant 9) — the cursor-High/codex-P1 regression on #298. Pre-3b the scan read live HEAD (the mutation's own open_dataset_head_for_write handle). Restore the live-HEAD read: take the table LOCATION from the pinned entry (stable across versions) and open the dataset at its current HEAD via open_dataset_head_for_ write. Gate-safe — the data_open_count / merge-insert-only gates are node inserts; the edge cardinality path (non-default @card only) is untouched by them, and the extra live-HEAD open is exactly the pre-3b shape. Also drops the dead None-fallback's schema re-validation (greptile P2, auto-resolved). The residual validate->commit TOCTOU is the pre-existing §7.1 gap (RFC-013 step 4), recorded in handoff §1d/§8. Turns cardinality_rejected_for_stale_handle_after_concurrent_edge_commit green; validators / write_cost / writes / consistency / end_to_end / branching all green. * docs(dev): link handoff docs from index * docs(engine): tighten 3b claims to match the code (#298 review) Review caught several comments/docs overclaiming what the code does (the session drop + the #298 cardinality fix left stale/too-strong wording). No logic change. - open_write_txn doc: drop the stale "shared per-graph Session" (WriteTxn no longer carries one); scope "once" to the table-touch hot path and note edge/load RI validation still re-resolves (→ step 4 §7.1) + the session-aware open is step 5. - edge cardinality call-site comment: it said the scan uses a "pinned txn.base" — it now opens LIVE HEAD (#298); corrected. - write_cost.rs: "opens the base once (with the shared Session)" → session-aware base open is deferred to step 5. - data_open_count completeness (instrumentation.rs + write_cost.rs): forbidden_apis only keeps engine code OUTSIDE the storage layer on the chokepoints; table_store.rs is allow-listed and holds direct Dataset::opens for branch-management ops (not the keyed-write hot path the gate measures). Narrowed the claim accordingly. - handoff §4: "schema once / open once" is the node hot path (the two gates); edge endpoint + loader RI/cardinality still re-validate and read warm — #298 un-regresses cardinality only, it does NOT close write-validation freshness (that's step 4 §1d/§7.1). build clean; write_cost / validators / forbidden_apis green.	2026-06-23 21:27:31 +02:00
Ragnor Comerford	6d4606a830	fix(engine): optimize survives a cross-process write race on the same table (#297 ) * test(engine): cross-process optimize-vs-write race — RED Two regression tests for the prod bug: a direct `optimize` process racing a served write on the same table fails, because the in-process write queue does not serialize across processes and the data-table optimize path has no retry. - optimize_survives_concurrent_insert_advancing_manifest: a concurrent insert advances the manifest while optimize is paused between compact and publish; optimize's equality-CAS publish then fails "expected X but current Y". - optimize_survives_concurrent_delete_before_compaction: a concurrent delete commits before optimize compacts; Lance rebases the compaction past it cleanly, so optimize again fails the publish CAS (the genuine Lance Rewrite-vs-Rewrite overlap is rarer and shares the internal path's retry). Both fail today with ExpectedVersionMismatch. Adds the `optimize.before_compact` failpoint seam + a wait_for_sidecar helper; serializes the optimize failpoint tests (shared failpoint name). The fix lands next. * fix(engine): optimize survives a cross-process write race on the same table The data-table optimize path trusted the in-process write queue and skipped a retry, so a CLI `optimize` racing a served write (separate processes = separate queues) failed: either the Lance Rewrite lost ("preempted by concurrent Update") or the manifest publish lost the strict equality CAS ("expected X but current Y"). Unify both compaction paths on the internal path's reopen+replan shape, with a two-level retry that matches the two failure points: - Outer loop (reopen+replan): a genuine Lance Rewrite-vs-Update/Delete same- fragment conflict means our compaction did not commit — reopen at the new HEAD and re-plan. Lance rebases the common disjoint case (a concurrent insert/delete on other fragments) for free, so this fires only on real overlap. - Inner loop (Phase C, monotonic publish): the manifest advanced between our compaction and our publish. The compaction is already committed at Lance HEAD N, so we must NOT reopen (that trips the HEAD>manifest drift guard on our own work). Re-read the current manifest version C: if C >= N the manifest already includes our compaction (versions are linear) — no-op; else fast-forward to N. Monotonic, not the strict equality CAS that manufactured the conflict. The Phase-A sidecar is written once and reused across reopen attempts (every Phase-B commit is content-preserving, so recovery rolls the observed HEAD forward or safely rolls the compaction back). The in-process queue is kept — it is now an in-process contention reducer, not the cross-process correctness guard. Shares the COMPACTION_RETRY_BUDGET constant + is_retryable_lance_conflict with the internal path; adds is_retryable_manifest_conflict for the publish loop. No writer_epoch. Turns the prior commit's two race tests green. * docs(rfc-013): two-op-class principle + the found+fixed optimize-vs-write race §6.6 records the maintenance vs logical op-class distinction (maintenance commutes → Lance rebase + reopen/replan + monotonic manifest fast-forward, no writer_epoch; logical → strict cross-process OCC + epoch) and the prod optimize-vs-served-write race that motivated it, now landed. Adds the matching mechanic row to §4.2. * fix(engine): retry must not misclassify optimize's own HEAD drift Review catch on the cross-process optimize fix: the outer retry loop re-ran the `lance_head > manifest` drift guard every iteration. After a partial Phase-B commit (the auto_cleanup strip or compaction commits, then a later op hits a retryable conflict), the reopened attempt saw HEAD ahead of the manifest — from OUR own sidecar-covered work, not an external writer — and deleted the sidecar + returned `skipped_for_drift`, stranding uncovered drift that then needs `repair`. Track `head_advanced` (did one of our Phase-B ops already commit). The drift guard now fires only when `!head_advanced` (genuine pre-existing external drift); once we have advanced HEAD, a reopened HEAD>manifest is our work that the monotonic publish fast-forwards. The no-op early-return likewise publishes prior committed work instead of dropping it when `head_advanced`. Regression test `optimize_retry_does_not_misclassify_own_head_drift` injects one retryable reindex conflict after the compaction commits (new `optimize.inject_ reindex_conflict` seam); red→green verified by negative control (reverting the gate reproduces `skipped_for_drift: Some(DriftNeedsRepair)`). Also de-flake `optimize_survives_concurrent_insert_advancing_manifest`: pause at `before_compact` (not post-compact) so the concurrent insert lands while HEAD== manifest — otherwise it could race optimize's committed-but-unpublished compaction and hit the write-path "HEAD ahead of manifest" guard. * fix(engine): optimize publish converges on retry-budget exhaustion Review catch (greptile): the monotonic Phase-C publish loop returned an error on its final iteration's retryable manifest conflict, even though that conflict can itself mean a concurrent writer published a version that already includes our (content- preserving) compaction — i.e. the postcondition ("the manifest reflects our compaction") is already met. Recovery covered it (no data loss), but the operator saw a spurious error and had to re-run. Restructure the loop to re-read `current` on every retryable conflict and, on budget exhaustion, do a final `current >= state.version` convergence check before surfacing the error — the §6.6 "postcondition is the state, not winning the CAS" principle. Factor the repeated current-version read into `current_manifest_version`.	2026-06-22 13:05:28 +02:00
Ragnor Comerford	5cfae9acc1	docs(rfc-013): latency = (serial_hops + ops/concurrency)·RTT — concurrency-cap correction + Lance-metadata comparison (#292 ) * feat(engine): compact the internal __manifest/_graph_commits tables in optimize `optimize` iterated node/edge catalog tables only, so the two internal system tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and were never compacted -- making every write's metadata scan O(fragments), which grows forever on a long-lived graph (RFC-013 step 2). `optimize_all_tables` now also compacts both internal tables via a new `compact_internal_table`. They are not catalog-tracked (readers open them at their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`: compact in place, no manifest publish (nothing to publish to), no recovery sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no optimize_indices (they carry no Lance index, only object_id's unenforced-PK metadata). No application lock: Lance's compact_files auto-retries its Rewrite against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a coordinator refresh afterwards makes the warm handle observe the compacted HEAD. Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits into __manifest -- a one-call throwaway for the full interim win; __manifest compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup (version GC) of the internal tables is deliberately NOT included here: it needs the Q8 cleanup-resurrection watermark first (deferred). maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar, graph coherent for reads + strict writes after). * test(engine): un-ignore the internal-table scan LOCK (step 2 acceptance) `internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance gate staged in PR #288. With internal-table compaction landed, a write's __manifest/_graph_commits scan is flat in commit-history depth on a compacted graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before measuring and runs green every-PR. * docs: RFC-013 step 2 internal-table compaction landed - invariants.md: close the compaction half of the read-path-rederivation known gap (optimize now compacts the internal tables; cleanup half still deferred). - maintenance.md: optimize covers __manifest/_graph_commits (no publish, no sidecar); not yet in cleanup. - rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8 watermark, deferred — debated; MTT-overlap + hot-path liability). - testing.md: the internal-table LOCK is now green every-PR. * fix(engine): guard absent _graph_commits + always compact internal tables Addresses PR #291 review findings: - Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction, but a graph can validly have none (the coordinator opens it as `Option`, gated on `storage.exists`, for graphs predating the commit graph). `Dataset::open` on the absent table errored and failed the whole optimize. Guard the `_graph_commits` compaction with the same `storage_adapter().exists()` check the coordinator uses; `__manifest` always exists so it stays unguarded. Regression test `optimize_tolerates_absent_graph_commits_table` (empty graph so no publish recreates the table before the guard). - Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table compaction for a schema with no node/edge types. Removed it so the internal tables are compacted regardless of the data-table set. - Codex (auto-cleanup, P1): documented — `compact_files` commits with a default `CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override, so on a graph storing an on auto_cleanup config the commit would fire version GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are safe; the only exposure is pre-fix upgraded graphs, identical to the existing data-table optimize path, with step 2b's watermark as the comprehensive guard. Added a comment in `compact_internal_table` recording this. * docs(rfc-013): serial-hop correction — wall-clock is the ~110-hop backbone, not op count Latency-slope measurement on the deployed edge binary (`f6d2cc03`, steps 1+3a landed; rustfs + per-op latency proxy, depth 1..85) shows wall-clock is set by a ~110-hop SERIAL backbone that is depth-invariant. Total ops grow +~7/depth but PARALLELIZE (parallelism 1->6), so the depth term adds little wall-clock. - New §0(c): the serial-hop vs total-op finding + branch-op backbones (create ~77, delete ~87, branch-write ~258/1777-ops/21s floor = fork-on-first-write). - §2.4: correct the '1720->198 ops => 258s->30s' op-count->wall-clock conversion. - §5.1: promote serial-hop/num_stages to the PRIMARY latency LOCK; op-count flatness demoted to a cost/compute-floor gate. - §9 step 2: reprioritized as Phase-7 prerequisite + compute-floor/space, NOT the wall-clock fix; step 3b (parallel capture-once WriteTxn) is the headline latency lever; branch-write moved under step 3b + fork seam. - Summary: serial-backbone correction up front. Vindicates the §3/§4.1 design; corrects the op-count latency framing. * docs(rfc-013): concurrency-cap correction + Lance-metadata comparison Fold in two measured findings from the deployed edge binary (`f6d2cc03`) on rustfs behind a latency+concurrency proxy: - §0(d): concurrency-cap A/B. Under unlimited concurrency the internal-table scan parallelizes (backbone ~110); under an R2-realistic cap (8) it serializes and an UNCOMPACTED graph runs away (per-write ops 1273->3505, wall 6->16s), while #291's internal compaction cuts it ~6x and bounds it (137->1 frag). The latency model is (serial_hops + ops/effective_concurrency)*RTT + compute. - Reframe step 2 across Summary/§2.4/§9: NOT de-ranked — on R2 (capped) it is a primary latency lever + the anti-runaway fix + Phase-7 prereq. The earlier 'step 2 is parallel, irrelevant to latency' was an unlimited-concurrency artifact. Deployed `f6d2cc03` optimize is node/edge-only; #291 (undeployed) is the prod win. - §5.1: the cost-gate ThrottledStore must cap concurrency AND inject latency; assert serial_hops flat AND ops flat in history. - §2.3 + §8: Lance/LanceDB comparison from 7.0.0 source — Lance metadata is a single-file per-version manifest read O(1) (latest_version_hint), pruned by default; omnigraph's __manifest-as-Lance-dataset scan is self-inflicted by the cross-table-atomicity choice. Adds explicit defense of Lance-dataset __manifest (MTT seam) vs a flat-file CAS'd manifest (cheaper, off the MTT path). Design (§3/§4.1) unchanged and vindicated; corrections are measurement framing, step sizing, and one design-choice that was implicit.	2026-06-21 21:54:59 +02:00
Ragnor Comerford	f2b792e0ae	(feat): compact the internal manifest/commit-graph tables in optimize (#291 ) * feat(engine): compact the internal __manifest/_graph_commits tables in optimize `optimize` iterated node/edge catalog tables only, so the two internal system tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and were never compacted -- making every write's metadata scan O(fragments), which grows forever on a long-lived graph (RFC-013 step 2). `optimize_all_tables` now also compacts both internal tables via a new `compact_internal_table`. They are not catalog-tracked (readers open them at their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`: compact in place, no manifest publish (nothing to publish to), no recovery sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no optimize_indices (they carry no Lance index, only object_id's unenforced-PK metadata). No application lock: Lance's compact_files auto-retries its Rewrite against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a coordinator refresh afterwards makes the warm handle observe the compacted HEAD. Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits into __manifest -- a one-call throwaway for the full interim win; __manifest compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup (version GC) of the internal tables is deliberately NOT included here: it needs the Q8 cleanup-resurrection watermark first (deferred). maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar, graph coherent for reads + strict writes after). * test(engine): un-ignore the internal-table scan LOCK (step 2 acceptance) `internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance gate staged in PR #288. With internal-table compaction landed, a write's __manifest/_graph_commits scan is flat in commit-history depth on a compacted graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before measuring and runs green every-PR. * docs: RFC-013 step 2 internal-table compaction landed - invariants.md: close the compaction half of the read-path-rederivation known gap (optimize now compacts the internal tables; cleanup half still deferred). - maintenance.md: optimize covers __manifest/_graph_commits (no publish, no sidecar); not yet in cleanup. - rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8 watermark, deferred — debated; MTT-overlap + hot-path liability). - testing.md: the internal-table LOCK is now green every-PR. * fix(engine): guard absent _graph_commits + always compact internal tables Addresses PR #291 review findings: - Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction, but a graph can validly have none (the coordinator opens it as `Option`, gated on `storage.exists`, for graphs predating the commit graph). `Dataset::open` on the absent table errored and failed the whole optimize. Guard the `_graph_commits` compaction with the same `storage_adapter().exists()` check the coordinator uses; `__manifest` always exists so it stays unguarded. Regression test `optimize_tolerates_absent_graph_commits_table` (empty graph so no publish recreates the table before the guard). - Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table compaction for a schema with no node/edge types. Removed it so the internal tables are compacted regardless of the data-table set. - Codex (auto-cleanup, P1): documented — `compact_files` commits with a default `CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override, so on a graph storing an on auto_cleanup config the commit would fire version GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are safe; the only exposure is pre-fix upgraded graphs, identical to the existing data-table optimize path, with step 2b's watermark as the comprehensive guard. Added a comment in `compact_internal_table` recording this. * fix(engine): retry publish on RetryableCommitConflict (compaction vs publish) Step 2 compacts `__manifest` with no app-level lock (Lance OCC arbitrates, validated against LanceDB + the lance-7.0.0 conflict resolver). compact_files' `Operation::Rewrite` auto-retries 20x (CommitConfig default num_retries=20), so a live publish usually wins the race and the compaction rebases. But the publish runs its merge-insert with conflict_retries(0) = one rebase attempt; if the compaction commits first AND the merge touched a fragment the Rewrite rewrote, Lance preempts the publish with `Error::RetryableCommitConflict` — a DIFFERENT variant from the row-level `TooMuchWriteContention` the publisher already retries. Left unhandled, that surfaces a transient error to the caller, i.e. a maintenance compaction (physical op) failing a live write (logical op) — invariant 7. Map `LanceError::RetryableCommitConflict` to a new `ManifestConflictDetails::RetryableCommitConflict` and treat it as retryable in the publisher's outer loop (reload fresh state + re-merge), alongside RowLevelCasContention. `ExpectedVersionMismatch` still propagates (a genuine expectation break must not be blindly retried). This also hardens multi-process concurrent writers generally, not just compaction. Normal publishes are insert-only (new object_ids -> new fragments, disjoint from rewritten old ones), so the conflict is rare; the guard covers the same-fragment-update edge and multi-process writers. Unit tests in publisher.rs pin the mapping + the retry-predicate contract. * revert: publisher RetryableCommitConflict handling (it was the wrong side) Reverts `d138902e`. Validated against lance-7.0.0: the publisher's merge-insert runs with conflict_retries(0), and execute_with_retry converts an exhausted retryable commit conflict to TooMuchWriteContention before the caller sees it (write/retry.rs ~95-130). So map_lance_publish_error NEVER receives RetryableCommitConflict from merge_rows — it receives TooMuchWriteContention, which the publisher already maps to RowLevelCasContention and retries. The reverted mapping was therefore dead on the real path and its unit test was synthetic. The actual exposure is the compaction side: compact_files -> commit_compaction -> apply_commit directly (no execute_with_retry), so a Rewrite-vs-Merge check_txn conflict propagates raw and optimize can fail on a live graph. That is fixed app-side in compact_internal_table in the following commit. * fix(engine): make internal-table compaction correct by construction Address three findings from review of the step-2 internal-table compaction: - Non-destructive by construction: before compacting an internal table, strip any stored `lance.auto_cleanup.` config off it. `compact_files` commits with a default `CommitConfig` (skip_auto_cleanup=false) and `CompactionOptions` exposes no override, so on a graph created by an older binary (on-by-default GC hook) the compaction commit would fire Lance's auto-cleanup and silently prune `__manifest`-pinned versions. Current binaries store no such config; the strip is the upgrade-path safety net so `optimize` can never GC versions. - App-level compaction retry: `compact_files` does NOT auto-retry a semantic conflict against a concurrent live writer (Rewrite vs Update/Merge/Delete propagates raw from apply_commit; Lance prescribes app-rerun). Wrap the internal-table compaction in a bounded retry loop that reopens fresh and replans on a retryable Lance conflict, so a maintenance compaction (a physical op) never fails a live write (a logical op) — invariant 7. - Compact all three internal tables, not two: `_graph_commit_actors` grows one fragment per commit on the authenticated write path, the same O(depth) scan as `__manifest`/`_graph_commits`. Drive the sweep from one source-of-truth list with per-table existence guards (the two commit-graph tables are optional). Make `graph_commit_actors_uri` pub(crate). Tests: the `internal_table_scans_are_flat_in_history` LOCK now runs the authenticated (actorful) write path so it covers `_graph_commit_actors` via the shared commit-graph IO wrapper (new `commit_many_as`/`measure_insert_as` helpers); `optimize_clears_stale_auto_cleanup_and_preserves_versions` pins the non-destructive guarantee (config cleared + no version GC); a unit test pins the retryable-conflict classifier; the empty-graph stats count is 7 (the actor table is created at init). docs: internal-table compaction covers all 3 tables, non-destructive, retried Sync the RFC-013 step-2a section and the maintenance guide with the correctness-by-design refinements: - optimize compacts `__manifest`, `_graph_commits`, AND `_graph_commit_actors` (the actor table grows on the authenticated write path). - optimize is non-destructive by construction — it never GCs versions, and strips stale `lance.auto_cleanup.` config so an upgraded graph's commit-time GC hook cannot fire during compaction. - internal-table compaction rebases and retries against concurrent live writers rather than failing the operator's optimize or the live write. - the cost LOCK is the authenticated-path acceptance test. fix(engine): refresh coordinator after a config-strip with no compaction work `compact_internal_table` returns early when `plan_compaction` finds no work, but `clear_stale_auto_cleanup_config` may have already committed a config-strip that advanced Lance HEAD. The early return skipped the coordinator refresh that the successful-compaction path performs, leaving warm `__manifest`/commit-graph handles pinned to the pre-strip version until the next read's version probe healed them. No correctness bug (the probe self-heals, and a stale-handle write would retry via publisher CAS), but the refresh makes coherence deterministic rather than probe-dependent. Refresh iff the config-strip actually committed. * docs(engine): correct compact_internal_table doc — compact_files does not auto-retry The function doc claimed "Lance's compact_files auto-retries its Operation::Rewrite against any concurrent writer" — wrong, and contradicting the is_retryable_lance_conflict doc just below it and the explicit retry loop that exists precisely because compact_files does NOT auto-retry semantic conflicts (Rewrite vs Update/Merge/Delete propagates raw through apply_commit). Also move the orphaned description from above the retry-budget const onto the function, and include the third internal table. * test(engine): optimize must clear stale auto_cleanup on DATA tables too (red) Regression test for a destructive bug on the data-table optimize path: on an upgraded graph whose node/edge table still carries pre-v7 lance.auto_cleanup.* config, `optimize`'s compact_files/optimize_indices commits fire Lance's version GC and prune __manifest-pinned data-table versions. Mirrors the internal-table auto_cleanup test on a Person table (force-repair realigns the config-induced drift so optimize doesn't skip the table). Red against the current code: the data-table path does not strip the config. The fix lands in the next commit. * fix(engine): clear stale auto_cleanup on the data-table optimize path too The auto_cleanup scrub previously only protected the internal tables; the data-table path (optimize_one_table) ran compact_files/optimize_indices with a default CommitConfig (skip_auto_cleanup=false) and no override, so on an upgraded graph those commits could fire Lance's version-GC hook and prune __manifest-pinned node/edge versions — making the "non-destructive" contract false for data tables. Strip the config before the HEAD-advancing commits, capturing version_before first so the strip's own commit still triggers the Phase-C manifest publish (no uncovered drift). No retry loop needed: the data-table path holds the per-table write queue. Covered by the existing Optimize recovery sidecar. Turns the prior commit's test green. Also: switch clear_stale_auto_cleanup_config off the deprecated delete_config_keys to update_config(None values), and correct two now-inaccurate doc comments — compaction is "one or more content-preserving commits" (compact_files can emit a ReserveFragments before the Rewrite), not "a single atomic commit"; the sidecar-free property rests on content-preservation + read-at-HEAD, not single-commit atomicity. * docs: optimize is non-destructive on all tables; correct atomicity/retry claims - non-destructive guarantee now spans data + internal tables (the auto_cleanup strip runs on both paths), not just the internal ones. - "single atomic Lance commit" was inaccurate: compaction can emit a ReserveFragments commit before the Rewrite; the no-sidecar property rests on content-preservation + read-at-HEAD, not single-commit atomicity. - "retries rather than failing" softened to the truth: a bounded retry on the internal path; sustained contention surfaces a loud conflict error (bounded + observable, not an infinite loop). The data path holds the per-table queue and never contends.	2026-06-21 16:38:20 +02:00
Ragnor Comerford	f6d2cc03e3	write-path cost gate + opener bypass (#288 ) * docs(rfc): RFC-013 write-path latency design + index link * perf(engine): open write-path tables directly, bypassing the namespace builder Write opens routed through DatasetBuilder::from_namespace, whose describe_table opened the whole dataset just to return a location and then re-resolved the latest version — an O(commit-depth) double latest-resolution per table open that missed Lance's O(1) version-hint fast path. On an object store this dominated write latency (~70%, RFC-013 section 2.4). TableStore::open_dataset_head_for_write now delegates to the direct opener (open_dataset_head: Dataset::open by URI + checkout_branch, routed through the tracked opener so cost tests can count it; a no-op in production). The manifest already holds every sub-table's location, so the namespace catalog lookup was redundant; ensure_expected_version still validates head == pinned for strict ops. This completes PR #268's open-by-location migration on the write side. With both reads (PR #268) and now writes bypassing it, nothing in production routes through the per-table Lance namespace. The dead open chain (load_table_from_namespace, open_table_head_for_write) is deleted and the StagedTableNamespace contract apparatus is gated #[cfg(test)], mirroring the already-test-only read namespace; __manifest commit coordination (GraphNamespacePublisher) is a separate component and is unaffected. See docs/dev/rfc-013-write-path-latency.md sections 2.4 and 9 (step 3a). * test(engine): write-path cost-budget gate on a shared harness Adds tests/helpers/cost.rs, a store-agnostic cost harness (IoCounts/StagedCounts, measure/measure_with_staged, assert_flat, local_graph/s3_graph) that the read-side warm_read_cost.rs, write_cost.rs, and write_cost_s3.rs share, so the IOTracker / task-local plumbing lives in exactly one place instead of duplicated per test. write_cost.rs (local, every-PR) gates the internal-table scan term flat in commit-history depth (a RED #[ignore]'d LOCK, the acceptance for bringing the internal tables into compaction) plus green guards: a single insert's data writes are bounded, a per-write read-op ceiling fails the moment a round-trip is added, and a keyed insert routes through stage_merge_insert once with no stage_append or vector-index build. write_cost_s3.rs (bucket-gated, rustfs CI) gates the data-table opener term flat across depth — the object-store-RPC phenomenon local FS cannot reproduce, and the red->green proof of the opener bypass. Wired into the rustfs_integration CI job and its path filter. Guards the "hot-path cost is bounded by work, not history" invariant on writes. See docs/dev/rfc-013-write-path-latency.md section 5.1, docs/dev/testing.md. * docs(rfc): RFC-013 step 3a landed; write-skew coupling; cost-gate test map - Section 9: mark step 1 (gate + harness) and step 3a (opener bypass) landed; record the per-table namespace retirement to test-only and the corrected measurement note (the opener win is S3-only; the local data-table growth is the merge-insert/RI fragment scan, a compaction term, not the opener). - Sections 7.1/6/11/5.5/10: correct the cross-table write-skew analysis after a prototype proved the scoped expected-set fix is a no-op against the per-object_id manifest (disjoint writers never share a row, so Lance never conflicts, the publisher never retries, and the expected check is a non-atomic pre-check evaluated once against stale state). The fix needs a shared contention row (Phase-7 graph_head / a minimal head row / commit-time re-validation), so it is coupled to that row, not standalone; that contention is load-bearing for correctness, not a drawback. Split the concurrent face (read-set + head) from the sequential face (inbound-RI validation on node removal) -- two different fixes. - testing.md: add write_cost.rs / helpers/cost.rs / write_cost_s3.rs to the test map; document the local-vs-S3 backend split; extend the cost-budget checklist item to the write/open path and point at the shared harness. * test(engine): isolate the opener in the S3 cost gate; fail loud on S3 setup errors Addresses two PR review findings on the bucket-gated write_cost_s3 gate: - The data-table opener was not isolated: `data_reads` also counts the merge-insert/RI scan, which reads O(fragment-count) and so grows with history for a different reason (compaction's domain, not the opener) -- the same term that made the local data-table count grow. The flat assertion would false-RED or misattribute scan growth to the opener on rustfs. Fix: compact (db.optimize) before each measurement so the table holds ~1 fragment, bounding the scan and leaving the opener's latest-version resolution as the only history-varying term. Compaction preserves version history, so the opener still faces a deep _versions/ chain -- the thing under test. - s3_graph used `.ok()?`, so when OMNIGRAPH_S3_TEST_BUCKET was set but the store was down/misconfigured, init/seed failures collapsed to None and the gate skipped + passed vacuously. Fix: skip only when the bucket env var is absent; once it is set, init/seed failures panic (mirrors tests/s3_storage.rs). * test(engine): isolate the S3 opener with a per-prefix IO probe (correct-by-design) Replaces the fixture-bounded isolation (compact-before-measure) from the prior commit with the root fix: a path-classifying ObjectStore wrapper (PrefixCounter) that attributes each data-table read to the opener term (_versions/.manifest) vs the scan term (data/.lance). IoCounts now exposes data_opener_reads / data_scan_reads, so write_cost_s3 asserts the opener flat directly* -- no compaction or fixture massaging, and the assertion measures the opener, not the conflated total. Closes the "harness conflates two IO terms" class: any cost test (read or write) can now isolate the opener. PrefixCounter implements only the object_store 0.13 core ObjectStore methods; the convenience surface (get/put/head/...) routes through get_opts/put_opts via ObjectStoreExt's blanket impl, so every read/write is still counted. Validated locally (every-PR) by write_cost::data_table_reads_split_into_flat_opener_ and_growing_scan: opener stays flat (7 -> 3) while scan grows (11 -> 91) and opener + scan == data_total exactly -- proving the classifier and confirming the local data-table growth is the fragment scan, not the opener. warm_read_cost (12 tests) stays green under the shared-harness change. * refactor(tests): remove cost-harness duplication and namespace cfg(test) noise Branch self-review (no behavior change) — pay down three liabilities the write-path work left: - warm_read_cost.rs kept its own probes() (three IOTrackers + a QueryIoProbes + a probe counter) and read raw .stats().read_iops — duplicating the shared helpers::cost harness this branch introduced. Migrated all 12 tests onto measure()/IoCounts; deleted the local probes(). (This also makes IoCounts' version_probes field used rather than dead.) - insert_cost was copy-pasted verbatim into write_cost.rs and write_cost_s3.rs. Hoisted to helpers::cost::measure_insert so the measured write is defined once. - The per-table Lance namespace (namespace.rs) became entirely test-only after step 3a, but was gated with ~22 per-item #[cfg(test)] attributes. Collapsed to a single `#[cfg(test)] mod namespace;` and stripped the per-item attributes; merged the import groups the gating had split. Verified: lib in-source 162 passed; write_cost 4 + warm_read_cost 12 passed; forbidden_apis passed.	2026-06-20 13:31:15 +02:00

5 commits