perf(engine): halve per-write __manifest scans (#307)

* test(write_cost): served-regime __manifest scan tripwire Adds `internal_table_scans_grow_without_compaction`, the served-regime twin of `internal_table_scans_are_flat_in_history`. The flat gate `optimize()`s before every measured write, so it only proves the *compacted* invariant and stays green even when a served graph's per-write `__manifest` scan amplifies without bound. This tripwire measures the uncompacted regime and asserts the scan grows — green today, and it flips RED once the amplification is bounded (write-path warm-reuse + version-GC), at which point it inverts to a permanent `assert_flat` gate. RFC-013. * perf(engine): halve per-write __manifest scans (RFC-013 PR2) Cuts a same-branch write from ~4 to ~2 `__manifest` scans (measured 50->25 at depth 10, 410->205 at depth 100) with the OCC contract and snapshot isolation preserved: - #1a probe-gate the OCC re-capture in `commit_all` via `occ_snapshot_for_branch` (mirrors the read path's `resolve_target_inner`): reuse the warm coordinator when a cheap incarnation probe proves it current, fall through to a cold read on mismatch. - #1b fold the post-publish `known_state` in-memory from `existing_versions` plus the committed rows instead of an O(fragments) re-scan; extracted the shared `assemble_manifest_state` reduction so the fold is byte-identical to a scan, proven by the new `post_publish_fold_matches_fresh_reopen` test. - #1c project `read_manifest_scan` to the columns it reads (drop `base_objects` always, `object_id` on the table-state path). The two remaining publish scans (`load_publish_state` and the `use_index(false)` merge-insert join) stay O(fragments), bounded by compaction/version-GC (RFC-013 PR1, not in this change). * test(manifest): reproduce owner-branch handoff fold desync The PR #307 post-publish fold appends pending table_version rows after existing_versions, and assemble_manifest_state keeps the first equal-version entry. A same-version owner-branch handoff updates a table_version row in place at the same Lance version with a new table_branch (merge-insert UpdateAll on the deterministic version_object_id), so the warm coordinator keeps the stale fork while a fresh re-scan reflects the handoff. This test commits a handoff through the coordinator commit path (exercising the fold) and asserts the warm snapshot equals a fresh reopen. It is red against the current fold; the following commit turns it green. Flagged by Cursor Bugbot (High) and ChatGPT Codex (P2) on PR #307. * fix(engine): fold table_version rows by (table_key, version) identity fold_inputs now keys version entries by (table_key, table_version), the manifest row identity carried by the deterministic version_object_id that the merge-insert CAS uses. A pending row at the same identity replaces the pre-publish entry, mirroring merge-insert UpdateAll on disk. Previously the fold appended pending rows after existing_versions, so an owner-branch handoff left two equal-version entries and assemble_manifest_state retained the stale one. The fold input now carries the same one-row-per-(table_key, version) uniqueness a fresh scan produces, so both feed assemble_manifest_state equivalent inputs and the warm known_state stays byte-identical to read_manifest_state. This corrects the derivation's identity model structurally and applies to any same-version in-place update. Closes the PR #307 review finding. * test(cost): enable lance-io test-util for IO request diagnostics Gives IoStats.requests + assert_io_eq!, used by the cost harness to record the __manifest read log (method + path) for failure diagnostics. Dev-dependency only, so production builds (which exclude dev-deps) never compile it. * test(cost): rebuild IO harness on GraphIoMeter + incremental_stats Consolidate the per-op ProbeHandles into OpProbes plus a persistent GraphIoMeter, and read per-op deltas via lance's incremental_stats() (get-and-reset) instead of cumulative stats() -- the upstream per-request idiom (rust/lance/src/dataset/tests/dataset_io.rs). Add cost_harness(body): it installs one __manifest tracker for a whole test body, so the graph opens under it and every coordinator handle (init plus each post-publish reassignment) carries the same tracker. measure reuses that ambient tracker when present, making manifest_reads ground truth (warm probe plus cold scans, handle-age-irrelevant); outside cost_harness it falls back to a fresh per-op tracker (today's behavior). The body future is boxed so wrapping a whole test body does not overflow the test thread's stack. Also stash each op's __manifest read log on the meter for assert_io_eq!-style failure diagnostics (last_manifest_reads). Behavior-preserving: no test wraps its body in cost_harness yet, so measure takes the fallback path and every cost number is unchanged. write_cost and warm_read_cost stay green. * test(write_cost): ground-truth __manifest counting via cost_harness Wrap the three __manifest-asserting tests (flat, grow, ceiling) in cost_harness so manifest_reads is ground truth -- the warm-coordinator freshness probe rides a long-lived handle a per-op tracker installed at measure time cannot see. The flat/grow gates are depth-difference assertions, so the constant per-write probe offset cancels and they pass unchanged; the absolute ceiling is retightened from 34 to 24 (~18 measured = ~15 publish-path scans + ~3 probe RPCs) with the read log dumped on a breach. Add manifest_reads_capture_warm_probe: it measures the same warm write fresh-only and under cost_harness and asserts ground truth strictly exceeds fresh-only by the probe's RPCs (11 vs 14). Reverting the ground-truth wiring makes the two equal, so this guards that a write's warm-handle probe (3 object-store RPCs that were counted as a single version_probe) cannot silently escape manifest_reads again. * test(warm_read_cost): ground-truth __manifest counting via cost_harness Wrap the warm (== 0) manifest gates in cost_harness so manifest_reads is ground truth. A read's freshness probe is served from Lance's cached manifest at 0 object-store reads (unlike a write's probe, which re-reads after its commit), so the == 0 assertions hold with no re-baseline -- and now also catch any future warm-handle scan a per-op tracker would miss. The stale (> 0) tests are unaffected either way and stay on the fresh fallback. * docs(testing): document ground-truth cost harness (GraphIoMeter) The cost harness now reads incremental_stats() deltas and, under cost_harness, installs one __manifest tracker before the graph opens so manifest_reads is ground truth (handle-age-irrelevant). Note that version_probes is the probe call count and that ground truth reveals a write's probe does ~3 object-store RPCs. * docs(rfc-013): bring write-path handoff current (Thread B + Phase 7 landed) Prepend a current-state section (§A) for the __manifest scan-amplification / version-chain thread: the problem, what landed on main (step 2a, Phase 7 #299), what is in flight on this branch / PR #307 (PR2 scan-halving, the owner-branch handoff fold fix, the PR2.1 ground-truth cost harness), the accurate measurement (per-write __manifest ops ~50->410 pre-PR2 vs 28->208 ground truth; the hidden 3-RPC freshness probe), the remaining roadmap (PR1a manual cleanup, PR3-scoping, deferred PR1b/PR4), critical files, and gotchas. Staleness fixes: Phase 7 was listed as a future "step 4" but landed as #299, so mark it LANDED in the TL;DR landed list and in the remaining-steps section. * docs(rfc-013): refresh PR307 handoff state
2026-06-30 02:49:39 +02:00 · 2026-06-27 13:18:04 +02:00 · 2026-06-27 13:18:04 +02:00 · a7d4cba53d
commit a7d4cba53d
parent 1c5cb8741e
13 changed files with 842 additions and 58 deletions
--- a/docs/dev/testing.md
+++ b/docs/dev/testing.md
@ -27,7 +27,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
 | `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
 | `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
 | `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — `internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
-| `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary |
+| `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives; reads per-op deltas via lance's `incremental_stats()`, the upstream per-request idiom from `rust/lance/src/dataset/tests/dataset_io.rs`), `cost_harness`/`GraphIoMeter` (installs ONE `__manifest` `IOTracker` for a whole test body so the graph opens **under** it and `manifest_reads` is **ground truth** — every read regardless of handle age, the warm-coordinator freshness probe included — closing the blind spot where a per-op tracker installed at measure time cannot see a long-lived handle's reads; outside `cost_harness`, `measure` falls back to fresh per-op tracking, so `write_cost_s3.rs` is unaffected), `last_manifest_reads()` (the manifest read log for `assert_io_eq!`-style failure diagnostics), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary |
 | `lifecycle.rs` | Graph lifecycle, schema state |
 | `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
 | `changes.rs` | `diff_between` / `diff_commits` |
@ -140,6 +140,7 @@ Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and deg
 - **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query."
 - **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost.
 - **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest`/`_graph_commits` fragment scans) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears.
+- **Count on the handle that does the reads, not just the one a measured op opens.** Lance's IO-counted tests attach the `IOTracker` to the (warm, cached) dataset and read `incremental_stats()` per request — the tracker MUST be on the handle performing the reads, or warm-handle reads escape. A per-op tracker installed at measure time cannot see reads on a long-lived handle opened earlier (the warm coordinator's `__manifest` handle, reused across writes), so such reads were silently undercounted. Wrap a depth-swept body in `cost_harness` so the manifest tracker is installed before the graph opens and `manifest_reads` is **ground truth** (handle-age-irrelevant). The `version_probes` counter is the freshness-probe *call* count; ground truth additionally reveals that a write's probe does ~3 object-store RPCs (a read's probe is a 0-IO cache hit). `manifest_reads_capture_warm_probe` is the guard that this stays true.
 - This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history).

 When in doubt, re-read [docs/dev/invariants.md](invariants.md) — quality gates apply to every change.