perf(engine): halve per-write __manifest scans (#307)

* test(write_cost): served-regime __manifest scan tripwire

Adds `internal_table_scans_grow_without_compaction`, the served-regime twin of
`internal_table_scans_are_flat_in_history`. The flat gate `optimize()`s before
every measured write, so it only proves the *compacted* invariant and stays
green even when a served graph's per-write `__manifest` scan amplifies without
bound. This tripwire measures the uncompacted regime and asserts the scan
grows — green today, and it flips RED once the amplification is bounded
(write-path warm-reuse + version-GC), at which point it inverts to a permanent
`assert_flat` gate. RFC-013.

* perf(engine): halve per-write __manifest scans (RFC-013 PR2)

Cuts a same-branch write from ~4 to ~2 `__manifest` scans (measured 50->25 at
depth 10, 410->205 at depth 100) with the OCC contract and snapshot isolation
preserved:

- #1a probe-gate the OCC re-capture in `commit_all` via `occ_snapshot_for_branch`
  (mirrors the read path's `resolve_target_inner`): reuse the warm coordinator
  when a cheap incarnation probe proves it current, fall through to a cold read
  on mismatch.
- #1b fold the post-publish `known_state` in-memory from `existing_versions` plus
  the committed rows instead of an O(fragments) re-scan; extracted the shared
  `assemble_manifest_state` reduction so the fold is byte-identical to a scan,
  proven by the new `post_publish_fold_matches_fresh_reopen` test.
- #1c project `read_manifest_scan` to the columns it reads (drop `base_objects`
  always, `object_id` on the table-state path).

The two remaining publish scans (`load_publish_state` and the `use_index(false)`
merge-insert join) stay O(fragments), bounded by compaction/version-GC (RFC-013
PR1, not in this change).

* test(manifest): reproduce owner-branch handoff fold desync

The PR #307 post-publish fold appends pending table_version rows after
existing_versions, and assemble_manifest_state keeps the first equal-version
entry. A same-version owner-branch handoff updates a table_version row in place
at the same Lance version with a new table_branch (merge-insert UpdateAll on the
deterministic version_object_id), so the warm coordinator keeps the stale fork
while a fresh re-scan reflects the handoff.

This test commits a handoff through the coordinator commit path (exercising the
fold) and asserts the warm snapshot equals a fresh reopen. It is red against the
current fold; the following commit turns it green. Flagged by Cursor Bugbot
(High) and ChatGPT Codex (P2) on PR #307.

* fix(engine): fold table_version rows by (table_key, version) identity

fold_inputs now keys version entries by (table_key, table_version), the manifest
row identity carried by the deterministic version_object_id that the merge-insert
CAS uses. A pending row at the same identity replaces the pre-publish entry,
mirroring merge-insert UpdateAll on disk. Previously the fold appended pending
rows after existing_versions, so an owner-branch handoff left two equal-version
entries and assemble_manifest_state retained the stale one.

The fold input now carries the same one-row-per-(table_key, version) uniqueness a
fresh scan produces, so both feed assemble_manifest_state equivalent inputs and
the warm known_state stays byte-identical to read_manifest_state. This corrects
the derivation's identity model structurally and applies to any same-version
in-place update. Closes the PR #307 review finding.

* test(cost): enable lance-io test-util for IO request diagnostics

Gives IoStats.requests + assert_io_eq!, used by the cost harness to record the
__manifest read log (method + path) for failure diagnostics. Dev-dependency only,
so production builds (which exclude dev-deps) never compile it.

* test(cost): rebuild IO harness on GraphIoMeter + incremental_stats

Consolidate the per-op ProbeHandles into OpProbes plus a persistent GraphIoMeter,
and read per-op deltas via lance's incremental_stats() (get-and-reset) instead of
cumulative stats() -- the upstream per-request idiom
(rust/lance/src/dataset/tests/dataset_io.rs). Add cost_harness(body): it installs
one __manifest tracker for a whole test body, so the graph opens under it and
every coordinator handle (init plus each post-publish reassignment) carries the
same tracker. measure reuses that ambient tracker when present, making
manifest_reads ground truth (warm probe plus cold scans, handle-age-irrelevant);
outside cost_harness it falls back to a fresh per-op tracker (today's behavior).
The body future is boxed so wrapping a whole test body does not overflow the test
thread's stack.

Also stash each op's __manifest read log on the meter for assert_io_eq!-style
failure diagnostics (last_manifest_reads).

Behavior-preserving: no test wraps its body in cost_harness yet, so measure takes
the fallback path and every cost number is unchanged. write_cost and
warm_read_cost stay green.

* test(write_cost): ground-truth __manifest counting via cost_harness

Wrap the three __manifest-asserting tests (flat, grow, ceiling) in cost_harness so
manifest_reads is ground truth -- the warm-coordinator freshness probe rides a
long-lived handle a per-op tracker installed at measure time cannot see. The
flat/grow gates are depth-difference assertions, so the constant per-write probe
offset cancels and they pass unchanged; the absolute ceiling is retightened from
34 to 24 (~18 measured = ~15 publish-path scans + ~3 probe RPCs) with the read log
dumped on a breach.

Add manifest_reads_capture_warm_probe: it measures the same warm write fresh-only
and under cost_harness and asserts ground truth strictly exceeds fresh-only by the
probe's RPCs (11 vs 14). Reverting the ground-truth wiring makes the two equal, so
this guards that a write's warm-handle probe (3 object-store RPCs that were counted
as a single version_probe) cannot silently escape manifest_reads again.

* test(warm_read_cost): ground-truth __manifest counting via cost_harness

Wrap the warm (== 0) manifest gates in cost_harness so manifest_reads is ground
truth. A read's freshness probe is served from Lance's cached manifest at 0
object-store reads (unlike a write's probe, which re-reads after its commit), so
the == 0 assertions hold with no re-baseline -- and now also catch any future
warm-handle scan a per-op tracker would miss. The stale (> 0) tests are unaffected
either way and stay on the fresh fallback.

* docs(testing): document ground-truth cost harness (GraphIoMeter)

The cost harness now reads incremental_stats() deltas and, under cost_harness,
installs one __manifest tracker before the graph opens so manifest_reads is ground
truth (handle-age-irrelevant). Note that version_probes is the probe call count and
that ground truth reveals a write's probe does ~3 object-store RPCs.

* docs(rfc-013): bring write-path handoff current (Thread B + Phase 7 landed)

Prepend a current-state section (§A) for the __manifest scan-amplification /
version-chain thread: the problem, what landed on main (step 2a, Phase 7 #299),
what is in flight on this branch / PR #307 (PR2 scan-halving, the owner-branch
handoff fold fix, the PR2.1 ground-truth cost harness), the accurate measurement
(per-write __manifest ops ~50->410 pre-PR2 vs 28->208 ground truth; the hidden
3-RPC freshness probe), the remaining roadmap (PR1a manual cleanup, PR3-scoping,
deferred PR1b/PR4), critical files, and gotchas.

Staleness fixes: Phase 7 was listed as a future "step 4" but landed as #299, so
mark it LANDED in the TL;DR landed list and in the remaining-steps section.

* docs(rfc-013): refresh PR307 handoff state
This commit is contained in:
Ragnor Comerford 2026-06-27 13:18:04 +02:00 committed by GitHub
parent 1c5cb8741e
commit a7d4cba53d
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 842 additions and 58 deletions

View file

@ -27,7 +27,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
| `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — `internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
| `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary |
| `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives; reads per-op deltas via lance's `incremental_stats()`, the upstream per-request idiom from `rust/lance/src/dataset/tests/dataset_io.rs`), `cost_harness`/`GraphIoMeter` (installs ONE `__manifest` `IOTracker` for a whole test body so the graph opens **under** it and `manifest_reads` is **ground truth** — every read regardless of handle age, the warm-coordinator freshness probe included — closing the blind spot where a per-op tracker installed at measure time cannot see a long-lived handle's reads; outside `cost_harness`, `measure` falls back to fresh per-op tracking, so `write_cost_s3.rs` is unaffected), `last_manifest_reads()` (the manifest read log for `assert_io_eq!`-style failure diagnostics), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary |
| `lifecycle.rs` | Graph lifecycle, schema state |
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
| `changes.rs` | `diff_between` / `diff_commits` |
@ -140,6 +140,7 @@ Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and deg
- **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query."
- **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost.
- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest`/`_graph_commits` fragment scans) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears.
- **Count on the handle that does the reads, not just the one a measured op opens.** Lance's IO-counted tests attach the `IOTracker` to the (warm, cached) dataset and read `incremental_stats()` per request — the tracker MUST be on the handle performing the reads, or warm-handle reads escape. A per-op tracker installed at measure time cannot see reads on a long-lived handle opened earlier (the warm coordinator's `__manifest` handle, reused across writes), so such reads were silently undercounted. Wrap a depth-swept body in `cost_harness` so the manifest tracker is installed before the graph opens and `manifest_reads` is **ground truth** (handle-age-irrelevant). The `version_probes` counter is the freshness-probe *call* count; ground truth additionally reveals that a write's probe does ~3 object-store RPCs (a read's probe is a 0-IO cache hit). `manifest_reads_capture_warm_probe` is the guard that this stays true.
- This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history).
When in doubt, re-read [docs/dev/invariants.md](invariants.md) — quality gates apply to every change.