perf(engine): remove the per-query metadata re-derivation tax on warm reads (#268)

* test(engine): add read-path IO instrumentation seam for warm-read cost tests

Prerequisite seam for the query-latency fixes. Adds
crates/omnigraph/src/instrumentation.rs:

- CountingStorageAdapter: a StorageAdapter decorator counting per-method
  reads (read_text/exists/read_text_versioned/list_dir), for the
  schema-contract reads on the query path.
- A per-query task-local (QueryIoProbes) carrying Lance WrappingObjectStore
  wrappers per open category plus a probe counter, delivered via
  with_query_io_probes. open_dataset_tracked attaches the wrapper so the
  open itself is counted (ObjectStoreParams.object_store_wrapper).

Wires the wrappers into the manifest open (open_manifest_dataset) and the
commit-graph opens (CommitGraph::open/open_at_branch). Production leaves
the task-local unset, so nothing attaches.

Makes Omnigraph::open_with_storage public so tests can inject the counting
adapter. lance-io is a dev-dependency (IOTracker named only in tests). No
runtime behavior change.

* test(engine): warm same-branch read should reuse the coordinator (red)

Cost-budget test using Lance IOTracker at the object-store boundary (the
LanceDB IO-counted-test pattern). On a 20-commit-deep graph, a warm
same-branch query re-opens a fresh coordinator, which opens both the commit
graph and __manifest. Asserts the read opens the commit graph zero times
and performs exactly one cheap version probe; today it does neither (it
scans the commit graph on re-open and never probes). The freshness guard
already passes. Adds the commit_many helper for history-depth fixtures.

Red half of the Fix 1 red->green pair; turns green with the next commit.

* perf(engine): same-branch reads reuse the warm coordinator (Fix 1)

query()/resolved_target re-opened a fresh GraphCoordinator from storage on
every read (full __manifest scan + two commit-graph scans), so a warm
read's cost grew with commit history (invariant 15) though the data was
unchanged.

resolved_target now serves same-branch reads from the warm in-memory
coordinator, gated by a cheap version probe (latest_version_id, one
object-store op) instead of a full re-open:
- fresh (probe == cached version): return the in-memory snapshot under the
  read lock, with a synthetic (branch, version) id and no commit-graph
  access (reads pin the snapshot by manifest version, not the commit DAG;
  invariant 2).
- stale: take the write lock, re-probe (double-checked; tokio RwLock has no
  read->write upgrade), then refresh_manifest_only (no commit-graph scan),
  preserving strong consistency for external writers (invariant 6).
Cross-branch and snapshot targets keep the existing cold-resolve path.

Adds ManifestCoordinator/GraphCoordinator::probe_latest_version and
GraphCoordinator::refresh_manifest_only. Nothing on the read path needs a
real commit ULID (only RuntimeCache keys on the id, where synthetic is
consistent), per a caller audit.

A warm same-branch read on a 20-commit graph now does zero commit-graph
opens and exactly one probe (down from a deep commit-graph scan) and still
observes external commits. The residual per-table __manifest scans are
removed later by Fix 2.

* test(engine): warm query should validate the schema contract once (red)

ensure_schema_state_valid runs twice per query (query()/run_query_at AND
resolved_target/snapshot_at_version), each reading 3 contract files + 2
existence probes. A warm query thus does 6 read_text + 4 exists where one
validation (3 + 2) suffices, measured via CountingStorageAdapter. Adds a
drift guard (schema_source_drift_is_caught_on_read) that already passes.

Red half of the finding-A red->green pair.

* perf(engine): validate the schema contract once per query (finding A)

ensure_schema_state_valid ran on every query AND again inside
resolved_target / snapshot_at_version, so each query validated the schema
contract twice (~10 storage ops). Removes the redundant query()/
run_query_at() calls; the validation inside resolved_target /
snapshot_at_version still runs, so drift is detected exactly as before.

A source-only fast path was rejected: a long-lived handle must detect
external drift of the schema source, IR, OR state on its next operation
(lifecycle::long_lived_handle_rejects_schema_*), which a source-only
compare would miss. So the only safe latency win is not validating twice.

A warm query now does one validation (3 read_text + 2 exists) instead of
two (6 + 4).

* test(engine): warm + multi-table reads should do zero manifest scans (red)

After Fix 1 a warm same-branch read still scans __manifest ~44 times at
20-commit depth: not from resolution (Fix 1 removed that) but from the
per-table open path, which routes through the Lance namespace and full-scans
__manifest twice per touched table (describe_table + describe_table_version).
Tightens the warm test to assert manifest read_iops == 0 and adds a
multi-table (traversal) test asserting the same, pinning the "2 tables = 2x"
tax. Red half of the Fix 2 red->green pair.

* perf(engine): open touched tables by location+version, not via the namespace (Fix 2)

SubTableEntry::open routed every read-path table open through
DatasetBuilder::from_namespace(BranchManifestNamespace), whose describe_table
full-scans __manifest and, with managed_versioning, makes Lance scan again
(describe_table_version) -- two full __manifest scans per touched table. That
was the residual that made warm-read manifest IO grow with history and the
'2 tables = 2x' multi-table tax.

The resolved Snapshot already holds each table's path/version/branch, so open
directly: from_uri(table_uri_for_path(root, path, branch)).with_version(v).
The branch-qualified location is the dataset that physically holds the version
(main: {path}; branch: {path}/tree/{branch}, Lance native-branch storage), and
with_version resolves it within THAT dataset's _versions. 0 namespace calls +
1 HEAD via the native ConditionalPutCommitHandler.

The read namespace (BranchManifestNamespace) is now unused in production
(writes use StagedTableNamespace), so it, its constructor, and the helpers only
it used (to_namespace_version, publish_requests, their imports) are gated
#[cfg(test)] -- retained to validate the namespace contract in unit tests.
Removes the dead open_table_at_version_from_manifest.

Warm same-branch + multi-table reads now scan __manifest zero times; branch +
time-travel reads stay correct (branching.rs, point_in_time.rs, 2 lib
regression tests); production-lib warnings unchanged (baseline).

* test(engine): cost-budget coverage for branch-warm and stale-refresh reads (matrix)

Extends the read-path cost-budget tests across more of the morphological matrix:
- warm_branch_read_does_no_manifest_scans: a warm read on a non-main branch
  (handle synced to it) scans __manifest zero times, exercising Fix 2's
  branch-owned-table open (tree/{branch} + with_version) on Fix 1's warm path --
  the cell that regressed when the open used with_branch against the base.
- stale_read_refreshes_manifest_only: an external commit makes the next read
  take the stale path, which re-reads the manifest (read_iops > 0) but never
  scans the commit graph (refresh_manifest_only), pinning Fix 1's manifest-only
  refresh.

Cold paths (cross-branch, time-travel) stay behavior-covered (branching.rs,
point_in_time.rs) and are cold by design (Fix 1 warm-paths only same-branch), so
there is no manifest==0 contract to assert there.

* test(engine): same-branch write after external commit must not fork the commit DAG (red)

* fix(engine): refresh commit-graph head before append to prevent same-branch DAG fork

A same-branch write that follows an external commit committed a fresh manifest
version (commit_all rebases the pin from a fresh coordinator) but appended off
the coordinator's stale in-memory commit-graph head, forking the commit DAG (the
new commit and the external commit shared a parent). Pre-existing for non-strict
inserts; widened to strict ops by Fix 1's refresh_manifest_only freshening the
read-time pin. record_graph_commit now refreshes the commit-graph head from
storage before append_commit, so the parent is the true current head.
record_merge_commit is unaffected (it passes explicit parents).

* perf(engine): hold open Dataset handles + share one Session per graph (Fix 3)

A warm same-branch read still re-opened every touched table per query (the
"never warms up" residual after Fix 1+2). A per-graph held-handle cache keyed by
(table_path, branch, version) now serves repeat reads with zero table opens, and
one shared lance::Session per graph warms metadata/index caches across opens.

Validated against LanceDB upstream (rust/lancedb/src/table/dataset.rs
DatasetConsistencyWrapper): hold an Arc<Dataset> and reuse it for 0-IO warm
reads; one Session per connection threaded into opens; writers never serve from
the read cache; time-travel bypasses. One adaptation: omnigraph keys by version
(snapshot-pins-version model) where LanceDB keys per-table+HEAD, reusing the
in-repo GraphIndexCache LRU template.

- ReadCaches (session + TableHandleCache) injected onto live-Branch-read
  snapshots in resolved_target; Snapshot::open serves from the cache or opens
  once with the session on a miss (via the instrumented open_table_dataset).
- Writes (resolved_branch_target -> open HEAD) and time-travel / Snapshot-id
  reads bypass the cache. Version-in-key makes a write a new key (old handle ages
  out via LRU); invalidate_all at branch-switch/refresh is hygiene only.
- Cost tests: a 2nd identical warm read does 0 table opens; a write re-opens only
  the changed table at its new version.

Full engine suite green.

* test(engine): forbid raw data opens in the read/exec layer (P2 guard)

Extend the forbidden-API guard with Dataset::open / DatasetBuilder::from_uri /
from_namespace so the read/exec layer (exec/, loader/, changes/, db/omnigraph/)
cannot bypass Snapshot::open and the held-handle cache (Fix 3). The instrumented
opener (instrumentation.rs) is allow-listed; two legitimate non-read opens (a
test editing __manifest, Hard-drop version GC) carry sentinels. The
storage/manifest layers stay allow-listed.

Lean P2 scope, per LanceDB-upstream + minimize-liability: the data-read boundary
already exists (SubTableEntry::open); this guard pins it so a future read cannot
open around the cache. Centralizing all internal opens behind one opener is
deferred.

* docs(dev): invariant 15 (one source of truth, cheaply derived) + cost-budget testing

Records the principle behind the query-latency work: Lance and the manifest are
the source of truth, everything else a derived view held warm and refreshed by a
cheap probe; the two failure modes (a drifting parallel copy, and cold
re-derivation whose cost grows with history) are deny-listed. Adds the
cost-budget testing discipline (assert a warm read's open/IO count is flat at
commit-history depth, the LanceDB IO-counted pattern) and the warm_read_cost.rs
row. Updates the read-path-re-derivation known gap to reflect what Fix 1/2/3 +
finding A close, and adds the commit-graph-parent-under-concurrency gap.

* fix(engine): branch-incarnation identity + unified invalidation + shared LruMap (PR #268 review)

Phase 6 A-D, correct-by-design responses to the Codex/Greptile P2 review comments. A: warm-read freshness and the table-handle cache key use the manifest incarnation (e_tag, manifest-timestamp fallback, then version), so a deleted+recreated non-main branch reusing a version number cannot be served stale; main stays version-cheap, non-main loads latest_manifest; a detected stale refresh also invalidates read caches; two regression tests force the version collision. B: unify the two cache invalidations into Omnigraph::invalidate_read_caches() at the four sites. C: assert the stale path's probe count. D: shared LruMap behind both caches with unconditional eviction, plus a unit test. Full engine suite green; multi-process lineage fork and O(history) write refresh remain known gaps for Phase 6E/7.
This commit is contained in:
Ragnor Comerford 2026-06-17 13:25:20 +02:00 committed by GitHub
parent d0e06a6ff6
commit 5243c048aa
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
26 changed files with 1918 additions and 107 deletions

View file

@ -53,7 +53,13 @@ converge the physical state.
versioning, fragments, branches, compaction, cleanup, and index primitives.
DataFusion should own relational execution where it fits. Do not add custom
WALs, transaction managers, buffer pools, page formats, or local clones of
substrate behavior. Read [lance.md](lance.md) before guessing.
substrate behavior. Read [lance.md](lance.md) before guessing. Respecting the
substrate also means *using* it idiomatically, not only refraining from
rebuilding it: reuse long-lived handles instead of re-opening per call,
resolve latest state through the substrate's cheap primitive instead of
re-scanning, and share its caches/session. Re-deriving per call what the
substrate keeps warm is a substrate violation even when no code is
reimplemented.
2. **Graph visibility is manifest-atomic.** Lance commits are per dataset.
OmniGraph's graph-level atomicity comes from publishing one manifest update
@ -126,6 +132,18 @@ converge the physical state.
a substitute for missing lower-level assertions. Read [testing.md](testing.md)
before adding tests.
15. **One source of truth, cheaply derived.** Lance and the manifest are the
source of truth. Everything the engine needs at runtime is a derived view of
them: read or projected on demand, held warm, refreshed by a cheap probe. Two
failure modes are forbidden. A *parallel copy* the engine maintains can drift
from the source, and that divergence compounds over time. *Cold
re-derivation* rebuilds the view from the full source on every call, so its
cost grows with history. Invariants 1 and 7, and the deny-list "state that
drifts" and "manifest-derivable reconciler" items, are instances; so is
bounding a read's cost to its working set rather than the commit count. This
is the structural face of "engineering is programming integrated over time":
both failure modes are liabilities that compound as the system grows.
## Current Truth Matrix
| Area | Current state | Source |
@ -252,6 +270,37 @@ them explicit.
- **Resource bounds:** some operations still lack enforced per-query memory or
time budgets. New long-running work should add explicit bounds rather than
widening the gap.
- **Read-path re-derivation (largely closed by the query-latency work):**
snapshot resolution used to re-open a fresh coordinator per read (a full
`__manifest` re-scan plus two commit-graph scans), open each table through the
namespace (two more `__manifest` scans per table), validate the schema twice,
and share no Lance `Session`. That was an O(commits) cost that never warmed up.
Fix 1 (warm coordinator reuse behind a `latest_version_id` probe), Fix 2 (open
tables by location+version), finding A (validate once), and Fix 3 (a held
`Dataset`-handle cache keyed by `(table, branch, version, e_tag when Lance
exposes it)` plus one shared `Session` per graph) remove that tax: a warm
same-branch read does one probe, one schema read, and zero opens on a repeat.
Non-main branch freshness compares the manifest incarnation (`version` plus
manifest-location e_tag when available, otherwise Lance manifest timestamp),
because Lance branch names can be deleted/recreated at the same version number;
the manifest e_tag is carried into synthetic snapshot ids when available, and
a detected same-branch manifest refresh clears read caches as the fallback for
e_tag-less table locations/topology. Remaining: the internal metadata tables
(`__manifest`, `_graph_commits`) are still not compacted, so the probe and
refresh cost still grows with fragment count on a long-lived graph (the
`optimize`-covers-internal-tables follow-up); the commit graph is not yet
reconcilable from the manifest; and the traversal id-map is still rebuilt.
- **Commit-graph parent under concurrency:** `record_graph_commit` now refreshes
the commit-graph head from storage before appending, so a same-branch write
after an external commit no longer forks the commit DAG by parenting off a
stale cached head (the single-process fork, pre-existing for non-strict
inserts and widened to strict ops by Fix 1's `refresh_manifest_only`, is now
closed). Residual: two processes writing disjoint tables can still pass their
per-table manifest CAS and append off the same parent (a refresh-then-append
TOCTOU). The convergent fix is reconcile-from-manifest (parent = the commit at
the manifest version the publisher CAS'd against; `manifest_version` is on
every commit row), composing with the manifest-to-commit-graph atomicity gap;
it needs commit-graph append ordering or a Lance append-CAS to fully close.
## Deny-list
@ -277,6 +326,10 @@ case is exceptional.
- Cost-blind plan choice when statistics are available or required.
- Hidden statistics for behavior that affects planning or operator choice.
- Hash-map iteration order in result ordering, plan choice, or migration output.
- Cold re-derivation on the hot path: rebuilding from the full source what could
be held warm and refreshed cheaply, so cost scales with history rather than the
working set (the cost face of invariant 15; "state that drifts" above is its
shadow-copy face).
- String-flattened SQL/filter generation when a structured pushdown API is
available.
- Eager multi-hop cross-product materialization when factorization fits.
@ -313,6 +366,8 @@ Use this as yes/no/NA for any non-trivial design or PR:
- Are stats/capabilities exposed when behavior depends on them?
- Are existing known gaps left no worse and documented if touched?
- Does the test live at the same boundary as the change?
- Is this operation's cost bounded with respect to history and scale, or does it
re-derive warm state from cold storage per call?
- Does the change avoid every deny-list pattern, or justify the exception?
## Maintenance Policy

View file

@ -23,8 +23,9 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `merge_truth_table.rs` | Merge-pair truth table (MR-786): all 9×9 `(left_op, right_op)` cells from `{noop, addNode, removeNode, addEdge, removeEdge, setProperty, dropProperty, addLabel, removeLabel}`. Adding a new op to `OpVariant` forces a compile error in `build_case` until the new row + column are dispositioned. 36 executable cells run through real `branch_merge` with a structured oracle (`MergeOutcome` / `MergeConflictKind` + graph-state assert); 45 cells involving `dropProperty`/`addLabel`/`removeLabel` are recorded as `Unsupported` until the mutation grammar grows. |
| `writes.rs` | Direct-publish writes: cancellation, non-strict insert/merge rebase under the per-table queue, strict stale-write conflicts, multi-statement atomicity, MR-794 staged-write rewire (D₂ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery) |
| `staged_writes.rs` | TableStore staged-write primitives (`stage_append`, `stage_merge_insert`, `commit_staged`, `scan_with_staged`, `count_rows_with_staged`) — primitive-level only; engine code uses the in-memory `MutationStaging` accumulator instead |
| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) — reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
| `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
| `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
| `lifecycle.rs` | Graph lifecycle, schema state |
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
| `changes.rs` | `diff_between` / `diff_commits` |
@ -125,5 +126,14 @@ When you pick up any change, walk through this:
6. **For substrate-touching changes** (Lance behavior), reach for `failpoints` or fixture-driven scenarios, not stubbed-out mocks.
7. **For server / API changes**, confirm the OpenAPI regeneration happens in `openapi.rs` and that the diff lands in `openapi.json`.
8. **Verify your change makes an existing test fail before it makes the new one pass.** If you can break the code without breaking a test, your coverage gap is the problem to fix first.
9. **Bound hot-path cost at history depth.** If the change touches a read or open path, add or extend a test that asserts a *bounded* cost (e.g. a warm same-branch read performs zero `Dataset::open`, or a fixed object-op count) against a fixture with realistic *commit-history depth*, not just realistic row counts. Cost that scales with history is invisible on a shallow fixture and only bites in production. See "Cost-budget tests" below.
## Cost-budget tests: bound hot-path cost at history depth
Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and degrade silently in production. The engine read path historically had no cost assertion, and fixtures carry shallow commit history, so an O(commits)-per-query cost stayed green in CI and only surfaced on a long-lived graph (read snapshot resolution re-scanned the internal manifest and commit-graph tables on every query, and those tables were never compacted). Guard against the class:
- **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query."
- **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost.
- This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history).
When in doubt, re-read [docs/dev/invariants.md](invariants.md) — quality gates apply to every change.