omnigraph/docs/dev
Ragnor Comerford 5243c048aa
perf(engine): remove the per-query metadata re-derivation tax on warm reads (#268)
* test(engine): add read-path IO instrumentation seam for warm-read cost tests

Prerequisite seam for the query-latency fixes. Adds
crates/omnigraph/src/instrumentation.rs:

- CountingStorageAdapter: a StorageAdapter decorator counting per-method
  reads (read_text/exists/read_text_versioned/list_dir), for the
  schema-contract reads on the query path.
- A per-query task-local (QueryIoProbes) carrying Lance WrappingObjectStore
  wrappers per open category plus a probe counter, delivered via
  with_query_io_probes. open_dataset_tracked attaches the wrapper so the
  open itself is counted (ObjectStoreParams.object_store_wrapper).

Wires the wrappers into the manifest open (open_manifest_dataset) and the
commit-graph opens (CommitGraph::open/open_at_branch). Production leaves
the task-local unset, so nothing attaches.

Makes Omnigraph::open_with_storage public so tests can inject the counting
adapter. lance-io is a dev-dependency (IOTracker named only in tests). No
runtime behavior change.

* test(engine): warm same-branch read should reuse the coordinator (red)

Cost-budget test using Lance IOTracker at the object-store boundary (the
LanceDB IO-counted-test pattern). On a 20-commit-deep graph, a warm
same-branch query re-opens a fresh coordinator, which opens both the commit
graph and __manifest. Asserts the read opens the commit graph zero times
and performs exactly one cheap version probe; today it does neither (it
scans the commit graph on re-open and never probes). The freshness guard
already passes. Adds the commit_many helper for history-depth fixtures.

Red half of the Fix 1 red->green pair; turns green with the next commit.

* perf(engine): same-branch reads reuse the warm coordinator (Fix 1)

query()/resolved_target re-opened a fresh GraphCoordinator from storage on
every read (full __manifest scan + two commit-graph scans), so a warm
read's cost grew with commit history (invariant 15) though the data was
unchanged.

resolved_target now serves same-branch reads from the warm in-memory
coordinator, gated by a cheap version probe (latest_version_id, one
object-store op) instead of a full re-open:
- fresh (probe == cached version): return the in-memory snapshot under the
  read lock, with a synthetic (branch, version) id and no commit-graph
  access (reads pin the snapshot by manifest version, not the commit DAG;
  invariant 2).
- stale: take the write lock, re-probe (double-checked; tokio RwLock has no
  read->write upgrade), then refresh_manifest_only (no commit-graph scan),
  preserving strong consistency for external writers (invariant 6).
Cross-branch and snapshot targets keep the existing cold-resolve path.

Adds ManifestCoordinator/GraphCoordinator::probe_latest_version and
GraphCoordinator::refresh_manifest_only. Nothing on the read path needs a
real commit ULID (only RuntimeCache keys on the id, where synthetic is
consistent), per a caller audit.

A warm same-branch read on a 20-commit graph now does zero commit-graph
opens and exactly one probe (down from a deep commit-graph scan) and still
observes external commits. The residual per-table __manifest scans are
removed later by Fix 2.

* test(engine): warm query should validate the schema contract once (red)

ensure_schema_state_valid runs twice per query (query()/run_query_at AND
resolved_target/snapshot_at_version), each reading 3 contract files + 2
existence probes. A warm query thus does 6 read_text + 4 exists where one
validation (3 + 2) suffices, measured via CountingStorageAdapter. Adds a
drift guard (schema_source_drift_is_caught_on_read) that already passes.

Red half of the finding-A red->green pair.

* perf(engine): validate the schema contract once per query (finding A)

ensure_schema_state_valid ran on every query AND again inside
resolved_target / snapshot_at_version, so each query validated the schema
contract twice (~10 storage ops). Removes the redundant query()/
run_query_at() calls; the validation inside resolved_target /
snapshot_at_version still runs, so drift is detected exactly as before.

A source-only fast path was rejected: a long-lived handle must detect
external drift of the schema source, IR, OR state on its next operation
(lifecycle::long_lived_handle_rejects_schema_*), which a source-only
compare would miss. So the only safe latency win is not validating twice.

A warm query now does one validation (3 read_text + 2 exists) instead of
two (6 + 4).

* test(engine): warm + multi-table reads should do zero manifest scans (red)

After Fix 1 a warm same-branch read still scans __manifest ~44 times at
20-commit depth: not from resolution (Fix 1 removed that) but from the
per-table open path, which routes through the Lance namespace and full-scans
__manifest twice per touched table (describe_table + describe_table_version).
Tightens the warm test to assert manifest read_iops == 0 and adds a
multi-table (traversal) test asserting the same, pinning the "2 tables = 2x"
tax. Red half of the Fix 2 red->green pair.

* perf(engine): open touched tables by location+version, not via the namespace (Fix 2)

SubTableEntry::open routed every read-path table open through
DatasetBuilder::from_namespace(BranchManifestNamespace), whose describe_table
full-scans __manifest and, with managed_versioning, makes Lance scan again
(describe_table_version) -- two full __manifest scans per touched table. That
was the residual that made warm-read manifest IO grow with history and the
'2 tables = 2x' multi-table tax.

The resolved Snapshot already holds each table's path/version/branch, so open
directly: from_uri(table_uri_for_path(root, path, branch)).with_version(v).
The branch-qualified location is the dataset that physically holds the version
(main: {path}; branch: {path}/tree/{branch}, Lance native-branch storage), and
with_version resolves it within THAT dataset's _versions. 0 namespace calls +
1 HEAD via the native ConditionalPutCommitHandler.

The read namespace (BranchManifestNamespace) is now unused in production
(writes use StagedTableNamespace), so it, its constructor, and the helpers only
it used (to_namespace_version, publish_requests, their imports) are gated
#[cfg(test)] -- retained to validate the namespace contract in unit tests.
Removes the dead open_table_at_version_from_manifest.

Warm same-branch + multi-table reads now scan __manifest zero times; branch +
time-travel reads stay correct (branching.rs, point_in_time.rs, 2 lib
regression tests); production-lib warnings unchanged (baseline).

* test(engine): cost-budget coverage for branch-warm and stale-refresh reads (matrix)

Extends the read-path cost-budget tests across more of the morphological matrix:
- warm_branch_read_does_no_manifest_scans: a warm read on a non-main branch
  (handle synced to it) scans __manifest zero times, exercising Fix 2's
  branch-owned-table open (tree/{branch} + with_version) on Fix 1's warm path --
  the cell that regressed when the open used with_branch against the base.
- stale_read_refreshes_manifest_only: an external commit makes the next read
  take the stale path, which re-reads the manifest (read_iops > 0) but never
  scans the commit graph (refresh_manifest_only), pinning Fix 1's manifest-only
  refresh.

Cold paths (cross-branch, time-travel) stay behavior-covered (branching.rs,
point_in_time.rs) and are cold by design (Fix 1 warm-paths only same-branch), so
there is no manifest==0 contract to assert there.

* test(engine): same-branch write after external commit must not fork the commit DAG (red)

* fix(engine): refresh commit-graph head before append to prevent same-branch DAG fork

A same-branch write that follows an external commit committed a fresh manifest
version (commit_all rebases the pin from a fresh coordinator) but appended off
the coordinator's stale in-memory commit-graph head, forking the commit DAG (the
new commit and the external commit shared a parent). Pre-existing for non-strict
inserts; widened to strict ops by Fix 1's refresh_manifest_only freshening the
read-time pin. record_graph_commit now refreshes the commit-graph head from
storage before append_commit, so the parent is the true current head.
record_merge_commit is unaffected (it passes explicit parents).

* perf(engine): hold open Dataset handles + share one Session per graph (Fix 3)

A warm same-branch read still re-opened every touched table per query (the
"never warms up" residual after Fix 1+2). A per-graph held-handle cache keyed by
(table_path, branch, version) now serves repeat reads with zero table opens, and
one shared lance::Session per graph warms metadata/index caches across opens.

Validated against LanceDB upstream (rust/lancedb/src/table/dataset.rs
DatasetConsistencyWrapper): hold an Arc<Dataset> and reuse it for 0-IO warm
reads; one Session per connection threaded into opens; writers never serve from
the read cache; time-travel bypasses. One adaptation: omnigraph keys by version
(snapshot-pins-version model) where LanceDB keys per-table+HEAD, reusing the
in-repo GraphIndexCache LRU template.

- ReadCaches (session + TableHandleCache) injected onto live-Branch-read
  snapshots in resolved_target; Snapshot::open serves from the cache or opens
  once with the session on a miss (via the instrumented open_table_dataset).
- Writes (resolved_branch_target -> open HEAD) and time-travel / Snapshot-id
  reads bypass the cache. Version-in-key makes a write a new key (old handle ages
  out via LRU); invalidate_all at branch-switch/refresh is hygiene only.
- Cost tests: a 2nd identical warm read does 0 table opens; a write re-opens only
  the changed table at its new version.

Full engine suite green.

* test(engine): forbid raw data opens in the read/exec layer (P2 guard)

Extend the forbidden-API guard with Dataset::open / DatasetBuilder::from_uri /
from_namespace so the read/exec layer (exec/, loader/, changes/, db/omnigraph/)
cannot bypass Snapshot::open and the held-handle cache (Fix 3). The instrumented
opener (instrumentation.rs) is allow-listed; two legitimate non-read opens (a
test editing __manifest, Hard-drop version GC) carry sentinels. The
storage/manifest layers stay allow-listed.

Lean P2 scope, per LanceDB-upstream + minimize-liability: the data-read boundary
already exists (SubTableEntry::open); this guard pins it so a future read cannot
open around the cache. Centralizing all internal opens behind one opener is
deferred.

* docs(dev): invariant 15 (one source of truth, cheaply derived) + cost-budget testing

Records the principle behind the query-latency work: Lance and the manifest are
the source of truth, everything else a derived view held warm and refreshed by a
cheap probe; the two failure modes (a drifting parallel copy, and cold
re-derivation whose cost grows with history) are deny-listed. Adds the
cost-budget testing discipline (assert a warm read's open/IO count is flat at
commit-history depth, the LanceDB IO-counted pattern) and the warm_read_cost.rs
row. Updates the read-path-re-derivation known gap to reflect what Fix 1/2/3 +
finding A close, and adds the commit-graph-parent-under-concurrency gap.

* fix(engine): branch-incarnation identity + unified invalidation + shared LruMap (PR #268 review)

Phase 6 A-D, correct-by-design responses to the Codex/Greptile P2 review comments. A: warm-read freshness and the table-handle cache key use the manifest incarnation (e_tag, manifest-timestamp fallback, then version), so a deleted+recreated non-main branch reusing a version number cannot be served stale; main stays version-cheap, non-main loads latest_manifest; a detected stale refresh also invalidates read caches; two regression tests force the version collision. B: unify the two cache invalidations into Omnigraph::invalidate_read_caches() at the four sites. C: assert the stale path's probe count. D: shared LruMap behind both caches with unconditional eviction, plus a unit test. Full engine suite green; multi-process lineage fork and O(history) write refresh remain known gaps for Phase 6E/7.
2026-06-17 13:25:20 +02:00
..
architecture.md feat!: delete the legacy OmnigraphConfig + config migrate; finish the omnigraph.yaml docs sweep (#252) 2026-06-15 22:31:29 +03:00
branch-protection.md ci: run Test Workspace only on main, not on pull requests (#212) 2026-06-13 19:23:41 +03:00
ci.md ci: run Test Workspace only on main, not on pull requests (#212) 2026-06-13 19:23:41 +03:00
cluster-axioms.md docs(cluster): axiom 15 — single ownership, mode-switch migration, per-operator layer (#164) 2026-06-10 00:44:51 +03:00
cluster-config-implementation-spec.md docs(cluster): RFC-005 — server boots from cluster state (Phase 5 design) (#174) 2026-06-10 15:22:12 +03:00
cluster-config-specs.md docs(user): restructure user docs into topic sections (Phase 1) (#223) 2026-06-14 13:52:14 +03:00
codeowners.md ci(codeowners): restore ragnorc to engineering and docs roles 2026-06-12 13:45:33 +03:00
execution.md fix(embedding): address PR review feedback (RFC-012 Phase 2) 2026-06-15 18:37:34 +02:00
index.md docs(rfc): add RFC-012 provider-independent embedding configuration 2026-06-15 15:07:38 +02:00
invariants.md perf(engine): remove the per-query metadata re-derivation tax on warm reads (#268) 2026-06-17 13:25:20 +02:00
lance.md test(engine): pin Lance 7 immutable-PK behavior + sharpen native-namespace alignment notes (#240) 2026-06-15 11:33:25 +02:00
merge.md docs: split user and developer docs (#93) 2026-05-15 03:45:22 +03:00
rfc-001-queries-envelope-mcp.md docs(user): restructure user docs into topic sections (Phase 1) (#223) 2026-06-14 13:52:14 +03:00
rfc-002-config-cli-architecture.md docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus 2026-06-12 17:33:11 +03:00
rfc-003-mcp-server-surface.md Stored-query registry foundation + config/CLI RFC-002 (#128) 2026-06-01 22:50:31 +02:00
rfc-004-cluster-graph-schema-apply.md docs(cluster): document Stage 4C — Phase 4 complete 2026-06-10 14:44:12 +03:00
rfc-005-server-cluster-boot.md docs(cluster,server): the Phase 5 mode switch; retire applied-not-serving caveats 2026-06-10 17:56:54 +03:00
rfc-007-operator-config.md docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus 2026-06-12 17:33:11 +03:00
rfc-008-deprecate-omnigraph-yaml.md docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus 2026-06-12 17:33:11 +03:00
rfc-009-unify-access-paths.md feat: canonical POST /load, deprecate /ingest (RFC-009 Phase 5) (#222) 2026-06-14 03:32:16 +03:00
rfc-010-cli-planes-restructure.md docs(rfc): RFC-010 — apply verification-comment current-state fixups (#215) 2026-06-13 22:24:09 +03:00
rfc-011-cli-refactoring.md feat(cli): add read-only profile list / profile show (RFC-011 D8) (#255) 2026-06-15 23:33:01 +03:00
rfc-012-embedding-provider-config.md Wire cluster embedding providers 2026-06-16 04:02:08 +03:00
schema-lint-v1-plan.md schema-lint chassis v1.0: DropProperty Soft + code-tagged diagnostics (MR-694) (#90) 2026-05-16 16:30:03 +03:00
testing.md perf(engine): remove the per-query metadata re-derivation tax on warm reads (#268) 2026-06-17 13:25:20 +02:00
writes.md fix: self-heal manifest-unreferenced branch forks (stop wedged branches) (#231) 2026-06-15 22:17:25 +02:00