mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-18 02:24:27 +02:00
* test(engine): add read-path IO instrumentation seam for warm-read cost tests
Prerequisite seam for the query-latency fixes. Adds
crates/omnigraph/src/instrumentation.rs:
- CountingStorageAdapter: a StorageAdapter decorator counting per-method
reads (read_text/exists/read_text_versioned/list_dir), for the
schema-contract reads on the query path.
- A per-query task-local (QueryIoProbes) carrying Lance WrappingObjectStore
wrappers per open category plus a probe counter, delivered via
with_query_io_probes. open_dataset_tracked attaches the wrapper so the
open itself is counted (ObjectStoreParams.object_store_wrapper).
Wires the wrappers into the manifest open (open_manifest_dataset) and the
commit-graph opens (CommitGraph::open/open_at_branch). Production leaves
the task-local unset, so nothing attaches.
Makes Omnigraph::open_with_storage public so tests can inject the counting
adapter. lance-io is a dev-dependency (IOTracker named only in tests). No
runtime behavior change.
* test(engine): warm same-branch read should reuse the coordinator (red)
Cost-budget test using Lance IOTracker at the object-store boundary (the
LanceDB IO-counted-test pattern). On a 20-commit-deep graph, a warm
same-branch query re-opens a fresh coordinator, which opens both the commit
graph and __manifest. Asserts the read opens the commit graph zero times
and performs exactly one cheap version probe; today it does neither (it
scans the commit graph on re-open and never probes). The freshness guard
already passes. Adds the commit_many helper for history-depth fixtures.
Red half of the Fix 1 red->green pair; turns green with the next commit.
* perf(engine): same-branch reads reuse the warm coordinator (Fix 1)
query()/resolved_target re-opened a fresh GraphCoordinator from storage on
every read (full __manifest scan + two commit-graph scans), so a warm
read's cost grew with commit history (invariant 15) though the data was
unchanged.
resolved_target now serves same-branch reads from the warm in-memory
coordinator, gated by a cheap version probe (latest_version_id, one
object-store op) instead of a full re-open:
- fresh (probe == cached version): return the in-memory snapshot under the
read lock, with a synthetic (branch, version) id and no commit-graph
access (reads pin the snapshot by manifest version, not the commit DAG;
invariant 2).
- stale: take the write lock, re-probe (double-checked; tokio RwLock has no
read->write upgrade), then refresh_manifest_only (no commit-graph scan),
preserving strong consistency for external writers (invariant 6).
Cross-branch and snapshot targets keep the existing cold-resolve path.
Adds ManifestCoordinator/GraphCoordinator::probe_latest_version and
GraphCoordinator::refresh_manifest_only. Nothing on the read path needs a
real commit ULID (only RuntimeCache keys on the id, where synthetic is
consistent), per a caller audit.
A warm same-branch read on a 20-commit graph now does zero commit-graph
opens and exactly one probe (down from a deep commit-graph scan) and still
observes external commits. The residual per-table __manifest scans are
removed later by Fix 2.
* test(engine): warm query should validate the schema contract once (red)
ensure_schema_state_valid runs twice per query (query()/run_query_at AND
resolved_target/snapshot_at_version), each reading 3 contract files + 2
existence probes. A warm query thus does 6 read_text + 4 exists where one
validation (3 + 2) suffices, measured via CountingStorageAdapter. Adds a
drift guard (schema_source_drift_is_caught_on_read) that already passes.
Red half of the finding-A red->green pair.
* perf(engine): validate the schema contract once per query (finding A)
ensure_schema_state_valid ran on every query AND again inside
resolved_target / snapshot_at_version, so each query validated the schema
contract twice (~10 storage ops). Removes the redundant query()/
run_query_at() calls; the validation inside resolved_target /
snapshot_at_version still runs, so drift is detected exactly as before.
A source-only fast path was rejected: a long-lived handle must detect
external drift of the schema source, IR, OR state on its next operation
(lifecycle::long_lived_handle_rejects_schema_*), which a source-only
compare would miss. So the only safe latency win is not validating twice.
A warm query now does one validation (3 read_text + 2 exists) instead of
two (6 + 4).
* test(engine): warm + multi-table reads should do zero manifest scans (red)
After Fix 1 a warm same-branch read still scans __manifest ~44 times at
20-commit depth: not from resolution (Fix 1 removed that) but from the
per-table open path, which routes through the Lance namespace and full-scans
__manifest twice per touched table (describe_table + describe_table_version).
Tightens the warm test to assert manifest read_iops == 0 and adds a
multi-table (traversal) test asserting the same, pinning the "2 tables = 2x"
tax. Red half of the Fix 2 red->green pair.
* perf(engine): open touched tables by location+version, not via the namespace (Fix 2)
SubTableEntry::open routed every read-path table open through
DatasetBuilder::from_namespace(BranchManifestNamespace), whose describe_table
full-scans __manifest and, with managed_versioning, makes Lance scan again
(describe_table_version) -- two full __manifest scans per touched table. That
was the residual that made warm-read manifest IO grow with history and the
'2 tables = 2x' multi-table tax.
The resolved Snapshot already holds each table's path/version/branch, so open
directly: from_uri(table_uri_for_path(root, path, branch)).with_version(v).
The branch-qualified location is the dataset that physically holds the version
(main: {path}; branch: {path}/tree/{branch}, Lance native-branch storage), and
with_version resolves it within THAT dataset's _versions. 0 namespace calls +
1 HEAD via the native ConditionalPutCommitHandler.
The read namespace (BranchManifestNamespace) is now unused in production
(writes use StagedTableNamespace), so it, its constructor, and the helpers only
it used (to_namespace_version, publish_requests, their imports) are gated
#[cfg(test)] -- retained to validate the namespace contract in unit tests.
Removes the dead open_table_at_version_from_manifest.
Warm same-branch + multi-table reads now scan __manifest zero times; branch +
time-travel reads stay correct (branching.rs, point_in_time.rs, 2 lib
regression tests); production-lib warnings unchanged (baseline).
* test(engine): cost-budget coverage for branch-warm and stale-refresh reads (matrix)
Extends the read-path cost-budget tests across more of the morphological matrix:
- warm_branch_read_does_no_manifest_scans: a warm read on a non-main branch
(handle synced to it) scans __manifest zero times, exercising Fix 2's
branch-owned-table open (tree/{branch} + with_version) on Fix 1's warm path --
the cell that regressed when the open used with_branch against the base.
- stale_read_refreshes_manifest_only: an external commit makes the next read
take the stale path, which re-reads the manifest (read_iops > 0) but never
scans the commit graph (refresh_manifest_only), pinning Fix 1's manifest-only
refresh.
Cold paths (cross-branch, time-travel) stay behavior-covered (branching.rs,
point_in_time.rs) and are cold by design (Fix 1 warm-paths only same-branch), so
there is no manifest==0 contract to assert there.
* test(engine): same-branch write after external commit must not fork the commit DAG (red)
* fix(engine): refresh commit-graph head before append to prevent same-branch DAG fork
A same-branch write that follows an external commit committed a fresh manifest
version (commit_all rebases the pin from a fresh coordinator) but appended off
the coordinator's stale in-memory commit-graph head, forking the commit DAG (the
new commit and the external commit shared a parent). Pre-existing for non-strict
inserts; widened to strict ops by Fix 1's refresh_manifest_only freshening the
read-time pin. record_graph_commit now refreshes the commit-graph head from
storage before append_commit, so the parent is the true current head.
record_merge_commit is unaffected (it passes explicit parents).
* perf(engine): hold open Dataset handles + share one Session per graph (Fix 3)
A warm same-branch read still re-opened every touched table per query (the
"never warms up" residual after Fix 1+2). A per-graph held-handle cache keyed by
(table_path, branch, version) now serves repeat reads with zero table opens, and
one shared lance::Session per graph warms metadata/index caches across opens.
Validated against LanceDB upstream (rust/lancedb/src/table/dataset.rs
DatasetConsistencyWrapper): hold an Arc<Dataset> and reuse it for 0-IO warm
reads; one Session per connection threaded into opens; writers never serve from
the read cache; time-travel bypasses. One adaptation: omnigraph keys by version
(snapshot-pins-version model) where LanceDB keys per-table+HEAD, reusing the
in-repo GraphIndexCache LRU template.
- ReadCaches (session + TableHandleCache) injected onto live-Branch-read
snapshots in resolved_target; Snapshot::open serves from the cache or opens
once with the session on a miss (via the instrumented open_table_dataset).
- Writes (resolved_branch_target -> open HEAD) and time-travel / Snapshot-id
reads bypass the cache. Version-in-key makes a write a new key (old handle ages
out via LRU); invalidate_all at branch-switch/refresh is hygiene only.
- Cost tests: a 2nd identical warm read does 0 table opens; a write re-opens only
the changed table at its new version.
Full engine suite green.
* test(engine): forbid raw data opens in the read/exec layer (P2 guard)
Extend the forbidden-API guard with Dataset::open / DatasetBuilder::from_uri /
from_namespace so the read/exec layer (exec/, loader/, changes/, db/omnigraph/)
cannot bypass Snapshot::open and the held-handle cache (Fix 3). The instrumented
opener (instrumentation.rs) is allow-listed; two legitimate non-read opens (a
test editing __manifest, Hard-drop version GC) carry sentinels. The
storage/manifest layers stay allow-listed.
Lean P2 scope, per LanceDB-upstream + minimize-liability: the data-read boundary
already exists (SubTableEntry::open); this guard pins it so a future read cannot
open around the cache. Centralizing all internal opens behind one opener is
deferred.
* docs(dev): invariant 15 (one source of truth, cheaply derived) + cost-budget testing
Records the principle behind the query-latency work: Lance and the manifest are
the source of truth, everything else a derived view held warm and refreshed by a
cheap probe; the two failure modes (a drifting parallel copy, and cold
re-derivation whose cost grows with history) are deny-listed. Adds the
cost-budget testing discipline (assert a warm read's open/IO count is flat at
commit-history depth, the LanceDB IO-counted pattern) and the warm_read_cost.rs
row. Updates the read-path-re-derivation known gap to reflect what Fix 1/2/3 +
finding A close, and adds the commit-graph-parent-under-concurrency gap.
* fix(engine): branch-incarnation identity + unified invalidation + shared LruMap (PR #268 review)
Phase 6 A-D, correct-by-design responses to the Codex/Greptile P2 review comments. A: warm-read freshness and the table-handle cache key use the manifest incarnation (e_tag, manifest-timestamp fallback, then version), so a deleted+recreated non-main branch reusing a version number cannot be served stale; main stays version-cheap, non-main loads latest_manifest; a detected stale refresh also invalidates read caches; two regression tests force the version collision. B: unify the two cache invalidations into Omnigraph::invalidate_read_caches() at the four sites. C: assert the stale path's probe count. D: shared LruMap behind both caches with unconditional eviction, plus a unit test. Full engine suite green; multi-process lineage fork and O(history) write refresh remain known gaps for Phase 6E/7.
|
||
|---|---|---|
| .. | ||
| architecture.md | ||
| branch-protection.md | ||
| ci.md | ||
| cluster-axioms.md | ||
| cluster-config-implementation-spec.md | ||
| cluster-config-specs.md | ||
| codeowners.md | ||
| execution.md | ||
| index.md | ||
| invariants.md | ||
| lance.md | ||
| merge.md | ||
| rfc-001-queries-envelope-mcp.md | ||
| rfc-002-config-cli-architecture.md | ||
| rfc-003-mcp-server-surface.md | ||
| rfc-004-cluster-graph-schema-apply.md | ||
| rfc-005-server-cluster-boot.md | ||
| rfc-007-operator-config.md | ||
| rfc-008-deprecate-omnigraph-yaml.md | ||
| rfc-009-unify-access-paths.md | ||
| rfc-010-cli-planes-restructure.md | ||
| rfc-011-cli-refactoring.md | ||
| rfc-012-embedding-provider-config.md | ||
| schema-lint-v1-plan.md | ||
| testing.md | ||
| writes.md | ||