mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-24 02:38:06 +02:00
* docs(rfc): RFC-013 write-path latency design + index link * perf(engine): open write-path tables directly, bypassing the namespace builder Write opens routed through DatasetBuilder::from_namespace, whose describe_table opened the whole dataset just to return a location and then re-resolved the latest version — an O(commit-depth) double latest-resolution per table open that missed Lance's O(1) version-hint fast path. On an object store this dominated write latency (~70%, RFC-013 section 2.4). TableStore::open_dataset_head_for_write now delegates to the direct opener (open_dataset_head: Dataset::open by URI + checkout_branch, routed through the tracked opener so cost tests can count it; a no-op in production). The manifest already holds every sub-table's location, so the namespace catalog lookup was redundant; ensure_expected_version still validates head == pinned for strict ops. This completes PR #268's open-by-location migration on the write side. With both reads (PR #268) and now writes bypassing it, nothing in production routes through the per-table Lance namespace. The dead open chain (load_table_from_namespace, open_table_head_for_write) is deleted and the StagedTableNamespace contract apparatus is gated #[cfg(test)], mirroring the already-test-only read namespace; __manifest commit coordination (GraphNamespacePublisher) is a separate component and is unaffected. See docs/dev/rfc-013-write-path-latency.md sections 2.4 and 9 (step 3a). * test(engine): write-path cost-budget gate on a shared harness Adds tests/helpers/cost.rs, a store-agnostic cost harness (IoCounts/StagedCounts, measure/measure_with_staged, assert_flat, local_graph/s3_graph) that the read-side warm_read_cost.rs, write_cost.rs, and write_cost_s3.rs share, so the IOTracker / task-local plumbing lives in exactly one place instead of duplicated per test. write_cost.rs (local, every-PR) gates the internal-table scan term flat in commit-history depth (a RED #[ignore]'d LOCK, the acceptance for bringing the internal tables into compaction) plus green guards: a single insert's data writes are bounded, a per-write read-op ceiling fails the moment a round-trip is added, and a keyed insert routes through stage_merge_insert once with no stage_append or vector-index build. write_cost_s3.rs (bucket-gated, rustfs CI) gates the data-table opener term flat across depth — the object-store-RPC phenomenon local FS cannot reproduce, and the red->green proof of the opener bypass. Wired into the rustfs_integration CI job and its path filter. Guards the "hot-path cost is bounded by work, not history" invariant on writes. See docs/dev/rfc-013-write-path-latency.md section 5.1, docs/dev/testing.md. * docs(rfc): RFC-013 step 3a landed; write-skew coupling; cost-gate test map - Section 9: mark step 1 (gate + harness) and step 3a (opener bypass) landed; record the per-table namespace retirement to test-only and the corrected measurement note (the opener win is S3-only; the local data-table growth is the merge-insert/RI fragment scan, a compaction term, not the opener). - Sections 7.1/6/11/5.5/10: correct the cross-table write-skew analysis after a prototype proved the scoped expected-set fix is a no-op against the per-object_id manifest (disjoint writers never share a row, so Lance never conflicts, the publisher never retries, and the expected check is a non-atomic pre-check evaluated once against stale state). The fix needs a shared contention row (Phase-7 graph_head / a minimal head row / commit-time re-validation), so it is coupled to that row, not standalone; that contention is load-bearing for correctness, not a drawback. Split the concurrent face (read-set + head) from the sequential face (inbound-RI validation on node removal) -- two different fixes. - testing.md: add write_cost.rs / helpers/cost.rs / write_cost_s3.rs to the test map; document the local-vs-S3 backend split; extend the cost-budget checklist item to the write/open path and point at the shared harness. * test(engine): isolate the opener in the S3 cost gate; fail loud on S3 setup errors Addresses two PR review findings on the bucket-gated write_cost_s3 gate: - The data-table opener was not isolated: `data_reads` also counts the merge-insert/RI scan, which reads O(fragment-count) and so grows with history for a different reason (compaction's domain, not the opener) -- the same term that made the local data-table count grow. The flat assertion would false-RED or misattribute scan growth to the opener on rustfs. Fix: compact (db.optimize) before each measurement so the table holds ~1 fragment, bounding the scan and leaving the opener's latest-version resolution as the only history-varying term. Compaction preserves version history, so the opener still faces a deep _versions/ chain -- the thing under test. - s3_graph used `.ok()?`, so when OMNIGRAPH_S3_TEST_BUCKET was set but the store was down/misconfigured, init/seed failures collapsed to None and the gate skipped + passed vacuously. Fix: skip only when the bucket env var is absent; once it is set, init/seed failures panic (mirrors tests/s3_storage.rs). * test(engine): isolate the S3 opener with a per-prefix IO probe (correct-by-design) Replaces the fixture-bounded isolation (compact-before-measure) from the prior commit with the root fix: a path-classifying ObjectStore wrapper (PrefixCounter) that attributes each data-table read to the opener term (_versions/.manifest) vs the scan term (data/*.lance). IoCounts now exposes data_opener_reads / data_scan_reads, so write_cost_s3 asserts the opener flat *directly* -- no compaction or fixture massaging, and the assertion measures the opener, not the conflated total. Closes the "harness conflates two IO terms" class: any cost test (read or write) can now isolate the opener. PrefixCounter implements only the object_store 0.13 core ObjectStore methods; the convenience surface (get/put/head/...) routes through get_opts/put_opts via ObjectStoreExt's blanket impl, so every read/write is still counted. Validated locally (every-PR) by write_cost::data_table_reads_split_into_flat_opener_ and_growing_scan: opener stays flat (7 -> 3) while scan grows (11 -> 91) and opener + scan == data_total exactly -- proving the classifier and confirming the local data-table growth is the fragment scan, not the opener. warm_read_cost (12 tests) stays green under the shared-harness change. * refactor(tests): remove cost-harness duplication and namespace cfg(test) noise Branch self-review (no behavior change) — pay down three liabilities the write-path work left: - warm_read_cost.rs kept its own probes() (three IOTrackers + a QueryIoProbes + a probe counter) and read raw .stats().read_iops — duplicating the shared helpers::cost harness this branch introduced. Migrated all 12 tests onto measure()/IoCounts; deleted the local probes(). (This also makes IoCounts' version_probes field used rather than dead.) - insert_cost was copy-pasted verbatim into write_cost.rs and write_cost_s3.rs. Hoisted to helpers::cost::measure_insert so the measured write is defined once. - The per-table Lance namespace (namespace.rs) became entirely test-only after step 3a, but was gated with ~22 per-item #[cfg(test)] attributes. Collapsed to a single `#[cfg(test)] mod namespace;` and stripped the per-item attributes; merged the import groups the gating had split. Verified: lib in-source 162 passed; write_cost 4 + warm_read_cost 12 passed; forbidden_apis passed.
100 lines
6.5 KiB
Markdown
100 lines
6.5 KiB
Markdown
# Developer Docs
|
|
|
|
**Audience:** contributors, maintainers, and coding agents
|
|
|
|
This is the contributor-facing entry point. These docs explain architecture,
|
|
invariants, implementation contracts, test ownership, and upstream Lance
|
|
constraints. User-facing behavior should still be documented through
|
|
[docs/user/index.md](../user/index.md) and the relevant public reference docs.
|
|
|
|
## Required For Every Non-Trivial Change
|
|
|
|
| Need | Read |
|
|
|---|---|
|
|
| Architectural rules, known gaps, deny-list | [invariants.md](invariants.md) |
|
|
| Upstream Lance source-of-truth index | [lance.md](lance.md) |
|
|
| Existing test coverage and test placement | [testing.md](testing.md) |
|
|
|
|
## Architecture And Storage
|
|
|
|
| Area | Read |
|
|
|---|---|
|
|
| System structure, L1/L2 framing, component diagrams | [architecture.md](architecture.md) |
|
|
| On-disk layout, manifest schema, URI behavior | [storage.md](../user/concepts/storage.md) |
|
|
| Direct-publish writes, D2, staged writes, recovery sidecars | [writes.md](writes.md) |
|
|
| Query execution, mutation execution, loader flow | [execution.md](execution.md) |
|
|
| Index lifecycle and graph topology indexes | [indexes.md](../user/search/indexes.md) |
|
|
| Branch and commit internals | [branches-commits.md](../user/branching/index.md) |
|
|
| Three-way merge implementation and conflicts | [merge.md](merge.md) |
|
|
| Diff/change-feed implementation | [changes.md](../user/branching/changes.md) |
|
|
| Branch protection policy | [branch-protection.md](branch-protection.md) |
|
|
|
|
## Language, Runtime, And Boundaries
|
|
|
|
| Area | Read |
|
|
|---|---|
|
|
| Schema grammar, catalog, migration planner | [schema-language.md](../user/schema/index.md) |
|
|
| Query grammar, IR, lints, mutation restrictions | [query-language.md](../user/queries/index.md) |
|
|
| Embedding client and `@embed` integration | [embeddings.md](../user/search/embeddings.md) |
|
|
| Cedar policy surface and server gating | [policy.md](../user/operations/policy.md) |
|
|
| Server auth, OpenAPI, endpoint handlers | [server.md](../user/operations/server.md) |
|
|
| Error taxonomy and serialization | [errors.md](../user/operations/errors.md) |
|
|
| Constants and tunables | [constants.md](../user/reference/constants.md) |
|
|
| Transaction model public contract | [transactions.md](../user/branching/transactions.md) |
|
|
|
|
## Project Operations
|
|
|
|
| Area | Read |
|
|
|---|---|
|
|
| CI and release workflows | [ci.md](ci.md) |
|
|
| Install and deployment packaging | [install.md](../user/install.md), [deployment.md](../user/deployment.md) |
|
|
| Release history | [releases/](../releases/) |
|
|
|
|
## Contribution & Governance
|
|
|
|
| Area | Read |
|
|
|---|---|
|
|
| How to contribute (external) | [CONTRIBUTING.md](../../CONTRIBUTING.md) |
|
|
| Governance model, roles, decision authority | [GOVERNANCE.md](../../GOVERNANCE.md) |
|
|
| Public contribution RFC track | [rfcs/](../rfcs/) |
|
|
|
|
The `docs/rfcs/` track is the **public, externally-authorable** RFC process. The
|
|
maintainer/internal RFCs below (`rfc-00N-*.md`) are a separate, team-owned
|
|
track; don't conflate the two.
|
|
|
|
## Case Studies
|
|
|
|
Worked write-ups of specific bugs — root cause, fix, and the reasoning that
|
|
ruled out the tempting-but-wrong alternatives. Read these for the debugging
|
|
pattern, not just the outcome.
|
|
|
|
| Area | Read |
|
|
|---|---|
|
|
| camelCase property filters lowercased at runtime (#283) — two engine→Lance boundaries, two different fixes | [bug-case-fix.md](bug-case-fix.md) |
|
|
|
|
## Active Implementation Plans
|
|
|
|
Working documents for in-flight feature work. Removed when the work lands.
|
|
|
|
| Area | Read |
|
|
|---|---|
|
|
| Schema-lint chassis v1 (MR-694) — `--allow-data-loss`, soft/hard drops | [schema-lint-v1-plan.md](schema-lint-v1-plan.md) |
|
|
| Inline + stored queries, request/response envelope, MCP (MR-656 / MR-976 / MR-969) | [rfc-001-queries-envelope-mcp.md](rfc-001-queries-envelope-mcp.md) |
|
|
| Config & CLI architecture — layered config, client targeting, file naming (MR-973 / MR-974 / MR-981) | [rfc-002-config-cli-architecture.md](rfc-002-config-cli-architecture.md) |
|
|
| MCP server surface — full tool parity, stored queries, modular auth (MR-969 / MR-956 / MR-974) | [rfc-003-mcp-server-surface.md](rfc-003-mcp-server-surface.md) |
|
|
| Future cluster control plane — declarative as-code config, JSON state ledger, reconciler | [cluster-config-specs.md](cluster-config-specs.md), [cluster-axioms.md](cluster-axioms.md), [cluster-config-implementation-spec.md](cluster-config-implementation-spec.md) |
|
|
| Cluster graph & schema apply — Phase 4 sidecars, roll-forward recovery, approval artifacts | [rfc-004-cluster-graph-schema-apply.md](rfc-004-cluster-graph-schema-apply.md) |
|
|
| Server boots from cluster state — Phase 5 mode switch, applied-revision serving | [rfc-005-server-cluster-boot.md](rfc-005-server-cluster-boot.md) |
|
|
| Per-operator config — `~/.omnigraph/` identity, keyed credentials, named servers (the operator slice of RFC-002) | [rfc-007-operator-config.md](rfc-007-operator-config.md) |
|
|
| Deprecate `omnigraph.yaml` — one concern per config surface; key-by-key migration map and staged retirement | [rfc-008-deprecate-omnigraph-yaml.md](rfc-008-deprecate-omnigraph-yaml.md) |
|
|
| Unify CLI embedded/remote access paths — parity referee, shared wire-DTO crate, `GraphClient` trait, declared plane capabilities | [rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) |
|
|
| Restructure the CLI around explicit planes — one graph-addressing model, declared capability surface, plane-grouped help (expands RFC-009 Phase 4) | [rfc-010-cli-planes-restructure.md](rfc-010-cli-planes-restructure.md) |
|
|
| CLI refactoring — one addressing & config model post-`omnigraph.yaml`: scope + `--graph` + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) | [rfc-011-cli-refactoring.md](rfc-011-cli-refactoring.md) |
|
|
| Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
|
|
| Write-path latency — capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) |
|
|
|
|
## Boundary
|
|
|
|
Developer docs may mention implementation details, stale gaps, upstream Lance
|
|
blockers, and review rules. User docs should not require that context unless
|
|
the detail changes the public contract.
|