Merge branch 'main' into ragnorc/omnigraph-mcp-crate

Folds in v0.7.1 (release #290 + optimize/write-path/internal-table-compaction
fixes #288/#291/#297) under the MCP branch.

Conflict resolutions (5 files):
- crates/omnigraph-server/Cargo.toml: take main's 0.7.1 path-dep constraints;
  keep our omnigraph-mcp dep (bumped to 0.7.1) + http dep.
- crates/omnigraph-server/src/handlers.rs: keep our server_list_queries
  doc-comment (exposed @mcp(expose) subset, invoke_query-gated) — it supersedes
  main's pre-@mcp(expose) text, since this branch adds the per-query expose flag.
- docs/user/operations/server.md: keep our GET /queries description
  (invoke_query gate + @mcp(expose) exposure) over main's read-gated/list-all text.
- docs/dev/index.md: keep both in-flight RFC rows; renumber this branch's tenancy
  RFC 013 -> 014 (rfc-014-tenancy-cells.md) since main now owns RFC-013
  (rfc-013-write-path-latency.md). Title + index link updated; link-check green.
- openapi.json: regenerated from merged source (OMNIGRAPH_UPDATE_OPENAPI=1) — now
  info.version 0.7.1 with our invoke_query/@mcp schema.

Coherence: omnigraph-mcp bumped 0.7.0 -> 0.7.1 to match the workspace; Cargo.lock
updated. cargo build --workspace green; server/mcp/api-types/compiler suites green
(schema_routes.rs reopen-after-apply flakes under parallel IO on a near-full disk,
passes single-threaded — a pre-existing main test, unchanged by the merge).
This commit is contained in:
Ragnor Comerford 2026-06-23 18:26:45 +02:00
commit adc36adf32
No known key found for this signature in database
44 changed files with 3595 additions and 528 deletions

70
docs/dev/docs-issues.md Normal file
View file

@ -0,0 +1,70 @@
# User Docs Coherence Ledger
**Last review:** 2026-06-20 (against 0.7.1)
**Status:** all open findings resolved — living ledger for future audits.
This page tracks stale or incoherent user-doc claims found during broad docs
reviews. Findings are validated against current **code/behavior**, not just
cross-doc consistency. Record new findings as they surface; mark them resolved
(with the fixing commit) once the public pages are corrected.
## Resolved — 2026-06-20 docs/user coherence sweep
Every finding from the 2026-06-20 review was validated (all reproduced) and
fixed. Branch `docs/user-coherence-0-7-1`.
| Pri | Finding | Resolution |
|---|---|---|
| P1 | `cluster apply` documented as catalog-only / "Stage 3A" with graph+schema deferred — in both `cli/reference.md` and the shipped CLI help (`cli.rs`) | Rewrote both to describe the real converge behavior (creates graphs, applies schema with soft drops, writes catalog, executes approved deletes in one ordered run); `deferred` now means the genuinely-unsupported case (standalone schema delete). |
| P1 | Stored-query exposure had two contracts: `server.md` documented a per-query `mcp:{expose:false}` knob; cluster docs said all queries are listed | Confirmed in code: cluster registry has no expose field (`QueryConfig`), boot bridge hardcodes `expose: true` (`omnigraph-server` settings), no GQ-level annotation. Removed the knob from `server.md`; documented "every applied query is listed; per-query exposure may become a Cedar-policy decision later". |
| P1 | The same stale "`mcp.expose == true` subset" contract lived in the **OpenAPI surface**: utoipa annotations (`handlers.rs:1029,1037`, `omnigraph-api-types/src/lib.rs:404`) drove `openapi.json` (Greptile catch on #293) | Updated the three Rust doc-comment/annotation strings to "every stored query" and regenerated `openapi.json` (`OMNIGRAPH_UPDATE_OPENAPI=1`); drift test green. Same-change per AGENTS.md rule 4. |
| P2 | `schema/index.md` claimed `allow_data_loss` honored "uniformly across transports" incl. HTTP `POST /schema/apply` | Scoped to the direct/embedded path; added that cluster-managed graphs evolve via `cluster apply` (soft drops only) and the HTTP route is 409-disabled for cluster serving. |
| P2 | `/load` missing from admission / body-limit / rate-limit / manifest-conflict prose (named `/ingest` only); constants called it "Ingest body limit" | Documented `/load` as canonical everywhere with `/ingest` as the deprecated alias; renamed the constant to "Load (bulk-write) body limit". |
| P2 | CLI "Bearer token resolution" section listed removed `omnigraph.yaml` keys (`graphs.<name>.bearer_token_env`, `auth.env_file`) | Replaced with a pointer to the keyed-credential model (`OMNIGRAPH_TOKEN_<NAME>``~/.omnigraph/credentials``OMNIGRAPH_BEARER_TOKEN`); no plaintext-in-config path. |
| P2 | Flat route names in a cluster-only server (`POST /query`, `POST /mutate`, `GET /queries`, `POST /queries/{name}`) | Added a one-line note that the per-graph subsections use shorthand under `/graphs/{id}/…`; the endpoint table is already fully qualified. |
| — | `version` printed `omnigraph 0.3.x` | → `0.7.x`. |
| — | `search/indexes.md` used deprecated `ingest --mode merge` | → `load --mode merge`. |
| — | `config.md` `deferred` disposition described as "graph/schema change, later phase" | → "an unsupported change (e.g. standalone schema delete)". |
| — | Stale stage labels (`Stage 3A`, `Stage 2C`, `Stage 1`) in active reference docs | Removed / reworded to plain language; release notes keep history. |
## Open — surfaced 2026-06-20, not yet fixed
- **Stale "config-only apply" / "Stage 3A" comments in `omnigraph-cluster`
source** (internal rustdoc, not user docs — out of scope for the docs sweep
above): `src/types.rs:147` ("Applied changes execute (config-only query/policy
catalog writes)"), `src/types.rs:265` ("Output of config-only cluster apply"),
`src/diff.rs:256`, and `src/tests.rs:1129` ("config-only apply (Stage 3A)").
Apply now also runs graph creates, schema applies, and approved deletes
(`diff.rs:411` `GraphCreate` / `SchemaApply`; the Stage-4 create/schema/delete
executors + tests `apply_creates_graph_and_unblocks_dependents`,
`apply_schema_update_and_dependent_query_in_one_run`,
`apply_blocks_graph_delete_without_approval`). Update these comments in a
cluster-crate change.
- **Cross-repo drift from this sweep** (separate repos):
- `omnigraph-ts` SDK — its generated `spec/openapi.json` +
`packages/sdk/src/generated/types.gen.ts` still describe the `GET /queries`
catalog as the `mcp.expose` subset. **No hand-fix:** the SDK's
`scripts/sync-spec.ts` pulls openapi.json from a *tagged* omnigraph release
(`/omnigraph/v{version}/openapi.json`), and the catalog fix landed on main
*after* the v0.7.1 tag — so it is in no tag yet and a hand-edit would be
overwritten on the next sync. It flows in automatically when the SDK bumps
to a tag containing the fix (v0.7.2+). Tracked, not actioned.
- `omnigraph-cookbooks/docs/best-practices.md` `bearer_token_env` chain —
**RESOLVED** by omnigraph-cookbooks PR #26 (2026-06-21), which deleted
`docs/best-practices.md` as part of the 0.7 restructure; the stale chain
survives nowhere on `main`.
## Verification checklist (re-run on the next docs audit)
```bash
rg -n "Stage [0-9]|graph/schema changes are deferred|reserved for later stages" docs/user crates/omnigraph-cli/src/cli.rs
rg -n "POST /query|POST /mutate|GET /queries|POST /queries/\{name\}|POST /schema/apply" docs/user
rg -n "ingest --mode|Ingest body limit|/ingest" docs/user
rg -n "0\.3\.x|bearer_token_env|auth\.env_file" docs/user
rg -n "expose: false|mcp\.expose" docs/user
```
Expected: active user docs have no matches for stale phrases, or the remaining
matches are explicitly marked as deprecated aliases, "no longer exist" notes, or
route shorthand disclaimed relative to `/graphs/{id}`. Release notes are allowed
to preserve historical behavior.

View file

@ -41,6 +41,7 @@ constraints. User-facing behavior should still be documented through
| Error taxonomy and serialization | [errors.md](../user/operations/errors.md) |
| Constants and tunables | [constants.md](../user/reference/constants.md) |
| Transaction model public contract | [transactions.md](../user/branching/transactions.md) |
| User-doc coherence cleanup ledger | [docs-issues.md](docs-issues.md) |
## Project Operations
@ -91,7 +92,8 @@ Working documents for in-flight feature work. Removed when the work lands.
| Restructure the CLI around explicit planes — one graph-addressing model, declared capability surface, plane-grouped help (expands RFC-009 Phase 4) | [rfc-010-cli-planes-restructure.md](rfc-010-cli-planes-restructure.md) |
| CLI refactoring — one addressing & config model post-`omnigraph.yaml`: scope + `--graph` + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) | [rfc-011-cli-refactoring.md](rfc-011-cli-refactoring.md) |
| Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
| Tenancy model — cluster-as-tenant cells (silo the data, pool the compute): `CellRuntime` lifts the per-cluster runtime, one server hosts N cells resolved by host before auth, WorkOS org→cell 1:1 with per-cell audience, tiered dedicated/pooled/on-prem on one binary | [rfc-013-tenancy-cells.md](rfc-013-tenancy-cells.md) |
| Write-path latency — capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) |
| Tenancy model — cluster-as-tenant cells (silo the data, pool the compute): `CellRuntime` lifts the per-cluster runtime, one server hosts N cells resolved by host before auth, WorkOS org→cell 1:1 with per-cell audience, tiered dedicated/pooled/on-prem on one binary | [rfc-014-tenancy-cells.md](rfc-014-tenancy-cells.md) |
## Boundary

View file

@ -285,11 +285,14 @@ them explicit.
because Lance branch names can be deleted/recreated at the same version number;
the manifest e_tag is carried into synthetic snapshot ids when available, and
a detected same-branch manifest refresh clears read caches as the fallback for
e_tag-less table locations/topology. Remaining: the internal metadata tables
(`__manifest`, `_graph_commits`) are still not compacted, so the probe and
refresh cost still grows with fragment count on a long-lived graph (the
`optimize`-covers-internal-tables follow-up); the commit graph is not yet
reconcilable from the manifest; and the traversal id-map is still rebuilt.
e_tag-less table locations/topology. Remaining: `optimize` now compacts the
internal metadata tables (`__manifest`, `_graph_commits`) too (RFC-013 step 2),
so a *periodically-optimized* graph keeps the probe/refresh/per-write scan flat
in history; but they are not yet brought into `cleanup` (version GC), so the
`_versions/` chain still grows until an explicit cleanup (the cleanup half is
deferred — it needs the Q8 cleanup-resurrection watermark first). The commit
graph is not yet reconcilable from the manifest; and the traversal id-map is
still rebuilt.
- **Commit-graph parent under concurrency:** `record_graph_commit` now refreshes
the commit-graph head from storage before appending, so a same-branch write
after an external commit no longer forks the commit DAG by parenting off a

File diff suppressed because it is too large Load diff

View file

@ -1,4 +1,4 @@
# RFC-013: Tenancy model — cluster-as-tenant cells, pooled compute
# RFC-014: Tenancy model — cluster-as-tenant cells, pooled compute
**Status:** Proposed — general architecture (server topology, identity, deployment).
**Date:** 2026-06-16

View file

@ -27,6 +27,8 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) — reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
| `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
| `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — `internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
| `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary |
| `lifecycle.rs` | Graph lifecycle, schema state |
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
| `changes.rs` | `diff_between` / `diff_commits` |
@ -70,9 +72,10 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
## RustFS / S3 integration
CI runs three S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml``rustfs_integration` job):
CI runs these S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml``rustfs_integration` job):
- `cargo test -p omnigraph-engine --test s3_storage`
- `cargo test -p omnigraph-engine --test write_cost_s3` (RFC-013 step 3a's data-table opener cost gate — flat across commit depth on S3; the term local FS can't reproduce)
- `cargo test -p omnigraph-server --test s3` (single-graph serving + config-free `--cluster s3://` boot)
- `cargo test -p omnigraph-cluster --test s3_cluster` (full control-plane lifecycle on the bucket)
- `cargo test -p omnigraph-cli --test system_local local_cli_s3_end_to_end_init_load_read_flow`
@ -127,7 +130,7 @@ When you pick up any change, walk through this:
6. **For substrate-touching changes** (Lance behavior), reach for `failpoints` or fixture-driven scenarios, not stubbed-out mocks.
7. **For server / API changes**, confirm the OpenAPI regeneration happens in `openapi.rs` and that the diff lands in `openapi.json`.
8. **Verify your change makes an existing test fail before it makes the new one pass.** If you can break the code without breaking a test, your coverage gap is the problem to fix first.
9. **Bound hot-path cost at history depth.** If the change touches a read or open path, add or extend a test that asserts a *bounded* cost (e.g. a warm same-branch read performs zero `Dataset::open`, or a fixed object-op count) against a fixture with realistic *commit-history depth*, not just realistic row counts. Cost that scales with history is invisible on a shallow fixture and only bites in production. See "Cost-budget tests" below.
9. **Bound hot-path cost at history depth.** If the change touches a read, **write**, or open path, add or extend a test that asserts a *bounded* cost (e.g. a warm same-branch read performs zero `Dataset::open`, or a per-write read-op count flat across commit depth) against a fixture with realistic *commit-history depth*, not just realistic row counts. Reuse the shared `helpers::cost` harness (`measure`/`IoCounts`/`assert_flat`) — don't hand-roll `IOTracker` wiring. Cost that scales with history is invisible on a shallow fixture and only bites in production. See "Cost-budget tests" below.
## Cost-budget tests: bound hot-path cost at history depth
@ -135,6 +138,7 @@ Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and deg
- **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query."
- **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost.
- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest`/`_graph_commits` fragment scans) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears.
- This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history).
When in doubt, re-read [docs/dev/invariants.md](invariants.md) — quality gates apply to every change.