mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-24 02:38:06 +02:00
Merge branch 'main' into ragnorc/omnigraph-mcp-crate
Folds in v0.7.1 (release #290 + optimize/write-path/internal-table-compaction fixes #288/#291/#297) under the MCP branch. Conflict resolutions (5 files): - crates/omnigraph-server/Cargo.toml: take main's 0.7.1 path-dep constraints; keep our omnigraph-mcp dep (bumped to 0.7.1) + http dep. - crates/omnigraph-server/src/handlers.rs: keep our server_list_queries doc-comment (exposed @mcp(expose) subset, invoke_query-gated) — it supersedes main's pre-@mcp(expose) text, since this branch adds the per-query expose flag. - docs/user/operations/server.md: keep our GET /queries description (invoke_query gate + @mcp(expose) exposure) over main's read-gated/list-all text. - docs/dev/index.md: keep both in-flight RFC rows; renumber this branch's tenancy RFC 013 -> 014 (rfc-014-tenancy-cells.md) since main now owns RFC-013 (rfc-013-write-path-latency.md). Title + index link updated; link-check green. - openapi.json: regenerated from merged source (OMNIGRAPH_UPDATE_OPENAPI=1) — now info.version 0.7.1 with our invoke_query/@mcp schema. Coherence: omnigraph-mcp bumped 0.7.0 -> 0.7.1 to match the workspace; Cargo.lock updated. cargo build --workspace green; server/mcp/api-types/compiler suites green (schema_routes.rs reopen-after-apply flakes under parallel IO on a near-full disk, passes single-threaded — a pre-existing main test, unchanged by the merge).
This commit is contained in:
commit
adc36adf32
44 changed files with 3595 additions and 528 deletions
70
docs/dev/docs-issues.md
Normal file
70
docs/dev/docs-issues.md
Normal file
|
|
@ -0,0 +1,70 @@
|
|||
# User Docs Coherence Ledger
|
||||
|
||||
**Last review:** 2026-06-20 (against 0.7.1)
|
||||
**Status:** all open findings resolved — living ledger for future audits.
|
||||
|
||||
This page tracks stale or incoherent user-doc claims found during broad docs
|
||||
reviews. Findings are validated against current **code/behavior**, not just
|
||||
cross-doc consistency. Record new findings as they surface; mark them resolved
|
||||
(with the fixing commit) once the public pages are corrected.
|
||||
|
||||
## Resolved — 2026-06-20 docs/user coherence sweep
|
||||
|
||||
Every finding from the 2026-06-20 review was validated (all reproduced) and
|
||||
fixed. Branch `docs/user-coherence-0-7-1`.
|
||||
|
||||
| Pri | Finding | Resolution |
|
||||
|---|---|---|
|
||||
| P1 | `cluster apply` documented as catalog-only / "Stage 3A" with graph+schema deferred — in both `cli/reference.md` and the shipped CLI help (`cli.rs`) | Rewrote both to describe the real converge behavior (creates graphs, applies schema with soft drops, writes catalog, executes approved deletes in one ordered run); `deferred` now means the genuinely-unsupported case (standalone schema delete). |
|
||||
| P1 | Stored-query exposure had two contracts: `server.md` documented a per-query `mcp:{expose:false}` knob; cluster docs said all queries are listed | Confirmed in code: cluster registry has no expose field (`QueryConfig`), boot bridge hardcodes `expose: true` (`omnigraph-server` settings), no GQ-level annotation. Removed the knob from `server.md`; documented "every applied query is listed; per-query exposure may become a Cedar-policy decision later". |
|
||||
| P1 | The same stale "`mcp.expose == true` subset" contract lived in the **OpenAPI surface**: utoipa annotations (`handlers.rs:1029,1037`, `omnigraph-api-types/src/lib.rs:404`) drove `openapi.json` (Greptile catch on #293) | Updated the three Rust doc-comment/annotation strings to "every stored query" and regenerated `openapi.json` (`OMNIGRAPH_UPDATE_OPENAPI=1`); drift test green. Same-change per AGENTS.md rule 4. |
|
||||
| P2 | `schema/index.md` claimed `allow_data_loss` honored "uniformly across transports" incl. HTTP `POST /schema/apply` | Scoped to the direct/embedded path; added that cluster-managed graphs evolve via `cluster apply` (soft drops only) and the HTTP route is 409-disabled for cluster serving. |
|
||||
| P2 | `/load` missing from admission / body-limit / rate-limit / manifest-conflict prose (named `/ingest` only); constants called it "Ingest body limit" | Documented `/load` as canonical everywhere with `/ingest` as the deprecated alias; renamed the constant to "Load (bulk-write) body limit". |
|
||||
| P2 | CLI "Bearer token resolution" section listed removed `omnigraph.yaml` keys (`graphs.<name>.bearer_token_env`, `auth.env_file`) | Replaced with a pointer to the keyed-credential model (`OMNIGRAPH_TOKEN_<NAME>` → `~/.omnigraph/credentials` → `OMNIGRAPH_BEARER_TOKEN`); no plaintext-in-config path. |
|
||||
| P2 | Flat route names in a cluster-only server (`POST /query`, `POST /mutate`, `GET /queries`, `POST /queries/{name}`) | Added a one-line note that the per-graph subsections use shorthand under `/graphs/{id}/…`; the endpoint table is already fully qualified. |
|
||||
| — | `version` printed `omnigraph 0.3.x` | → `0.7.x`. |
|
||||
| — | `search/indexes.md` used deprecated `ingest --mode merge` | → `load --mode merge`. |
|
||||
| — | `config.md` `deferred` disposition described as "graph/schema change, later phase" | → "an unsupported change (e.g. standalone schema delete)". |
|
||||
| — | Stale stage labels (`Stage 3A`, `Stage 2C`, `Stage 1`) in active reference docs | Removed / reworded to plain language; release notes keep history. |
|
||||
|
||||
## Open — surfaced 2026-06-20, not yet fixed
|
||||
|
||||
- **Stale "config-only apply" / "Stage 3A" comments in `omnigraph-cluster`
|
||||
source** (internal rustdoc, not user docs — out of scope for the docs sweep
|
||||
above): `src/types.rs:147` ("Applied changes execute (config-only query/policy
|
||||
catalog writes)"), `src/types.rs:265` ("Output of config-only cluster apply"),
|
||||
`src/diff.rs:256`, and `src/tests.rs:1129` ("config-only apply (Stage 3A)").
|
||||
Apply now also runs graph creates, schema applies, and approved deletes
|
||||
(`diff.rs:411` `GraphCreate` / `SchemaApply`; the Stage-4 create/schema/delete
|
||||
executors + tests `apply_creates_graph_and_unblocks_dependents`,
|
||||
`apply_schema_update_and_dependent_query_in_one_run`,
|
||||
`apply_blocks_graph_delete_without_approval`). Update these comments in a
|
||||
cluster-crate change.
|
||||
- **Cross-repo drift from this sweep** (separate repos):
|
||||
- `omnigraph-ts` SDK — its generated `spec/openapi.json` +
|
||||
`packages/sdk/src/generated/types.gen.ts` still describe the `GET /queries`
|
||||
catalog as the `mcp.expose` subset. **No hand-fix:** the SDK's
|
||||
`scripts/sync-spec.ts` pulls openapi.json from a *tagged* omnigraph release
|
||||
(`/omnigraph/v{version}/openapi.json`), and the catalog fix landed on main
|
||||
*after* the v0.7.1 tag — so it is in no tag yet and a hand-edit would be
|
||||
overwritten on the next sync. It flows in automatically when the SDK bumps
|
||||
to a tag containing the fix (v0.7.2+). Tracked, not actioned.
|
||||
- `omnigraph-cookbooks/docs/best-practices.md` `bearer_token_env` chain —
|
||||
**RESOLVED** by omnigraph-cookbooks PR #26 (2026-06-21), which deleted
|
||||
`docs/best-practices.md` as part of the 0.7 restructure; the stale chain
|
||||
survives nowhere on `main`.
|
||||
|
||||
## Verification checklist (re-run on the next docs audit)
|
||||
|
||||
```bash
|
||||
rg -n "Stage [0-9]|graph/schema changes are deferred|reserved for later stages" docs/user crates/omnigraph-cli/src/cli.rs
|
||||
rg -n "POST /query|POST /mutate|GET /queries|POST /queries/\{name\}|POST /schema/apply" docs/user
|
||||
rg -n "ingest --mode|Ingest body limit|/ingest" docs/user
|
||||
rg -n "0\.3\.x|bearer_token_env|auth\.env_file" docs/user
|
||||
rg -n "expose: false|mcp\.expose" docs/user
|
||||
```
|
||||
|
||||
Expected: active user docs have no matches for stale phrases, or the remaining
|
||||
matches are explicitly marked as deprecated aliases, "no longer exist" notes, or
|
||||
route shorthand disclaimed relative to `/graphs/{id}`. Release notes are allowed
|
||||
to preserve historical behavior.
|
||||
|
|
@ -41,6 +41,7 @@ constraints. User-facing behavior should still be documented through
|
|||
| Error taxonomy and serialization | [errors.md](../user/operations/errors.md) |
|
||||
| Constants and tunables | [constants.md](../user/reference/constants.md) |
|
||||
| Transaction model public contract | [transactions.md](../user/branching/transactions.md) |
|
||||
| User-doc coherence cleanup ledger | [docs-issues.md](docs-issues.md) |
|
||||
|
||||
## Project Operations
|
||||
|
||||
|
|
@ -91,7 +92,8 @@ Working documents for in-flight feature work. Removed when the work lands.
|
|||
| Restructure the CLI around explicit planes — one graph-addressing model, declared capability surface, plane-grouped help (expands RFC-009 Phase 4) | [rfc-010-cli-planes-restructure.md](rfc-010-cli-planes-restructure.md) |
|
||||
| CLI refactoring — one addressing & config model post-`omnigraph.yaml`: scope + `--graph` + derived access path, served-default / privileged-direct, profiles, named queries, capability classifier (completes RFC-008) | [rfc-011-cli-refactoring.md](rfc-011-cli-refactoring.md) |
|
||||
| Provider-independent embedding configuration — one resolved `EmbeddingConfig` + sealed provider enum (Gemini/OpenAI/Mock), identity recorded in the schema IR, query-time same-space validation, NFR floor | [rfc-012-embedding-provider-config.md](rfc-012-embedding-provider-config.md) |
|
||||
| Tenancy model — cluster-as-tenant cells (silo the data, pool the compute): `CellRuntime` lifts the per-cluster runtime, one server hosts N cells resolved by host before auth, WorkOS org→cell 1:1 with per-cell audience, tiered dedicated/pooled/on-prem on one binary | [rfc-013-tenancy-cells.md](rfc-013-tenancy-cells.md) |
|
||||
| Write-path latency — capture-once `WriteTxn`, version-pinned opens, one `GraphPublishAuthority` fed declarative `PublishPlan`s, manifest-authoritative lineage, epoch fence, bounded history (compaction + cleanup), and an IO-counted cost contract (`iss-write-s3-roundtrip-amplification`, `iss-991`) | [rfc-013-write-path-latency.md](rfc-013-write-path-latency.md) |
|
||||
| Tenancy model — cluster-as-tenant cells (silo the data, pool the compute): `CellRuntime` lifts the per-cluster runtime, one server hosts N cells resolved by host before auth, WorkOS org→cell 1:1 with per-cell audience, tiered dedicated/pooled/on-prem on one binary | [rfc-014-tenancy-cells.md](rfc-014-tenancy-cells.md) |
|
||||
|
||||
## Boundary
|
||||
|
||||
|
|
|
|||
|
|
@ -285,11 +285,14 @@ them explicit.
|
|||
because Lance branch names can be deleted/recreated at the same version number;
|
||||
the manifest e_tag is carried into synthetic snapshot ids when available, and
|
||||
a detected same-branch manifest refresh clears read caches as the fallback for
|
||||
e_tag-less table locations/topology. Remaining: the internal metadata tables
|
||||
(`__manifest`, `_graph_commits`) are still not compacted, so the probe and
|
||||
refresh cost still grows with fragment count on a long-lived graph (the
|
||||
`optimize`-covers-internal-tables follow-up); the commit graph is not yet
|
||||
reconcilable from the manifest; and the traversal id-map is still rebuilt.
|
||||
e_tag-less table locations/topology. Remaining: `optimize` now compacts the
|
||||
internal metadata tables (`__manifest`, `_graph_commits`) too (RFC-013 step 2),
|
||||
so a *periodically-optimized* graph keeps the probe/refresh/per-write scan flat
|
||||
in history; but they are not yet brought into `cleanup` (version GC), so the
|
||||
`_versions/` chain still grows until an explicit cleanup (the cleanup half is
|
||||
deferred — it needs the Q8 cleanup-resurrection watermark first). The commit
|
||||
graph is not yet reconcilable from the manifest; and the traversal id-map is
|
||||
still rebuilt.
|
||||
- **Commit-graph parent under concurrency:** `record_graph_commit` now refreshes
|
||||
the commit-graph head from storage before appending, so a same-branch write
|
||||
after an external commit no longer forks the commit DAG by parenting off a
|
||||
|
|
|
|||
1467
docs/dev/rfc-013-write-path-latency.md
Normal file
1467
docs/dev/rfc-013-write-path-latency.md
Normal file
File diff suppressed because it is too large
Load diff
|
|
@ -1,4 +1,4 @@
|
|||
# RFC-013: Tenancy model — cluster-as-tenant cells, pooled compute
|
||||
# RFC-014: Tenancy model — cluster-as-tenant cells, pooled compute
|
||||
|
||||
**Status:** Proposed — general architecture (server topology, identity, deployment).
|
||||
**Date:** 2026-06-16
|
||||
|
|
@ -27,6 +27,8 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) — reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
|
||||
| `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
|
||||
| `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
|
||||
| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — `internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
|
||||
| `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary |
|
||||
| `lifecycle.rs` | Graph lifecycle, schema state |
|
||||
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
|
||||
| `changes.rs` | `diff_between` / `diff_commits` |
|
||||
|
|
@ -70,9 +72,10 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
|
||||
## RustFS / S3 integration
|
||||
|
||||
CI runs three S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml` → `rustfs_integration` job):
|
||||
CI runs these S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml` → `rustfs_integration` job):
|
||||
|
||||
- `cargo test -p omnigraph-engine --test s3_storage`
|
||||
- `cargo test -p omnigraph-engine --test write_cost_s3` (RFC-013 step 3a's data-table opener cost gate — flat across commit depth on S3; the term local FS can't reproduce)
|
||||
- `cargo test -p omnigraph-server --test s3` (single-graph serving + config-free `--cluster s3://` boot)
|
||||
- `cargo test -p omnigraph-cluster --test s3_cluster` (full control-plane lifecycle on the bucket)
|
||||
- `cargo test -p omnigraph-cli --test system_local local_cli_s3_end_to_end_init_load_read_flow`
|
||||
|
|
@ -127,7 +130,7 @@ When you pick up any change, walk through this:
|
|||
6. **For substrate-touching changes** (Lance behavior), reach for `failpoints` or fixture-driven scenarios, not stubbed-out mocks.
|
||||
7. **For server / API changes**, confirm the OpenAPI regeneration happens in `openapi.rs` and that the diff lands in `openapi.json`.
|
||||
8. **Verify your change makes an existing test fail before it makes the new one pass.** If you can break the code without breaking a test, your coverage gap is the problem to fix first.
|
||||
9. **Bound hot-path cost at history depth.** If the change touches a read or open path, add or extend a test that asserts a *bounded* cost (e.g. a warm same-branch read performs zero `Dataset::open`, or a fixed object-op count) against a fixture with realistic *commit-history depth*, not just realistic row counts. Cost that scales with history is invisible on a shallow fixture and only bites in production. See "Cost-budget tests" below.
|
||||
9. **Bound hot-path cost at history depth.** If the change touches a read, **write**, or open path, add or extend a test that asserts a *bounded* cost (e.g. a warm same-branch read performs zero `Dataset::open`, or a per-write read-op count flat across commit depth) against a fixture with realistic *commit-history depth*, not just realistic row counts. Reuse the shared `helpers::cost` harness (`measure`/`IoCounts`/`assert_flat`) — don't hand-roll `IOTracker` wiring. Cost that scales with history is invisible on a shallow fixture and only bites in production. See "Cost-budget tests" below.
|
||||
|
||||
## Cost-budget tests: bound hot-path cost at history depth
|
||||
|
||||
|
|
@ -135,6 +138,7 @@ Correctness bugs fail loudly in tests; cost-scaling bugs pass every test and deg
|
|||
|
||||
- **Assert a cost budget, not just a result.** For a read/open path, assert the number of `Dataset::open` calls (or object-store ops) a warm query performs, and that it does not grow with commit count. The reference is LanceDB's IO-counted tests, which assert a cached read costs 0-1 IO and carry a named regression test against "a list call on every subsequent query."
|
||||
- **Test at history depth.** Build a fixture with many *commits* (not many rows) and assert warm-read cost is flat across depths. A shallow fixture cannot catch an O(commits) cost.
|
||||
- **Use the shared harness, and gate each term on the backend where it manifests.** `helpers::cost` (`measure`/`IoCounts`/`assert_flat`/`local_graph`/`s3_graph`) is the one place the `IOTracker`/task-local plumbing lives — consume it, don't duplicate it. The write path has *two distinct* depth terms that split cleanly across backends, and conflating them is a real trap (the local data-table read count grows with depth too, but for a different reason — the merge-insert/RI scan reading O(depth) *fragments*, reduced by compaction, not by the opener): (1) the **internal-table** scan term (`__manifest`/`_graph_commits` fragment scans) reproduces on **any** backend including local FS, so `write_cost.rs` gates it on local every-PR; (2) the **data-table opener** term (latest-version resolution) is a per-object-store-RPC phenomenon — local-FS resolves latest with one cheap `read_dir` regardless of the opener used, so the namespace-vs-direct difference is **invisible on local** and only shows on a real object store (per-version GETs), gated by the bucket-gated `write_cost_s3.rs`. Same harness, different fixture; each term asserted where it actually appears.
|
||||
- This is the testing companion to invariant 15 in [docs/dev/invariants.md](invariants.md) (hot-path cost is bounded by work, not history).
|
||||
|
||||
When in doubt, re-read [docs/dev/invariants.md](invariants.md) — quality gates apply to every change.
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue