mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-24 02:38:06 +02:00

Andrew Altshuler 9c792649e2

docs(user): coherence cleanup aligned with 0.7.1 (#293 )

* docs(cli): fix cluster apply semantics — converges graphs+schema, not config-only

`cluster apply` creates graphs, applies schema updates (soft drops), writes
stored-query/policy catalog resources, and executes approved graph deletes in
one ordered run. Both the user docs and the shipped CLI help text still
described it as a "Stage 3A" config-only (query/policy) subset that defers
graph/schema changes "to a later stage" — wrong since the graph/schema executor
landed.

- docs/user/cli/reference.md: rewrite the cluster paragraph to describe apply's
  actual converge behavior; keep deferred for the genuinely-unsupported case
  (standalone schema deletes); drop the stale "Stage 3A" / "reserved for later
  stages" framing.
- crates/omnigraph-cli/src/cli.rs: fix the `cluster apply` help text to match.

Part of the docs/user coherence cleanup (docs/dev/docs-issues.md, P1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FQ1Hf4eXLsJmeLUkTYBEw7

* docs(server): align stored-query exposure with cluster-only behavior

server.md documented a per-query expose knob ("`mcp.expose` defaults to true;
set `mcp: { expose: false }` to hide from the catalog") that does not exist in
the only deployment mode. Cluster-only serving lists every stored query: the
cluster registry has no expose field (`QueryConfig { file }`) and the boot
bridge hardcodes `expose: true` for all cluster queries
(omnigraph-server settings), and there is no GQ-level expose annotation. This
contradicted clusters/config.md, which already states the correct behavior.

Replace the knob bullet with the cluster truth (every applied query is listed;
per-query exposure may become a Cedar-policy decision later) and drop the
"`mcp.expose` stored queries" phrasing from the catalog description, the
endpoint table, and the intro. The `mcp_expose` JSON catalog field is unchanged
(still emitted, always true in cluster mode).

Part of the docs/user coherence cleanup (docs/dev/docs-issues.md, P1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FQ1Hf4eXLsJmeLUkTYBEw7

* docs(schema): split direct/embedded vs cluster-managed schema apply

schema/index.md claimed `allow_data_loss` is "honored uniformly across
transports" and listed HTTP `POST /schema/apply` among them. But that route is
409-disabled for cluster-backed serving (already documented in server.md), and
cluster-managed graphs evolve only through `cluster apply` with soft drops —
there is no cluster HTTP data-loss path.

Scope the data-loss flag to the direct/embedded path (`schema apply --store`,
SDK), and add a paragraph: cluster-managed graphs use `cluster apply`
(soft drops only); HTTP `POST /schema/apply` is 409 for cluster serving; direct
apply against a cluster-managed path is refused. Cross-refs server + cluster
docs.

Part of the docs/user coherence cleanup (docs/dev/docs-issues.md, P2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FQ1Hf4eXLsJmeLUkTYBEw7

* docs(server): document /load as canonical in limits + admission prose

The endpoint table already listed both `/load` (canonical) and `/ingest`
(deprecated alias) at 32 MB, but the admission-control, body-limit,
rate-limit, and manifest-conflict prose named only `/ingest` — and the
constants page called the limit "Ingest body limit". Add `/load` alongside (or
ahead of) `/ingest` everywhere, and rename the constant to "Load (bulk-write)
body limit" noting the `/ingest` alias shares it.

Part of the docs/user coherence cleanup (docs/dev/docs-issues.md, P2).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FQ1Hf4eXLsJmeLUkTYBEw7

* docs(cli): drop stale bearer-token keys + fix version string

The "Bearer token resolution (CLI)" section still listed removed omnigraph.yaml
keys (`graphs.<name>.bearer_token_env`, `auth.env_file`) — config surfaces that
no longer exist and that implied plaintext tokens in config. Replace it with a
pointer to the keyed-credential model documented above
(`OMNIGRAPH_TOKEN_<NAME>` → `~/.omnigraph/credentials` →
`OMNIGRAPH_BEARER_TOKEN`). Also fix the `version` row: the CLI prints 0.7.x, not
0.3.x.

Part of the docs/user coherence cleanup (docs/dev/docs-issues.md, P2 + smaller).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FQ1Hf4eXLsJmeLUkTYBEw7

* docs: route-spelling note + drop stale stage/deferred crumbs

- server.md: add a one-line note that the per-graph subsections name routes in
  shorthand (`GET /queries`, `POST /query`, `POST /mutate`,
  `POST /queries/{name}`) but every one is served under `/graphs/{id}/…` — the
  endpoint table is already fully-qualified.
- clusters/config.md: redefine the `deferred` plan disposition as an unsupported
  change (e.g. a standalone schema delete) instead of "graph/schema change,
  later phase" (graph creates and schema updates apply now); drop the "Stage 2C"
  label from the lock-recovery note.
- search/indexes.md: `ingest --mode merge` → canonical `load --mode merge`.

Part of the docs/user coherence cleanup (docs/dev/docs-issues.md, P2 + smaller).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FQ1Hf4eXLsJmeLUkTYBEw7

* docs(dev): track user-docs coherence ledger; mark 2026-06-20 findings resolved

Convert the scratch review notes into a tracked living ledger and link it from
the dev index. All ten findings from the 2026-06-20 docs/user sweep are
validated and fixed in this branch (P1 cluster-apply semantics + stored-query
exposure; P2 schema-apply paths, /load canonical, bearer-token keys, route
shorthand; plus version/ingest/deferred/stage crumbs). The verification grep
checklist is retained for future audits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FQ1Hf4eXLsJmeLUkTYBEw7

* docs(api): align GET /queries OpenAPI contract with cluster-only behavior

Greptile P1 on #293: the prose fix in server.md left the OpenAPI surface stale.
The utoipa annotations (handlers.rs, omnigraph-api-types QueriesCatalogOutput)
still described the catalog as "the `mcp.expose == true` subset", and those
drive the checked-in openapi.json — so SDK consumers read a contract the
cluster-only server does not honor (it lists every stored query).

Update the three Rust doc-comment/annotation strings to "every stored query"
and regenerate openapi.json (OMNIGRAPH_UPDATE_OPENAPI=1; drift test green) in
the same change, per AGENTS.md rule 4. Ledger updated: this finding resolved,
plus the cross-repo drift it surfaced (omnigraph-ts generated spec/types and
omnigraph-cookbooks best-practices bearer_token_env) tracked as open follow-ups.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01FQ1Hf4eXLsJmeLUkTYBEw7

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

2026-06-21 00:02:34 +03:00

15 KiB

Raw Blame History

HTTP Server (`omnigraph-server`)

Axum 0.8 + tokio + utoipa-generated OpenAPI. Cluster-only boot: the server always boots from a cluster (--cluster <dir | s3://…>) and serves N graphs (N ≥ 1) under cluster routes. There is no longer a single-graph flat-route mode, no positional <URI> boot, no --target, and no omnigraph.yaml-graphs:-map boot. All HTTP is nested under /graphs/{graph_id}/...; /healthz and the management /graphs enumeration stay flat.

Boot

Cluster boot (the only boot)

omnigraph-server --cluster <dir | s3://…> --bind 0.0.0.0:8080

omnigraph-server --cluster <dir-or-uri> boots from the cluster catalog's applied revision. The server resolves that revision into per-graph startup configs (id, URI, optional per-graph policy, stored-query registry) plus an optional server-level policy, then opens every configured graph in parallel at startup (bounded concurrency = 4, quarantining graph-specific open failures). Routing is always multi-graph — requests to bare flat protected paths (/read, /snapshot, …) return 404; the served surface is /graphs/{graph_id}/.... See cluster-config.md for what is read and the readiness rules.

Readiness is fail-fast for cluster-global problems: missing or unreadable state, invalid/unattributable recovery sidecars, unreadable shared catalog payloads, cluster policy errors, or zero healthy graphs. Graph-attributed pending recovery sidecars and graph-specific startup failures quarantine that graph instead; the server logs startup diagnostics and serves the remaining healthy graphs. GET /graphs enumerates ready/served graphs only, so quarantined graphs are absent and their routes return 404.

Operators who want the original all-or-nothing boot contract can pass --require-all-graphs or set OMNIGRAPH_REQUIRE_ALL_GRAPHS=1. In that mode, any graph quarantine, graph-open failure, stored-query startup failure, or embedding-provider resolution failure aborts startup.

A scheme-qualified argument (s3://…) reads the ledger straight from the storage root, with no local config directory. --bind, --unauthenticated, and the bearer-token env vars all apply.

Stored-query validation at startup

If a graph declares a queries: registry (see cli-reference), the server loads and type-checks every stored query against that graph's live schema at startup. Query parse/type failures quarantine that graph; if no graph remains healthy, startup refuses. Two MCP-exposed queries claiming the same tool name are likewise graph-local startup failures. Non-blocking advisories (e.g. an MCP-exposed query with a vector parameter an agent cannot supply) are logged. Validate offline before deploying with omnigraph queries validate. Discover the stored queries as a typed tool catalog with GET /queries, and invoke one over HTTP with POST /queries/{name} (both below).

Endpoint inventory

Per-graph endpoints — all nested under /graphs/{id}/.... {id} is the graph id from the cluster's applied revision:

Method	Path	Auth	Action
GET	`/healthz`	none	—
GET	`/openapi.json`	none	— (strips security if auth disabled; emits the nested cluster paths with `cluster_` operation-id prefix)
GET	`/graphs/{id}/snapshot?branch=`	bearer + `read`	snapshot of branch
POST	`/graphs/{id}/query`	bearer + `read`	inline read query (canonical; clean field names `query`/`name`; mutations → 400)
POST	`/graphs/{id}/read`	bearer + `read`	deprecated alias of `/query` (legacy field names `query_source`/`query_name`, byte-stable response; carries `Deprecation: true` + `Link: <query>; rel="successor-version"`)
POST	`/graphs/{id}/export`	bearer + `export`	NDJSON stream
POST	`/graphs/{id}/mutate`	bearer + `change`	mutation (canonical; `query`/`name`; accepts legacy `query_source`/`query_name` as serde aliases)
POST	`/graphs/{id}/change`	bearer + `change`	deprecated alias of `/mutate` (carries `Deprecation: true` + `Link: <mutate>; rel="successor-version"`)
GET	`/graphs/{id}/queries`	bearer + `read`	list the graph's stored queries as a typed tool catalog
POST	`/graphs/{id}/queries/{name}`	bearer + `invoke_query` (+ `change` for a stored mutation)	invoke a named query from the `queries:` registry; deny == 404
GET	`/graphs/{id}/schema`	bearer + `read`	get current `.pg` source
POST	`/graphs/{id}/schema/apply`	bearer + `schema_apply` (target=`main`)	disabled for cluster-backed serving; returns 409 and points operators at `omnigraph cluster apply` + restart
POST	`/graphs/{id}/load`	bearer + `branch_create` (only when `from` is set and the branch is created) + `change`	bulk load (canonical); branch creation is opt-in via `from` — without it a missing `branch` is a 404, never an implicit fork (32 MB body limit)
POST	`/graphs/{id}/ingest`	bearer + `branch_create` (only when `from` is set and the branch is created) + `change`	deprecated alias of `/load` (carries `Deprecation: true` + `Link: <load>; rel="successor-version"`) (32 MB body limit)
GET	`/graphs/{id}/branches`	bearer + `read`	list branches
POST	`/graphs/{id}/branches`	bearer + `branch_create`	create
DELETE	`/graphs/{id}/branches/{branch}`	bearer + `branch_delete`	delete
POST	`/graphs/{id}/branches/merge`	bearer + `branch_merge`	merge `source → target`
GET	`/graphs/{id}/commits?branch=`	bearer + `read`	list
GET	`/graphs/{id}/commits/{commit_id}`	bearer + `read`	show

Server-level management endpoints:

Method	Path	Auth	Action
GET	`/graphs`	bearer + `graph_list` on `Server::"root"`	list ready/served graphs

The per-graph subsections below name routes in shorthand (GET /queries, POST /query, POST /mutate, POST /queries/{name}); every one is served under the /graphs/{id}/… prefix shown in the table — only /graphs and /healthz are flat.

Stored-query catalog (`GET /queries`)

List the graph's stored queries as a typed tool catalog — enough for a client (e.g. an MCP server) to register each as a tool without fetching .gq source. Each entry: { name, tool_name, description, instruction, mutation, params }, where each param is { name, kind, item_kind?, vector_dim?, nullable }. kind is one of string | bool | int | bigint | float | date | datetime | blob | vector | list (decomposed so a consumer maps it with a closed switch, never re-parsing GQ type spelling). bigint (I64/U64), date, datetime, and blob are carried as JSON strings — a 64-bit integer loses precision as a JSON number, dates are ISO strings, and a blob is a URI string.

Read-gated (works in default-deny mode). The catalog is graph-wide (branch-independent; read is authorized against main).
Every stored query in the applied registry is listed. Cluster-served graphs have no per-query expose flag today — every query in the cluster queries: registry appears in the catalog. (Per-query exposure may become a Cedar-policy decision in a later release; see cluster-config.)
Not Cedar-filtered per query (yet). A caller with read but not invoke_query can list a query they can't invoke (which would 404). Closing that gap is future per-query authorization; for now the catalog is a discovery surface and invoke_query remains the invocation gate.

Stored-query invocation (`POST /queries/{name}`)

Invoke a curated, server-side stored query by name — the source comes from the graph's queries: registry, so the client never sends .gq. The request body itself is optional; omit it for no-param queries, or send { "params": { … }, "branch": "main", "snapshot": null }, where every field is optional and params keys match the query's declared parameters. The response is the read envelope (ReadOutput) for a stored read or the mutation envelope (ChangeOutput) for a stored mutation — serialized untagged, so the wire shape is identical to /query / /mutate.

Gate: invoke_query (per-graph, graph-scoped) at the boundary. A stored mutation is double-gated — it also passes the engine's change gate, so an actor with invoke_query but not change gets 403.
Deny == unknown, for callers without invoke_query: for a caller lacking the grant, an invoke_query denial and an unknown query name return the same 404 (identical body), so the catalog can't be probed. A caller that holds invoke_query may still get the inner gate's 403 for an existing query it can't read/change (the double-gate, above) — so existence is visible to grant-holders by design.
Requires an explicit policy grant when auth is on. In default-deny mode (bearer tokens but no policy.file), only read is permitted, so every /queries/{name} call returns 404 until an invoke_query rule is configured.
A stored mutation cannot target a snapshot (400); a parameter type error is a structured 400 naming the parameter.

Adding and removing graphs

Runtime add/remove via API is not exposed — neither POST /graphs nor DELETE /graphs/{id} is implemented. Operators add or remove graphs by running cluster apply against the cluster (which publishes a new applied revision) and restarting the server so it boots from the new revision. The server treats the cluster source as operator-owned and never writes it.

A future release may introduce a managed registry and re-expose runtime mutation on top of it.

Inline read queries (`POST /query`)

POST /query is the read-only, agent-friendly twin of POST /read. The request body uses clean field names that match the CLI -e flag and the GQ query keyword:

{
  "query":    "query find($n: String) { match { $p: Person { name: $n } } return { $p.name } }",
  "name":     "find",
  "params":   { "n": "Alice" },
  "branch":   "main",
  "snapshot": null
}

Response shape is identical to /read (ReadOutput). If the inline source contains mutations (insert / update / delete), the request is rejected with HTTP 400 and an error pointing the caller at POST /mutate — the read-only contract is enforced at the URL.

POST /mutate is the canonical mutation endpoint. It accepts the same clean field names (query, name); the legacy field names query_source and query_name continue to deserialize as serde aliases so existing clients keep working without changes.

Deprecated names (`/read`, `/change`)

POST /read and POST /change are kept for back-compat indefinitely — they are byte-stable on the request side and otherwise behave identically to /query / /mutate. They are flagged as deprecated through three independent channels:

OpenAPI: the operations carry deprecated: true in openapi.json, so every OpenAPI codegen (typescript-fetch, openapi-generator, oapi-codegen, …) emits a @deprecated marker on the generated SDK method.
Response headers (RFC 9745): every response carries Deprecation: true.
Response headers (RFC 8288): every response carries a Link header pointing at the canonical successor: Link: <query>; rel="successor-version" for /read, and Link: <mutate>; rel="successor-version" for /change. SDKs and HTTP proxies can pick the successor up automatically.

Migration is purely cosmetic on the client side — swap the URL path, leave the request body and response handling alone.

Streaming

Only /export streams (application/x-ndjson, MPSC channel + Body::from_stream). Everything else is buffered JSON.

Error model

Uniform ErrorOutput { error, code?, merge_conflicts[], manifest_conflict? } with code ∈ unauthorized | forbidden | bad_request | not_found | conflict | too_many_requests | internal. Merge conflicts attach structured MergeConflictOutput { table_key, row_id?, kind, message }.

manifest_conflict is set on concurrent-write rejections (HTTP 409): the caller's pre-write view of one table's manifest version was stale. ManifestConflictOutput { table_key, expected, actual } tells the client which table to refresh and retry. This is the conflict shape produced by concurrent /mutate (or its /change alias), /load (or its deprecated /ingest alias) calls landing the same (table, branch) race.

HTTP status codes used: 200, 400, 401, 403, 404, 409, 429, 500.

Per-actor admission control

Disjoint (table, branch) writes from different actors now run concurrently, guarded only by the engine's per-(table, branch) write queue. To keep one heavy actor from exhausting shared capacity (Lance I/O, manifest churn, network), the server gates mutating handlers through per-process admission limits configured from environment variables:

Env var	Default	Purpose
`OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX`	16	Concurrent in-flight mutations per actor
`OMNIGRAPH_PER_ACTOR_BYTES_MAX`	4 GiB	In-flight estimated bytes per actor

When an actor exceeds its in-flight count or byte budget, the server returns HTTP 429 Too Many Requests with code: too_many_requests and a Retry-After header (seconds). The actor should back off; other actors are unaffected.

Cedar policy authorization runs before admission accounting so denied requests don't consume admission slots.

Today admission gates every mutating handler: /mutate (and its deprecated alias /change), /load (and its deprecated alias /ingest), /branches/{create,delete,merge}, and /schema/apply. Read-only endpoints (/snapshot, /query, /read, /export, /branches GET, /commits, /schema GET) are not admission-gated.

Body limits

Default: 1 MB
/load (and its deprecated /ingest alias): 32 MB

Auth model (`bearer + SHA-256`)

Tokens are SHA-256 hashed on startup; plaintext is never persisted in memory.
Constant-time comparison.
Three sources, in precedence:
1. OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET — AWS Secrets Manager (build with --features aws)
2. OMNIGRAPH_SERVER_BEARER_TOKENS_FILE or OMNIGRAPH_SERVER_BEARER_TOKENS_JSON — JSON {actor_id: token, …}
3. OMNIGRAPH_SERVER_BEARER_TOKEN — single legacy token, actor default
If no tokens are configured, startup refuses unless --unauthenticated or OMNIGRAPH_UNAUTHENTICATED=1 explicitly opts into open local-dev mode. A policy file without tokens is also rejected at startup. In open mode /openapi.json strips the security scheme.

See deployment.md for token-source operational details.

Tracing & observability

tower_http::TraceLayer::new_for_http()
Policy decisions logged at INFO level with actor, action, branch, decision, matched rule
Startup logs: token source name, graph URI, bind address
Graceful SIGINT shutdown

Not implemented (by design or "TBD")

CORS — not configured; add tower_http::cors if needed.
Rate limiting — per-actor admission control gates /mutate (alias /change), /load (alias /ingest), /branches/{create,delete,merge}, /schema/apply (see "Per-actor admission control" above). No global rate limiter is configured; add tower_http::limit if a graph-wide cap is needed.
Pagination — none (commits/branches return everything; export streams).
Runtime graph add/remove — run cluster apply and restart.

15 KiB Raw Blame History

HTTP Server (omnigraph-server)