omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-09 01:35:18 +02:00

Author	SHA1	Message	Date
Ragnor Comerford	54bbe902e5	Merge remote-tracking branch 'origin/main' into ragnorc/scrutinize-rfc-002	2026-06-02 18:11:29 +02:00
Ragnor Comerford	d54bccb940	fix(optimize): skip blob-bearing tables to avoid Lance compaction crash (#138 ) Some checks failed CI / Classify Changes (push) Has been cancelled Details CI / Check AGENTS.md Links (push) Has been cancelled Details CI / Container Entrypoint (push) Has been cancelled Details Release Edge / Prepare edge release (push) Has been cancelled Details CI / Test Workspace (push) Has been cancelled Details CI / Test omnigraph-server --features aws (push) Has been cancelled Details CI / Test Windows release binaries (push) Has been cancelled Details CI / RustFS S3 Integration (push) Has been cancelled Details Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled Details Release Edge / Smoke Windows installer (push) Has been cancelled Details * test(optimize): pin Lance blob-column compaction failure as a surface guard Lance compact_files mis-decodes blob-v2 columns under its forced BlobHandling::AllBinary read ("more fields in the schema than provided column indices"), failing even a pristine uniform-V2_2 multi-fragment blob table; reads use descriptor handling and are unaffected. Guard 10 reproduces this and is self-retiring: it turns red on the Lance bump that fixes the bug, forcing LANCE_SUPPORTS_BLOB_COMPACTION to flip. * fix(optimize): skip blob-bearing tables instead of crashing compaction omnigraph optimize aborted the whole sweep when any node/edge table had a Blob property: Lance compact_files cannot decode blob-v2 columns under AllBinary (the column-index error pinned by the surface guard). Skip blob-bearing tables behind a LANCE_SUPPORTS_BLOB_COMPACTION gate and report them via TableOptimizeStats.skipped / SkipReason (surfaced in the CLI and a tracing::warn) instead of erroring, which also isolates the failure so the other tables still compact. Reads/writes are unaffected; only fragment/space reclamation on blob tables is deferred until the upstream Lance fix. Adds a maintenance.rs regression test (validated red with the column-index symptom before the fix, green after), a concise v0.6.1 release note, and updates docs (maintenance, cli-reference, AGENTS capability matrix, invariants Known Gaps, lance.md audit, constants). * refactor(optimize): make TableOptimizeStats and SkipReason non_exhaustive Both are returned result types, never built by callers, so #[non_exhaustive] makes this the last field/variant addition that can break downstream literal construction and keeps future ones non-breaking (review feedback on the public-field addition). The v0.6.1 Compatibility Notes call out the source-level change. Also drops the now-stale "RED today / GREEN after the fix lands" narration in the optimize_skips_blob_table_and_reports_skip test (historical regression context now that the fix is in this branch), and folds in the expanded v0.6.1 release note. * chore(release): bump workspace to v0.6.1 Coherent version bump to accompany the v0.6.1 release note: all five crate manifests + path-dependency constraints, Cargo.lock, the AGENTS.md surveyed-version line, and openapi.json info.version move 0.6.0 -> 0.6.1. Matches the established release pattern (#118 landed the v0.6.0 note + bump together) and resolves the Codex/Devin review flag that a v0.6.1 note without a bump leaves CARGO_PKG_VERSION reporting 0.6.0 and mixed package versions.	2026-06-02 17:12:00 +02:00
Ragnor Comerford	4de7865847	docs(rfc-002): reserve cloud multi-tenancy shapes (forward-compat) Folds in the validated parts of the cloud-deployment workstream briefing. Code claims verified to the line: GraphKey { tenant_id: Option<TenantId>, graph_id } and ResolvedActor.tenant_id already ship (MR-668, identity.rs:116,189), and tenant is server-resolved (MR-731, identity.rs:180) -- so these are cheap reservations, not new machinery. Added (reserve only, parse-but-reject; tenant never in locator/path/body): - Non-Goals: cloud-mode multi-tenancy out of scope; shapes reserved so it is additive. - 6: serve.auth.oauth.issuers as a LIST + tenant_claim (the one-way door); field schema deferred to MR-956 RFC 0001 to avoid a second OIDC config. Server-side OIDC reframed as Federated-Auth-owned (may precede V6), not 'my V6'. - 6: serve.policy is a tagged source at the policy level (file today; directory/manifest reserved) -- NOT a source: wrapper (pushback on the briefing's prescription; the wrapper is the only actually-breaking part and is inconsistent with storage:/auth:). - 7: credential identity unit becomes (server, organization) for multi-org on one cloud endpoint -- endpoint-binding alone can't disambiguate; reserve omnigraph:<server>[/<org>] keying. - 9: unified registry preserves GraphKey { tenant_id, graph_id }; don't flatten to graph_id-only; GET /graphs tenant-scoped in Cloud. - Open questions: OIDC ownership/timeline reconciliation. Held the speculation line: organization selector, omnigraph:// URI sugar, and --organization flag are additive-later, so they stay notes (Non-Goals), not new fields/flags shipped now. Nit corrected: AuthSource::Oidc / graph:* scopes are reserved via #[non_exhaustive], not present draft variants.	2026-06-02 16:57:15 +02:00
Ragnor Comerford	ad968d95c4	docs(rfc-002): add phased implementation plan + divergence analysis Synthesized from a six-agent code+upstream validation pass. Added: - V0 'Foundations' phase (crate + api-types extraction, version gate, layered-config fixture harness + keychain seam, test relocation). - Phase detail (sizing/gates/exit, file touchpoints) for V0-V6, a critical-path + parallelization graph, and a validation-findings list. - '## Divergence & single source of truth': states what the design prevents structurally (config<->CLI via the typed locator; config<->HTTP capability surface) vs only reduces (CLI<->HTTP route paths), and records the shared-route-table/generated-client move as the residual structural fix. Inline corrections from the audit: - storage.profile is env-only in Lance + omnigraph -> scoped out of v1; region/endpoint feasible now (Lance per-dataset storage_options). - single-mode GET /graphs is 405->403-by-default (not ->200) without a serve.policy; recorded the wire vs Cedar graph_id reconciliation decision. - omnigraph-config extraction alone does not shed Axum (CLI imports api::*) -> note the omnigraph-api-types companion crate.	2026-06-02 15:55:40 +02:00
Ragnor Comerford	db3683eb26	docs(rfc-002): scope auth ban + fix migration prefix (3rd review) Address the third review (4 points): 1. serve.auth conflict: the lower-trust auth ban applied too broadly. Scope it to servers.<name>.auth (client credential sourcing); serve.auth (secret-free server-side accept config) stays valid in a committed deployment manifest. Tightened the same wording in section 6 and 7 for consistency. 2. legacy URI split preserves the path prefix: strip only the trailing /graphs/{gid}; endpoint keeps host + any reverse-proxy path. 3. define the explicit-credential surface: a project-only server is unauthenticated/local-dev by default; authenticated use needs promotion to a trusted layer or an operator-supplied --token-from flag (future, listed in 10). 4. OAuth no longer leaks into the V3 CLI surface: login OAuth device flow marked V6 in the CLI bullet and the N11 breadboard row.	2026-06-02 14:20:32 +02:00
Ragnor Comerford	eaeacb2385	docs(rfc-002): tighten credential trust model + accuracy (2nd review) Address the second implementation-readiness review (7 points): 1. env-token endpoint-binding was not enforceable as written -> replace with a trusted-origin credential model: ambient creds (env/keychain/profile) apply only to servers whose identity came from a trusted layer; login-written creds additionally bind to their issued-for endpoint. 2. project-layer auth: a lower-trust layer may define endpoint-only servers but may not carry an auth: block at all (command = repo-authored RCE) - now a validation rule, not just prose. 3. legacy remote-URI migration: split https://host/graphs/{gid} into endpoint+graph_id so V2's always-/graphs/{id}/ client can't double the prefix. 4. summary realigned with body: enumeration is graph_list-gated, oauth reserved (not first-class), secrets out-of-repo (not 'structurally unreachable'). 5. disambiguate higher-precedence (project wins merges) vs higher-trust (global owns identity) - they run opposite for the project layer. 6. drop top-level 'queries' from the named-resource merge map (per-graph only). 7. mark OMNIGRAPH_BIND proposed, not current; binary honors --bind/server.bind only (lib.rs:899).	2026-06-02 13:23:58 +02:00
Ragnor Comerford	3a53fb3c94	docs(rfc-002): rewrite config & CLI architecture + readiness review Rewrite RFC-002 around a typed GraphLocator (storage: XOR server:+graph_id:), servers:+graphs: with three-tier addressing, serve: vs servers: de-collision, global-first layered config, a method x source auth model, and an omnigraph-config crate extraction. Verified against code, not ticket status. Incorporates the implementation-readiness review (10 points): 1. current flag is --target, not --graph; --graph canonical + --target alias 2. credential-redirection fix: endpoint-bound creds + layer identity rule + AX threat model 3. no-arg resolution: defaults.graph for bare commands; defaults.server only namespaces unknown ids 4. route unification spec: canonical single-mode graph_id; GET /graphs lists served set 5. serve.graphs replaces server.graph (preserves serve-a-subset) 6. restore query.roots (ad-hoc --query path resolution) 7. soften 'structurally unreachable'; move mTLS key off the repo tree 8. legacy bearer_token_env -> synthesized-server migration 9. enumeration caveat: known-id addressing vs graph_list-gated discovery 10. mark oauth/mtls reserved; full impl deferred to V6 Also realigns the docs/dev/index.md entry.	2026-06-02 13:12:06 +02:00
Ragnor Comerford	3c2b1b8051	Stored-query registry foundation + config/CLI RFC-002 (#128 ) * MR-969: add stored-query registry config surface Introduce the `queries:` block in omnigraph.yaml — an inline `name -> entry` map of stored queries, per-graph (`graphs.<id>.queries`) and top-level for single-graph mode, mirroring how `policy` is wired in both modes. Each entry points at a `.gq` file and carries optional MCP exposure settings (`expose`, `tool_name`), defaulting to not-exposed. Additive: absent `queries:` leaves current behavior unchanged. - QueryEntry { file, mcp: McpSettings { expose, tool_name } } - `queries` field on TargetConfig + OmnigraphConfig (serde default) - query_entries() / target_query_entries() accessors - resolve_query_file() — base_dir-relative `.gq` path resolution - round-trip + absent-block tests Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add stored-query registry loader and GraphHandle wiring Add a `queries` module: QueryRegistry loads each declared `.gq` entry, parses it, and selects the query whose symbol matches the manifest key, asserting the two agree (key == `query <name>` symbol). Identity is the query name; a key/symbol mismatch is a load-time error. Errors are collected, not fail-fast, so a bad registry surfaces every broken entry at once. Schema type-checking is deliberately left to a separate pass so the loader stays callable without an open engine. Thread an `Option<Arc<QueryRegistry>>` through GraphHandle alongside the per-graph policy; the URI-canonicalizing clone propagates it. Production openers default to None for now — the boot path loads and attaches the registry in a later change. - QueryRegistry::{from_specs, load, lookup, iter}; StoredQuery::is_mutation - GraphHandle.queries field, propagated on canonical clone - registry unit tests: identity match/mismatch, multi-query selection, per-entry parse errors, error collection, mutation classification Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: add RFC-002 config & CLI architecture Layered config (user-global ~/.config/omnigraph/ + per-project), a unifying `target` abstraction resolving to (locus, graph, sub-state, credential) with embedded-URI XOR remote-server loci, multi-server × multi-graph client targeting, credentials by-reference, and the file-naming decision: project and server config are one artifact (`omnigraph.yaml`); the only differently-named file is the user-global `config.yaml`, split by scope not role. Includes the 12-factor bind portability rule (prefer --bind/OMNIGRAPH_BIND over a committed server.bind) and the defined-locally / invoked-remotely model for stored queries. Derived from first principles working backwards from what the engine enables; validated against kube/Helix/git/compose. Linked from docs/dev/index.md. Proposed; phased rollout for the MR-973/974/981 family. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add check() to validate stored queries against the live schema A pure check(registry, catalog) that type-checks every stored query via the same typecheck_query_decl the engine runs for inline queries — no parallel implementation. Failures are collected, not fail-fast, so an operator sees every broken query (e.g. a type/property a migration renamed or removed) in one pass. Breakages are fatal (the boot path will refuse to start); warnings are advisory. Pure over (registry, catalog) so it is callable both at boot (engine catalog) and offline from the CLI without an open engine. Advisory lint: an mcp.expose:true query that declares a Vector(N) parameter warns — an LLM cannot supply a raw embedding vector; such a query should take a String parameter and embed server-side. Warns rather than rejects, since service-to-service callers may pass vectors. - CheckReport { breakages, warnings }; has_breakages / is_clean - tests: valid query, unknown type, unknown property, collect-not-fail-fast, vector-param-exposed warns, unexposed silent Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Drop internal plan-label refs from stored-query config comments Doc comments referenced sequencing labels ("C2") that mean nothing to a reader; reword to describe the behavior directly. Comment-only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: reconcile aliases with the role model in RFC-002 Place the existing client-only `aliases:` block in the client/server role split: aliases are client-role (CLI, embedded, ungated) and may live in both user-global and project config; `queries:` is server-role (deployment manifest only). They overlap as "name -> .gq"; `queries:` is the superset, and the end-state subsumes aliases (definition -> queries, target/branch/format -> client invocation context, positional args -> CLI sugar). v1 keeps aliases unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: make RFC-002 config global-first, project-optional The global user config is the primary, self-sufficient default; the CLI works from any directory with no project file (the kubectl/aws/gh posture), a deliberate flip from today's project-anchored behavior. The project omnigraph.yaml becomes an optional repo-scoped override and the deployment manifest. Uniform schema, both layers optional; global can hold any section including a personal server's graphs/queries. Additive: project still overrides global; the flip adds a fallback layer below the project file rather than removing it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: justify XDG ~/.config/omnigraph over legacy ~/.omnigraph in RFC-002 Make the rationale explicit: XDG-first because OmniGraph is a client that will cache remote catalogs and keep session state alongside secrets, and XDG separates config / cache / state into distinct dirs (clear cache without touching creds; backups skip cache) whereas a single ~/.omnigraph/ mixes them. Honor ~/.omnigraph/ as a fallback for the peer-group (aws/kube/docker/helix) expectation. Add XDG_CACHE_HOME / XDG_STATE_HOME to the override precedence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: build RFC-002 credentials on the existing env-file mechanism OmniGraph already has credentials-by-reference: bearer_token_env names the env var, and auth.env_file is a git-ignored dotenv the CLI auto-loads (real env vars win), resolved via resolve_remote_bearer_token. The RFC's proposed credentials.yaml + token_env were redundant parallel inventions. Reconcile: reuse bearer_token_env (extend to servers.<name>) and auth.env_file (add a global ~/.config/omnigraph/.env layered under the project .env.omni); OS keychain is an additive future resolver. No new credentials.yaml. Updated summary, non-goals, background, file-naming, credentials, example, login, migration, rollout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: use single ~/.omnigraph dir (Helix-style), not XDG, in RFC-002 Reverse the earlier XDG-first call. The prior argument rested on a false dichotomy (single-dir => mixed config/cache/state); in fact the peer tools (aws, kube, helix) achieve separation via SUBDIRECTORIES inside one ~/.tool/ dir (~/.aws/sso/cache/, ~/.kube/cache/), getting cache hygiene AND one discoverable place. So everything goes under ~/.omnigraph/: config.yaml, credentials (dotenv, 0600), cache/, state/. Lower cognitive load, matches what DB/cloud-CLI users expect, matches Helix. OMNIGRAPH_HOME overrides; $XDG_CONFIG_HOME optionally honored but ~/.omnigraph/ is canonical. Updated all paths, the rationale paragraph, the file-naming table (added a cache/state row), and env precedence. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: reconcile RFC-002 with shipped/planned CLI tickets Align with reality found in existing tickets: - Noun is graph/graphs, not target/targets (MR-603 done renamed the config key targets->graphs, flag --graph). Use graphs:/--graph; an entry is embedded (uri) XOR remote (server + remote graph name). - ~/.omnigraph/ confirmed by MR-581 (og template pull, done) which already quick-starts templates there. - Templates already exist (MR-581/MR-531) — not invented here. - The init family is already specced (init, quickstart MR-973, serve MR-970, prune MR-972, mcp install MR-974, agent-mode MR-981); this RFC only adds the user route (~/.omnigraph/config.yaml + login). - aliases: -> operations: planned (MR-839). - bearer_token_env gap tracked in MR-971. - query lint/check already exist (MR-639) — registry validator must not collide with the singular `query check`. Add a Reconciliation section; fix the canonical example to graphs:/--graph. Also: merge semantics refined (deep-merge settings, replace named entries, replace lists, config view --resolved --show-origin). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: correct stale-ticket claims and fold init/bootstrap design into RFC-002 Verify against code, not ticket statuses (MR-581 is marked done but is stale/unbuilt): no ~/.omnigraph usage, no template/serve/quickstart/ prune/login commands exist; config still uses aliases: (no operations:). So ~/.omnigraph/ stands on peer-convention merits alone, and templates are a design question, not a foothold. Add §7.5: the three-tier init model (user route = login + ~/.omnigraph/config.yaml; thin project init; fat quickstart + templates) with first-principles positions (split init/login, in-place refuse-if-exists, interactive vs --auto/agent-mode, --template flag, secrets-on-scaffold gitignore rule). This RFC owns only the user route; the rest are sibling tickets (MR-973/970/972/974/981). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: breadboard + slice Shape A in RFC-002 Add the implementation breadboard (places P1-P5, affordances N1-N14 with NEW markers, mermaid) and five vertical slices for the selected config/ CLI/init shape: V1 global layer + merge engine + config view; V2 remote graphs + HTTP-client path + credential resolution; V3 omnigraph login; V4 init-hardening + quickstart + templates (rides MR-970); V5 agent-mode (MR-981). Rollout reordered to the slice sequence; spikes X1-X4 gate their owning slice. V1-V2 close the substantive client->server gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Add InvokeQuery Cedar action (coarse, graph-scoped) A per-graph, branch-scoped action that gates invoking a server-side stored query by name. Coarse for now: an `invoke_query` allow rule permits any stored query on the graph; a future, additive refinement adds an optional per-query-name scope without changing rules written against the coarse action. Enforcement is at the HTTP boundary; the engine `_as` writers still enforce read/change per the query body, so a stored mutation is double-gated (invoke_query to reach the tool, change for the write). No call site yet — the invocation handler wires it in a later change (same pattern as Admin/GraphList added ahead of consumers). - variant + as_str/resource_kind(Graph)/FromStr/uses_branch_scope - Cedar schema: invoke_query appliesTo Graph - tests: per-graph allow/deny, branch-scope accepted Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Load and type-check stored queries at server boot, refusing breakage At startup the server now loads each graph's stored-query registry, type-checks every query against that graph's live schema, and refuses to boot if any query references a type/property the schema doesn't have (same posture as bad policy YAML) — so schema drift surfaces at the deploy boundary, not silently at invocation. Non-blocking warnings are logged. The validated registry is attached to the GraphHandle (the two production sites previously held `queries: None`). Loading (parse + key==symbol identity) happens at settings-build time where the config is in scope; the schema type-check happens after each engine opens (single mode in `open_single_with_queries`, multi mode in `open_single_graph`). `open_with_bearer_tokens_and_policy` delegates with an empty registry so its 18 test callers are unchanged; the public `new_` constructors are unchanged (only the private build path threads the registry). - ServerConfigMode::Single / GraphStartupConfig carry the loaded registry - boot tests: valid registry boots; type-broken query refuses boot + names it Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Add `omnigraph queries validate` and `queries list` CLI `queries validate` type-checks the stored-query registry against the live schema offline — it opens the selected graph, runs the same check() the server runs at boot, prints breakages/warnings (human or --json), and exits non-zero on any breakage — so an operator can catch a query broken by a schema change without restarting the server. `queries list` prints each registered query's name, MCP exposure, and typed params. Named `validate` (not `check`) to avoid overlap with the existing `omnigraph lint` — `query check`/`query lint` are already deprecated argv-shims to `lint`. Registry entries resolve like the server: a named graph uses its per-graph `queries:`; otherwise the top-level one. - Queries subcommand group; reuses QueryRegistry::load + check from omnigraph-server; local-only (needs the schema), mirrors lint - tests: clean registry exits 0, broken query exits non-zero + names it, list shows the query and its typed params Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * Route registry selection through one shared query_entries_for The "which queries: block applies for graph X" rule existed twice — the server boot path and the CLI's registry_entries — and had already drifted: the CLI carried an unreachable unwrap_or_else fallback the server lacked. Add OmnigraphConfig::query_entries_for(graph: Option<&str>) as the single definition (named graph -> its per-graph block; otherwise top-level) and route all three sites through it: server single mode, server multi-graph loop, and the CLI. The CLI's dead fallback arm is deleted; CLI and server now resolve identically by construction. No behavior change. Extends the config round-trip test to pin the selector, including the unknown-name -> top-level fallback the deleted CLI arm covered. * Funnel registry validation through one validate_and_attach gate The check -> refuse-on-breakage -> log-warnings -> empty->None block was copy-pasted across both open paths (single mode and the multi-graph per-graph open), differing only by the graph label. A third opener could attach a registry that was never schema-checked. Extract validate_and_attach(queries, catalog, label) -> Option<Arc<..>> as the single gate both paths call, so attaching an unchecked registry is no longer expressible. The catalog handle is an owned Arc, so calling it before the multi-mode policy match (which rebinds db) is borrow-clean. No behavior change. Adds a direct unit test of the helper (empty / clean / breakage incl. the graph label in the message) — covering the multi-graph path's logic, which previously had no boot-refusal coverage. * Resolve param types structurally in the MCP vector lint The exposed-query advisory detected vector params with type_name.starts_with("Vector(") — a second copy of the compiler's own ScalarType::from_str_name vector parsing that could drift from it. Key the lint off PropType::from_param_type_name + ScalarType::Vector(_) instead, the one canonical resolver the type system already uses. Any future param-suppliability lint now reads the structured type rather than scanning the surface string. Behavior-preserving: the grammar forbids list-of-vector params (list_type = "[" base_type "]", and base_type excludes Vector), so the only input where the structured and string checks could differ is unparseable. Adds a guard test that an exposed String param does not false-trigger the warning. * Refuse duplicate MCP tool names across exposed stored queries The effective MCP tool name (explicit tool_name, else the query name) is a second identity namespace beside the registry key, but nothing enforced it unique — two exposed queries could claim one catalog key, and each consumer re-derived the name ad hoc. Add StoredQuery::effective_tool_name() as the one definition, and a load-time uniqueness pass in from_specs over exposed queries: a collision is a collected LoadError naming the loser and the winner. Scoped to exposed queries (unexposed have no MCP tool); deterministic over the BTreeMap so the first-declared wins and the error order is stable. New (rare) refusal: a config with colliding exposed tool names now fails `omnigraph queries validate` offline and refuses server boot, the same posture as a malformed registry. Release-note-worthy. Test-first: duplicate_exposed_tool_name_is_a_load_error (red before the pass, green after) + a CLI offline test; the unexposed sibling pins the exposed-only scope; effective_tool_name asserts folded into the load test. * docs: document the queries registry, CLI, and invoke_query action The stored-query surface shipped without user docs. Add it, per the same-PR maintenance contract: - policy.md: invoke_query as per-graph action #10 (branch-scoped), with the double-gating note; renumber graph_list; add it to the branch_scope list. - cli-reference.md: the `queries validate \| list` command, and the `queries:` config block (per-graph + top-level) with mcp.expose/tool_name and the tool-name uniqueness rule. - server.md: boot-time stored-query type-check (refuse on breakage), noting invocation over HTTP/MCP is not yet exposed. * Add POST /queries/{name} stored-query invocation handler Invoke a curated server-side stored query by name: source + name come from the per-graph queries: registry, the client sends only runtime inputs (params, branch, snapshot). Gated by the invoke_query Cedar action at the boundary; the handler delegates to the existing run_query/run_mutate, whose inner Read/Change enforce still runs — so a stored mutation is double-gated (invoke_query to reach the tool, change for the write). - InvokeStoredQueryRequest + an untagged InvokeStoredQueryResponse { Read(ReadOutput), Change(ChangeOutput) } → one Json<_> return type and a oneOf 200 schema (a correct contract, not a wrong-but-simple one). - Route lives in per_graph_protected → single-mode /queries/{name} and multi-mode /graphs/{id}/queries/{name} for free. - Deny == unknown: an invoke_query denial and a missing query both return the same 404, so the catalog can't be probed by an unauthorized caller. - OpenAPI regenerated; tests cover read, mutation double-gate (403 vs 200), bad-param 400, and the identical-404 deny path. Completes the MR-969 V1 invocation slice (registry + /queries/{name} + invoke_query). * docs: stored-query invocation endpoint; flip the not-yet-exposed caveat Now that POST /queries/{name} ships (C7), document it: add the endpoint to server.md's inventory + an invocation section (body, untagged read/mutate envelope, invoke_query gate, double-gated mutations, deny == 404), and flip the startup note that said invocation was not yet exposed. In policy.md, replace "no invocation call site yet" on the invoke_query action with a pointer to the endpoint. * Scope the stored-query 404-hiding claim to non-invoke_query callers Review found the deny==404 catalog-hiding was overstated as a contract: it holds only at the outer invoke_query gate. A caller that HOLDS invoke_query but lacks read/change gets the inner gate's 403 for an existing query vs 404 for an unknown one — so existence is visible to grant-holders by design (the intended double-gate). The handler docstring, OpenAPI 404 description, and server.md all claimed the 404 was airtight against any denied actor. Correct the wording in all three (no behavior change) and add the missing symmetric test (invoke_query but no read -> 403 for an existing query, 404 for unknown) so the actual contract is pinned. Also document that in default-deny mode (tokens, no policy) every invocation 404s until an invoke_query rule is configured. Nits: the from_specs collision comment said "first declared wins" but it is lexicographically-first by name (BTreeMap); the effective_tool_name docstring overclaimed the CLI display routes through it (it resolves the rule on its own output DTO). * Default mcp.expose to true (the manifest entry is the opt-in) expose controls MCP-catalog membership only — it is not an authorization gate (invocation is gated by invoke_query regardless). So requiring a per-query mcp.expose: true was friction with no safety benefit: a non-exposed query is still HTTP-invocable by name. Flip the default so declaring a query in the manifest exposes it to the agent tool catalog by default; expose: false is the escape hatch for service-only queries. Both the absent-mcp path (Default impl) and the present-but-no-expose path (serde default fn) now yield true. Doc comments + cli-reference updated; the config round-trip test asserts the new default. * Add GET /queries stored-query catalog endpoint List a graph's mcp.expose stored queries as a typed tool catalog so a client (the MCP server) can register them as tools without fetching .gq source. Each entry carries name, MCP tool_name, description/instruction, a read/mutate flag, and decomposed typed params (kind enum: string\|bool\|int\| bigint\|float\|date\|datetime\|blob\|vector\|list, plus item_kind for lists and vector_dim) — so the consumer builds an input schema with a closed match and never re-parses omnigraph type spelling. I64/U64 are bigint (string on the wire): a JSON number loses precision past 2^53 and the engine already accepts decimal strings. Read-gated (works in default-deny; the catalog is graph-wide, authorized against main). NOT Cedar-filtered per query yet — a reader can list a query whose invoke_query they lack (documented gap until per-query authz lands); invocation stays invoke_query-gated + deny==404. - api: QueriesCatalogOutput / QueryCatalogEntry / ParamDescriptor / ParamKind + query_catalog_entry (reuses PropType::from_param_type_name; scalar_kind is exhaustive, so a new ScalarType is a compile error here until catalogued). - GET /queries route in per_graph_protected (→ /graphs/{id}/queries in multi mode); OpenAPI regenerated; path allowlists updated. - Tests: projection unit (every kind, list, vector, nullable, mutation, empty) + handler (exposed-only filter, read-gate probe-oracle, empty registry). * docs: GET /queries stored-query catalog endpoint Document the catalog: the endpoint table row (GET /queries, read-gated), a catalog section (typed-param kind enum, bigint/date/datetime/blob-as-string, graph-wide/branch-independent, mcp.expose default true, the read-gated probe-oracle gap), and flip the startup note now that the catalog ships. * Collect file-I/O and parse errors in QueryRegistry::load in one pass load() early-returned on any unreadable .gq file, masking parse / identity / tool-name-collision errors in the OTHER (readable) files — so an operator fixed the missing file, restarted, and only then saw the next broken query. Now it collects I/O errors but still runs from_specs on the readable specs and returns the union, so every broken entry surfaces at once (matching the collected-errors contract the rest of the registry already follows). Safe: from_specs' tool-name collision check runs over loaded queries only, so dropping an I/O-failed entry can only under-report a collision, never invent one. I/O errors are ordered first (BTreeMap key order), then spec errors. Adds a load-level test (tempdir: a valid, a missing, and a parse-broken .gq) asserting all three surface in one Err — confirmed red before the fix. * Make invoke_query graph-scoped (one branch authority) invoke_query gates reaching the curated stored-query surface — a graph-level capability. Per-branch/snapshot access is already enforced by the inner read/change gate in run_query/run_mutate (authorized against the resolved branch), so branch-scoping the outer gate was redundant AND wrong for snapshot reads (it defaulted to main). Drop the branch dimension: remove InvokeQuery from uses_branch_scope (it joins admin as graph-scoped) and authorize the boundary gate with branch: None. Lossless: an actor confined to branch X by their read/change rules can still only invoke a stored query that touches X. A rule that sets branch_scope on invoke_query is now rejected by validate() — write invoke_query in its own rule. Ripple (atomic): restructure the server invoke fixture so invoke_query sits in its own branch_scope-free rule; invert invoke_query_is_branch_scoped -> invoke_query_rejects_branch_scope; the per-graph authorize test uses branch: None; docs (policy.md, server.md, the InvokeQuery doc). No wire/OpenAPI change. * Resolve graph config by identity, not server mode Which policy/queries block applies for a graph was decided three different, mode-dependent ways: single-mode boot used top-level even for a named graph; multi-mode used per-graph (and silently ignored a top-level queries block); the CLI used per-graph for a named target. So `queries validate --target prod` could check a different registry than the single-mode server loaded, and a named graph's per-graph policy/queries were silently shadowed. Make config a function of graph IDENTITY: a graph served by NAME (--target/server.graph, a graphs: entry) uses its own graphs.<name>.{policy, queries}; a bare URI is anonymous and uses top-level. One rule, applied by single-mode boot, multi-mode boot, and the CLI — so they can't diverge and the CLI predicts the server exactly. No silent ignore: serving a named graph while a top-level policy/queries block is populated now refuses boot, naming the block (the multi-mode top-level-policy bail, extended to queries and to single-mode-named). The CLI's `queries validate` derives the schema URI and the registry from ONE selection, and a positional URI forces anonymous (ignoring cli.graph) so the two can't come from different graphs. BREAKING (released behavior): single mode by name (--target/server.graph) with top-level policy/queries previously used top-level; it now uses the per-graph block and refuses boot if top-level is also populated. Bare-URI single mode is unchanged. Loud, with migration text pointing at graphs.<name>. - config: resolve_policy_file_for (policy sibling of query_entries_for, no top-level fallback) + populated_top_level_blocks for the coherence check. - characterization tests (single-mode named -> per-graph; named + top-level -> bail; multi-mode top-level queries -> bail; CLI positional-URI -> top-level). - docs: policy.md, server.md, cli-reference.md. * docs: RFC-002 credentials keyed by server name (keychain/profile/env) Reworks the RFC's credentials model: secrets are keyed by server name — OS keychain `omnigraph:<server>` (preferred) -> a `[<server>]` profile in `~/.omnigraph/credentials` -> `OMNIGRAPH_TOKEN[_<SERVER>]` env (CI), the AWS/gh/kube model. `servers.<name>` is endpoint-only by default but may carry an explicit, secret-free `auth: { token: { env\|file\|command\|keychain } }` source. The shipped `bearer_token_env` + `.env.omni` dotenv remain a legacy compat path; no `credentials.yaml`. * docs: RFC-002 — typed graph locator (storage/server/graph_id), not a uri string Add §1.1: the resolved graph address is a typed GraphLocator (Embedded{storage} \| Remote{server, graph_id}), not a flat uri: String. Diagnoses the string model's cost in the code today (~16 is_remote_uri forks, TargetConfig can't express multi-server x multi-graph, the CLI bails on remote, the ts SDK models baseUrl+graphId separately) and settles the YAML naming so the key names the locus: - storage: (embedded) — shipped uri: is a deprecated alias - server: + graph_id: (remote) — graph_id defaults to the entry key - storage xor server, reject both/neither (no silent ambiguity) Kills the graphs:/graph: collision and the uri:-might-be-a-server ambiguity. Updates the §1/§8 examples and the entry-shape notes to the new naming. * Test: queries list must reject an unknown --target queries list opens no graph URI, so unknown-graph validation does not ride along on resolve_target_uri the way it does for every other command. The new test reproduces the gap: with an unknown --target the command currently exits 0 and prints the (empty) top-level registry instead of erroring like the URI-resolving commands do. Fails against current code; the fix follows. * Validate the graph selection in queries list Graph-existence validation was a side effect of URI resolution: every URI-resolving command rejects an unknown --target via resolve_target_uri, but queries list opens no URI, so query_entries_for(Some(unknown)) silently fell back to the top-level registry and showed the wrong (or empty) catalog. Make membership a property of the selection: add the fallible resolve_graph_selection alongside the infallible query_entries_for (a known name passes through, an unknown name errors with the same message as resolve_target_uri, None stays anonymous), and validate the selection in execute_queries_list. query_entries_for is unchanged — server boot's bare-URI path still needs its None -> top-level arm. * Surface policy-engine errors from stored-query invoke The invoke handler mapped every authorize_request failure to 404 ('stored query not found'), which collapsed the authorization decision (deny -> 403) together with operational failures (no actor -> 401, Cedar evaluation error -> 500). A real policy-engine 500 was hidden as a missing query. Separate the two concerns instead of sniffing the masked status. Extract authorize() returning an Authz { Allowed, Denied(msg) } decision and reserve Err for operational failures only; authorize_request becomes a thin wrapper that maps Denied -> 403, so the 16 deny-as-403 callers are unchanged. The invoke handler now matches the decision directly: a denial stays 404 (deny == missing, so the catalog can't be probed without the grant), while a 401/500 propagates with its true status. 500 is now a reachable outcome on POST /queries/{name}; document it in the endpoint responses and regenerate openapi.json. * Extract the named-graph/top-level coherence rule into one helper The rule 'a named graph uses its own graphs.<name> block, so a populated top-level block is a config error' lived inline in single-mode server boot. Extract it to OmnigraphConfig::ensure_top_level_blocks_honored so the same definition can be shared by the CLI selection gate (next commit) and the two can't drift. Boot calls the helper; the message is reworded context-neutral (drops 'serving') so it reads correctly from both boot and the CLI. Behavior-preserving: multi-graph mode keeps its own unconditional check, and single_mode_named_graph_rejects_top_level_blocks still passes. * Test: queries validate/list must reject a named graph with a top-level block Server boot refuses a config where a graph is selected by name yet a top-level queries:/policy.file block is populated (the block would be silently ignored). The CLI's queries validate/list resolve the same named selection but skip that coherence check, so they give a false green / list the per-graph block. The new test reproduces it: validate prints OK and list succeeds where boot would refuse. Fails against current code; the fix follows. * Enforce top-level coherence in the single CLI selection gate queries validate validated graph membership only as a side effect of URI resolution and queries list only via resolve_graph_selection's membership check; neither applied the named-graph/top-level coherence rule server boot enforces, so both gave a false green on a config boot refuses. Fold ensure_top_level_blocks_honored into resolve_graph_selection so it is the single gate that returns only valid + server-coherent selections, and route resolve_selected_graph (queries validate) through it; queries list already calls the gate. A named graph with a populated top-level block now errors in both commands, matching boot. A positional URI stays anonymous (top-level honored), so queries_validate_positional_uri_ignores_default_graph is unaffected. * docs: RFC-003 — MCP server surface for omnigraph-server Detailed MCP-transport design for the stored-query/MCP work, building on the shipped #128 registry. Corrects the draft against the branch head: the coarse invoke_query gate + 404 denial-masking are already wired (server_invoke_query), so per-query invoke_query scope (PolicyRequest has no query-name dimension yet) is the real prerequisite; positions the doc as superseding rfc-001's MCP transport (/mcp/tools+/mcp/invoke) and reconciles the shipped mcp.expose YAML form and the schema-introspection non-goal; grounds the parity surface in the actual omnigraph-ts package (13 tools with read/change ids, 2 resources). * docs(config): clarify graph config boundaries * fix(config): enforce graph-scoped policies and query validation * fix(cli): require graph selection for scoped query registries * fix(server): preserve named graph id in single mode policy * fix(cli): share graph identity for policy resolution * test(cli): cover policy tooling server graph selection * fix(cli): honor server graph for policy tooling --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-01 22:50:31 +02:00
Ragnor Comerford	353c0c876a	fix(branch): make branch delete correct under partial failure (#137 ) * test(lance): pin force_delete_branch surface guard Pin the Lance 6.0.1 force_delete_branch behavior the branch-delete single-authority redesign relies on: plain delete_branch errors on a missing ref, force_delete_branch removes an existing forked branch, and the local-store quirk where force_delete on a fully-absent branch still errors (worked around by the upcoming TableStore::force_delete_branch). Re-pin the docs/dev/lance.md alignment stanza (9 guards; 4 runtime). * feat(storage): add force branch-delete to TableStore + CommitGraph Add TableStore::force_delete_branch and CommitGraph::force_delete_branch (idempotent: tolerate an already-absent branch via Lance RefNotFound / NotFound), plus CommitGraph::list_branches for the cleanup reconciler to diff against the manifest authority. RefConflict (referencing descendants) is still surfaced. Unused until the branch-delete rewire. * test(maintenance): red — cleanup reconciles orphaned branch forks Forge a Lance branch on the Person table that the manifest never references (a zombie fork from an incomplete prior delete) and assert cleanup reclaims it while leaving main intact. Fails today: cleanup does not yet reconcile orphaned forks. Goes green with the next commit. * fix(maintenance): reconcile orphaned branch forks in cleanup Add reconcile_orphaned_branches: force_delete_branch every per-table and commit-graph Lance branch absent from the manifest branch set (the authority), children-before-parents. Folded into cleanup_all_tables, runs before version GC. Idempotent and authority-derived; no-ops once nothing is orphaned, and would harmlessly find nothing if a future Lance atomic multi-dataset branch op prevented orphans. Adds TableStore::list_branches and exposes graph_commits_uri(pub crate). Turns the maintenance red test green. * test(failpoints): red — branch_delete partial failure converges Add the branch_delete.before_table_cleanup failpoint hook (inert without the feature) and a regression test: a cleanup-step failure after the manifest authority flip must leave branch_delete returning Ok, the branch gone, the orphan stranded, then reclaimed by cleanup, and the name reusable. Fails today: cleanup_deleted_branch_tables propagates the error as a hard failure. Goes green with the next commit. * fix(branch): best-effort fork reclaim after the manifest flip Make branch_delete treat per-table forks and the commit-graph branch as derived state reclaimed best-effort with force_delete_branch after the manifest authority flip. A reclaim failure (transient error, or the branch_delete.before_table_cleanup failpoint) is logged via tracing::warn and swallowed: the branch is already gone and the cleanup reconciler converges the orphan. cleanup_deleted_branch_tables no longer returns an error or blocks the call. Turns the partial-failure recovery test green. * test(failpoints): red — recreate over orphaned fork is actionable After a partial-failure delete leaves a fork orphaned, recreating the branch name and writing to the previously-forked table before cleanup runs currently surfaces the opaque ExpectedVersionMismatch ("stale view ... expected manifest table version N"). Assert instead a clear error pointing the user at cleanup. Goes green with the next commit. * fix(branch): actionable orphan-collision error in fork_branch_from_state When a fork's create_branch collides with an existing target ref, reuse it only if its head matches source_version (a legitimate concurrent first-write). A version mismatch means a zombie fork from an incomplete prior delete: return a manifest_conflict pointing the user at `omnigraph cleanup`, instead of the opaque ExpectedVersionMismatch. Turns the recreate-over-orphan red test green. * docs(invariants): single-authority branch-lifecycle + Lance forward-compat Record branch delete in the Current Truth Matrix: manifest is the single authority flipped atomically first, per-table forks + commit-graph branch are derived state reclaimed best-effort with the cleanup reconciler as backstop, and reusing a name whose reclaim failed surfaces an actionable error. Note the reconciler is authority-derived and degrades to a no-op under a future Lance atomic multi-dataset branch op, the same shape as invariant 7. * test(failpoints): red — cleanup isolates a single-table failure Add the cleanup.table_gc failpoint hook (inert without the feature) and an error: Option<String> field on TableCleanupStats (mechanical, always None for now). Regression test: a one-shot version-GC failure for one table must not abort the whole cleanup — assert cleanup still succeeds, surfaces the failure per-table in stats, and the independent reconcile pass still reclaimed an orphan. Fails today: the version-GC collect aborts on the first table error. Goes green with the next commit. * fix(maintenance): fault-isolate cleanup per table Make the cleanup sweep do as much as it can and converge on re-run instead of aborting wholesale on one table's transient error (invariant 13). The version-GC loop now records a per-table failure on its stats row (error: Some) and logs it rather than collecting into a Result that aborts; reconcile_orphaned_branches isolates per-table and commit-graph failures into BranchReconcileStats.failures. The CLI reports any failed tables and tells the user to rerun cleanup. Addresses the Devin review finding. Turns the single-table-failure test green. * test(failpoints): red — branch_create heals commit-graph zombie + is atomic Add the branch_delete.before_commit_graph_reclaim failpoint hook and two regression tests: (a) recreating a name whose delete left a commit-graph zombie must succeed (today it dies on Lance's internal Clone error), and (b) branch_create must roll back the manifest branch when the derived commit-graph branch fails (today it leaves the manifest branch created while returning Err). Both fail now; green with the next commit. The existing branch_create_failpoint_triggers test still passes. * fix(branch): make branch_create atomic + heal commit-graph zombie branch_create now flips the manifest authority first, then creates the derived commit-graph branch in create_commit_graph_branch, force-dropping any orphaned commit-graph ref left by an incomplete prior delete (the manifest branch is fresh, so a same-named commit-graph branch is provably a zombie). If commit-graph creation fails, the manifest branch is rolled back so the name never half-exists. Addresses the Codex review finding. Turns the two branch_create red tests green; existing tests unaffected. * test(failpoints): red — fork collision misclassifies live concurrent fork Add the fork.before_classify failpoint hook and a concurrency test: when a concurrent first-write legitimately wins the fork race, the loser must get a retryable refresh-and-retry, not the misleading run-cleanup orphan error. Today the version-comparison misclassifies the live fork as an orphan (the Cursor finding). Goes green with the next commit. * fix(branch): manifest-arbitrated fork-collision classification Classify a fork collision by the manifest authority instead of comparing Lance branch versions. Before forking, open_owned_dataset_for_branch_write re-reads the live manifest: if the table is already forked on the active branch, a concurrent first-write won and the loser gets a retryable refresh-and-retry (not a misleading orphan error). fork_branch_from_state no longer guesses from versions — a create collision past that check is an orphan, so it returns the actionable cleanup error. Addresses the Cursor finding; turns the live-concurrent-fork test green, zombie path unchanged. * test(failpoints): close branch-lifecycle test gaps Three coverage additions for the branch-delete work (behavior already correct; these lock it in and catch regressions): - cleanup_isolates_reconcile_failure: inject a force-delete failure into the reconcile loop (new cleanup.reconcile_fork hook) and assert the sweep continues + converges on re-run. Directly covers the reconcile loop the Devin finding was about (previously only version-GC was). - cleanup_reclaims_orphaned_commit_graph_branch: forge a commit-graph orphan via the delete reclaim failpoint and assert cleanup's reconcile_commit_graph_orphans drops it (previously untested). - fork_collision_with_live_concurrent_fork_is_retryable: replace the fixed 300ms sleep with a deterministic readiness signal (cfg_callback + compare_exchange atomics) so the two-writer ordering can't flake. Full failpoints suite 31/0.	2026-06-01 13:28:38 +02:00
Ragnor Comerford	2d5c4b1202	docs: rename runs.md/runs.rs → writes and repoint all references (#131 ) Some checks failed CI / Classify Changes (push) Has been cancelled Details CI / Check AGENTS.md Links (push) Has been cancelled Details CI / Container Entrypoint (push) Has been cancelled Details Release Edge / Prepare edge release (push) Has been cancelled Details CI / Test Workspace (push) Has been cancelled Details CI / Test omnigraph-server --features aws (push) Has been cancelled Details CI / Test Windows release binaries (push) Has been cancelled Details CI / RustFS S3 Integration (push) Has been cancelled Details Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled Details Release Edge / Smoke Windows installer (push) Has been cancelled Details The Run state machine was removed in MR-771 (v0.4.0); `docs/dev/runs.md` and `crates/omnigraph/tests/runs.rs` have since documented and tested the direct-publish write path, so the "runs" name was misleading. - git mv docs/dev/runs.md → docs/dev/writes.md (reframe H1 + intro; keep MR-771 history note) - git mv crates/omnigraph/tests/runs.rs → tests/writes.rs (reframe header) - repoint every runs.md / runs.rs reference across docs, AGENTS.md, and source comments - fix four pre-existing broken `docs/runs.md` links (the file never lived at that path) to `docs/dev/writes.md` - fix the stale v0.4.0 anchor to the live section No behavior change: every source edit is a comment. Engine builds and the renamed test passes 25/25; scripts/check-agents-md.sh passes. The run-removal cleanup itself (run_registry.rs guard, __run__ prefix) is deferred to MR-770.	2026-05-30 23:20:56 +02:00
Andrew Altshuler	854ad0afcb	feat(server): compose OMNIGRAPH_TARGET_URI with OMNIGRAPH_CONFIG in entrypoint (#129 ) The container entrypoint's URI and config branches were mutually exclusive, so a deployment driven by OMNIGRAPH_TARGET_URI could never load a policy file. Forward --config alongside the positional URI when OMNIGRAPH_CONFIG is also set (the URI still wins via resolve_target_uri), enabling Cedar policy without changing how the URI is provided. Add docker/entrypoint_test.sh (arg-composition cases) + a CI job, and document the env-var contract in docs/user/deployment.md. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-05-30 20:17:55 +01:00
Ragnor Comerford	24413844ae	Add Windows release binaries (#127 ) * Add Windows release binaries * Fix Windows installer downloads	2026-05-30 14:23:40 +02:00
Ragnor Comerford	50910b3753	docs: align release artifact docs	2026-05-29 14:04:16 +02:00
devin-ai-integration[bot]	1a4d2cee97	feat: inline query strings in CLI and HTTP server (#110 ) * feat(MR-656): inline query strings in CLI and HTTP server CLI: - Add -e / --query-string <STRING> to omnigraph read and omnigraph change - Exactly one of --query, --query-string, --alias is required (3-way XOR) - Empty --query-string is rejected with a clear error HTTP: - New POST /query (read-only, clean field names: query/name/params/branch/snapshot) - Mutations on /query are rejected with 400 -- use POST /change instead - ChangeRequest fields polished: query (alias query_source), name (alias query_name) - POST /read and POST /change remain byte-compatible for existing clients Tests: - cli.rs: -e happy-path on read/change, mutex error vs --query, empty -e rejected - system_local.rs: inline -e read and -e change exercise the local flow - system_remote.rs: inline -e read/change over HTTP plus direct /query 200/400 - server.rs: /query 200, /query 400 on mutation, /change legacy field alias - openapi.rs: new /query path, QueryRequest schema, ChangeRequest field-name polish Docs: cli.md (-e examples), cli-reference.md (read/change rows), server.md (/query) Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * feat(MR-656): rename read/change to query/mutate with deprecation signals HTTP server: - Add POST /mutate as canonical write endpoint (pairs with POST /query). - Mark POST /read and POST /change as deprecated. Three-channel signal: * OpenAPI: `deprecated: true` on the operation (every codegen flags the generated SDK method). * RFC 9745: response `Deprecation: true` header on every response. * RFC 8288: response `Link: </successor>; rel="successor-version"` pointing at /query and /mutate respectively. - Share business logic across /mutate and /change via run_mutate(); the /change wrapper is the only place that adds the deprecation headers. - ChangeRequest field aliases (query_source/query_name) preserved. - AliasCommand serde now accepts `query`/`mutate` alongside `read`/`change`. CLI: - Promote `omnigraph query` / `omnigraph mutate` to top-level canonical subcommands (clap visible_alias keeps `omnigraph read` / `omnigraph change` working forever). - Promote `omnigraph lint` / `omnigraph check` to top-level (was nested under `omnigraph query lint`, which is now a deprecated argv shim that rewrites to the canonical form). - Argv-level preprocessing prints a one-line deprecation warning to stderr when any legacy spelling is used. Canonical names are silent. Tests: - Server: /mutate works, /change emits Deprecation+Link headers, /read emits Deprecation+Link headers, /query carries no deprecation signal. - OpenAPI: /read and /change flagged deprecated; /query and /mutate not. - CLI: canonical `lint` matches deprecated `query lint` / `query check` output; `read` / `change` print deprecation warnings. Docs: - cli.md: new canonical examples; "Deprecated names" migration table. - cli-reference.md: top-level table updated; aliases.<name>.command accepts both legacy and canonical spellings. - server.md: endpoint inventory shows /query and /mutate as canonical and /read and /change as deprecated; dedicated section explains the three-channel deprecation signal. - og-cheet-sheet.md: use new `omnigraph lint` / `omnigraph check`. - openapi.json regenerated. Migration is purely cosmetic — every deprecated form continues to work indefinitely; only the spelling changes. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * fix(MR-656): address Devin Review findings on /query and /change Two issues raised by Devin Review on PR #110: 1. `POST /query` mutation-rejection error pointed at the deprecated `/change` endpoint instead of the canonical `/mutate`. Fixed in three places: the runtime error message in `server_query`, the utoipa 400-response description, and the handler doc comment. The `QueryRequest` schema docstrings in `api.rs` got the same update so the openapi.json bodies match. Server and openapi tests updated. 2. `execute_change_remote` serialized `ChangeRequest` directly, which emits the new canonical field names `query` / `name` on the wire. `#[serde(alias = "query_source")]` only affects deserialization, so a newer CLI talking to an older server would have its `/change` POST body fail with "missing field: query_source". Fixed by extracting a `legacy_change_request_body` helper that hand-rolls the JSON with the legacy keys (`query_source` / `query_name`), the same byte-stable contract `execute_read_remote` already uses against `/read`. Added two unit tests on the helper to lock the wire shape in. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * docs(dev): RFC 001 — inline + stored queries, envelope, MCP Tracked artifact consolidating the design across MR-656 (this branch), MR-976 (Phase 1 envelope hardening parent, with MR-977/978/979/980 sub-issues), and MR-969 (stored queries + MCP). Sections: * Two paths, one engine — inline `/query` + `/mutate` (this PR) coexist with stored `/queries/{name}` (MR-969). Same `run_query` / `run_mutate` backend (the fold-in landed in the previous commit). * Request envelope ("before") — Idempotency-Key, If-Match, X-Deadline, X-Trace-Id, expect, dry_run, fields. Phase 1 ships the load-bearing subset on `/mutate`. * Response envelope ("after") — audit_id, snapshot_id, commit_id, stats, warnings. Closes the provenance loop today's `ChangeOutput` leaves open. * `.gq` pragmas — `@description`, `@returns`, `@mcp`. Source-of-truth for the stored-query agent contract; no separate YAML registry. * Multi-graph MCP — per-graph `/graphs/{id}/mcp/tools` + `/mcp/invoke`. Token binds to one graph by default; cross-graph agents loop. * Cedar split — `read`/`change` for inline, `invoke_query` for stored. Operators deny ad-hoc for agent groups while keeping curated tool list open. * Rejected alternatives — per-env override files, compiled bundles, tool-name prefixing across graphs, body-field graph dispatch. Index entry added under "Active Implementation Plans" so future agents land on the RFC before touching queries / mutations / envelope code. `scripts/check-agents-md.sh` clean (35 links, 34 docs). * docs(server): clarify why run_query lacks AppState parameter run_mutate takes state for workload admission; run_query doesn't because reads aren't admission-gated today. Mark the asymmetry as intentional and flag the two future events that would grow the signature: Phase 1's `expect: { max_rows_scanned: N }` budget (MR-976) or per-actor admission extending to stored-read invocations (MR-969). Prevents the natural "make these symmetrical" follow-up. * refactor(server): run_query / run_mutate take &ResolvedActor Replace `Option<Extension<ResolvedActor>>` in the helpers with `Option<&ResolvedActor>`. Saves MR-969's stored-query handler from wrapping a bare actor in axum's `Extension(...)` before calling. Handler signatures (`server_query`, `server_read`, `server_mutate`, `server_change`) keep `Option<Extension<ResolvedActor>>` because that is what axum injects, and unwrap at the call site with `actor.as_ref().map(\|Extension(actor)\| actor)`. Net: -13/+10 LOC, 89/0 server tests pass. * docs(releases): v0.6.0 — describe inline + canonical-named queries (MR-656) Extend the v0.6.0 release notes to cover the third piece of work landing alongside the graph terminology rename and multi-graph server mode: canonical-named `POST /query` and `POST /mutate` endpoints, the CLI's new `-e/--query-string` flag, the top-level promotion of `lint` / `check`, and the three-channel deprecation signal on `/read` and `/change` (OpenAPI `deprecated: true` + RFC 9745 + RFC 8288). Additions: * Top blurb: "Two pieces" -> "Three pieces" with a bullet describing the rename + inline flow. * Breaking Changes: new "Query / mutation rename" subsection covering the `ChangeRequest` field rename (with the back-compat serde aliases and the CLI's `legacy_change_request_body` byte-stable wire helper) and the `omnigraph query lint` -> `omnigraph lint` move. * New: 5 bullets — the two endpoints, the CLI subcommands, the `-e` flag, the deprecation signal channels, the widened `aliases.<name>.command` vocabulary. * User Impact: one bullet making explicit that the rename is cosmetic on the client side and migration is voluntary. * Documentation: pointers to the updated `server.md` / `cli.md` / `cli-reference.md` and the new `docs/dev/rfc-001-queries-envelope-mcp.md`. +15/-1 lines. `./scripts/check-agents-md.sh` clean. * refactor(cli): demote `check` from visible_alias to deprecation shim `omnigraph check` was a clap `visible_alias` on `lint`, advertised in `--help` as an equivalent canonical name. Per MR-981 §6 (long-form flags as canonical, short forms as visible aliases), visible aliases on subcommand names hurt agent CX: agents emit either spelling depending on training-data drift, and there's no length signal pointing at the canonical name. Changes: * Remove `#[command(visible_alias = "check")]` from the `Lint` variant. `omnigraph --help` now shows only `lint`. * Add bare `check` to `rewrite_deprecated_argv` so `omnigraph check <args>` still works — it rewrites to `omnigraph lint <args>` and emits a one-line stderr deprecation warning, matching the existing pattern for `read` / `change` / `query lint` / `query check`. * Fix the nested `query check` shim to substitute `check` -> `lint` in the rewritten argv (previously it relied on `check` being a visible_alias to reach the `Lint` variant). * New test `deprecated_check_top_level_rewrites_to_lint` covers: bare `check` produces identical stdout to `lint`, emits the deprecation warning, and `check` does NOT appear as an alias in `omnigraph --help`. * Release notes updated to reflect the deprecation-shim treatment and cross-reference MR-981 §6 reasoning. Cargo / Go users typing `check` still work indefinitely; one stderr nudge per invocation teaches the canonical name. Agents see only `lint` in `--help --json` so they emit one canonical form. 67/0 omnigraph-cli tests pass; 39 workspace test suites green. --------- Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com> Co-authored-by: Ragnor Comerford <hello@ragnor.co>	2026-05-29 13:41:54 +02:00
Ragnor Comerford	e0f13b32c5	(feat): multi-graph server mode (#119 ) * mr-668: add GraphId newtype + Cloud-mode forward identity stubs (PR 1/10) PR 1 of the MR-668 multi-graph server work. Pure types, no runtime behavior changes yet. Ships the validated identity vocabulary that the rest of the implementation will consume: - `GraphId(String)` — `^[a-zA-Z0-9-]{1,64}$`, leading underscore rejected (engine reserves every `_` filename), reserved route names rejected (`policies`, `healthz`, `openapi`, `openapi.json`, `graphs`). Validation lives in `try_from` only; serde `Deserialize` re-runs it so JSON payloads cannot bypass. - `TenantId(String)` — same regex shape as GraphId. `None` in Cluster mode; reserved for Cloud mode (RFC 0003) where it carries the OAuth `org_id` claim. - `GraphKey { tenant_id: Option<TenantId>, graph_id }` — the registry HashMap key. `cluster()` constructor for the Cluster-mode default. - `Scope` enum with `Full` variant — Cluster mode default; RFC 0004 will extend with OAuth scopes (`graph:read`/`write`/`admin`/``). - `AuthSource` enum with `Static` variant — Cluster mode default; RFC 0001 step 1 will add `Oidc`. - `ResolvedActor { actor_id, tenant_id, scopes, source }` — replaces the upcoming refactor of `AuthenticatedActor(Arc<str>)` in PR 4a. Per MR-668 design decision 13: ship the Cloud-mode forward type shapes now (no `TokenVerifier` trait yet — that's RFC 0001 step 1) so handler signatures stay stable across the Cluster → Cloud trajectory. `Scope` and `AuthSource` use `#[non_exhaustive]` so future variants don't break caller matches. Tests: 26 new (15 graph_id + 11 identity), all passing. No regression in the existing 36 server library tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: Omnigraph::init error-path cleanup + three failpoints (PR 2a/10) PR 2a of the MR-668 multi-graph server work. Bug fix: a partially-failed `Omnigraph::init` previously left orphan schema files at the graph URI, making the URI unusable for a retry (the next `init` would refuse because `_schema.pg` already exists). Changes: 1. `init_with_storage` now wraps the I/O phase. On any error from `init_storage_phase`, calls `best_effort_cleanup_init_artifacts` to remove the three schema files before returning the original error: - `_schema.pg` - `_schema.ir.json` - `__schema_state.json` Cleanup is best-effort: a failure to delete is logged via `tracing::warn` but does NOT mask the init error. 2. Three failpoints added at the init phase boundaries: - `init.after_schema_pg_written` - `init.after_schema_contract_written` - `init.after_coordinator_init` 3. Four new failpoint tests in `tests/failpoints.rs` pin the cleanup behavior at each boundary plus the "original error wins over cleanup error" contract. All 23 failpoint tests pass. Coverage gap (documented in code comments): Lance per-type datasets and `__manifest/` directory created by `GraphCoordinator::init` are NOT cleaned up after a coordinator-init-phase failure. Recursive directory deletion requires `StorageAdapter::delete_prefix`, which was deferred along with `DELETE /graphs/{id}` (originally PR 2b). When that primitive lands, the third failpoint test can be tightened to assert the graph root is fully empty. Tests: 4 new (init_failpoint_), all 23 failpoint tests green. No regression in the 105 engine library tests or 64 end_to_end tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> mr-668: add GraphHandle + GraphRegistry data structure (PR 3/10) PR 3 of the MR-668 multi-graph server work. Pure data structure — no routing changes yet (that's PR 4a). New file: `crates/omnigraph-server/src/registry.rs` - `GraphHandle { key: GraphKey, uri: String, engine: Arc<Omnigraph>, policy: Option<Arc<PolicyEngine>> }` — the per-graph state that the routing middleware (PR 4a) will inject as a request extension. - `RegistrySnapshot { graphs: HashMap<GraphKey, Arc<GraphHandle>> }` — immutable snapshot; replaced atomically via `ArcSwap`. - `GraphRegistry { snapshot: ArcSwap<_>, mutate: Mutex<()> }` — lock-free reads, mutex-serialized mutations. - `RegistryLookup { Ready(Arc<GraphHandle>) \| Gone }` — two-valued, no `Tombstoned` variant since DELETE is deferred in v0.7.0 scope. - `InsertError { DuplicateKey \| DuplicateUri }` — both rejection cases for create-graph (maps to HTTP 409 in PR 7). - Methods: `new`, `from_handles` (bulk startup-time init), `get`, `list`, `len`, `insert`. Race semantics pinned by three multi-thread tests: - `concurrent_insert_same_key_exactly_one_succeeds` — N=8 spawned inserts with the same key; exactly 1 returns Ok, 7 return DuplicateKey. - `concurrent_insert_distinct_keys_all_succeed` — N=8 spawned inserts with distinct keys; all succeed. - `concurrent_reads_during_inserts_see_consistent_snapshots` — reader loop concurrent with sequential writes; every listed handle's key resolves via `get()` (no torn state). Why no tombstones field: `DELETE /graphs/{id}` is deferred to bound the scope of v0.7.0. Without a delete endpoint, there's no use for tombstones — every key in the registry is `Ready`, and every key not in the registry is `Gone`. When DELETE lands later, the `Tombstoned` variant + `tombstones: HashSet<GraphKey>` slot in additively without breaking caller signatures (the `Gone` variant remains the "not currently active" case). Why `tokio::sync::Mutex`: insert is async because PR 7's flow holds this mutex across the atomic YAML rewrite step (file I/O). std::Mutex would footgun across .await. Dependency additions: `arc-swap = { workspace = true }`, `thiserror = { workspace = true }` (used by InsertError). Tests: 12 new (12 passing). 74 server lib tests total green (62 from PR 1 + 12 new). Clippy clean on server crate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: router restructure + handler refactor for multi-graph (PR 4a/10) PR 4a of the MR-668 multi-graph server work. The heaviest single PR — rewires every handler to extract `Arc<GraphHandle>` from a routing middleware, replaces `AuthenticatedActor(Arc<str>)` with `ResolvedActor` everywhere, and adds the `ServerMode` discriminator. Behavior changes: - Single mode (legacy `omnigraph-server <URI>`): flat routes (`/snapshot`, `/read`, `/branches`, …) continue to work exactly as v0.6.0. Internally, the registry holds a single handle keyed by the sentinel `SINGLE_GRAPH_KEY_ID = "default"`; routing middleware injects that handle on every request. No HTTP-visible change. - Multi mode (new): routes nest under `/graphs/{graph_id}/...`. Routing middleware extracts the graph id from the path, looks it up in the registry, and injects the handle. 404 if not found. (Multi-mode startup itself lands in PR 5; this PR provides the router-side wiring.) AppState refactor: - `engine: Arc<Omnigraph>` and `policy_engine: Option<Arc<PolicyEngine>>` fields removed — both now live inside `GraphHandle` in the registry. - `mode: ServerMode { Single { uri } \| Multi { config_path } }` added. - `registry: Arc<GraphRegistry>` added. - `server_policy: Option<Arc<PolicyEngine>>` added (placeholder for management endpoints in PR 6b; unused today). - Existing constructors (`new`, `new_with_bearer_token{s,_and_policy}`, `new_with_workload`, `open`) build a single-mode AppState internally and remain source-compatible. Tests that constructed AppState via these constructors continue to work. - `with_policy_engine` post-construction setter — rebuilds the single-mode handle with the policy attached. Engine-layer enforcement is NOT reinstalled (matches the old single-field semantics; `open_with_bearer_tokens_and_policy` is the path that installs both layers). - `new_multi` constructor added for PR 5's startup loop. - `uri()` now returns `Option<&str>` (Some in single, None in multi). Routing middleware: - `resolve_graph_handle` injects `Arc<GraphHandle>` as a request extension. Mode-aware: single returns the only handle; multi parses `/graphs/{graph_id}/...` from the URI. Returns 404 in multi mode when the graph id is unregistered. Records `graph_id` on the current tracing span. - `require_bearer_auth` updated to insert `ResolvedActor` (was `AuthenticatedActor`). Handler refactor — every protected handler: - Gains `Extension(handle): Extension<Arc<GraphHandle>>` param. - Replaces `state.engine` → `handle.engine`. - Replaces `state.policy_engine()` → `handle.policy.as_deref()`. - Replaces `state.uri()` → `handle.uri.as_str()` (or `.clone()` where String is needed). - Replaces `Arc::clone(&state.engine)` → `Arc::clone(&handle.engine)` (the spawn-and-clone pattern in `server_export` — proof that a long-running export survives the registry being mutated later). authorize_request signature: - Was: `(state: &AppState, actor: Option<&AuthenticatedActor>, request: PolicyRequest)`. - Now: `(actor: Option<&ResolvedActor>, policy: Option<&PolicyEngine>, request: PolicyRequest)`. - Per-graph callers pass `handle.policy.as_deref()`. The (future PR 6b) management endpoints will pass `state.server_policy.as_deref()`. MR-731 invariant preserved: - The single chokepoint `request.actor_id = actor.actor_id.as_ref().to_string()` inside `authorize_request` still overwrites any client-supplied actor identity. Regression test `actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers` at `tests/server.rs:1114-1216` passes unchanged. Tests: 0 new (the registry race tests in PR 3 already cover the data structure; this PR exercises them indirectly via the existing test suite). 74 lib + 57 server integration + 60 openapi = 191 tests green. Clippy clean. LOC: +397 insertions, -153 deletions in `crates/omnigraph-server/src/lib.rs`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> mr-668: OpenAPI multi-mode cluster filter (PR 4b/10) PR 4b of the MR-668 multi-graph server work. In multi mode, the served `/openapi.json` reports cluster routes (`/graphs/{graph_id}/...`) instead of the legacy flat protected paths — matching what `build_app` actually mounts (PR 4a's `Router::nest`). Single mode is unchanged. Implementation: - New `server_openapi` branch: when `state.mode()` is `Multi`, call `nest_paths_under_cluster_prefix(&mut doc)` after `ApiDoc::openapi()`. - The rewrite consumes `doc.paths.paths`, then for every path-item: - If the path is in `ALWAYS_FLAT_PATHS` (`/healthz` for now), keep it flat. - Otherwise, prefix every operation_id with `cluster_` and reinsert the item at `/graphs/{graph_id}<original_path>`. - Single mode hits no extra work — the path map is untouched. - The static `ApiDoc::openapi()` still emits the flat surface, so in-process callers (the existing `openapi_json()` helper in tests) see the unmodified spec. Why cluster_ prefix on operation IDs: OpenAPI specs require unique operation_ids across the document. With both flat (single-mode) and cluster (multi-mode) surfaces ever co-existing in a generated SDK, the prefix prevents collision. The current served doc only carries one surface, so the prefix is forward-compat with potential future dual-surface generation. Tests: 6 new in `tests/openapi.rs`, all via the `/openapi.json` route (not the static `ApiDoc::openapi()` helper): - `multi_mode_openapi_lists_cluster_paths` — every protected path appears as a cluster variant. - `multi_mode_openapi_drops_flat_protected_paths` — flat protected paths are absent. - `multi_mode_openapi_keeps_healthz_flat` — `/healthz` survives. - `multi_mode_openapi_prefixes_operation_ids_with_cluster` — every cluster operation_id starts with `cluster_`. - `multi_mode_operation_ids_are_unique` — no operation_id collisions. - `single_mode_openapi_unchanged_by_cluster_filter` — single mode still emits the legacy flat surface (regression). New test helper `app_for_multi_mode(graph_ids)` exercises the new `AppState::new_multi` constructor from PR 4a — first user of multi-mode construction outside of unit tests. Result: 66 openapi tests + 57 server integration tests + 74 lib tests = 197 green. No regression in the existing OpenAPI drift check (`openapi_spec_is_up_to_date` still validates the static flat surface matches the committed openapi.json). LOC: +67 in lib.rs (rewrite logic), +219 in tests/openapi.rs (test suite + helper). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: multi-graph startup + mode inference (PR 5/10) PR 5 of the MR-668 multi-graph server work. This is the first PR that makes multi mode actually usable end-to-end: operators invoking `omnigraph-server --config omnigraph.yaml` with a non-empty `graphs:` map and no single-mode selector now get a running multi-graph server. Mode inference (MR-668 decision 2, four-rule matrix in `load_server_settings`): 1. CLI `<URI>` positional → Single 2. CLI `--target <name>` → Single (URI from graphs.<name>) 3. `server.graph` in config → Single (URI from graphs.<name>) 4. `--config` + non-empty `graphs:` + no single-mode selector → Multi (all entries in `graphs:`) 5. otherwise → error with migration hint Rule 5's error message names every escape hatch so operators can fix their invocation without grepping docs. Config schema extensions: - `TargetConfig.policy: PolicySettings` (per-graph Cedar policy file). `#[serde(default)]` so existing single-graph YAMLs keep parsing. - `ServerDefaults.policy: PolicySettings` (server-level Cedar policy for management endpoints — loaded in PR 5, wired into `GET /graphs` in PR 6b). - `OmnigraphConfig::resolve_target_policy_file(name)` and `resolve_server_policy_file()` helpers — both resolve relative to the config file's `base_dir`. Public types added to `omnigraph-server`: - `ServerConfigMode { Single { uri, policy_file } \| Multi { graphs, config_path, server_policy_file } }`. - `GraphStartupConfig { graph_id, uri, policy_file }` — one entry per graph in multi mode. `ServerConfig` shape change: - WAS: `{ uri: String, bind, policy_file, allow_unauthenticated }`. - NOW: `{ mode: ServerConfigMode, bind, allow_unauthenticated }`. - Breaking for any code that constructs `ServerConfig` directly. `main.rs` is unaffected (uses `load_server_settings`). `serve()` now forks on `ServerConfig.mode`: - Single: existing flow via `AppState::open_with_bearer_tokens_and_policy`. - Multi: parallel open via `futures::stream::iter(graphs) .map(open_single_graph).buffer_unordered(4).collect()`. Bound 4 is a rule-of-thumb for I/O-bound work — at N≤10 this trades startup latency for a small amount of concurrent S3/Lance open pressure. Fail-fast: first open error aborts startup; in-flight opens drop their engine via Arc (Lance datasets close cleanly). New helper `open_single_graph(GraphStartupConfig)`: - Validates `GraphId` per the regex in PR 1. - `Omnigraph::open(uri).await` with descriptive error context. - Loads per-graph policy file and re-applies it via `Omnigraph::with_policy` (engine-layer enforcement, MR-722). - Returns `Arc<GraphHandle>` ready for the registry. Routing middleware bug fix: - `Router::nest("/graphs/{graph_id}", inner)` rewrites `request.uri().path()` to the inner suffix (e.g. `/snapshot`). The previous middleware tried to parse `{graph_id}` from `request.uri().path()` and got 400 instead of 200. Fixed by reading from `axum::extract::OriginalUri` request extension, which preserves the pre-rewrite URI. - Caught by the two new tests `cluster_routes_dispatch_per_graph_handle` and `cluster_route_for_unknown_graph_returns_404`. Tests (14 new, all passing): - Four-rule matrix: one test per branch + the joint case `mode_inference_cli_uri_overrides_graphs_map` + the empty-graphs-map error case. - Per-graph + server-level policy file path resolution. - Reserved `GraphId` rejection at startup. - End-to-end multi-graph routing: two graphs side by side, each cluster route hits the right engine. - Unknown graph id under cluster prefix → 404. - Flat routes 404 in multi mode. Inline `ServerConfig` test (`serve_refuses_to_start_in_state_1_without_unauthenticated`) and three `server_settings_` tests updated to the new `mode` shape. Result: 211 server tests green (74 lib + 71 integration + 66 openapi), MR-731 regression test still pinned and passing. LOC: +45 config.rs, +281 lib.rs (net), +395 tests/server.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> mr-668: Cedar resource-model refactor (PR 6a/10) PR 6a of the MR-668 multi-graph server work. Policy-crate-only refactor — no HTTP handler changes, no operator-supplied policy.yaml changes. Sets up the chassis that PR 6b's `GET /graphs` consumes. Two new `PolicyAction` variants: - `GraphCreate` — gates `POST /graphs` (deferred behavioral PR). - `GraphList` — gates `GET /graphs` (lands in PR 6b). Note: `GraphDelete` is intentionally NOT added in this PR. `DELETE /graphs/{id}` is deferred from MR-668's v0.7.0 scope to bound complexity (no `delete_prefix`, no tombstone, no `RegistryLookup::Tombstoned`). Adding the Cedar action without a consumer would be the same kind of "dead vocabulary" trap the `Admin` variant already documents. New `PolicyResourceKind { Graph, Server }` enum, plus a `PolicyAction::resource_kind()` method that classifies every action. Per-graph actions (Read, Change, BranchCreate, …) bind to `Omnigraph::Graph::"<graph_label>"`; server-scoped actions (GraphCreate, GraphList) bind to the singleton `Omnigraph::Server::"root"`. `Admin` stays classified as per-graph for now — MR-724 will pick the final shape when the first consumer surface ships. Cedar schema string additions: - `entity Server;` - `action "graph_create" appliesTo { principal: Actor, resource: Server, ... }` - `action "graph_list" appliesTo { principal: Actor, resource: Server, ... }` Compiler updates: - `compile_policy_source` picks the resource literal based on the action's `resource_kind`. Existing graph-only policies generate the same Cedar source as before — pinned by `per_graph_rules_continue_to_work_alongside_server_rules`. - `compile_entities` includes the `Server::"root"` entity only when a rule references a server-scoped action. Keeps test assertions for graph-only policies tight. - `PolicyEngine::authorize` builds the right resource UID at request time based on `request.action.resource_kind()`. Validation rules added to `PolicyConfig::validate`: - A rule may not mix server-scoped and per-graph actions (different resource kinds need different `permit` clauses). - Server-scoped actions cannot have `branch_scope` or `target_branch_scope` — there's no branch context at the server level. Operator impact: zero. The Cedar schema `Omnigraph::Server` entity is internally referenced by `compile_policy_source`; operator policy.yaml files only declare actions in `rules[].allow.actions` and never reference the resource entity directly. Decision 6's "internal rename only; operator policies unaffected" contract is preserved and pinned by `per_graph_rules_continue_to_work_alongside_server_rules`. Tests: 5 new (11 policy tests total, up from 6): - `graph_list_action_authorizes_against_server_resource` - `graph_create_action_authorizes_against_server_resource` - `server_scoped_rule_cannot_use_branch_scope` - `rule_mixing_server_and_per_graph_actions_is_rejected` - `per_graph_rules_continue_to_work_alongside_server_rules` No regression: 145 server tests (74 lib + 71 integration) still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: GET /graphs endpoint + per-graph policy wire-up (PR 6b/10) PR 6b of the MR-668 multi-graph server work. First management endpoint — `GET /graphs` lists every graph registered with the server, gated by the server-level Cedar policy from PR 6a. New API shapes (in `omnigraph-server::api`): - `GraphInfo { graph_id, uri }` — one entry per registered graph. - `GraphListResponse { graphs: Vec<GraphInfo> }` — sorted alphabetically by `graph_id` for deterministic output. Handler `server_graphs_list`: - Mounted at `GET /graphs` in both modes. - Single mode: returns 405 (resource exists in the API surface, just not operational without a `graphs:` map). 405 chosen over 404 so clients see "resource exists, wrong context" rather than "no such resource". - Multi mode: requires bearer auth (when configured); Cedar-gated by `PolicyAction::GraphList` against `Omnigraph::Server::"root"` (PR 6a's chassis). Returns the sorted registry list. Cedar gate composition: - When no `server.policy.file` is configured, the MR-723 default-deny falls through: `GraphList` is not `Read`, so an authenticated actor without a server policy gets 403. This is the right default — don't expose the registry until the operator explicitly authorizes it. - When a server policy is configured, Cedar evaluates the rule. The test `get_graphs_with_server_policy_authorizes_per_cedar` pins the admin-allow / viewer-deny split. Routing: - New `management` sub-router holding `/graphs` (auth-required, no `resolve_graph_handle` middleware — operates on the registry, not a single graph). - Single mode merges flat protected routes + management. - Multi mode merges nested `/graphs/{graph_id}/...` + management. OpenAPI: - `server_graphs_list` registered in `ApiDoc::paths(...)`. - `EXPECTED_PATHS` in `tests/openapi.rs` gains `/graphs`. - `openapi.json` regenerated (auto-tracked by `openapi_spec_is_up_to_date` in CI). Tests: 4 new in `tests/server.rs::multi_graph_startup`: - `get_graphs_lists_registered_graphs_in_multi_mode` - `get_graphs_returns_405_in_single_mode` - `get_graphs_requires_bearer_auth_when_configured` - `get_graphs_with_server_policy_authorizes_per_cedar` What's NOT in this PR (deferred): - Per-graph policy enforcement is wired through `handle.policy` (PR 4a already did this); PR 6b doesn't add new per-graph behavior beyond making sure the server policy lookup composes cleanly alongside it. - `POST /graphs` (PR 7) and `DELETE /graphs/{id}` (out of scope for v0.7.0). - CLI `omnigraph graphs list` (PR 8 will add). Result: 215 server tests green (74 lib + 66 openapi + 75 integration), 11 policy tests green. MR-731 spoof regression preserved across all this work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: POST /graphs runtime create endpoint (PR 7/10) PR 7 of the MR-668 multi-graph server work. Operators can now add a graph to a running multi-graph server without restarting: curl -X POST http://server/graphs \ -H "Content-Type: application/json" \ -d '{ "graph_id": "beta", "uri": "/data/beta.omni", "schema": { "source": "node Person { name: String @key }\n" }, "policy": { "file": "./policies/beta.yaml" } }' DELETE remains deferred (out of v0.7.0 scope per the trimmed plan — no `delete_prefix`, no tombstones). Body shape (decision 7): - Nested `schema: { source: "..." }` (mirrors the `policy: { file }` pattern; leaves room for future fields without breakage). - Optional nested `policy: { file: "..." }` for per-graph Cedar. - 32 MiB body limit (reuses `INGEST_REQUEST_BODY_LIMIT_BYTES`). - Asymmetric with `SchemaApplyRequest` which keeps flat `schema_source: String` — documented in api.rs. Atomic YAML rewrite + drift detection: - New `config::rewrite_atomic(path, new_config, expected_hash)`: flock → re-read + hash check → serialize → write `.tmp` → fsync → rename → fsync parent dir. Returns the new hash for the caller to update its in-memory baseline. - New `config::hash_config_file(path)` — SHA-256 of the on-disk bytes, used at startup and after each rewrite. - New `RewriteAtomicError { Drift \| Io \| Serialize }` enum. - `AppState.config_hash: Option<Arc<Mutex<[u8;32]>>>` carries the in-memory baseline. Updated after every successful rewrite so subsequent POSTs don't false-trigger drift. - The mutex is `std::sync::Mutex` (brief critical section, no .await inside). The flock itself serializes file access process-wide AND across multiple server instances (defense in depth). - All sync I/O runs inside `tokio::task::spawn_blocking` — flock is sync. Handler ordering (the load-bearing sequence): 1. Mode check: 405 in single mode. 2. Cedar authorize: `GraphCreate` against `Omnigraph::Server::"root"`. 3. Validate body: `GraphId::try_from` (regex + reserved-name), empty schema/uri checks, per-graph policy file parse. 4. Pre-check registry for duplicate graph_id / duplicate uri (409). 5. `Omnigraph::init` the new engine. 6. Atomic YAML rewrite (drift detection inside). 7. Publish in registry (atomic re-check via `GraphRegistry::insert`). Failure modes (documented in handler rustdoc): - Init fails → orphan storage at `req.uri` (PR 2a cleans up schema files; Lance datasets remain orphans until `delete_prefix` lands). - YAML rewrite fails (drift, IO) → orphan storage; YAML unchanged. - Registry insert fails (race) → YAML has entry but registry doesn't; next restart opens it cleanly. New dependency: `fs2 = "0.4"` (workspace + omnigraph-server). POSIX-only file locking. Linux/macOS deployment supported; Windows out of scope. Tests (10 new in `tests/server.rs::multi_graph_startup`): - `post_graphs_creates_a_new_graph_end_to_end` — happy path, includes YAML inspection to confirm the rewrite landed. - `post_graphs_baseline_hash_updates_between_rewrites` — two POSTs in a row both succeed (drift baseline updates correctly). - `post_graphs_duplicate_graph_id_returns_409` - `post_graphs_duplicate_uri_returns_409` - `post_graphs_invalid_graph_id_returns_400` (reserved name) - `post_graphs_empty_schema_source_returns_400` - `post_graphs_returns_405_in_single_mode` - `post_graphs_yaml_drift_detection_returns_503` — operator hand-edits omnigraph.yaml; server refuses to clobber. - `hash_config_file_is_deterministic_and_detects_changes` - `rewrite_atomic_refuses_when_hash_drifts` OpenAPI: `server_graphs_create` registered in `ApiDoc::paths(...)`; openapi.json regenerated. Result: 225 server tests green (74 lib + 66 openapi + 85 integration), all MR-731 regressions still pinned. LOC: ~580 lib.rs net (handler + helpers), ~120 config.rs (rewrite machinery), +71 api.rs (request/response shapes), +332 tests/server.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: CLI omnigraph graphs list/create (PR 8/10) PR 8 of the MR-668 multi-graph server work. CLI parity for the v0.7.0 management surface: operators can now manage graphs from the command line against a running multi-graph server. omnigraph graphs list --target dev --json omnigraph graphs create \ --target dev \ --graph-id beta \ --graph-uri /data/beta.omni \ --schema schema.pg DELETE is intentionally absent — server-side DELETE was deferred from v0.7.0 scope, and shipping a client subcommand for a server endpoint that doesn't exist would be dead vocabulary. The help output, the subcommand enum, and the test that pins it (`graphs_subcommand_help_ lists_list_and_create`) all agree. CLI architecture (modeled on `BranchCommand`): - New `Command::Graphs { command: GraphsCommand }` top-level variant. - `GraphsCommand { List, Create }` enum. - List: GET `<base>/graphs`. Stdout is `<graph_id>\t<uri>` per line, or JSON via `--json`. - Create: reads `--schema <path>` from local disk, inlines as `schema: { source: <file> }` in the POST body (nested per MR-668 decision 7). Optional `--policy-file <path>` becomes `policy: { file: <path> }`. Returns 201 → "created graph X at Y" or JSON via `--json`. - Both subcommands reject local URI targets with a clear "remote multi-graph server URL" error. New API type imports in the CLI: `GraphCreateRequest`, `GraphCreateResponse`, `GraphListResponse`, `GraphSchemaSpec`, `GraphPolicySpec` — all from `omnigraph-server::api`. Tests: - cli.rs (4 new, non-network): * `graphs_subcommand_help_lists_list_and_create` — pins the deferral of `delete` (catches scope creep). * `graphs_list_against_local_uri_errors_with_remote_only_message` * `graphs_create_against_local_uri_errors_with_remote_only_message` * `graphs_create_with_missing_schema_file_errors` — pins the IO context in the schema-read error path. - system_remote.rs (1 new, `#[ignore]` like its peers): * `graphs_list_and_create_against_multi_graph_server` — spawns a multi-mode server, calls `graphs list` (sees `alpha`), `graphs create` (adds `beta`), `graphs list` again (sees both), and confirms the new graph is reachable via its cluster route. CLI suite: 62 tests green (58 existing + 4 new). The new ignored end-to-end test runs locally with `cargo test --ignored`. LOC: +159 main.rs (enum + handlers), +88 cli.rs (unit tests), +131 system_remote.rs (integration test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: composite e2e tests, race fix, v0.7.0 release (PR 9/10) PR 9 — the final integration PR for MR-668 multi-graph server work. Closes the v0.7.0 release. Composite lifecycle tests (closes gaps flagged in PR 7's coverage review): - `multi_graph_lifecycle_post_query_restart_persistence` — POST a graph, query it via cluster route, reload the config from disk and confirm `load_server_settings` sees the rewritten YAML. Validates the "restart resolves orphans" failure-mode story. - `per_graph_policy_enforced_on_post_created_graph` — POST a graph with a per-graph policy attached, then send authenticated read and change requests. Per-graph Cedar enforcement fires correctly on a POST-created graph (engine-layer policy reinstalled via `Omnigraph::with_policy` inside the create flow). - `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent POSTs with distinct graph_ids all return 201. Caught a real race in `rewrite_atomic` (see below). Race fix — `rewrite_atomic_with_modify`: The first composite test surfaced a real bug. The old `rewrite_atomic(path, new_config, expected_hash)` captured the baseline hash OUTSIDE the flock, then called rewrite_atomic which re-acquired it inside. Under concurrent writers: - POST A: captures baseline H0, calls rewrite_atomic. - POST B: captures baseline H0 too (before A's update lands). - A: acquires flock, on-disk == H0, writes H1, releases. - A: updates baseline H0 → H1. - B: tries to acquire flock — waits. - B: acquires flock. On-disk is now H1. Expected (captured before A finished) is H0. MISMATCH → spurious Drift error. Worse: even if the timing happens to align, B's `updated` config was constructed from BYTES read before the flock. B writes a config that doesn't include A's new graph — silent data loss. The fix: new `config::rewrite_atomic_with_modify(path, baseline, modify)` takes a closure. Inside the flock + baseline mutex: 1. Read on-disk bytes, hash, compare to baseline. 2. Parse on-disk YAML. 3. Call `modify(parsed)` to produce the new config — receives fresh on-disk state, returns the modification. 4. Serialize + write + fsync + rename + update baseline. Everything is read-modify-write under the same critical section. Concurrent writers serialize cleanly. Test confirmed this is no longer a race. The old `rewrite_atomic(path, new_config, expected_hash)` API stays for tests that don't need the read-modify-write shape; the POST handler switches to the new shape. Version bump v0.6.0 → v0.7.0: - All 5 `crates//Cargo.toml` (compiler, engine, policy, cli, server) plus their inter-crate `path` dep version constraints. - `Cargo.lock` regenerated by `cargo build --workspace`. - `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server row updated to mention multi-graph + cluster routes + atomic YAML rewrite. - `openapi.json` regenerated. Docs: - `docs/releases/v0.7.0.md` (new) — release notes with breaking changes, new features, deferred items (DELETE, `delete_prefix`, actor forwarding), and the single→multi migration recipe. - `docs/user/server.md` — substantial section additions for the two modes, mode inference, cluster endpoint table, management endpoints, `omnigraph.yaml` ownership contract, `POST /graphs` body shape + status codes. - `docs/user/cli.md` — `omnigraph graphs list/create` section, deferred-DELETE note. - `docs/user/policy.md` — server-scoped Cedar actions (`graph_create`, `graph_list`), per-graph vs server-level policy composition, example server-level policy. Workspace test pass: 573 tests green across all crates. Zero failures. MR-731 spoof regression still pinned and passing across the entire 10-PR series. This commit closes MR-668. v0.7.0 is ready for tagging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> mr-668: remove POST /graphs and CLI graphs create (defer runtime graph mgmt) The POST /graphs runtime-create endpoint shipped in PR 7/10 has three unresolved high-severity bugs: - flock-on-renamed-inode race: the YAML flock is taken on omnigraph.yaml itself, then a temp file is renamed over it. Cross-process writers end up locking different inodes — both believing they hold exclusive access. - duplicate-check outside the file lock: precheck runs against the in-memory registry only; the locked closure does config.graphs.insert(...) unconditionally. Concurrent same-id POSTs can persist the loser in YAML while the in-memory registry keeps the winner — they disagree after restart. - best_effort_cleanup_init_artifacts deletes _schema.pg / _schema.ir.json / __schema_state.json on any init failure. An accidental re-init against an existing graph's URI destroys its schema; subsequent open() fails at read_text(_schema.pg). The correct fix is a Lance-style cluster catalog (reserve → init → publish with recovery sidecars), parallel to the engine's existing __manifest discipline. That work is out of scope for v0.7.0. For now, disable runtime add/remove from the network and CLI surface. Operators add graphs by editing omnigraph.yaml and restarting. The GET /graphs read-only enumeration stays. Removed: - POST /graphs handler + router fragment + utoipa registration - 13 post_graphs_* server tests + 3 composite POST tests + multi_mode_app_with_real_config / post_graph helpers - CLI omnigraph graphs create subcommand + its handler + cli.rs tests - system_remote.rs combined list+create test trimmed to list-only - YAML rewrite infra: rewrite_atomic[_with_modify], RewriteAtomicError, staging_path, hash_config_file, AppState::config_hash field + threading through new_multi and open_multi_graph_state - fs2 dependency (verified absent from cargo tree) - sha2/fs2 imports in config.rs (only the rewrite path used them) - Cedar PolicyAction::GraphCreate variant + "graph_create" match arms + action def in Cedar schema + graph_create_action_authorizes_against_server_resource test - GraphCreateRequest / GraphCreateResponse / GraphSchemaSpec / GraphPolicySpec API types (only the POST handler / CLI imported them) Kept: - GET /graphs (read-only enumeration) and graph_list Cedar action - omnigraph graphs list CLI subcommand - All multi-graph startup, mode inference, cluster routes, per-graph + server-level Cedar policies - server_settings_drive_multi_graph_startup_end_to_end (the test that covers operator-authored YAML + restart — the path that survives) - best_effort_cleanup_init_artifacts and the three init failpoints (still reachable from CLI `omnigraph init`; preflight fix deferred as a follow-up) - GraphRegistry::insert and its concurrency tests — production callers gone, but the method is the natural seam for the future cluster-catalog work Also fixed (transcript issue 4): - ALWAYS_FLAT_PATHS now includes /graphs so multi-mode OpenAPI advertises the management route correctly (was previously rewritten to /graphs/{graph_id}/graphs) - multi_mode_openapi_keeps_healthz_flat → renamed to multi_mode_openapi_keeps_management_paths_flat, asserts both /healthz and /graphs stay flat - multi_mode_openapi_prefixes_operation_ids_with_cluster skips /graphs in addition to /healthz Doc fixes: - docs/user/cli.md: graphs list example was --target http://..., but --target is a config-graph-name lookup; corrected to --uri. Removed the graphs create example. - docs/user/server.md: dropped POST /graphs row, "omnigraph.yaml ownership", and "POST /graphs body shape" sections. Added a paragraph stating runtime add/remove is not exposed in v0.7.0. - docs/user/policy.md: dropped graph_create action; reworded the "Configuration" line to clarify that server-scoped rules (graph_list) take neither branch_scope nor target_branch_scope. - docs/releases/v0.7.0.md: rewrote release narrative — multi-graph mode ships; runtime add/remove deferred. - AGENTS.md: HTTP server bullet and capability matrix row updated to reflect read-only GET /graphs and the operator-edit workflow. - openapi.json regenerated; /graphs has only .get, no .post. Diff: 17 files, +123 −1525 LOC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: comment cleanup and policy format style Strip "PR Na/Nb" sub-PR references throughout MR-668 surfaces — they were useful during the 10-PR delivery sequence but rot now that the work is in the tree. Keep the MR-668 umbrella references. Also: - Add explicit `when = when` and `resource_literal = resource_literal` named args in `compile_policy_source`'s outer `format!` to match the surrounding crate style (already explicit for `group` and `action`). - Rename the best-effort cleanup tracing target from "omnigraph::init" to "omnigraph::init::cleanup" so operators can filter init-failure cleanup events separately from init's other log lines. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: drop actor_id from PolicyRequest; pass actor as separate arg The MR-731 "server-authoritative actor identity" invariant was enforced by an in-function chokepoint (`request.actor_id = actor.actor_id...` overwrite inside `authorize_request`). That worked but relied on every caller passing in a `PolicyRequest` and trusting the overwrite — a comment-enforced invariant. Move the invariant into the type system: * `PolicyRequest` no longer carries `actor_id`. The struct now models what a caller wants to do, not who they are. * `PolicyEngine::authorize(actor_id: &str, request: &PolicyRequest)` and `validate_request(actor_id, request)` take identity as a separate argument. The same shape `PolicyChecker::check` already had for the engine layer. * `authorize_request` in the HTTP layer extracts `actor_id` from the bearer-resolved `ResolvedActor` and passes it positionally — no overwrite step that could be skipped. * CLI `omnigraph policy explain` updated (the only other consumer that built a `PolicyRequest`). Public API break for the `omnigraph-policy` crate. Worth it: handlers can no longer accidentally populate `actor_id` from a request body field, and external consumers are forced by the compiler to source actor identity from a trusted path. The MR-731 chokepoint test `actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers` still passes — the bearer-resolved actor is what reaches the engine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: consolidate AppState single-mode constructors; delete with_policy_engine The prior `with_policy_engine` constructor reused the engine `Arc` from the existing handle (`engine: Arc::clone(&existing.engine)`) without re-applying `Omnigraph::with_policy`. Combined with `new_with_workload`, the documented composition pattern was `AppState::new_with_workload(...).with_policy_engine(p)` — which produced an `AppState` whose HTTP layer enforced Cedar but whose underlying engine had no `PolicyChecker` installed. Any caller reaching the engine via `state.registry().list()[i].engine` could bypass policy entirely. The doc comment named this gap; the type system didn't. Make composition impossible to get wrong: * Add `AppState::new_single(uri, db, tokens, Option<PolicyEngine>, WorkloadController)` — canonical single-mode constructor that takes every option together and routes through `build_single_mode` (which applies `db.with_policy(checker)` to the engine itself). * `new`, `new_with_bearer_token`, `new_with_bearer_tokens`, `new_with_bearer_tokens_and_policy`, `new_with_workload` all become thin wrappers around `new_single`. * Delete `with_policy_engine`. There is no post-construction policy install path any more; the single linear construction forces HTTP-layer and engine-layer policy to install together or not at all. Regression test `engine_layer_policy_fires_via_direct_arc_omnigraph_from_new_single` constructs an `AppState::new_single` with a deny-all policy, pulls the `Arc<Omnigraph>` from the registry handle (the same path an embedded SDK consumer would take), and asserts a direct `mutate_as` call returns `OmniError::Policy`. Pre-fix this test would have succeeded the mutation. Test caller in `ingest_per_actor_admission_cap_returns_429` migrates from `.with_policy_engine(...)` to `new_single(..., Some(policy_engine), workload)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: derive any_per_graph_policy on RegistrySnapshot; simplify dup check `AppState::requires_bearer_auth` walked the entire registry per request (cloning Arcs into a `Vec`, then `.iter().any(\|h\| h.policy .is_some())`) to decide whether the auth middleware should challenge. The walk is unnecessary — the answer only changes when the registry mutates, which is exactly the moment a new snapshot is constructed. Move the flag onto the snapshot itself: * `RegistrySnapshot { graphs, any_per_graph_policy: bool }`. * `RegistrySnapshot::new(graphs)` is the only construction path — it derives the flag from `graphs.values().any(\|h\| h.policy .is_some())` so the cached value can't drift from the source data. * `Default` delegates to `new(HashMap::new())`. * `GraphRegistry::from_handles` and `insert` build snapshots via `RegistrySnapshot::new(...)`. * `GraphRegistry::snapshot_ref()` exposes the current snapshot through an `arc_swap::Guard`; callers that need cached derived state go through this accessor (callers that only want `graphs` still use `list` / `get`). `requires_bearer_auth` becomes one `ArcSwap::load` + bool read. Also (drive-by, same file, same hunk): replace the dead `if let Some(other) = seen_uris.get(...)` + `let _ = other;` pattern in `from_handles` with a plain `seen_uris.contains_key(...)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: fail-fast multi-graph startup with try_collect The `open_multi_graph_state` doc comment claims "Fail-fast — the first open error aborts startup; other in-flight opens are dropped" but the code did .buffer_unordered(4) .collect::<Vec<_>>() .await .into_iter() .collect::<Result<Vec<_>>>()?; which drains every future in the stream before propagating the first `Err`. With N S3-backed graphs and graph #2 failing fast, the caller still waits for #1, #3, #4, … to either succeed or fail before seeing the error. Replace the four-line dance with `futures::TryStreamExt::try_collect`, which short-circuits on the first `Err` and drops the rest. The doc comment now matches behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: drop unused State extractor from 7 read-only handlers After the routing-middleware refactor moved the engine into the per-graph `GraphHandle` (extracted via `Extension<Arc<GraphHandle>>`), seven read-only handlers — `server_snapshot`, `server_read`, `server_export`, `server_schema_get`, `server_branch_list`, `server_commit_list`, `server_commit_show` — kept an unused `State(_state): State<AppState>` extractor. Drop it. Each request avoids one `FromRequestParts` clone of `AppState`'s Arcs. Handlers that actually use state (workload admission for write paths, `server_policy` for management endpoints) keep theirs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: emit info! for graph routing decision `tracing::Span::current().record("graph_id", ...)` in the routing middleware silently no-ops here: no upstream `#[tracing::instrument]` on the handlers declares a `graph_id` field, and `TraceLayer::new_for_http` doesn't either. The recorded value never lands anywhere visible. Replace with an explicit `info!(graph_id = %handle.key.graph_id, "graph routed")` event so operators can grep logs and correlate requests with the active graph. In single mode the value is the sentinel `"default"`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: align GET /graphs 405 body code with HTTP status The single-mode `GET /graphs` handler returned an `ApiError` built via struct literal with `status: METHOD_NOT_ALLOWED, code: BadRequest`. The body code disagreed with the HTTP status — clients deserializing on `code` saw `bad_request`, clients deserializing on `status` saw 405. Same bug class as the earlier 503+Conflict mismatch on the removed YAML drift path. Close the class for this one remaining instance: * Add `ErrorCode::MethodNotAllowed` to the API enum. * Add `ApiError::method_not_allowed(msg)` — pairs the 405 status with the matching code. * Replace the struct literal in `server_graphs_list` with the constructor. * Regenerate `openapi.json` (adds `method_not_allowed` to the ErrorCode schema enum). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: drop unused axum::handler::Handler import The import landed in earlier work but no current call site uses it. Emitted an `unused_imports` warning on every server build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: drop unused fs2 workspace dependency `fs2 = "0.4"` lingered in [workspace.dependencies] after the POST /graphs flock-on-rename design was pulled. `cargo tree -i fs2` reports no consumers in the workspace and the dep is not in Cargo.lock. Removing the declaration closes the "phantom dep" class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: AGENTS.md Cedar row no longer hardcodes action count The "8 actions" claim drifted as soon as MR-668 added `graph_list`. Bumping the count would just push the drift one PR forward; the correct-by-design fix is to defer to the canonical list in docs/user/policy.md and stop maintaining a duplicate count. Closes the "doc hardcodes a count that drifts from the enum" class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: cfg(test)-gate GraphRegistry::insert and its mutex `insert` and the `mutate: Mutex<()>` that serializes it had no runtime consumer in v0.7.0 — the only insertion path at startup is `from_handles`, and runtime add/remove is deferred until a managed cluster catalog ships. Leaving both `pub` and live made them a "looks like API, isn't" footgun: a future change could build on `insert` without re-establishing the concurrency contract with an actual consumer in scope. Gate both together (`#[cfg(test)]` on the method, the field, and the `tokio::sync::Mutex` import) so the race-pinning tests still compile but production cannot reach them. When a real consumer ships, ungate both — they're a unit. Closes the "public API with no runtime consumer drifts toward incorrect" class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: drop vestigial PolicyEngine surface * `validate_request` had zero callsites — pure surface for nothing. * `deny`'s `_actor_id` and `_request` parameters were both unused (the underscore prefix gave it away); the message is built by the caller before `deny` ever sees the request. Trim both. Closes the "public API that the type system can't justify" class for the policy engine. No behavior change; every existing test stays green because the deletions never had a runtime effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: regression test for init re-init footgun (red) A second `Omnigraph::init` against an existing graph URI today destroys the existing graph's schema artifacts. `init_storage_phase` overwrites `_schema.pg` before any preflight, and on the inner `GraphCoordinator::init` failure that follows, `best_effort_cleanup_init_artifacts` deletes all three schema files. The existing Lance datasets and `__manifest/` survive but the schema metadata is gone — unrecoverable without operator surgery. This test exercises that path and currently fails with "_schema.pg must not be deleted by a failed re-init", confirming the destructive cleanup branch fires. The fix in the next commit makes the test pass by preflighting with `storage.exists()` and returning a typed error before any write touches disk. Per AGENTS.md rule 12, the test commit lands just before the fix commit so the red → green pair is visible in `git log` and a reviewer can check out this commit alone to reproduce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: close init re-init footgun via InitOptions preflight (green) `Omnigraph::init` is "create a new graph"; existing graphs need an explicit overwrite. Today's behavior — silently overwrite schema files, then on inner failure delete them via best-effort cleanup — is destructive against an existing graph regardless of which branch fires. Correct-by-design fix: * New `InitOptions { force: bool }` struct (default `force: false`). * New `Omnigraph::init_with_options(uri, schema, options)`. The old `Omnigraph::init(uri, schema)` is a thin shortcut that passes `InitOptions::default()`. * `init_with_storage` runs a `storage.exists()` preflight on the three schema URIs BEFORE any parse, write, or coordinator call. Any hit → typed `OmniError::AlreadyInitialized { uri }`. The destructive code paths (the `write_text` overwrite and the best-effort cleanup) are now unreachable in strict mode against an existing graph. * `force: true` skips the preflight; existing operators who actually mean to overwrite opt in explicitly. * CLI: `omnigraph init --force` maps to `InitOptions { force: true }`. * HTTP: `OmniError::AlreadyInitialized` maps to 409 via `ApiError::from_omni`. Not currently HTTP-reachable (POST /graphs was pulled), but the wiring lands here so a future runtime create endpoint has one canonical translation. Closes the "init is destructive against existing state" class. The regression test added in the previous commit (`init_on_existing_graph_uri_does_not_destroy_existing_schema`) turns green: the original schema files now survive a second init attempt byte-for-byte, and the call errors cleanly with `AlreadyInitialized`. The four existing `init_failpoint_after__cleans_up_` tests stay green — strict mode's preflight passes on a fresh tempdir, and cleanup still runs as before when a failpoint fires mid-write. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: split PolicyEngine::load into kind-typed loaders Pre-fix, every caller of `PolicyEngine::load(path, graph_id)` passed some `graph_id` argument — even when the policy was server-scoped and Cedar's resolution would never touch a Graph entity. The server-level loader at lib.rs passed the meaningless sentinel `"server"`. A graph policy file containing a `graph_list` rule compiled fine; a server policy file containing a `read` rule compiled fine. Both silently no-op'd at request time because the engine kind and the rule's resource kind disagreed. Correct-by-design fix: replace `load` with two kind-typed loaders. * `PolicyEngine::load_graph(path, graph_id)` — for per-graph policy files. Rejects any rule whose action `resource_kind()` is `Server`. * `PolicyEngine::load_server(path)` — for server-level policy files. Takes no `graph_id`: server-scoped actions resolve against the singleton `Omnigraph::Server::"root"` entity, never a Graph. Rejects any rule whose action `resource_kind()` is `Graph`. The old `load` is hard-deleted in the same commit because every in-tree consumer migrates here (no semver promise on the workspace crate, no external pinners). New `PolicyEngineKind` enum types the loader's intent; `validate_kind_alignment` is the load-time check that closes the "wrong action, wrong file, silent no-op" class — operators get a load-time error instead of confused-and- silent behavior at request time. Callsites migrated: * server lib.rs:374 (single-mode per-graph) → load_graph * server lib.rs:1065 (multi-mode server) → load_server * server lib.rs:1103 (multi-mode per-graph) → load_graph * CLI main.rs:732 (resolve_policy_engine) → load_graph * tests/server.rs ×5 (4 graph, 1 server) → load_graph/load_server * policy_engine_chassis.rs → load_graph Four new in-source tests pin the contract: both rejection paths and both positive paths. Closes the "operator puts an action in the wrong file and the rule silently never matches" class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: introduce GraphRouting, retire single_mode_handle Pre-fix, `AppState` always carried `Arc<GraphRegistry>` even when serving one graph. Single mode populated the registry with one handle keyed by the `SINGLE_GRAPH_KEY_ID = "default"` sentinel; `single_mode_handle` walked the registry, asserted `len == 1`, and returned the single element with a 500-class "programmer error" branch on mismatch. Three smells in a row — magic key, walk-and-assert, programmer-error guard — all because the single-mode runtime was forced through a multi-mode abstraction. Correct-by-design fix: type the routing. * New `pub enum GraphRouting { Single { handle }, Multi { registry, config_path } }` on `AppState`. The `Single` arm carries the handle directly — no registry, no key, no walk. * `resolve_graph_handle` middleware matches on `routing`. Single mode returns the handle in O(1); multi mode does the same path-extract + registry lookup as before. The 500-class programmer-error branch is gone — the type system now makes the violated invariant ("single mode has exactly one handle") unrepresentable. * `requires_bearer_auth` reads `handle.policy.is_some()` directly in the Single arm; Multi arm still uses the cached `any_per_graph_policy` flag. `ServerMode` and the legacy `registry` field on `AppState` are still populated for now — C-3 removes both once every reader is migrated. The `SINGLE_GRAPH_KEY_ID` sentinel and `ServerMode` will also go away in C-3. Closes the "single mode forced through a multi-mode abstraction" class. All 76 server integration tests stay green: handlers still extract `Extension<Arc<GraphHandle>>` from the request, so the middleware's internal change is invisible to them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: remove ServerMode, registry field, and the SINGLE_GRAPH sentinel C-1/C-2 introduced `GraphRouting` and pointed the middleware at it. This commit removes the legacy shape that's now dead: * `ServerMode` enum — deleted. Single mode's `uri` lives on `handle.uri`; multi mode's `config_path` lives on the `GraphRouting::Multi` arm. * `AppState.mode: ServerMode` field — deleted. * `AppState.registry: Arc<GraphRegistry>` field — deleted. Multi mode's registry is on `GraphRouting::Multi { registry, .. }`; single mode has no registry at all. * `AppState::mode()`, `AppState::uri()`, `AppState::registry()` accessors — deleted. New `AppState::routing() -> &GraphRouting` is the single public entry point. * `SINGLE_GRAPH_KEY_ID` constant — deleted. `GraphHandle.key` is still required by the struct, but in single mode the key is now only a tracing label (`"default"`, inlined with a comment naming its sole remaining purpose). Single-mode flat routes never carry a `{graph_id}` parameter, so the key is never compared against user input, and there is no registry where it could be a map key. C-1/C-2 already removed the registry walk that the sentinel was named for. Callers migrated: * `build_app` (lib.rs:944) — matches on `state.routing()` instead of `state.mode()`. * `server_graphs_list` (lib.rs:1162) — destructures the Multi arm to get the registry; Single arm short-circuits to 405. * `server_openapi` (lib.rs:1217) — matches the Multi arm for the cluster-prefix rewrite. * `tests/server.rs:3735` — the B2 footgun regression test now matches on `state.routing()` to extract the single-mode handle (the test's earlier `state.registry().list().next()` shape was the closest pre-fix analog to "embedded consumer reaches the engine"; the new shape is more direct). Closes the entire "single mode forced through a multi-mode abstraction" class. After this commit: * No magic sentinel as a routing key. * No `single_mode_handle` walk-and-assert helper. * No 500-class "programmer error" branch in the middleware. * No two-field discriminant on `AppState` where one would do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: regression test for nested-route path extraction (red) `server_branch_delete` and `server_commit_show` use bare `Path<String>` extractors. In single-mode flat routes (`/branches/{branch}`, `/commits/{commit_id}`) this works — one capture, one value. In multi-graph cluster routes (`/graphs/{graph_id}/branches/{branch}`, `/graphs/{graph_id}/commits/{commit_id}`) axum 0.8 propagates the outer `{graph_id}` capture into the inner handler, so the extractor sees two captures and 500s with "Wrong number of path arguments. Expected 1 but got 2." `cluster_routes_dispatch_per_graph_handle` only exercises `/snapshot` (no Path extractor), so the regression slipped through. This test closes that gap structurally: every cluster route with an inner path param gets exercised here. Currently fails with the exact symptom above. Fix in the next commit makes it pass. Per AGENTS.md rule 12, the red test commit lands just before the fix so the pair is visible in `git log` and a reviewer can check out this commit alone to reproduce. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: named-field path-param structs for nested cluster routes (green) `Path<String>` deserializes one path-param value positionally. Single-mode flat routes (`/branches/{branch}`, `/commits/{commit_id}`) have one capture; multi-mode nested routes (`/graphs/{graph_id}/branches/{branch}`, `/graphs/{graph_id}/commits/{commit_id}`) have two — axum 0.8 propagates the outer capture into nested handlers. Same handler, two different shapes; the multi-mode shape 500s with "Wrong number of path arguments. Expected 1 but got 2." Symptomatic fix: change to `Path<(String, String)>` and ignore the first element. Breaks again the moment we add another nest layer (e.g. tenant in Cloud mode). Correct-by-design fix: named-field structs deserialized by name from axum's path-param map. Each handler picks only the fields it needs. Stable across single / multi / future-cloud nest depths because deserialization is by field name, not position. * New `BranchPath { branch: String }` (file-local to lib.rs) * New `CommitPath { commit_id: String }` * `server_branch_delete` extractor → `Path<BranchPath>` * `server_commit_show` extractor → `Path<CommitPath>` Closes the "handler path-extractor type is positional and breaks when route nesting changes" class. Red test from the previous commit turns green. All 77 server tests pass (single-mode branch delete + commit show, plus new multi-mode coverage). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: centralize policy-requires-tokens check in the runtime classifier Single-mode `open_with_bearer_tokens_and_policy` bailed at lib.rs:380 when policy was installed and no tokens. Multi-mode `open_multi_graph_state` had no equivalent: the server started, every request 401'd because no token could ever match, and the operator spent time debugging a misconfiguration the single-mode path would have caught at startup. The doc/code contradiction made the gap easy to miss: the `ServerRuntimeState::PolicyEnabled` docstring said tokens-or-not was "unusual but valid — every request fails 401 without a bearer, which is effectively 'locked'." The single-mode bail contradicted that. In practice, silent-401-on-every-request is bug-shaped, not feature-shaped (operators wanting deny-all should configure tokens plus a deny-all Cedar rule to get meaningful 403s with policy-decision logging). Symptomatic fix: add a copy of the bail to multi-mode. Two copies that can drift again the next time a startup path is added. Correct-by-design fix: hoist the check into `classify_server_runtime_state` so both modes get the same enforcement from one source of truth. The classifier becomes the single source of truth for "should we start?" and adding a startup invariant there is now the natural extension point for any future mode. Classifier matrix is now complete: \| has_tokens \| has_policy \| allow_unauthenticated \| Result \| \|---\|---\|---\|---\| \| F \| F \| F \| bail (existing) \| \| F \| F \| T \| Open (existing) \| \| T \| F \| * \| DefaultDeny (existing) \| \| F \| T \| * \| bail (NEW — closes the gap) \| \| T \| T \| * \| PolicyEnabled (existing) \| Changes: * `classify_server_runtime_state` (lib.rs:870-890) gains the `(false, true, _) => bail!(…)` arm with a clear message naming the failure mode and the two valid resolutions. * `open_with_bearer_tokens_and_policy` (lib.rs:369+) drops its redundant local bail — the classifier rejected the invalid case before construction was reached. * `ServerRuntimeState::PolicyEnabled` docstring is rewritten: drops the "(unusual but valid)" carve-out and states plainly that PolicyEnabled requires tokens. Names the explicit alternative (tokens + deny-all Cedar rule) for operators who want the all-requests-denied behavior. * `classify_policy_enabled_always_wins` test is renamed to `classify_policy_enabled_requires_tokens` and the now-invalid `(false, true, _)` assertion is removed (covered by the new rejection test). * New `classify_policy_without_tokens_is_rejected` test covers the new arm. * New `serve_refuses_to_start_with_policy_but_no_tokens_multi_mode` integration test pins the multi-mode propagation path — symmetric with the existing single-mode `serve_refuses_to_start_in_state_1_without_unauthenticated`. Closes the "single mode and multi mode startup branches can drift on safety invariants" class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: close coverage gaps surfaced by the test-coverage audit The bot-review pass and the subsequent coverage audit surfaced two material gaps in PR #119's test surface — both easy to close, both worth closing before merge. * Gap 1 — cluster-route sweep. The Bug-1 path-extractor regression slipped through because `cluster_routes_dispatch_per_graph_handle` only exercised `/snapshot`. The other six protected cluster routes (`/read`, `/change`, `/export`, `/schema`, `/schema/apply`, `/ingest`, `/branches/merge`) were implicitly trusted to work without any multi-mode integration test. Add `all_protected_cluster_routes_resolve_to_their_handler` (`tests/server.rs`) that hits each protected cluster route with a minimal request and asserts the response is consistent with the handler being reached — no 404 (router didn't match), no 500 with "Wrong number of path arguments" (Bug-1 class), no 500 with "missing extension" (routing middleware didn't inject the handle). Status code is a negative assertion because each handler's happy-path inputs differ; what matters is "the request reached the handler," not "the handler returned 200" — that's already pinned by the single-mode tests. * Gap 2 — `--force` happy path. The strict re-init regression test (`init_on_existing_graph_uri_does_not_destroy_existing_schema`) pins the error path; nothing pinned the `force: true` escape hatch actually doing what its docstring claims. Add `init_with_force_recovers_from_orphan_schema_files` (`tests/lifecycle.rs`). Writes a bare `_schema.pg` to simulate orphan files from a failed prior init, confirms strict mode bails as expected, then confirms `init_with_options(force: true)` succeeds and produces a functional graph. Note: the test follows the documented semantics — force skips the preflight only, it does NOT purge existing Lance state. An earlier draft of the test (against full overwrite of an existing populated graph) failed because `GraphCoordinator::init` errored on the existing `__manifest`, which is exactly the limitation the `InitOptions::force` docstring already calls out. Recursive purge needs `StorageAdapter::delete_prefix` (tracked separately). Coverage is now fully aligned with the PR's claims. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: regression test for GraphList open-mode bypass (red) Cursor bot's review at commit `4120448` surfaced that `server_graphs_list` returns 200 in Open mode (`--unauthenticated`, no tokens, no policy), exposing the full graph registry — graph IDs and URIs that may contain S3 bucket paths or internal hostnames — to any unauthenticated caller. Root cause: `authorize_request`'s no-policy fallback only denies when `actor.is_some()`. In Open mode `actor: None`, so the denial branch never fires and the call returns `Ok(())`. The docstring on `server_graphs_list` claims the endpoint is "Cedar-gated" and that we "don't leak the registry until the operator explicitly authorizes it" — but Open mode has no Cedar at all, so the docstring intent and the code disagree. This commit renames the existing `get_graphs_lists_registered_graphs_in_multi_mode` test to `get_graphs_denied_in_open_mode_without_server_policy` and flips the assertion from 200 → 403. Today this fails (server returns 200) — exactly the symptom the bot named. The fix in the next commit tightens the no-policy fallback to deny server-scoped actions unconditionally, regardless of mode. Per AGENTS.md rule 12, the red test commit lands just before the fix so the red → green pair is visible in `git log` and a reviewer can check out this commit alone to reproduce. Sort-order coverage that previously lived in the renamed test moves to `get_graphs_with_server_policy_authorizes_per_cedar` in the next commit, where the admin-200 response is operator- authorized and a non-empty body is asserted. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: server-scoped actions always require explicit policy (green) `server_graphs_list` returned 200 in Open mode (`--unauthenticated`, no tokens, no policy) because `authorize_request`'s no-policy fallback only denied when `actor.is_some()` AND action != Read. In Open mode `actor: None`, so the denial branch never fired and the call returned `Ok(())` — leaking the registry (graph IDs + URIs that may contain S3 bucket paths or internal hostnames) to any unauthenticated caller. The docstring on `server_graphs_list` claimed it was "Cedar-gated" and that the server should "not leak the registry until the operator explicitly authorizes it" — docstring intent and code disagreed. Symptomatic fix: special-case GraphList. Breaks the moment another server-scoped action (`graph_create`, `graph_delete`) is added. Correct-by-design fix: tie authorization to the action's `resource_kind()`. Server-scoped actions (`PolicyResourceKind::Server`) always require explicit policy authorization — there is no runtime state where they're served by default. Per-graph actions keep the existing default-deny logic (DefaultDeny denies non-Read for authenticated actors; Open mode allows everything per the operator's `--unauthenticated` opt-in for graph DATA, but not for server topology). The fix uses the existing `PolicyResourceKind` enum that #119 already added — no new abstraction. Future server-scoped actions (runtime `graph_create`/`graph_delete` when the cluster catalog ships) automatically pick up the same enforcement without any per-action handler change. Changes: * `crates/omnigraph-server/src/lib.rs:51` — re-export `PolicyResourceKind` (the kind discriminator was already public on the omnigraph-policy crate; needed in scope here). * `crates/omnigraph-server/src/lib.rs:1457` — `authorize_request`'s no-policy fallback gains a server-scoped-action check that fires before the actor-based default-deny logic. Error message names the failure mode and points at `server.policy.file`. * `crates/omnigraph-server/tests/server.rs:5037` — `get_graphs_with_server_policy_authorizes_per_cedar` extended to register two graphs in non-alphabetical order and assert the admin-200 response is sorted alphabetically. Restores the sort-order coverage that lived in `get_graphs_lists_registered_graphs_in_multi_mode` before the red commit renamed it to assert denial. Also bundles a small adjacent cleanup that the bot-review flagged: * `crates/omnigraph-server/src/graph_id.rs:124` — drop the unreachable `"openapi.json"` entry from `is_reserved`. The regex `^[a-zA-Z0-9-]{1,64}$` rejects every dot-containing name before `is_reserved` can run, so dotted entries in this list were dead code that misled readers into thinking the list needed to cover them. Comment now names the structural exclusion. The `rejects_reserved_route_names` test loses its `openapi.json` row (covered by `rejects_dots` via the regex). Closes the "server-scoped management actions silently leak in Open mode" class. Red test from the previous commit (`get_graphs_denied_in_open_mode_without_server_policy`) turns green; all 78 server integration tests + 76 lib tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: fold multi-graph work into v0.6.0 (no separate v0.7.0 release) The branch had bumped workspace versions to 0.7.0 and added a dedicated `docs/releases/v0.7.0.md` for the multi-graph work. Per scope decision: ship the graph-rename and the multi-graph mode in one v0.6.0 release. Changes: * Workspace versions bumped 0.7.0 → 0.6.0 in every crate manifest (`omnigraph`, `omnigraph-compiler`, `omnigraph-policy`, `omnigraph-server`, `omnigraph-cli`) and their internal `path = ..., version = "..."` dependency constraints. * `docs/releases/v0.7.0.md` content merged into `docs/releases/v0.6.0.md`, retargeted to a single coherent v0.6.0 release note covering both the graph terminology rename and the multi-graph server mode. The original v0.7.0.md is deleted. * All `v0.7.0` / `0.7.0` doc and comment references throughout `crates/`, `docs/`, `AGENTS.md`, and `openapi.json` retargeted to `v0.6.0` / `0.6.0`. `Cargo.lock` regenerated to match. * OpenAPI spec regenerated via `OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapi openapi_spec_is_up_to_date` — `"version": "0.6.0"` now. Verification: * `cargo build --workspace` — clean (6 pre-existing engine warnings only). * `cargo test --workspace --locked` — zero failures across all 39 test result groups. * `bash scripts/check-agents-md.sh` — passes (34 links / 33 docs). * `grep -rn "0\.7\.0\\|v0\.7\.0" --include='.rs' --include='.md' --include='.json' --include='.toml' .` returns no workspace hits. The three remaining `0.7.0` strings in `Cargo.lock` belong to unrelated 3rd-party crates (`pem-rfc7468`, `radium`, `rand_xoshiro`). The git tag and crates.io publish happen later — this commit just consolidates the surface so the eventual release is one coherent v0.6.0 covering all the work since v0.5.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mr-668: sanitize internal refs from v0.6.0 release notes cubic-dev-ai P2 comments flagged that the release notes carried internal Linear ticket and RFC references (MR-668, MR-731, MR-723, RFC 0003, RFC 0004). Per AGENTS.md maintenance rule 5, "Release docs are public project history. Describe capabilities, behavior changes, breaking changes, upgrade notes, and user impact; do not reference private ticket systems, internal codenames, or planning shorthand that an outside contributor cannot inspect." The bot's comments are correct against our own published contract — they were a docs-quality regression introduced when I drafted these notes. Replaced each internal reference with the public-facing concept it stood for. The substantive content (capabilities, behavior, guarantees) was already present alongside the refs; sanitization just trimmed the bracketed ticket labels: * Line 6: dropped `(MR-668)` from the multi-graph mode summary — the descriptive name was already self-sufficient. * Line 24: `MR-731 spoof defense` → `the bearer-derived-actor- identity guarantee`; `Forward-compat for Cloud mode (RFC 0003) and OAuth provider (RFC 0004)` → "forward-compat seams for future multi-tenant and OAuth deployments; they're inert in this release" — describes what the operator sees instead of pointing at planning docs. * Line 26: `MR-731's server-authoritative-actor invariant` → "the server-authoritative-actor invariant: actor identity is always sourced from the bearer-token match resolved at the auth boundary" — the public-facing statement of the guarantee. * Line 36: `(MR-723 default-deny otherwise rejects …)` → "without a server policy the default-deny posture rejects …" — same content, no ticket label. * Line 121: `MR-731 spoof regression test` → "The bearer-auth- derived-actor-identity regression test (client-supplied identity headers are ignored; the server-resolved actor is the only identity Cedar sees)" — describes what the test guards instead of naming the originating ticket. Verified: `grep -E 'MR-\d+\|RFC[ -]?\d+' docs/releases/v0.6.0.md` returns no matches; the rest of `docs/releases/` is also clean. `scripts/check-agents-md.sh` passes. Note: cubic-dev-ai also flagged `crates/omnigraph-cli/src/main.rs:276` ("doc comment incorrectly references v0.6.0 for a command that only exists in v0.7.0"). That comment is based on a stale model of the release surface — after folding v0.7.0 into v0.6.0 in the previous commit, the multi-graph CLI surface IS in v0.6.0 and the comment is correct as written. No change needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: close validated init and multi-graph gaps * chore: address review cleanup comments --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 16:19:31 +02:00
Ragnor Comerford	fd41f798b7	chore(codeowners): remove aaltshuler as owner	2026-05-28 11:41:38 +02:00
Ragnor Comerford	972a6e047b	chore(codeowners): add ragnorc as engineering owner	2026-05-28 11:20:51 +02:00
Ragnor Comerford	cc2412dc65	Rename repo terminology to graph (#118 ) Some checks failed CI / Classify Changes (push) Has been cancelled Details CI / Check AGENTS.md Links (push) Has been cancelled Details Release Edge / Prepare edge release (push) Has been cancelled Details CI / Test Workspace (push) Has been cancelled Details CI / Test omnigraph-server --features aws (push) Has been cancelled Details CI / RustFS S3 Integration (push) Has been cancelled Details Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled Details	2026-05-24 16:46:00 +01:00
Andrew Altshuler	bb1fe57640	release: v0.5.0 (#115 ) * gitignore: exclude docs/internal/ from publication Mirrors the existing "Local-only working files (not for the public repo)" pattern. Working notes filed under docs/internal/ stay on the contributor's machine instead of cluttering the published doc tree or tripping the AGENTS.md / docs-index cross-link check (scripts/check-agents-md.sh enumerates every docs/.md and requires each one to be linked from an audience index — internal notes don't have an audience index by definition). Incidental to the v0.5.0 release; lands separately from the version bump commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> ci: skip docs/internal/ in agents-md cross-link check Matches the .gitignore exclusion. Mirrors the existing 'docs/releases/' exclusion pattern: notes under docs/internal/ aren't part of the published doc tree and don't need to be linked from an audience index. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * release: v0.5.0 — Lance 6 substrate, Cedar policy engine, schema-lint v1 Bumps the workspace from 0.4.2 to 0.5.0. Release notes at docs/releases/v0.5.0.md. Three user-visible pillars motivate the minor bump: 1. Lance 6.0.1 substrate (DataFusion 52→53, Arrow 57→58) 2. Engine-wide Cedar policy enforcement on every _as writer; server defaults to deny-all; signed-token-claim-only actor identity 3. Schema-lint v1 chassis: OG-XXX-NNN codes, soft drops, and `--allow-data-loss` (Hard mode) for destructive migrations Plus structured DataFusion Expr filter pushdown (unblocks CompOp::Contains via array_has), HTTP allow_data_loss parity, inline .gq sources on CLI/HTTP, optional CORS layer, and bug fixes (merge-insert dup-rowid, branch-merge coordinator restore on error, blob columns in branch merge). Sites bumped: - 5 crate [package].version lines (omnigraph, omnigraph-cli, omnigraph-compiler, omnigraph-policy, omnigraph-server) - 10 internal path-dep `version = "..."` constraints across the four manifests that depend on sister crates (engine, server, cli, plus engine's dev-dep on the compiler) - Cargo.lock (regenerated via cargo update --workspace) - AGENTS.md "Version surveyed:" - openapi.json `info.version` (regenerated via OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapi) Verification: - cargo test --workspace --locked: 907/907 green - cargo test -p omnigraph-engine --test failpoints --features failpoints: 19/19 green - cargo test -p omnigraph-engine --test lance_surface_guards: 3/3 - scripts/check-agents-md.sh: clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 13:59:42 +01:00
Andrew Altshuler	3551e0d40e	chore(lance): bump 4.0.0 → 6.0.1 (DataFusion 52→53, Arrow 57→58) (#111 ) * tests: add lance_surface_guards pre-flight pins for the v6 bump Land 8 named guards in a new test file that pin Lance API surfaces OmniGraph relies on. Each guard turns a silent-break risk (variant rename, struct restructure, async-flip) into a red CI bar instead of runtime drift. Guards (mapped to the silent-break inventory from the v6 migration plan): Runtime (#[tokio::test]): 1. lance_error_too_much_write_contention_variant_exists — pins the variant referenced by db/manifest/publisher.rs::map_lance_publish_error. 2. manifest_location_field_shape — pins .path/.size/.e_tag/.naming_scheme types and ManifestLocation accessor returning &Self (the access pattern at db/manifest/metadata.rs:84-88). 6. write_params_default_does_not_set_storage_version — confirms our explicit V2_2 pin remains load-bearing (blob v2 requirement). Compile-only async fns (#[allow(...)] + unimplemented!() placeholders; never run, but cargo build --tests enforces the API shape): 3. checkout_version + restore chain — pins the recovery rollback hammer at db/manifest/recovery.rs:505-522. 4. DatasetBuilder::from_namespace().with_branch().with_version().load() — pins the namespace builder chain at db/manifest/namespace.rs:162-174. 5. MergeInsertBuilder fluent chain — pins the manifest CAS at db/manifest/publisher.rs:370-391, including the return shape (Arc<Dataset>, MergeStats). 7. compact_files(&mut ds, CompactionOptions, None) — pins db/omnigraph/optimize.rs:107. 8. DeleteResult { new_dataset, num_deleted_rows } — pins the inline delete result shape (MR-A will repurpose this guard to the staged two-phase variant once Lance #6658 migration lands). This is commit 1 of the chore/lance-6.0.1 migration. Cargo bump follows in commit 2 (will trigger the guards under v6 if any surface drifted). Per the migration plan at ~/.claude/plans/shimmering-percolating-duckling.md (written this session). Two guards from the plan deferred to follow-up: - manifest_cas_returns_row_level_contention_variant (full publisher race integration test — needs harness scaffolding) - table_version_metadata_byte_compatible_with_v4 (TableVersionMetadata is pub(crate); requires test reach extension). Verified on v4: cargo test -p omnigraph-engine --test lance_surface_guards passes 3/3 runtime tests; cargo build -p omnigraph-engine --tests compiles all 5 compile-only guards clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(deps): bump Lance 4.0.0 → 6.0.1, DataFusion 52 → 53, Arrow 57 → 58 The Cargo bump itself. Source is intentionally untouched — this commit will not compile. The compile errors are the work-list for subsequent commits on this branch. Lance updates: lance + 7 sub-crates 4.0.0 → 6.0.1. Transitive churn: + lance-tokenizer v6.0.1 (vendored tokenizer per Lance PR #6512) + object_store 0.13.x (Lance 6 brings it transitively; our explicit pin stays at 0.12.5 for now — revisit in stages if diamond bites) - tantivy* crates (replaced by lance-tokenizer) Compile error landscape on this commit (11 errors): • 1× E0432: `lance_index::DatasetIndexExt` import (Lance PR #6280 moved it to lance::index). Sites: table_store.rs:20, db/manifest.rs:37 (the second site was missed by the pre-flight inventory). • 8× E0599: `create_index_builder` / `load_indices` missing on `lance::Dataset` — all downstream of the DatasetIndexExt move. Once the import is corrected on table_store.rs and db/manifest.rs, these resolve automatically. • 2× E0063: missing field `is_only_declared` in `DescribeTableResponse` initializer at db/manifest/namespace.rs:221, 364. New Lance namespace field per the v5 namespace restructure (PR #6186). Surface guards (lance_surface_guards.rs, commit `d571fa8`) all still compile + the 3 runtime ones pass on v6 — none of the silent-break surfaces drifted. That's the load-bearing observation: the publisher CAS chain, ManifestLocation field shape, checkout_version/restore, DatasetBuilder fluent chain, MergeInsertBuilder return shape, WriteParams::default, compact_files signature, and DeleteResult fields are all v6-stable. Next commits address the 11 errors per the migration plan stages 3-8. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * imports: move DatasetIndexExt to lance::index (Lance PR #6280) Lance 5.0 (PR #6280) moved `DatasetIndexExt` out of `lance-index` into `lance::index`. `is_system_index` and `IndexType` stayed in `lance-index`. Mechanical update of 6 import sites: crates/omnigraph/src/table_store.rs:20 — split into two `use` lines crates/omnigraph-server/tests/server.rs:10 — was traits::DatasetIndexExt crates/omnigraph/tests/search.rs:6 crates/omnigraph/tests/branching.rs:7 crates/omnigraph/tests/failpoints.rs:467 crates/omnigraph-cli/tests/cli.rs:3 — was traits::DatasetIndexExt All 9 E0599 cascading errors on .create_index_builder / .load_indices resolve once the trait is back in scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * namespace: add is_only_declared field to DescribeTableResponse Lance namespace 6.0.0 added `is_only_declared: Option<bool>` to `DescribeTableResponse` (lance-namespace-reqwest-client 0.7+ via the v5.0 namespace API restructure, Lance PR #6186). Set to `Some(false)` because every table BranchManifestNamespace returns from describe_table is materialized — the manifest snapshot only includes entries for tables we've already opened via Dataset::open. Two sites in db/manifest/namespace.rs (BranchManifestNamespace + StagedTableNamespace impls of LanceNamespace::describe_table). Closes the last two compile errors from the v6 bump in the engine lib. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * cargo: add lance to omnigraph-cli + omnigraph-server dev-deps Stage 3 moved DatasetIndexExt imports from `lance-index` to `lance::index` in the cli and server test crates. Both crates only had `lance-index` in their dev-dependencies; add `lance` alongside so the new path resolves. This is the last compile-error fix from the v6 bump — `cargo build --workspace --tests` is now green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs: refresh Lance alignment audit for v6.0.1; bump surveyed version Per CLAUDE.md maintenance rule 2 (same-PR docs): - docs/dev/lance.md: replace the v4.0.1 alignment audit stanza with the v6.0.1 audit. Captures every v5/v6 finding from this PR (the DatasetIndexExt move, DescribeTableResponse.is_only_declared, MergeInsertBuilder return shape, ManifestLocation field shape, LanceFileVersion::default flip, file-reader async, tokenizer vendor, Lance #6658/#6666/#6877 status). Cross-references each guard in tests/lance_surface_guards.rs. - AGENTS.md: bump "Storage substrate: Lance 4.x" → "Lance 6.x". Note: surveyed crate version stays at 0.4.2 — substrate version bumps are independent of OmniGraph's release version. - crates/omnigraph/src/storage_layer.rs: update the trait module-level doc-comment to reflect that Lance #6658 closed 2026-05-14 and delete_where two-phase migration is MR-A (the next follow-up). #6666 stays open; create_vector_index inline residual stays. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * tests: silence clippy::diverging_sub_expression on compile-only guards The five `_compile_` async fns in lance_surface_guards.rs use `let ds: Dataset = unimplemented!()` as a placeholder so type inference can chase the method chain we want to pin, without ever running the function. Clippy's `diverging_sub_expression` lint flags this pattern because the RHS diverges; that's the entire point. Added to the per-fn `#[allow(...)]` list, alongside dead_code / unreachable_code / unused_variables / unused_mut already there. No behavior change. cargo test -p omnigraph-engine --test lance_surface_guards still 3/3 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> docs: correct #6658 status — closed but API ships in Lance v7.x, not v6.0.1 The audit stanza in docs/dev/lance.md and the storage_layer.rs trait doc-comment both implied the public DeleteBuilder::execute_uncommitted API shipped with Lance 6.0.1. It did not. Issue #6658 closed 2026-05-14, but binary search across the release stream confirms: v6.0.1 ❌ no pub async fn execute_uncommitted on DeleteBuilder v6.1.0-rc.1 ❌ v7.0.0-beta.5 ❌ v7.0.0-beta.10 ✅ first appearance v7.0.0-rc.1 ✅ So MR-A (delete two-phase migration) is gated on the Lance v7.x bump, not on this PR. v7.0.0-rc.1 dropped 2026-05-21; GA likely within a week. No behavior change. Doc-only correction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * ci(lib): bump recursion_limit to 256 — Lance 6 trait depth on Linux Lance 6's heavier trait surface around futures/streams in storage_layer.rs's staged-write API pushes the rustc trait-resolution recursion limit past the default 128 on Linux builds. CI on PR #111 surfaced this in both `Test Workspace` and `Test omnigraph-server --features aws`: error: queries overflow the depth limit! = help: consider increasing the recursion limit by adding a `#![recursion_limit = "256"]` attribute to your crate (`omnigraph`) = note: query depth increased by 130 when computing layout of `{async block@crates/omnigraph/src/storage_layer.rs:697:5: 697:10}` (The async block is `stage_create_btree_index`'s body — its return type is several layers of `impl Future<Output=Result<StagedHandle>>` deep on top of Lance's own builder return types.) Local macOS builds happened to short-circuit before tripping the limit, which is why this didn't surface during the v6 bump sequence. The fix rustc itself suggests is one line at the crate root. No behavior change. Revisit if a future Lance bump stops needing it. Verified: `cargo build --locked -p omnigraph-server --features aws` compiles clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 00:42:29 +01:00
Andrew Altshuler	aadfa11ecb	schema: HTTP allow_data_loss exposure + e2e drop coverage (MR-694 follow-up) (#107 ) Some checks failed CI / Classify Changes (push) Has been cancelled Details CI / Check AGENTS.md Links (push) Has been cancelled Details Release Edge / Prepare edge release (push) Has been cancelled Details CI / Test Workspace (push) Has been cancelled Details CI / Test omnigraph-server --features aws (push) Has been cancelled Details CI / RustFS S3 Integration (push) Has been cancelled Details Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled Details The schema-lint chassis v1.2 (PR #100) shipped `--allow-data-loss` on the CLI, but `SchemaApplyRequest` had no equivalent field — Hard-mode drops were CLI-only. This commit closes that feature gap and adds e2e test coverage for drop modes across HTTP + CLI, plus data preservation on additive apply, plus a CLI↔SDK plan-parity assertion. Feature gap closed: - `crates/omnigraph-server/src/api.rs` — added `allow_data_loss: bool` (default false via `#[serde(default)]`) to `SchemaApplyRequest`. Added `Default` derive so test usages can use `..Default::default()`. - `crates/omnigraph-server/src/lib.rs` — `server_schema_apply` now constructs `SchemaApplyOptions { allow_data_loss: request.allow_data_loss }` and threads through to `apply_schema_as`. - `crates/omnigraph-cli/src/main.rs` — remote-URI schema-apply path used to bail with "--allow-data-loss not yet supported on remote"; now forwards the flag into the JSON payload so the CLI behaves identically against local and remote URIs. - `openapi.json` — regenerated; only diff is the new field on `SchemaApplyRequest`. Tests added (8 new): * `crates/omnigraph-server/tests/server.rs` (+5): - `schema_apply_route_soft_drops_property_via_http` — POST schema removing nullable property, verify catalog reflects the drop AND `snapshot_at_version(pre)` still has `age` in the field list (time-travel reachability is the Soft contract). - `schema_apply_route_soft_drops_node_type_via_http` — POST schema removing `Company` node + cascading `WorksAt` edge. - `schema_apply_route_hard_drops_property_with_allow_data_loss` — POST with `allow_data_loss: true`, verify plan step reports `mode: hard`. - `schema_apply_route_keeps_drops_soft_without_flag` — same schema without flag, verify `mode: soft`. Pins default semantics against accidental Hard promotion. - `schema_apply_route_additive_property_preserves_existing_rows` — load fixture, POST adding nullable property, verify row count preserved (SDK suite covers data preservation on drops + renames; additive AddProperty wasn't pinned). Plus helpers `schema_without_age` and `schema_without_company`. * `crates/omnigraph-cli/tests/cli.rs` (+3): - `schema_apply_allow_data_loss_flag_promotes_drops_to_hard` — CLI `omnigraph schema apply --allow-data-loss --schema X.pg --json`, verify plan step has `mode: hard`. - `schema_apply_without_allow_data_loss_keeps_soft_drops` — without flag, verify Soft. - `schema_plan_parity_cli_and_sdk` — same `.pg` source through `Omnigraph::plan_schema` (SDK) and `omnigraph schema plan --json` (CLI), assert the steps array is byte-identical post-JSON. HTTP has no `/schema/plan` endpoint; apply-side parity is implicitly covered by the HTTP drop tests + CLI drop tests using identical fixtures. Docs: - `docs/user/schema-language.md` — new "Destructive drops" section documenting Soft vs Hard semantics and that `allow_data_loss` is now honored uniformly across CLI / HTTP / SDK. Verification: every new test passes; full `cargo test --workspace --locked` green; `scripts/check-agents-md.sh` passes. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 01:56:46 +03:00
Andrew Altshuler	f3f2a051ba	policy: server 3-state default-deny matrix (MR-723) (#105 ) Closes the "tokens but no policy" trap. Pre-MR-723, an operator who configured bearer tokens and forgot to set policy.file got a server that required auth and then permitted every action — the illusion of protection. After MR-723, that configuration is default-deny: only `read` actions succeed; every other action returns HTTP 403. Three startup states, classified deterministically: - Open — no tokens, no policy. Requires explicit `--unauthenticated` flag or `OMNIGRAPH_UNAUTHENTICATED=1`; otherwise `serve()` refuses to start. Forces the operator to opt in to "fully open dev mode" so it can't happen accidentally. - DefaultDeny — tokens configured, no policy. `authorize_request` rejects every action except `Read` with 403. The warn-log on startup names the misconfiguration explicitly. - PolicyEnabled — policy file configured. Cedar evaluates every request, unchanged from pre-MR-723. What landed: - `ServerConfig.allow_unauthenticated: bool` + `--unauthenticated` flag on the `omnigraph-server` bin + `OMNIGRAPH_UNAUTHENTICATED` env var (`load_server_settings` honors both). - New `classify_server_runtime_state(has_tokens, has_policy, allow_unauthenticated) -> Result<ServerRuntimeState>` pure function. `serve()` calls it before opening the engine and bails with a clear error when the operator hits the no-tokens-no-policy-no-flag cell. - `authorize_request` state-2 branch: when `policy_engine()` is None but the bearer-auth middleware delivered an authenticated actor, any action other than `Read` returns 403 with a message that names the misconfiguration. - `AppState::with_policy_engine(self, engine)` builder method so integration tests that need a custom workload (`new_with_workload`) can still install a permit-all policy without a new constructor. - `app_for_loaded_repo_with_auth(token)` and `app_for_loaded_repo_with_auth_tokens(tokens)` test helpers now install a permit-all policy alongside tokens — they previously represented the "tokens but no policy" state that MR-723 makes default-deny, and tests that don't care about policy were inadvertently coupled to the loophole. Tests: - `classify_` unit tests (3) — every cell of the matrix. - `default_deny_mode_allows_read_for_authenticated_actor` — GET /snapshot succeeds with bearer token + no policy. - `default_deny_mode_rejects_change_with_forbidden` — POST /change rejected with 403 + "default-deny" message. - `default_deny_mode_rejects_schema_apply_with_forbidden` — POST /schema/apply rejected with 403 + "default-deny" message. - New `app_for_repo_with_auth_tokens_only(schema, tokens)` helper builds the State-2 fixture without policy. The pre-MR-723 helpers `app_for_loaded_repo_with_auth` shift semantics to "tokens + permit-all" so existing tests retain their original intent. docs/user/policy.md: new "Server runtime states (MR-723)" section documents the matrix and the explicit `--unauthenticated` opt-in. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 17:02:26 +03:00
Andrew Altshuler	a275306a15	policy: CLI policy injection — local writes go through engine enforce (MR-722) (#104 ) Closes the CLI side of the policy chassis fan-out. Before this commit, CLI direct-engine writes bypassed Cedar entirely because the CLI never called `Omnigraph::with_policy(...)` for non-`policy validate\|test\|explain` subcommands. After this commit, every CLI direct-engine writer (change, load, ingest, branch create/delete/merge, schema apply) opens the engine via a new `open_local_db_with_policy(uri, &config)` helper that installs the configured `PolicyEngine` when `policy.file` is set, and threads the resolved actor through to the `_as` writer methods. Actor identity resolution: - New top-level `--as <ACTOR>` global flag on the CLI overrides config. - New `cli.actor` field in `omnigraph.yaml` provides a default actor. - Precedence: `--as` > `cli.actor` > None. - When policy is configured and neither is set, the engine-layer footgun guard fires and the write is denied — silent bypass via "I forgot the actor" is exactly what the guard prevents. - Remote HTTP writes ignore both — bearer-token-resolved server-side. Helpers added in main.rs: - `open_local_db_with_policy(uri, &config) -> Result<Omnigraph>` — opens the DB and installs the PolicyEngine when configured. Without policy this is identical to a bare `Omnigraph::open`. - `resolve_cli_actor(cli_as, &config) -> Option<&str>` — implements the flag > config > None precedence. Engine: added `load_file_as` to the loader as the actor-aware mirror of `load_file`, so CLI file-path loads flow through the same enforce gate as in-memory `load_as` calls. Test rewrite: `local_cli_policy_tooling_is_end_to_end_while_local_writes_stay_unenforced` was the explicit assertion of the pre-chassis hole. Renamed and split: - `local_cli_policy_tooling_is_end_to_end` — sanity for the read-only policy CLI surfaces (validate/test/explain), unchanged behavior. - `local_cli_change_enforces_engine_layer_policy` — the new assertion: policy installed + no actor → footgun-guard denial; `--as act-bruno` on protected main → Cedar denial; `--as act-ragnor` (admins-write rule) on main → permit, write committed. POLICY_E2E_YAML gains an `admins-write` rule so the permit case has a non-trivial actor to exercise. docs/user/policy.md updated with `cli.actor` + `--as <ACTOR>` usage. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 04:06:21 +03:00
Andrew Altshuler	da42beec41	policy: chassis fan-out — _as variants on the remaining 6 writers (MR-722) (#103 ) PR #102 wired apply_schema_as. This PR completes the chassis-side coverage so every public mutating engine entry point hits the same Omnigraph::enforce(action, scope, actor) gate regardless of transport: - mutate_as → enforce(Change, Branch(branch), actor) - load_as → enforce(Change, Branch(branch), actor) - ingest_as → enforce(Change, Branch(branch), actor); also threads actor through the implicit branch_create_from_as so fresh-branch ingest correctly hits BranchCreate too - branch_create_as → enforce(BranchCreate, TargetBranch(name), actor) - branch_create_from_as → enforce(BranchCreate, BranchTransition { source, target }, actor) - branch_delete_as → enforce(BranchDelete, TargetBranch(name), actor) - branch_merge_as → enforce(BranchMerge, BranchTransition { source, target }, actor) Three new _as variants for branch ops (create, create_from, delete) that had no actor surface before; existing actor-less variants delegate with actor=None so the no-policy path is a strict no-op. HTTP handlers updated to thread the resolved actor into the new _as variants for branch_create and branch_delete (was previously dropped). 14 new SDK chassis tests (one allow + one deny pair per wired writer); the existing 4 apply_schema_as tests stay. All 18 pass. docs/user/policy.md updated to describe engine-wide enforcement and the coarse-vs-fine layer split (engine = action gate, query layer per-row = MR-725 future). AGENTS.md capability matrix updated to match. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 03:38:18 +03:00
Andrew Altshuler	7a86f654d4	policy: codify signed-token-claim-only actor identity (MR-731) (#101 ) Warm-up commit for the policy chassis epic (MR-722). PR #1 of the chassis series — same role as schema-lint v1's commit #1 baseline. Zero behavioral change; establishes the regression test, the load-bearing doc comment, and the user-doc paragraph for an invariant already true in code. Server auth already resolves `actor_id` from the matched bearer token at `omnigraph-server/src/lib.rs:692-694`, overwriting whatever the handler put in the PolicyRequest. The principle is named in docs/dev/invariants.md Hard Invariant 11 ("clients cannot set actor identity directly"). What was missing: a regression test, a load-bearing doc comment at the resolution site, and a user-facing documentation paragraph. This commit adds all three. Why first. The actor-identity invariant is the foundation every other policy decision stands on. If `actor_id` can be spoofed, every chassis primitive (per-row scope, audit log, two-person rule) becomes ungated. Pinning the invariant first means PR #2 (the chassis core) doesn't have to re-prove this assertion. Changes: * crates/omnigraph-server/tests/server.rs — new regression test actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers with three sub-assertions: - spoof-up: bearer for denied actor + X-Actor-Id naming allowed actor → 403 (header doesn't promote) - spoof-down: bearer for allowed actor + X-Actor-Id naming denied actor → 200 (header doesn't demote) - empty-string spoof: empty X-Actor-Id doesn't clear resolved actor Cross-link to MR-777 (auth boundary cases — actor-id collision + malformed bearer) noted in the test docstring. * crates/omnigraph-server/src/lib.rs — expanded doc comment at the actor-resolution site explaining the SECURITY INVARIANT, citing Hard Invariant 11, the Supabase RLS history footgun, and the regression test that pins the contract. Reader thinking "I should let clients override actor_id for impersonation" hits this comment first. * docs/user/policy.md — new "Actor identity (signed-claim-only)" section near the existing Server enforcement section. Closes the user-facing doc gap MR-731's "Done when" requires. Architectural decisions for PR #2+ pinned this session (not implemented here, recorded so future implementers don't re-litigate): - PolicyEngine moves to new `omnigraph-policy` workspace crate so both engine and server can depend on it (Q2). - `enforce(action, scope, actor)` will take a new `ResourceScope` enum, leaving room for MR-725's per-type and per-row variants (Q3). - `PolicyAction::Admin` is kept and wired (Option A) — meta-action for policy-management surfaces (hot reload, audit log query, approvals list) as those consumer features land (Q4). Test results: - cargo test -p omnigraph-server --test server: 45 pass (44 existing + 1 new); no regressions - scripts/check-agents-md.sh: passes (34 links / 33 docs OK) Out of scope (PR #2+): - Omnigraph::with_policy() + enforce() method - omnigraph-policy crate creation - ResourceScope enum - CLI policy injection into Omnigraph - HTTP-layer redundant-check removal - MR-724 Admin action wiring (PR #2) - MR-723 default-deny 3-state (PR #4) - MR-736 severity warn/deny (PR #5) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 02:51:34 +03:00
Andrew Altshuler	e98347eb7b	schema-lint chassis v1.0: DropProperty Soft + code-tagged diagnostics (MR-694) (#90 ) * schema-lint chassis v1 (WIP): tier surfacing + plan doc First commit of the chassis v1 branch. Lands a small, foundational slice without behavior change, plus a planning doc that lays out the remaining 7 commits in sequence so the PR can be reviewed incrementally. This commit: - Adds SchemaMigrationStep::diagnostic() returning the full &'static DiagnosticCode (family + tier + severity) for UnsupportedChange steps with codes. Renderers can now reach the tier without re-implementing the code → tier lookup. - CLI `omnigraph schema plan` output now displays tier alongside code: unsupported change on node:Person.age [OG-DS-104, destructive]: removing property 'Person.age' is not supported in schema migration v1 Operators see at-a-glance the kind of risk each rejection represents — not just the rule identifier. - No behavior change. All 11 existing schema_apply tests still pass. Planning doc at docs/schema-lint-v1-plan.md tracks the 7 remaining commits to bring v1 to feature-complete: 1. (this commit) Tier surfacing in plan output. 2. Soft / Hard mode enum on drop steps. 3. Tombstone fields on catalog IR. 4. Planner emits DropProperty { Soft } by default. 5. Apply path implements Soft mode. 6. Convert PR #62 destructive-rejection tests. 7. --allow-data-loss flag + Hard mode. 8. (optional) Tombstone unhide / restore command. Delete the planning doc when v1 lands. Intentionally checked in to the WIP branch so the scope is reviewable; not intended as a permanent doc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * schema-lint v1 commit 2: DropMode + dormant Drop* variants Second commit of the chassis v1 branch. Lands the type-level shape of soft/hard drops without wiring them up. Variants are reachable from emitters but the planner doesn't produce them yet; the apply path returns an explicit not-yet-implemented error if one shows up via deserialization. Added: - `DropMode { Soft, Hard }` — orthogonal to `SafetyTier`. Tier classifies the rule's risk class; mode is the operator's intent for data treatment. - `Soft` → catalog tombstone, data retained. Tier: safe. - `Hard` → Lance-level removal. Tier: destructive; will require --allow-data-loss to apply (commit 7). - `SchemaMigrationStep::DropType { type_kind, name, mode }` and `SchemaMigrationStep::DropProperty { type_kind, type_name, property_name, mode }` variants. - Re-export `DropMode` from `omnigraph_compiler::DropMode` so downstream crates don't reach into the catalog submodule. - CLI `render_schema_plan_step` arms for both variants, surfacing the mode in plan output: `drop property 'Person.age' of node 'Person' (soft mode)`. - `apply_schema_with_lock` exhaustive match arm for the two new variants that returns `manifest_internal` with a clear not-yet-implemented message. If a SchemaIR JSON containing Drop{Type,Property} arrives (e.g. from a future tool or hand- written), the apply path fails explicitly rather than silently misclassifying. - Two new in-source tests: - `drop_steps_round_trip_through_serde` — pins the wire shape for all four (variant × mode) combinations. - `drop_mode_serde_uses_snake_case` — pins external-tool- friendly serialization (`"soft"` / `"hard"`). Build: clean, only pre-existing warnings. Tests: - omnigraph-compiler schema_plan: 6/6 (4 existing + 2 new). - omnigraph-engine schema_apply: 11/11 (unchanged — planner still emits UnsupportedChange for removal paths). Next commit (commit 3 per docs/schema-lint-v1-plan.md): add the `tombstoned: bool` fields to NodeIR / EdgeIR / PropertyIR for the catalog representation of soft-mode tombstones. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * plan doc: reframe v1 around Lance native drop_columns After a substrate audit of the Lance data-evolution guide on 2026-05-13, the v1 plan was simplified. Two key findings: 1. Lance's `drop_columns()` is already metadata-only and reversible via time travel until cleanup. No need for a parallel `tombstoned: bool` field in our catalog IR — Lance's version graph IS the tombstone. 2. The full schema_apply substrate migration (add_columns, drop_columns, alter_columns vs. stage_overwrite across all step types) is consolidated in MR-948 as a sibling issue. v1 only uses the relevant slice (drop_columns for OG-DS-1XX). Net plan changes: - Commit 3 (original): tombstone fields on catalog IR → dropped. No catalog IR change needed. The Lance drop_columns commit IS the tombstone. - Commit 5 (original): apply path writes tombstoned: true → replaced with: apply path calls Dataset::drop_columns([name]). - Commit 7 Hard mode: stage_overwrite removing the column → replaced with: drop_columns + compact_files + cleanup_old_versions. Same APIs omnigraph cleanup already uses. - Commit 8 (original): omnigraph schema unhide → dropped. Time travel is the undo (omnigraph snapshot --at <commit>). Net result: 8 commits → 5 commits. ~250 LoC less surface. More substrate-aligned. The chassis types from commit 2 (DropMode enum, DropType / DropProperty variants) are kept exactly as designed; only the implementation strategy changed. The Lance docs quote is included in the doc so future readers see the substrate behavior cited verbatim. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * schema-lint v1 commit 3: emit + apply DropProperty { Soft } Wire the dormant DropProperty variant end-to-end for the Soft case. Per docs/schema-lint-v1-plan.md, commit #3 of the schema-lint chassis v1 series (MR-694). Planner (schema_plan.rs): - plan_properties: emit DropProperty { type_kind, type_name, property_name, mode: Soft } instead of UnsupportedChange when a property exists in accepted but not in desired. Plan is now supported = true for drop-only changes. Apply (schema_apply.rs): - Route DropProperty { Soft } through rewritten_tables. The existing batch_for_schema_apply_rewrite path already iterates the target schema fields, so a property absent from desired_catalog is naturally projected away. The prior Lance version retains the dropped column for time-travel reversibility (until cleanup runs). - DropType still errors (lands in commit #4 with different mechanics: __manifest entry removal instead of column projection). - DropProperty { Hard } still errors (lands in commit #5 with --allow-data-loss CLI flag + immediate compact_files + cleanup_old_versions). Tests: - Planner unit test plan_emits_soft_drop_for_removed_nullable_property asserts the variant emission + supported = true + no UnsupportedChange. - Integration test apply_schema_drops_a_nullable_property_softly_ preserves_prior_version (replaces the former apply_schema_rejects_dropping_a_property_with_data) asserts: (a) plan contains DropProperty { Soft } (b) apply succeeds + manifest advances + row count unchanged (c) current dataset schema lacks the dropped column (d) snapshot_at_version(pre_drop) still has the dropped column (e) reopen consistency — drop preserved across engine restart Recovery: rides on SidecarKind::SchemaApply per MR-847. No new sidecar kind needed; the entire apply path is already sidecar-wrapped. Substrate alignment: this commit uses the stage_overwrite full-rewrite path (full_rewrite cost class) rather than Lance native drop_columns (catalog_only cost class). MR-948 is the follow-up substrate-alignment refactor that introduces a LanceColumnOp surface and switches the metadata-only case onto drop_columns. Functional outcome is identical; cost-class improvement deferred. Test results: - cargo test -p omnigraph-compiler --lib: 238 passed - cargo test -p omnigraph-engine --test schema_apply: 11 passed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: move schema-lint-v1-plan into docs/dev/ + add to index Post-rebase fixup for the docs split (#93). The plan doc was added to docs/ at the top level before main reorganized to docs/{user,dev}/. This moves it into docs/dev/ and adds an entry to docs/dev/index.md under a new "Active Implementation Plans" section so the check-agents-md.sh link check passes. Per the original commit message (`617a77d`), the plan doc is intentionally temporary — it will be deleted when v1 lands. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-16 16:30:03 +03:00
Andrew Altshuler	0de5f69d86	docs: drop npx mdrip; use curl \| pandoc for full-page fetches (#97 ) The previous "fetch the full page" recommendation in AGENTS.md and docs/dev/lance.md pointed at an unknown-author npm CLI that, on consent, wrote agent-targeted content into AGENTS.md and modified .gitignore / tsconfig.json. Source audit was clean of malicious code but the self-perpetuating prompt-injection pattern combined with a single maintainer and ~21 downloads/day made it not worth the risk. Switched to the curl + pandoc command already documented as the no-tool option. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 16:06:24 +03:00
Andrew Altshuler	60eee78465	docs: split user and developer docs (#93 )	2026-05-15 03:45:22 +03:00
Andrew Altshuler	6bad829ed0	branch-protection: declarative policy + apply script (#89 ) Branch protection on main, declared as code rather than as opaque GitHub UI state. Pairs with the CODEOWNERS chassis (#88): once this PR lands and an admin runs the apply script, every PR to main must satisfy code-owner review and the listed required checks. Components: - .github/branch-protection.json — the policy. Edit this to change required checks, review counts, etc. Includes a _comment field for human readers; the apply script strips it before PUT. - scripts/apply-branch-protection.sh — idempotent apply via `gh api`. Reads back current state for verification. Supports DRY_RUN=1. - docs/branch-protection.md — explains the policy, how to apply, how to change, why declared as code. - AGENTS.md topic-index row. Policy summary: - Required status checks (strict): Classify Changes, Check AGENTS.md Links, Test Workspace, Test omnigraph-server --features aws, CODEOWNERS / drift, CODEOWNERS / noedit. - Required approving reviews: 1, must be a code owner. - Dismiss stale reviews on new commits. - Required linear history (squash or rebase merges only). - No force pushes, no deletions, no admin bypasses. - Required conversation resolution. What's NOT in this PR: - Required signed commits — not yet; maintainers must enroll GPG/SSH signing first or merges will block. - Tag protection for v* tags — separate PR. - Additional required checks (cargo deny, audit, fmt, clippy, CodeQL, schema-lint MR-946) — separate PRs as each lands. - The script is NOT run by CI. Branch-protection changes are admin actions; CI-driven auto-apply would defeat the purpose. Manual invocation is the audit point. How to apply after merge: ./scripts/apply-branch-protection.sh Requires gh-CLI auth with repo-admin permissions. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:38:20 +03:00
Andrew Altshuler	730712b73f	codeowners: yml source of truth + generator + drift CI (#88 ) * codeowners: generator + drift CI + initial roles Source-of-truth approach to CODEOWNERS: yml is hand-edited, CODEOWNERS is generated and CI-enforced. Every role change is a reviewable PR with a permanent in-repo audit trail. No GitHub UI clicks, no shadow state. Initial roles: engineering @aaltshuler owns crates/** + default (.github/, scripts/, Cargo., openapi.json, everything else not docs) docs @aaltshuler @ragnorc owns docs/, README.md, AGENTS.md, CLAUDE.md, SECURITY.md Per GitHub semantics, multiple owners on a CODEOWNERS line means "any one satisfies the review" — for docs, either named member can approve. Strict "N distinct approvers" would need a CI workaround (not wired today; tracked for future hardening). Components: - .github/codeowners-roles.yml — source of truth. Edit this. - .github/scripts/render-codeowners.py — generator (PyYAML; ~100 LoC). - .github/CODEOWNERS — generated. CI rejects hand-edits. - .github/workflows/codeowners.yml — two checks: drift: re-render and assert CODEOWNERS matches. * noedit: reject PRs that edit CODEOWNERS without editing the yml. - docs/codeowners.md — explains the source-of-truth pattern, how to change roles, how to add new roles. - AGENTS.md topic-index row. What's NOT in this PR: - Branch protection on main (separate PR; needs `gh api` call against the org). - Required-reviewer enforcement (depends on branch protection landing). - Required CI status checks (depends on branch protection landing). - Scheduled rotation (the schedule: block in the yml + a weekly workflow). Today's roles are stable; rotation isn't needed yet. - Linear-as-source-of-truth integration (Approach 4 from the design discussion; deferred). Verified: - Generator output is deterministic (idempotent re-runs). - scripts/check-agents-md.sh OK (28 links, 28 docs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * codeowners: fix catch-all ordering (Devin review #88) Devin caught a real bug: GitHub CODEOWNERS uses "last match wins" semantics, but the generator emitted the catch-all `` AFTER specific patterns. Net effect: `` won for every file, silently nullifying the docs role and never routing reviews to @ragnorc. Fix is one-line — emit the default `` line before iterating the specific paths. Also: - Added a regression assertion in the generator: after rendering, the first non-comment line must start with `` if a default is configured. Generator exits non-zero otherwise. Catches the same class of mistake in any future refactor. - Rewrote the yml header comment, which incorrectly stated "keep more-specific paths after broader patterns" (correct for GitHub semantics but the generator was doing the opposite — so the comment read as a description of behavior when it was actually a contradicted intention). Verified by re-rendering: `` is now line 12, `crates/` is line 14, `docs/` is line 15, etc. README.md matches both `` and `README.md`; `README.md` is later → wins → @aaltshuler + @ragnorc both assigned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:26:06 +03:00
Andrew Altshuler	c142dafdf3	schema-lint chassis v0: code-tagged diagnostics (MR-694) (#87 ) First slice of the schema-lint chassis. Adds stable `OG-XXX-NNN` codes to schema-migration rejections so operators can suppress, look up, and filter on identifiers rather than free-text prose. Atlas-style chassis adapted to omnigraph's typed-IR substrate (no SQL injection vector, no per-engine locks, native edge/vector/embedding types). What's in v0: - New `omnigraph-compiler/src/lint/` module with: - `diagnostic.rs` — Family / SafetyTier / Severity enums covering ten families (DS, MF, CD, BC, NM, OW, NL, VE, ED, LK). Only DS and MF are populated in this PR. - `codes.rs` — 8 DiagnosticCode constants (OG-DS-101..105, OG-MF-103, OG-MF-104, OG-MF-106). Five of the eight are wired to real emission sites; the other three are reserved. - Unit tests for catalog invariants: codes unique, prefix matches family, suffixes are 3-digit, destructive defaults to error, lookup() works, EMITTED_IN_V0 codes exist in ALL_CODES. - `SchemaMigrationStep::UnsupportedChange` gains an optional `code: Option<String>` field. New `unsupported_error_message()` helper prefixes the message with `[code]` when present. - 5 of 17 existing rejection paths now carry codes: - `removing node type` → OG-DS-102 - `removing edge type` → OG-DS-103 - `removing property` → OG-DS-104 - `adding required property without backfill` → OG-MF-103 - `changing property type` → OG-MF-106 Remaining 12 paths carry `code: None` and are tagged as future work. - `schema_apply` surfaces the formatted error (with `[code]` prefix); CLI `omnigraph schema plan` renders the code on the `unsupported change on <entity>` line. - PR #62 destructive-rejection tests in `tests/schema_apply.rs` now assert on the stable code (`msg.contains("OG-DS-104")`) instead of the error-message substring. 11/11 tests pass. - New `docs/schema-lint.md` documents the v0 catalog + the 10 families + Atlas prior art. AGENTS.md index updated. What's explicitly NOT in v0 (subsequent PRs): - No severity config in `omnigraph.yaml` (MR-694 §2). - No `@allow(OG-XXX-NNN, "rationale")` suppression directive (§3). - No `--allow-data-loss` flag or destructive-tier enforcement. - No new `SchemaMigrationStep` variants (soft/hard drops, default, widen/narrow). MR-700, MR-697 land those. - No pre-migration checks (MR-941). - No CD / VE / LK / NM family rules (MR-942..945). - No CI integration (MR-946). Tests: 235 compiler tests, 11 schema_apply integration tests, 14 lint module tests, 55 CLI tests — all green. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 17:08:18 +03:00
Ragnor Comerford	53d41a30b4	Merge pull request #85 from ModernRelay/ragnorc/survey-state engine: pin stable-row-id preservation through stage_overwrite	2026-05-12 17:24:55 -07:00
Ragnor Comerford	2121d9f6c3	docs: storage stable-row-ids reflects every dataset The L1 capability list claimed the flag was enabled "for the commit-graph and run-registry datasets" — stale. Every Lance dataset OmniGraph creates has enable_stable_row_ids: true; the run-registry datasets are gone since MR-771. Replace with a single paragraph capturing the invariant, the consequences (row-version columns available, CreateIndex × Rewrite not retryable, Lance reader version required), the legacy-dataset constraint (one-way at create, dump-and-reload to migrate), and a pointer to the regression test in staged_writes.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:56:51 -07:00
Ragnor Comerford	24c0558180	docs: lead AGENTS.md first principle with integrated-over-time framing Reframes the first-principle section to lead with Winters' "engineering is programming integrated over time" as the lens, keeping "minimize ongoing liability" as the operative directive and folding in "complexity should be earned." Adds a new Tiebreakers subsection with two rules that the prior section lacked clean appeals for: - correctness > simplicity > performance (lexicographic) - reversibility shapes evidence demand (reversible → prod metrics over napkin math over RFCs; irreversible → RFC up-front) Adds a Hyrum's-Law deny-list entry in both AGENTS.md and docs/invariants.md §IX: shipping observable behavior is shipping a contract, even when undocumented. Net always-on context cost: ~7 lines. No renumbering of §I–VIII invariants; Hyrum's Law lands in the deny-list to avoid breaking back-references. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 16:27:24 -07:00
Claude	e22d468e27	Add maintenance + destructive-migration test coverage The audit of test coverage flagged three holes: - `omnigraph optimize` and `omnigraph cleanup` had no integration tests (no `maintenance.rs`). Add one covering empty/idempotent edges, the policy-validation contract on `cleanup`, and head preservation under aggressive policies. - `apply_schema` only covered I32 -> I64 type-change rejection. Add the symmetric narrowing case plus rejections for the other destructive shapes (drop property with data, drop node type, drop edge type, add required property without backfill) and assert the manifest version doesn't advance. Add a positive `@rename_from` case to pin the stable-type-id contract preserves rows through a rename. - `docs/testing.md` was missing `validators.rs` and the new `maintenance.rs` from its file table; bump the count and add rows.	2026-05-12 23:36:01 +03:00
devin-ai-integration[bot]	6914e0256e	MR-786: merge-pair truth table with exhaustive op-variant matrix (#81 ) * MR-786: merge-pair truth table with exhaustive op-variant matrix Add crates/omnigraph/tests/merge_truth_table.rs that enumerates every (left_op, right_op) cell from the operation vocabulary named in the ticket — {noop, addNode, removeNode, addEdge, removeEdge, setProperty, dropProperty, addLabel, removeLabel} — and asserts the deterministic outcome of Omnigraph::branch_merge against a structured oracle. The matrix is built in a 9x9 match in build_case, so adding a new OpVariant is a compile-time, fail-on-omission task. Today's mutation grammar only exposes insert \| update set \| delete (see docs/query-language.md), so the 36 cells over the first six ops are executable and the 45 cells involving dropProperty/addLabel/removeLabel are recorded as Expected::Unsupported with a note. Each executable cell spins up a fresh tempdir, applies one mutation per branch, calls branch_merge, and asserts either: * MergeOutcome (AlreadyUpToDate / FastForward / Merged) plus a GraphAssert on the affected entities, or * an OmniError::MergeConflicts whose entries match the expected table_key + MergeConflictKind (row_id is optional because edge ULIDs are generated at runtime). branch_merge is directional, so the (L, R) and (R, L) cells live in separate entries in the matrix and are run independently — the op-pair symmetry encoded in build_case serves as the commutativity oracle without doubling the runtime. End-to-end the suite runs in ~10s on a fresh build, well under the 30s budget asserted at the bottom of the test. Also adds a row to docs/testing.md so the test-coverage map points future agents at this file. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * Use one Omnigraph handle for both branches Self-review caught that the runner was opening two Omnigraph handles on the same temp dataset (one for main, a second via Omnigraph::open for feature). tests/branching.rs uses one handle and passes the branch name to mutate_branch — same pattern works here and avoids any cache-coherency surprises between the two handles. Also drops the post-merge reopen, which only existed to give the second handle a fresh snapshot. Runtime drops ~10s -> ~9s. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * Assert exact conflict count, not subset inclusion cubic and Devin Review both flagged that check_outcome's Expected::Conflicts arm only enforces want ⊆ got, so a regression that produces a spurious extra conflict (e.g. emitting both OrphanEdge and a stray DivergentInsert) would silently pass the truth-table cell. For a deterministic oracle that's the wrong direction — the cell pins the exact conflict-artifact set, not a lower bound. Add an assert_eq!(got.len(), want.len()) before the existence loop. All 36 executable cells still pass; runtime unchanged. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * Subsume 4 conflict tests in branching.rs into truth table The four `branch_merge_reports__conflict` tests (DivergentUpdate / DivergentInsert / DeleteVsUpdate / OrphanEdge) were redundant with the deterministic-oracle cells in the new `merge_truth_table.rs` and only added drift risk. To preserve the post-conflict invariant that lived in `branch_merge_reports_divergent_update_conflict` (target unchanged after a failed merge), the truth-table runner now generalizes it: on every `Conflicts` cell, main's state is asserted against `state_after_apply_only(right_op)`. That gives strictly more coverage than the deleted tests carried, since the invariant now applies to all* seven conflict cells, not just one. The `UniqueViolation` and `CardinalityViolation` cases stay in `branching.rs` — they're combinatorial (require >1 op per side with a non-default schema) and out of scope for the pair-wise truth table. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * Fix misleading 'Total edges: 0' comment in (AddEdge, RemoveEdge) cell Devin Review flagged that the comment said 'Total edges: 0' while the parenthetical math evaluates to 1 (matching `GraphAssert::base()`). The assertion is correct; only the leading number in the comment was wrong. Reworded to 'Net edges: … = 1 (matches base)' so the prose agrees with both the math and the assertion. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> --------- Co-authored-by: Ragnor <ragnor@modernrelay.com> Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-12 22:36:01 +03:00
Ragnor Comerford	3bd072c917	docs: add docs/transactions.md — branch-as-transaction explainer (#69 ) The architectural rule "no cross-query BEGIN/COMMIT; branches fill that role" lives in docs/invariants.md §VI.23 but is not surfaced anywhere user-facing. New users coming from Postgres/MySQL hit the gap when they realize multiple queries on main are independently atomic, not jointly atomic. This page explains the model with worked examples: * Single-query multi-statement (atomic by default) * Two separate queries on main (NOT atomic — common surprise) * Many queries via a branch (atomic at merge) * Coordinating multiple agents via branch-per-agent Plus a comparison table to BEGIN/COMMIT, failure-mode rundown, and "when to use what" decision matrix. Linked from AGENTS.md "Where to find each topic" between branches-commits.md and runs.md. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 22:35:57 +03:00
Devin AI	4eb865b340	docs: expand 0.4.2 release notes	2026-05-10 14:37:58 +00:00
Devin AI	e44a4704eb	docs: fix admission gating description	2026-05-10 14:16:26 +00:00
Devin AI	a42d178119	release: prepare omnigraph 0.4.2	2026-05-10 14:02:28 +00:00
Devin AI	6a3f0677ae	server: drop unwired try_admit_rewrite / 503 admission surface	2026-05-09 20:58:17 +00:00
Ragnor Comerford	6ef07386d3	docs+engine: refresh server.md rate-limiting note; cache version() TOCTOU Two cleanups bundled because they're both single-line, post-MR-686 hygiene flagged by cubic during PR review: - docs/server.md:102 said "Rate limiting — none" while the new admission-control section earlier in the file documents 429s on the five mutating handlers. Replace with a pointer to the admission section and clarify that no graph-wide rate limiter is wired. - schema_apply.rs:445-451 called `db.version().await` twice — once for the conditional check, once in the error format string — creating a cosmetic TOCTOU under interior mutability. Cache the result in `current_manifest_version` so the error message reflects the version that triggered the rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:59:45 +02:00
Ragnor Comerford	7aca6ddac5	docs: PR 2 documentation pass (server / architecture / §VI.23) - docs/server.md: new "Per-actor admission control (MR-686)" section documenting WorkloadController defaults, the 429/503 mapping with Retry-After semantics, the Cedar-then-admission ordering, and the /change-only-for-now scope. Adds 429 / 503 to the listed HTTP status codes and `too_many_requests` / `service_unavailable` to the ErrorCode enumeration in the error model paragraph. - docs/architecture.md: server/CLI diagram updated. Adds WorkloadController and WriteQueueManager nodes; flow is HTTP -> auth -> Cedar -> admission -> engine -> queue. Engine label changed to "Arc<Omnigraph>" to reflect the AppState flip. Prose now points at server.md and runs.md for the admission/queue contracts. The CLI's bypass-admission note is preserved. - docs/invariants.md §VI.23 status annotation: explicitly cites the per-(table, branch) writer-queue + revalidation-under-queue as closing the Lance-HEAD-vs-manifest drift class under concurrent writers once the global RwLock is removed (PR 2 Step F). Continuous in-process rollback recovery still aspirational (MR-870 ticket). scripts/check-agents-md.sh passes (26 links, 26 docs). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 17:09:49 +02:00
Ragnor Comerford	c12f6adb0c	docs/invariants: add §VI.35-37 + non-commitments for MR-686 Three new §VI invariants name what OmniGraph commits to as an agent-native system of record: branches as the cross-query coordination primitive, per-query isolation as a per-query opt-in (Serializable up, eventual down), and type-aware agent-resolvable merges. Plus an explicit non-commitments subsection so reviewers see what is intentionally out of scope (Strict Serializable across queries, cross-process linearizable single-object writes, auto-resolution of ambiguous merge conflicts). §VII and §VIII renumber by +3 to make room (35-43 -> 38-46, 44-47 -> 47-50); deny-list and review-checklist references in §IX/§X follow. testing.md's pre-existing stale §VII.33/34/36 references resolve to their actual §VIII.47/48/50 targets in the same pass. staged_writes.rs:866's docstring gains an MR-686 forward reference so the load-bearing concurrency-hazard test points readers at the queue work that closes the gap. §VI.34 is preserved alongside the broader §VI.36 to keep its MR-425 pointer addressable; the overlap is documented in §VI.36's status line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 14:45:54 +02:00
Ragnor Comerford	a30666bc38	docs/tests: reserve Phase A/B/C/D for the per-writer recovery flow Three terminologies were calling themselves Phase A/B in PR #72: 1. Per-writer recovery (canonical, four phases A/B/C/D — sidecar / commit_staged loop / manifest publish / sidecar delete in `docs/runs.md:157`). 2. Per-table staged-write contract from MR-793 (two phases — `stage_*` then `commit_staged`). 3. Test-narrative scaffolding (Phase A = setup the failure, Phase B = verify recovery — used as section dividers in failpoints.rs). Same letters, three meanings; three reviewers including the bots have already misread the code in the resulting fog. This change keeps "Phase A/B/C/D" exclusively for #1 and rewrites the other two: - `ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable` → `ensure_indices_stage_btree_failure_leaves_existing_tables_writable` (matches the `stage_create_btree_index` API verb). - Comment at `table_ops.rs:610` and the test docstring at `failpoints.rs:807` rewrite "a Phase A failure in the staged-index path" → "a stage-step failure in the staged-index path". - Twelve `// Phase A:` / `// Phase B:` test scaffolding comment headers in `failpoints.rs` (across six test fns) become `// Setup:` / `// Recovery:`. - A "Phase letter convention" note added near `docs/runs.md:165` spells the rule out for future readers. Also bundled: rename `composite_flow_init_load_branch_merge_time_travel_optimize_cleanup` → `composite_flow_canonical_lifecycle` so it pairs as a story name with `composite_flow_multi_branch_sequential_merges` (the previously- deferred symmetry rename). No behaviour change. Both renamed tests pass; full failpoints (18) + composite_flow (2) suites pass; workspace baseline + clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 22:46:03 +02:00
Ragnor Comerford	9fc6526ec0	tests: multi-branch sequential merges compositional flow Adds `composite_flow_multi_branch_sequential_merges` covering the agent-workflow pattern that single-merge tests in `branching.rs` cannot reach: two feature branches diverging from main with main writes interleaved between every diverge point, sequential merges into main, time-travel through the resulting merge DAG, and reopen consistency over a multi-merge history. The script (18 numbered steps with assertions per step): init+load → mutate main → branch feat-a → mutate main → mutate feat-a → branch feat-b → mutate feat-b → mutate feat-a (with edge) → merge feat-a → mutate main → merge feat-b → time-travel to pre-merge-a + pre-merge-b → reopen + verify. Catches eight compositional gap categories that only surface with ≥2 merges and main mutations between them: base/LCA recomputation across two merges, manifest-pin propagation through merge commits, time-travel through merge DAG without state bleed-through, branch- DAG consistency, sibling-branch isolation under writes elsewhere, post-merge main-write integration, multi-merge reopen replay, and clean-flow recovery-sidecar absence. `composite_flow.rs` was added to `docs/testing.md` so the before- every-task checklist points agents at the file before duplicating coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 19:34:04 +02:00
Ragnor Comerford	815ff743f5	recovery: refresh-time roll-forward closes the in-process residual + invariants helper Bundle of three correctness fixes plus a shared invariants helper that existing tests now use. 1. SchemaApply atomicity: close the residual gap where a sidecar exists but staging files don't (e.g., Phase B failure BEFORE `_schema.pg.staging` write). `recover_schema_state_files` now returns a `SchemaStateRecovery` discriminator (`Noop` / `CleanedStaging` / `CompletedStagingRename { schema_apply_sidecar }`); the token threads through `recover_manifest_drift` → `process_sidecar`. SchemaApply sidecars are eligible for roll-forward ONLY when the staging rename completed in the same recovery pass. Full mode rolls back; RollForwardOnly defers. Without this, recovery would publish the manifest pin against new-schema data while `_schema.pg` stayed old (real corruption). New failpoint `schema_apply.before_staging_write` + new test `schema_apply_without_schema_staging_rolls_back_on_next_open` pin the gating. 2. Rollback target correction. Rollback now restores Lance HEAD to the current manifest pin (`state.manifest_pinned`) instead of the sidecar's `expected_version`. For UnexpectedAtP1/UnexpectedMultistep classifications these can differ; the old code could regress Lance HEAD past the manifest pin, re-introducing drift in the OTHER direction. The new behavior establishes `Lance HEAD == manifest pin` post-rollback — the canonical drift-free invariant. Param renamed from `expected_version` → `target_version` to match. Audit `to_version` records the actual restore target. This is a latent-behavior change. Any external consumer that compared `audit.to_version` against `sidecar.expected_version` for non-trivial classifications now sees the manifest pin instead. 3. Audit commit-graph unification. `record_audit` now opens the per-branch commit graph for ANY sidecar with `sidecar.branch.is_some()` — not just BranchMerge. Plain Mutation/Load/EnsureIndices commits on a feature branch now correctly land on that branch's commit graph, instead of main's. Closes the class of bug analogous to D2 but for non-merge writers. Pre-existing repos with non-main commits already on main's commit graph stay where they are; future recoveries write to the per-branch ref. Mixed-version compatibility is asymmetric but safe (old binaries ignore per-branch refs they don't know about; new binaries read both). 4. Recovery invariants helper + branch-axis cells. New `tests/helpers/recovery.rs` (~505 LOC) exports `assert_post_recovery_invariants(repo, op_id, RecoveryExpectation)` plus a `TableExpectation` builder. Six existing recovery tests refactored to call it; per-test bespoke assertions replaced. Two new branch-axis cells added in `tests/failpoints.rs`: - `recovery_rolls_forward_load_on_feature_branch` - `recovery_rolls_forward_ensure_indices_on_feature_branch` The loader gains a `mutation.post_finalize_pre_publisher` failpoint hook (gated on the `failpoints` feature; zero-cost in release) so the load test can pin the same Phase B → Phase C boundary the mutation path uses. Misc: - `Omnigraph::refresh` extracts `reload_schema_if_source_changed`: early-return when schema source unchanged (saves IR parse + catalog rebuild on the steady-state refresh path). - New test injection point `failpoint_publish_table_head_without_index_rebuild_for_test` under `#[cfg(feature = "failpoints")]`. Tests: 31 recovery + failpoint integration tests pass (14 + 17, up from 14 + 16). Full workspace sweep with `--features failpoints` clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-05 16:04:48 +02:00
Ragnor Comerford	aaa031e834	recovery: refresh-time roll-forward closes the in-process residual Adds RecoveryMode { Full, RollForwardOnly } and wires Omnigraph::refresh to invoke roll-forward-only recovery. This closes the documented "long-running server between Phase B failure and process restart" residual without requiring a restart, for the common case (mutation / load finalize → publisher failure). Why roll-forward only and not full sweep: * Roll-forward is safe under concurrency (publisher uses row-level CAS). * Roll-back uses Dataset::restore, which "wins" against concurrent Append/Update/Delete/CreateIndex/Merge per check_restore_txn — silently orphaning the concurrent writer's commit (pinned by tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning). Sidecars that classify as RollBack-eligible are LEFT ON DISK for the next ReadWrite open, where no concurrent writers exist and full restore is safe. Implementation: * recovery.rs: RecoveryMode enum; recover_manifest_drift takes mode; process_sidecar branches on mode for Abort and RollBack — both defer to next ReadWrite open under RollForwardOnly. RollForward behavior unchanged. * omnigraph.rs: Omnigraph::refresh promoted to pub; calls recover_manifest_drift in RollForwardOnly mode after coordinator refresh. Steady-state cost: one list_dir of __recovery (early return on empty). Adds refresh_coordinator_only — pub(crate) — for engine-internal callers that hold an in-flight sidecar (the schema_apply lease-check + lock-release paths). Without this split, refresh would race the in-flight sidecar. * schema_apply.rs: switch all 6 internal db.refresh() call sites to refresh_coordinator_only(). Tests: * refresh_runs_roll_forward_recovery_in_process — trigger mutation.post_finalize_pre_publisher; without restart, call db.refresh(); assert sidecar deleted, drifted row visible, subsequent mutation succeeds. * refresh_defers_rollback_eligible_sidecar_to_next_open — synthesize a Mutation sidecar with bogus expected (UnexpectedAtP1 → RollBack); refresh leaves it on disk and Lance HEAD unchanged; drop and reopen runs the full sweep which advances HEAD via restore. Docs: * docs/runs.md "Long-running servers" caveat updated to describe the refresh-time roll-forward path and the rollback-defer behavior. * docs/invariants.md §VI.23 status line updated to reflect in-process closure of the common case. Workspace tests pass with --features failpoints; no regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 00:15:42 +02:00
Ragnor Comerford	05e52f2ee0	recovery: rename composite test, strip ticket references, address review Three bundled changes: 1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and the test function). OmniGraph is consumed by both humans and agents - naming the test after one audience misframes the library. 2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and review-round labels from source, tests, and docs added by this branch. Internal traceability belongs in commit messages and PR descriptions, not in checked-in artifacts. Upstream lance-format/lance issue refs and pre-existing MR-XXX refs in docs not touched by this branch are left alone. 3. Two outstanding review findings addressed: - `needs_index_work_node` / `needs_index_work_edge`: propagate `count_rows` errors instead of `unwrap_or(0)`. Silently treating transient I/O failures as "0 rows" risked skipping a table from the recovery sidecar pin set that was actually about to be modified. - `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`: strengthen the assertion to fail when sidecar B classifies under a stale snapshot. The new assertion checks post-recovery Lance HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar deleted + audit rows present" pair passed in both the bug and fix paths because both delete the sidecar and write an audit row; the differentiator is the post-recovery HEAD. Strengthening the assertion exposed an additional nuance: in this overlapping- sidecar scenario sidecar B's audit kind is RolledBack (no-op) rather than RolledForward, since sidecar A's roll-forward publishes Lance HEAD as the new manifest pin (absorbing B's work). The docstring now explains why this is correct given current `roll_forward_all` semantics. All workspace tests pass with --features failpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 13:56:36 +02:00
Ragnor Comerford	164bafbbe7	recovery: address PR #72 review findings Bot reviewers (cubic, cursor, chatgpt-codex) caught 4 merge-blocking bugs + 3 strongly-recommended fixes + 3 doc errors in the initial PR. Each fix has a paired test demonstrating the bug before the fix. Merge-blocking fixes: - BranchMerge moved to loose-match classifier arm. publish_rewritten_ merge_table runs multiple commit_staged calls per table (merge_insert + delete_where + index rebuilds). Strict classification rolled back valid completed Phase B work as UnexpectedMultistep. Three new unit tests pin the loose-match behavior for BranchMerge. - branch_merge sidecar uses self.active_branch() (the resolved target branch) instead of inferring from the first sorted table key. The previous heuristic could record None (== main) when the merge target was a non-main branch, causing recovery to publish to the wrong manifest namespace. - Best-effort sidecar delete in all 5 writer sites (mutation, loader, schema_apply, branch_merge, ensure_indices). Previously, a sidecar cleanup failure after a successful manifest publish would error out the user's call for a write that already landed. Now: log a warning and ignore — the next open's recovery sweep tidies the stale sidecar via NoMovement classification. - ensure_indices sidecar scoped to tables that need work via new helpers needs_index_work_node / needs_index_work_edge. Previously the sidecar pinned every catalog table; if only one needed indexing, the others classified as NoMovement and the all-or-nothing decision rolled back legitimate index work. Strongly-recommended fixes: - recover_manifest_drift now takes &mut GraphCoordinator and refreshes between sidecars. Sidecar B's classification needs to see sidecar A's manifest changes, otherwise B can be classified against stale pins and incorrectly roll back work that just landed. - list_sidecars sorts URIs before reading. Sidecar filenames are ULIDs (chronologically sortable), so this gives deterministic, time-ordered processing. Filesystem-order was nondeterministic. - ReadOnly opens skip recover_schema_state_files too (was: only the MR-847 sweep was gated). Read-only consumers may run with read-only credentials; silent open-time mutations violate the contract. Doc cleanups: - Removed stale "Phase 4 placeholder" comment from recover_manifest_drift. - docs/runs.md decision-tree wording now correctly surfaces the InvariantViolation abort path. - docs/branches-commits.md clarifies actor_id is in _graph_commit_actors.lance (joined by graph_commit_id), not on _graph_commits.lance itself. Test surface (post-fixes): - 25 unit tests in db::manifest::recovery (+4 from this commit). - 10 integration tests in tests/recovery.rs (+3 from this commit). - ~672 tests across ~25 binaries pass with --features failpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-03 12:21:40 +02:00

1 2

87 commits