omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-21 02:28:07 +02:00

Author	SHA1	Message	Date
Ragnor Comerford	4e2f18a95e	mr-668: split PolicyEngine::load into kind-typed loaders Pre-fix, every caller of `PolicyEngine::load(path, graph_id)` passed some `graph_id` argument — even when the policy was server-scoped and Cedar's resolution would never touch a Graph entity. The server-level loader at lib.rs passed the meaningless sentinel `"server"`. A graph policy file containing a `graph_list` rule compiled fine; a server policy file containing a `read` rule compiled fine. Both silently no-op'd at request time because the engine kind and the rule's resource kind disagreed. Correct-by-design fix: replace `load` with two kind-typed loaders. * `PolicyEngine::load_graph(path, graph_id)` — for per-graph policy files. Rejects any rule whose action `resource_kind()` is `Server`. * `PolicyEngine::load_server(path)` — for server-level policy files. Takes no `graph_id`: server-scoped actions resolve against the singleton `Omnigraph::Server::"root"` entity, never a Graph. Rejects any rule whose action `resource_kind()` is `Graph`. The old `load` is hard-deleted in the same commit because every in-tree consumer migrates here (no semver promise on the workspace crate, no external pinners). New `PolicyEngineKind` enum types the loader's intent; `validate_kind_alignment` is the load-time check that closes the "wrong action, wrong file, silent no-op" class — operators get a load-time error instead of confused-and- silent behavior at request time. Callsites migrated: * server lib.rs:374 (single-mode per-graph) → load_graph * server lib.rs:1065 (multi-mode server) → load_server * server lib.rs:1103 (multi-mode per-graph) → load_graph * CLI main.rs:732 (resolve_policy_engine) → load_graph * tests/server.rs ×5 (4 graph, 1 server) → load_graph/load_server * policy_engine_chassis.rs → load_graph Four new in-source tests pin the contract: both rejection paths and both positive paths. Closes the "operator puts an action in the wrong file and the rule silently never matches" class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 13:35:22 +02:00
Ragnor Comerford	2bb6e24fe3	mr-668: drop vestigial PolicyEngine surface * `validate_request` had zero callsites — pure surface for nothing. * `deny`'s `_actor_id` and `_request` parameters were both unused (the underscore prefix gave it away); the message is built by the caller before `deny` ever sees the request. Trim both. Closes the "public API that the type system can't justify" class for the policy engine. No behavior change; every existing test stays green because the deletions never had a runtime effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 13:04:59 +02:00
Ragnor Comerford	76ee061cac	mr-668: drop actor_id from PolicyRequest; pass actor as separate arg The MR-731 "server-authoritative actor identity" invariant was enforced by an in-function chokepoint (`request.actor_id = actor.actor_id...` overwrite inside `authorize_request`). That worked but relied on every caller passing in a `PolicyRequest` and trusting the overwrite — a comment-enforced invariant. Move the invariant into the type system: * `PolicyRequest` no longer carries `actor_id`. The struct now models what a caller wants to do, not who they are. * `PolicyEngine::authorize(actor_id: &str, request: &PolicyRequest)` and `validate_request(actor_id, request)` take identity as a separate argument. The same shape `PolicyChecker::check` already had for the engine layer. * `authorize_request` in the HTTP layer extracts `actor_id` from the bearer-resolved `ResolvedActor` and passes it positionally — no overwrite step that could be skipped. * CLI `omnigraph policy explain` updated (the only other consumer that built a `PolicyRequest`). Public API break for the `omnigraph-policy` crate. Worth it: handlers can no longer accidentally populate `actor_id` from a request body field, and external consumers are forced by the compiler to source actor identity from a trusted path. The MR-731 chokepoint test `actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers` still passes — the bearer-resolved actor is what reaches the engine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:00:52 +02:00
Ragnor Comerford	52f28cebe8	mr-668: comment cleanup and policy format style Strip "PR Na/Nb" sub-PR references throughout MR-668 surfaces — they were useful during the 10-PR delivery sequence but rot now that the work is in the tree. Keep the MR-668 umbrella references. Also: - Add explicit `when = when` and `resource_literal = resource_literal` named args in `compile_policy_source`'s outer `format!` to match the surrounding crate style (already explicit for `group` and `action`). - Rename the best-effort cleanup tracing target from "omnigraph::init" to "omnigraph::init::cleanup" so operators can filter init-failure cleanup events separately from init's other log lines. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 11:57:04 +02:00
Ragnor Comerford	937fd6382d	mr-668: remove POST /graphs and CLI graphs create (defer runtime graph mgmt) The POST /graphs runtime-create endpoint shipped in PR 7/10 has three unresolved high-severity bugs: - flock-on-renamed-inode race: the YAML flock is taken on omnigraph.yaml itself, then a temp file is renamed over it. Cross-process writers end up locking different inodes — both believing they hold exclusive access. - duplicate-check outside the file lock: precheck runs against the in-memory registry only; the locked closure does config.graphs.insert(...) unconditionally. Concurrent same-id POSTs can persist the loser in YAML while the in-memory registry keeps the winner — they disagree after restart. - best_effort_cleanup_init_artifacts deletes _schema.pg / _schema.ir.json / __schema_state.json on any init failure. An accidental re-init against an existing graph's URI destroys its schema; subsequent open() fails at read_text(_schema.pg). The correct fix is a Lance-style cluster catalog (reserve → init → publish with recovery sidecars), parallel to the engine's existing __manifest discipline. That work is out of scope for v0.7.0. For now, disable runtime add/remove from the network and CLI surface. Operators add graphs by editing omnigraph.yaml and restarting. The GET /graphs read-only enumeration stays. Removed: - POST /graphs handler + router fragment + utoipa registration - 13 post_graphs_* server tests + 3 composite POST tests + multi_mode_app_with_real_config / post_graph helpers - CLI omnigraph graphs create subcommand + its handler + cli.rs tests - system_remote.rs combined list+create test trimmed to list-only - YAML rewrite infra: rewrite_atomic[_with_modify], RewriteAtomicError, staging_path, hash_config_file, AppState::config_hash field + threading through new_multi and open_multi_graph_state - fs2 dependency (verified absent from cargo tree) - sha2/fs2 imports in config.rs (only the rewrite path used them) - Cedar PolicyAction::GraphCreate variant + "graph_create" match arms + action def in Cedar schema + graph_create_action_authorizes_against_server_resource test - GraphCreateRequest / GraphCreateResponse / GraphSchemaSpec / GraphPolicySpec API types (only the POST handler / CLI imported them) Kept: - GET /graphs (read-only enumeration) and graph_list Cedar action - omnigraph graphs list CLI subcommand - All multi-graph startup, mode inference, cluster routes, per-graph + server-level Cedar policies - server_settings_drive_multi_graph_startup_end_to_end (the test that covers operator-authored YAML + restart — the path that survives) - best_effort_cleanup_init_artifacts and the three init failpoints (still reachable from CLI `omnigraph init`; preflight fix deferred as a follow-up) - GraphRegistry::insert and its concurrency tests — production callers gone, but the method is the natural seam for the future cluster-catalog work Also fixed (transcript issue 4): - ALWAYS_FLAT_PATHS now includes /graphs so multi-mode OpenAPI advertises the management route correctly (was previously rewritten to /graphs/{graph_id}/graphs) - multi_mode_openapi_keeps_healthz_flat → renamed to multi_mode_openapi_keeps_management_paths_flat, asserts both /healthz and /graphs stay flat - multi_mode_openapi_prefixes_operation_ids_with_cluster skips /graphs in addition to /healthz Doc fixes: - docs/user/cli.md: graphs list example was --target http://..., but --target is a config-graph-name lookup; corrected to --uri. Removed the graphs create example. - docs/user/server.md: dropped POST /graphs row, "omnigraph.yaml ownership", and "POST /graphs body shape" sections. Added a paragraph stating runtime add/remove is not exposed in v0.7.0. - docs/user/policy.md: dropped graph_create action; reworded the "Configuration" line to clarify that server-scoped rules (graph_list) take neither branch_scope nor target_branch_scope. - docs/releases/v0.7.0.md: rewrote release narrative — multi-graph mode ships; runtime add/remove deferred. - AGENTS.md: HTTP server bullet and capability matrix row updated to reflect read-only GET /graphs and the operator-edit workflow. - openapi.json regenerated; /graphs has only .get, no .post. Diff: 17 files, +123 −1525 LOC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-26 17:49:38 +02:00
Ragnor Comerford	d11c18fb27	mr-668: composite e2e tests, race fix, v0.7.0 release (PR 9/10) PR 9 — the final integration PR for MR-668 multi-graph server work. Closes the v0.7.0 release. Composite lifecycle tests (closes gaps flagged in PR 7's coverage review): - `multi_graph_lifecycle_post_query_restart_persistence` — POST a graph, query it via cluster route, reload the config from disk and confirm `load_server_settings` sees the rewritten YAML. Validates the "restart resolves orphans" failure-mode story. - `per_graph_policy_enforced_on_post_created_graph` — POST a graph with a per-graph policy attached, then send authenticated read and change requests. Per-graph Cedar enforcement fires correctly on a POST-created graph (engine-layer policy reinstalled via `Omnigraph::with_policy` inside the create flow). - `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent POSTs with distinct graph_ids all return 201. Caught a real race in `rewrite_atomic` (see below). Race fix — `rewrite_atomic_with_modify`: The first composite test surfaced a real bug. The old `rewrite_atomic(path, new_config, expected_hash)` captured the baseline hash OUTSIDE the flock, then called rewrite_atomic which re-acquired it inside. Under concurrent writers: - POST A: captures baseline H0, calls rewrite_atomic. - POST B: captures baseline H0 too (before A's update lands). - A: acquires flock, on-disk == H0, writes H1, releases. - A: updates baseline H0 → H1. - B: tries to acquire flock — waits. - B: acquires flock. On-disk is now H1. Expected (captured before A finished) is H0. MISMATCH → spurious Drift error. Worse: even if the timing happens to align, B's `updated` config was constructed from BYTES read before the flock. B writes a config that doesn't include A's new graph — silent data loss. The fix: new `config::rewrite_atomic_with_modify(path, baseline, modify)` takes a closure. Inside the flock + baseline mutex: 1. Read on-disk bytes, hash, compare to baseline. 2. Parse on-disk YAML. 3. Call `modify(parsed)` to produce the new config — receives fresh on-disk state, returns the modification. 4. Serialize + write + fsync + rename + update baseline. Everything is read-modify-write under the same critical section. Concurrent writers serialize cleanly. Test confirmed this is no longer a race. The old `rewrite_atomic(path, new_config, expected_hash)` API stays for tests that don't need the read-modify-write shape; the POST handler switches to the new shape. Version bump v0.6.0 → v0.7.0: - All 5 `crates/*/Cargo.toml` (compiler, engine, policy, cli, server) plus their inter-crate `path` dep version constraints. - `Cargo.lock` regenerated by `cargo build --workspace`. - `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server row updated to mention multi-graph + cluster routes + atomic YAML rewrite. - `openapi.json` regenerated. Docs: - `docs/releases/v0.7.0.md` (new) — release notes with breaking changes, new features, deferred items (DELETE, `delete_prefix`, actor forwarding), and the single→multi migration recipe. - `docs/user/server.md` — substantial section additions for the two modes, mode inference, cluster endpoint table, management endpoints, `omnigraph.yaml` ownership contract, `POST /graphs` body shape + status codes. - `docs/user/cli.md` — `omnigraph graphs list/create` section, deferred-DELETE note. - `docs/user/policy.md` — server-scoped Cedar actions (`graph_create`, `graph_list`), per-graph vs server-level policy composition, example server-level policy. Workspace test pass: 573 tests green across all crates. Zero failures. MR-731 spoof regression still pinned and passing across the entire 10-PR series. This commit closes MR-668. v0.7.0 is ready for tagging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 21:32:49 +02:00
Ragnor Comerford	0e5aa036f4	mr-668: Cedar resource-model refactor (PR 6a/10) PR 6a of the MR-668 multi-graph server work. Policy-crate-only refactor — no HTTP handler changes, no operator-supplied policy.yaml changes. Sets up the chassis that PR 6b's `GET /graphs` consumes. Two new `PolicyAction` variants: - `GraphCreate` — gates `POST /graphs` (deferred behavioral PR). - `GraphList` — gates `GET /graphs` (lands in PR 6b). Note: `GraphDelete` is intentionally NOT added in this PR. `DELETE /graphs/{id}` is deferred from MR-668's v0.7.0 scope to bound complexity (no `delete_prefix`, no tombstone, no `RegistryLookup::Tombstoned`). Adding the Cedar action without a consumer would be the same kind of "dead vocabulary" trap the `Admin` variant already documents. New `PolicyResourceKind { Graph, Server }` enum, plus a `PolicyAction::resource_kind()` method that classifies every action. Per-graph actions (Read, Change, BranchCreate, …) bind to `Omnigraph::Graph::"<graph_label>"`; server-scoped actions (GraphCreate, GraphList) bind to the singleton `Omnigraph::Server::"root"`. `Admin` stays classified as per-graph for now — MR-724 will pick the final shape when the first consumer surface ships. Cedar schema string additions: - `entity Server;` - `action "graph_create" appliesTo { principal: Actor, resource: Server, ... }` - `action "graph_list" appliesTo { principal: Actor, resource: Server, ... }` Compiler updates: - `compile_policy_source` picks the resource literal based on the action's `resource_kind`. Existing graph-only policies generate the same Cedar source as before — pinned by `per_graph_rules_continue_to_work_alongside_server_rules`. - `compile_entities` includes the `Server::"root"` entity only when a rule references a server-scoped action. Keeps test assertions for graph-only policies tight. - `PolicyEngine::authorize` builds the right resource UID at request time based on `request.action.resource_kind()`. Validation rules added to `PolicyConfig::validate`: - A rule may not mix server-scoped and per-graph actions (different resource kinds need different `permit` clauses). - Server-scoped actions cannot have `branch_scope` or `target_branch_scope` — there's no branch context at the server level. Operator impact: zero. The Cedar schema `Omnigraph::Server` entity is internally referenced by `compile_policy_source`; operator policy.yaml files only declare actions in `rules[].allow.actions` and never reference the resource entity directly. Decision 6's "internal rename only; operator policies unaffected" contract is preserved and pinned by `per_graph_rules_continue_to_work_alongside_server_rules`. Tests: 5 new (11 policy tests total, up from 6): - `graph_list_action_authorizes_against_server_resource` - `graph_create_action_authorizes_against_server_resource` - `server_scoped_rule_cannot_use_branch_scope` - `rule_mixing_server_and_per_graph_actions_is_rejected` - `per_graph_rules_continue_to_work_alongside_server_rules` No regression: 145 server tests (74 lib + 71 integration) still green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-25 20:20:35 +02:00
Ragnor Comerford	cc2412dc65	Rename repo terminology to graph (#118 ) Some checks failed CI / Classify Changes (push) Has been cancelled Details CI / Check AGENTS.md Links (push) Has been cancelled Details Release Edge / Prepare edge release (push) Has been cancelled Details CI / Test Workspace (push) Has been cancelled Details CI / Test omnigraph-server --features aws (push) Has been cancelled Details CI / RustFS S3 Integration (push) Has been cancelled Details Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled Details Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled Details	2026-05-24 16:46:00 +01:00
Andrew Altshuler	bb1fe57640	release: v0.5.0 (#115 ) * gitignore: exclude docs/internal/ from publication Mirrors the existing "Local-only working files (not for the public repo)" pattern. Working notes filed under docs/internal/ stay on the contributor's machine instead of cluttering the published doc tree or tripping the AGENTS.md / docs-index cross-link check (scripts/check-agents-md.sh enumerates every docs/.md and requires each one to be linked from an audience index — internal notes don't have an audience index by definition). Incidental to the v0.5.0 release; lands separately from the version bump commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> ci: skip docs/internal/ in agents-md cross-link check Matches the .gitignore exclusion. Mirrors the existing 'docs/releases/' exclusion pattern: notes under docs/internal/ aren't part of the published doc tree and don't need to be linked from an audience index. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * release: v0.5.0 — Lance 6 substrate, Cedar policy engine, schema-lint v1 Bumps the workspace from 0.4.2 to 0.5.0. Release notes at docs/releases/v0.5.0.md. Three user-visible pillars motivate the minor bump: 1. Lance 6.0.1 substrate (DataFusion 52→53, Arrow 57→58) 2. Engine-wide Cedar policy enforcement on every _as writer; server defaults to deny-all; signed-token-claim-only actor identity 3. Schema-lint v1 chassis: OG-XXX-NNN codes, soft drops, and `--allow-data-loss` (Hard mode) for destructive migrations Plus structured DataFusion Expr filter pushdown (unblocks CompOp::Contains via array_has), HTTP allow_data_loss parity, inline .gq sources on CLI/HTTP, optional CORS layer, and bug fixes (merge-insert dup-rowid, branch-merge coordinator restore on error, blob columns in branch merge). Sites bumped: - 5 crate [package].version lines (omnigraph, omnigraph-cli, omnigraph-compiler, omnigraph-policy, omnigraph-server) - 10 internal path-dep `version = "..."` constraints across the four manifests that depend on sister crates (engine, server, cli, plus engine's dev-dep on the compiler) - Cargo.lock (regenerated via cargo update --workspace) - AGENTS.md "Version surveyed:" - openapi.json `info.version` (regenerated via OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test openapi) Verification: - cargo test --workspace --locked: 907/907 green - cargo test -p omnigraph-engine --test failpoints --features failpoints: 19/19 green - cargo test -p omnigraph-engine --test lance_surface_guards: 3/3 - scripts/check-agents-md.sh: clean Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-23 13:59:42 +01:00
Andrew Altshuler	9973683261	policy: chassis core — omnigraph-policy crate + Omnigraph::enforce() (MR-722) (#102 ) PR #2 of the policy chassis series (PR #1 = MR-731, merged in #101). The structural fix that moves Cedar enforcement from HTTP-only to engine-wide. apply_schema is the proof-of-concept writer; PR #3 fans the enforce() call out to the remaining six (mutate_as, load, ingest_as, branch_create_from, branch_delete, branch_merge). ## What lands ### New crate: omnigraph-policy The 844-line policy.rs moves from `omnigraph-server` into a new `omnigraph-policy` workspace crate so both engine and server can depend on it. Cedar dependency moves with it. The server's policy.rs becomes a re-export shim (`pub use omnigraph_policy::`) so existing `omnigraph_server::PolicyAction` etc. paths keep working — CLI and test consumers don't have to migrate in one go. ### New trait: PolicyChecker ```rust pub trait PolicyChecker: Send + Sync { fn check(&self, action: PolicyAction, scope: &ResourceScope, actor: &str) -> Result<(), PolicyError>; } ``` `PolicyEngine` (Cedar-backed) implements it. `Omnigraph::with_policy()` takes `Arc<dyn PolicyChecker>`. Engine tests mock the trait without spinning up Cedar. MR-725 will extend the trait with `predicate_for()` for query-layer pushdown — additive, no call-site changes. ### New enum: ResourceScope Four variants — Graph, Branch, TargetBranch, BranchTransition — mapping cleanly to today's `(branch, target_branch)` shape on PolicyRequest via `to_branch_pair()`. Each engine writer picks the variant that matches the existing HTTP-layer convention so engine and HTTP evaluate the same Cedar decision. Invariant: ResourceScope stays at branch granularity. Per-type and per-row scope are MR-725's territory, not engine-layer's. Adding Type/Row variants here creates two places per-type policy can be evaluated, which can drift. See chassis design refinements comment on MR-722 (2026-05-17). ### Omnigraph::with_policy() + enforce() New `policy: Option<Arc<dyn PolicyChecker>>` field on Omnigraph, None by default (preserves embedded/dev no-enforcement mode). * `with_policy(self, checker)` setter — builder-style, consumes self. * `enforce(action, scope, actor)` — the gate. When policy is None, no-op. When policy is Some AND actor is None, hard error — silent bypass via "I forgot the actor" is exactly the footgun this gate is here to prevent. ### apply_schema_as: first writer wired * New public method `apply_schema_as(source, options, actor)` that calls `enforce(SchemaApply, TargetBranch("main"), actor)` before acquiring the schema-apply lock or doing any other work. * Existing `apply_schema(source)` and `apply_schema_with_options(...)` delegate to it with actor=None (no-actor variants). * HTTP handler `server_schema_apply` updated to call apply_schema_as with the resolved actor. AppState construction injects the PolicyEngine into Omnigraph via `with_policy`. HTTP-layer authorize_request still fires first; the engine gate is the redundant-but-correct backstop and the only path that protects SDK / embedded callers. PR #3 removes the HTTP redundancy. ### OmniError::Policy New error variant for engine-layer policy denial / evaluation failure. ApiError::from_omni maps it to 403. ### MR-724 Admin action — Option A reservation PolicyAction::Admin kept in the enum with a load-bearing doc comment naming its future consumers (hot reload, audit log query, approvals list per MR-726 / MR-732 / MR-734). No enforce(Admin, ...) call site exists yet — the variant is reserved so the action vocabulary is complete from chassis day one. MR-724 closes when the first consumer surface ships. ### New SDK-side integration test `crates/omnigraph/tests/policy_engine_chassis.rs` — four tests covering: * Policy denies for unauthorized actor → OmniError::Policy * Policy permits for authorized actor → apply succeeds * Policy installed + no actor → hard error (forget-the-actor footgun) * No policy → no-op (embedded/dev default still works) These exercise the engine path directly — no HTTP layer involved. ## Test results - cargo test --workspace --locked --no-fail-fast: 851 passed, 0 failed * 45 server tests (existing) pass * 14 schema_apply tests (existing) pass * 4 new chassis tests pass * 60 OpenAPI tests pass (no HTTP API surface changes) * No regressions across the workspace ## Architectural decisions baked in Per MR-722 chassis design refinements comment (2026-05-17): 1. PolicyChecker is a trait, not just a concrete. Engine and server consume the trait. MR-725 adds predicate_for() additively. 2. ResourceScope stays at branch granularity. No Type/Row variants. 3. Coarse-vs-fine framing pinned: engine-layer is action gate; query-layer (MR-725) is predicate gate. Both backed by same Cedar engine; non-overlapping responsibilities. 4. Admin action reserved for policy-management surfaces (MR-724 Option A). ## Pending follow-ups (PR #3+) - Fan-out enforce() to mutate_as, load, ingest_as, branch_create_from, branch_delete, branch_merge (PR #3). - Remove HTTP-layer authorize_request redundancy once engine gate covers all writers (PR #3). - CLI policy injection into Omnigraph for non-`policy validate\|test\|explain` subcommands (PR #3 or follow-up). - MR-723 default-deny 3-state matrix (PR #4). - MR-736 severity warn/deny (PR #5). - AGENTS.md scope-of-enforcement rewrite once chassis fully lands. - Coarse-vs-fine framing in docs/user/policy.md. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 00:36:36 +03:00

10 commits