mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-21 02:28:07 +02:00
10 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
4e2f18a95e
|
mr-668: split PolicyEngine::load into kind-typed loaders
Pre-fix, every caller of `PolicyEngine::load(path, graph_id)` passed *some* `graph_id` argument — even when the policy was server-scoped and Cedar's resolution would never touch a Graph entity. The server-level loader at lib.rs passed the meaningless sentinel `"server"`. A graph policy file containing a `graph_list` rule compiled fine; a server policy file containing a `read` rule compiled fine. Both silently no-op'd at request time because the engine kind and the rule's resource kind disagreed. Correct-by-design fix: replace `load` with two kind-typed loaders. * `PolicyEngine::load_graph(path, graph_id)` — for per-graph policy files. Rejects any rule whose action `resource_kind()` is `Server`. * `PolicyEngine::load_server(path)` — for server-level policy files. Takes no `graph_id`: server-scoped actions resolve against the singleton `Omnigraph::Server::"root"` entity, never a Graph. Rejects any rule whose action `resource_kind()` is `Graph`. The old `load` is hard-deleted in the same commit because every in-tree consumer migrates here (no semver promise on the workspace crate, no external pinners). New `PolicyEngineKind` enum types the loader's intent; `validate_kind_alignment` is the load-time check that closes the "wrong action, wrong file, silent no-op" class — operators get a load-time error instead of confused-and- silent behavior at request time. Callsites migrated: * server lib.rs:374 (single-mode per-graph) → load_graph * server lib.rs:1065 (multi-mode server) → load_server * server lib.rs:1103 (multi-mode per-graph) → load_graph * CLI main.rs:732 (resolve_policy_engine) → load_graph * tests/server.rs ×5 (4 graph, 1 server) → load_graph/load_server * policy_engine_chassis.rs → load_graph Four new in-source tests pin the contract: both rejection paths and both positive paths. Closes the "operator puts an action in the wrong file and the rule silently never matches" class. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
2bb6e24fe3
|
mr-668: drop vestigial PolicyEngine surface
* `validate_request` had zero callsites — pure surface for nothing. * `deny`'s `_actor_id` and `_request` parameters were both unused (the underscore prefix gave it away); the message is built by the caller before `deny` ever sees the request. Trim both. Closes the "public API that the type system can't justify" class for the policy engine. No behavior change; every existing test stays green because the deletions never had a runtime effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
76ee061cac
|
mr-668: drop actor_id from PolicyRequest; pass actor as separate arg
The MR-731 "server-authoritative actor identity" invariant was enforced by an in-function chokepoint (`request.actor_id = actor.actor_id...` overwrite inside `authorize_request`). That worked but relied on every caller passing in a `PolicyRequest` and trusting the overwrite — a comment-enforced invariant. Move the invariant into the type system: * `PolicyRequest` no longer carries `actor_id`. The struct now models what a caller wants to do, not who they are. * `PolicyEngine::authorize(actor_id: &str, request: &PolicyRequest)` and `validate_request(actor_id, request)` take identity as a separate argument. The same shape `PolicyChecker::check` already had for the engine layer. * `authorize_request` in the HTTP layer extracts `actor_id` from the bearer-resolved `ResolvedActor` and passes it positionally — no overwrite step that could be skipped. * CLI `omnigraph policy explain` updated (the only other consumer that built a `PolicyRequest`). Public API break for the `omnigraph-policy` crate. Worth it: handlers can no longer accidentally populate `actor_id` from a request body field, and external consumers are forced by the compiler to source actor identity from a trusted path. The MR-731 chokepoint test `actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers` still passes — the bearer-resolved actor is what reaches the engine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
52f28cebe8
|
mr-668: comment cleanup and policy format style
Strip "PR Na/Nb" sub-PR references throughout MR-668 surfaces — they were useful during the 10-PR delivery sequence but rot now that the work is in the tree. Keep the MR-668 umbrella references. Also: - Add explicit `when = when` and `resource_literal = resource_literal` named args in `compile_policy_source`'s outer `format!` to match the surrounding crate style (already explicit for `group` and `action`). - Rename the best-effort cleanup tracing target from "omnigraph::init" to "omnigraph::init::cleanup" so operators can filter init-failure cleanup events separately from init's other log lines. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
937fd6382d
|
mr-668: remove POST /graphs and CLI graphs create (defer runtime graph mgmt)
The POST /graphs runtime-create endpoint shipped in PR 7/10 has three
unresolved high-severity bugs:
- flock-on-renamed-inode race: the YAML flock is taken on
omnigraph.yaml itself, then a temp file is renamed over it.
Cross-process writers end up locking different inodes — both
believing they hold exclusive access.
- duplicate-check outside the file lock: precheck runs against
the in-memory registry only; the locked closure does
config.graphs.insert(...) unconditionally. Concurrent same-id
POSTs can persist the loser in YAML while the in-memory registry
keeps the winner — they disagree after restart.
- best_effort_cleanup_init_artifacts deletes _schema.pg /
_schema.ir.json / __schema_state.json on any init failure. An
accidental re-init against an existing graph's URI destroys its
schema; subsequent open() fails at read_text(_schema.pg).
The correct fix is a Lance-style cluster catalog (reserve → init →
publish with recovery sidecars), parallel to the engine's existing
__manifest discipline. That work is out of scope for v0.7.0.
For now, disable runtime add/remove from the network and CLI surface.
Operators add graphs by editing omnigraph.yaml and restarting. The
GET /graphs read-only enumeration stays.
Removed:
- POST /graphs handler + router fragment + utoipa registration
- 13 post_graphs_* server tests + 3 composite POST tests +
multi_mode_app_with_real_config / post_graph helpers
- CLI omnigraph graphs create subcommand + its handler + cli.rs tests
- system_remote.rs combined list+create test trimmed to list-only
- YAML rewrite infra: rewrite_atomic[_with_modify], RewriteAtomicError,
staging_path, hash_config_file, AppState::config_hash field +
threading through new_multi and open_multi_graph_state
- fs2 dependency (verified absent from cargo tree)
- sha2/fs2 imports in config.rs (only the rewrite path used them)
- Cedar PolicyAction::GraphCreate variant + "graph_create" match arms
+ action def in Cedar schema + graph_create_action_authorizes_against_server_resource test
- GraphCreateRequest / GraphCreateResponse / GraphSchemaSpec /
GraphPolicySpec API types (only the POST handler / CLI imported them)
Kept:
- GET /graphs (read-only enumeration) and graph_list Cedar action
- omnigraph graphs list CLI subcommand
- All multi-graph startup, mode inference, cluster routes,
per-graph + server-level Cedar policies
- server_settings_drive_multi_graph_startup_end_to_end (the test
that covers operator-authored YAML + restart — the path that
survives)
- best_effort_cleanup_init_artifacts and the three init failpoints
(still reachable from CLI `omnigraph init`; preflight fix deferred
as a follow-up)
- GraphRegistry::insert and its concurrency tests — production
callers gone, but the method is the natural seam for the future
cluster-catalog work
Also fixed (transcript issue 4):
- ALWAYS_FLAT_PATHS now includes /graphs so multi-mode OpenAPI
advertises the management route correctly (was previously rewritten
to /graphs/{graph_id}/graphs)
- multi_mode_openapi_keeps_healthz_flat → renamed to
multi_mode_openapi_keeps_management_paths_flat, asserts both
/healthz and /graphs stay flat
- multi_mode_openapi_prefixes_operation_ids_with_cluster skips
/graphs in addition to /healthz
Doc fixes:
- docs/user/cli.md: graphs list example was --target http://...,
but --target is a config-graph-name lookup; corrected to --uri.
Removed the graphs create example.
- docs/user/server.md: dropped POST /graphs row, "omnigraph.yaml
ownership", and "POST /graphs body shape" sections. Added a
paragraph stating runtime add/remove is not exposed in v0.7.0.
- docs/user/policy.md: dropped graph_create action; reworded the
"Configuration" line to clarify that server-scoped rules (graph_list)
take neither branch_scope nor target_branch_scope.
- docs/releases/v0.7.0.md: rewrote release narrative — multi-graph
mode ships; runtime add/remove deferred.
- AGENTS.md: HTTP server bullet and capability matrix row updated to
reflect read-only GET /graphs and the operator-edit workflow.
- openapi.json regenerated; /graphs has only .get, no .post.
Diff: 17 files, +123 −1525 LOC.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
d11c18fb27
|
mr-668: composite e2e tests, race fix, v0.7.0 release (PR 9/10)
PR 9 — the final integration PR for MR-668 multi-graph server work.
Closes the v0.7.0 release.
Composite lifecycle tests (closes gaps flagged in PR 7's coverage
review):
- `multi_graph_lifecycle_post_query_restart_persistence` — POST a
graph, query it via cluster route, reload the config from disk
and confirm `load_server_settings` sees the rewritten YAML.
Validates the "restart resolves orphans" failure-mode story.
- `per_graph_policy_enforced_on_post_created_graph` — POST a graph
with a per-graph policy attached, then send authenticated read
and change requests. Per-graph Cedar enforcement fires correctly
on a POST-created graph (engine-layer policy reinstalled via
`Omnigraph::with_policy` inside the create flow).
- `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent
POSTs with distinct graph_ids all return 201. Caught a real
race in `rewrite_atomic` (see below).
Race fix — `rewrite_atomic_with_modify`:
The first composite test surfaced a real bug. The old
`rewrite_atomic(path, new_config, expected_hash)` captured the
baseline hash OUTSIDE the flock, then called rewrite_atomic which
re-acquired it inside. Under concurrent writers:
- POST A: captures baseline H0, calls rewrite_atomic.
- POST B: captures baseline H0 too (before A's update lands).
- A: acquires flock, on-disk == H0, writes H1, releases.
- A: updates baseline H0 → H1.
- B: tries to acquire flock — waits.
- B: acquires flock. On-disk is now H1. Expected (captured
before A finished) is H0. MISMATCH → spurious Drift error.
Worse: even if the timing happens to align, B's `updated` config
was constructed from BYTES read before the flock. B writes a config
that doesn't include A's new graph — silent data loss.
The fix: new `config::rewrite_atomic_with_modify(path, baseline,
modify)` takes a closure. Inside the flock + baseline mutex:
1. Read on-disk bytes, hash, compare to baseline.
2. Parse on-disk YAML.
3. Call `modify(parsed)` to produce the new config — receives
fresh on-disk state, returns the modification.
4. Serialize + write + fsync + rename + update baseline.
Everything is read-modify-write under the same critical section.
Concurrent writers serialize cleanly. Test confirmed this is no
longer a race.
The old `rewrite_atomic(path, new_config, expected_hash)` API stays
for tests that don't need the read-modify-write shape; the POST
handler switches to the new shape.
Version bump v0.6.0 → v0.7.0:
- All 5 `crates/*/Cargo.toml` (compiler, engine, policy, cli, server)
plus their inter-crate `path` dep version constraints.
- `Cargo.lock` regenerated by `cargo build --workspace`.
- `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server
row updated to mention multi-graph + cluster routes + atomic YAML
rewrite.
- `openapi.json` regenerated.
Docs:
- `docs/releases/v0.7.0.md` (new) — release notes with breaking
changes, new features, deferred items (DELETE, `delete_prefix`,
actor forwarding), and the single→multi migration recipe.
- `docs/user/server.md` — substantial section additions for the
two modes, mode inference, cluster endpoint table, management
endpoints, `omnigraph.yaml` ownership contract, `POST /graphs`
body shape + status codes.
- `docs/user/cli.md` — `omnigraph graphs list/create` section,
deferred-DELETE note.
- `docs/user/policy.md` — server-scoped Cedar actions
(`graph_create`, `graph_list`), per-graph vs server-level policy
composition, example server-level policy.
Workspace test pass: 573 tests green across all crates. Zero
failures. MR-731 spoof regression still pinned and passing across
the entire 10-PR series.
This commit closes MR-668. v0.7.0 is ready for tagging.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
0e5aa036f4
|
mr-668: Cedar resource-model refactor (PR 6a/10)
PR 6a of the MR-668 multi-graph server work. Policy-crate-only refactor —
no HTTP handler changes, no operator-supplied policy.yaml changes. Sets
up the chassis that PR 6b's `GET /graphs` consumes.
Two new `PolicyAction` variants:
- `GraphCreate` — gates `POST /graphs` (deferred behavioral PR).
- `GraphList` — gates `GET /graphs` (lands in PR 6b).
Note: `GraphDelete` is intentionally NOT added in this PR. `DELETE
/graphs/{id}` is deferred from MR-668's v0.7.0 scope to bound complexity
(no `delete_prefix`, no tombstone, no `RegistryLookup::Tombstoned`).
Adding the Cedar action without a consumer would be the same kind of
"dead vocabulary" trap the `Admin` variant already documents.
New `PolicyResourceKind { Graph, Server }` enum, plus a
`PolicyAction::resource_kind()` method that classifies every action.
Per-graph actions (Read, Change, BranchCreate, …) bind to
`Omnigraph::Graph::"<graph_label>"`; server-scoped actions
(GraphCreate, GraphList) bind to the singleton
`Omnigraph::Server::"root"`. `Admin` stays classified as per-graph for
now — MR-724 will pick the final shape when the first consumer surface
ships.
Cedar schema string additions:
- `entity Server;`
- `action "graph_create" appliesTo { principal: Actor, resource: Server, ... }`
- `action "graph_list" appliesTo { principal: Actor, resource: Server, ... }`
Compiler updates:
- `compile_policy_source` picks the resource literal based on the
action's `resource_kind`. Existing graph-only policies generate
the same Cedar source as before — pinned by
`per_graph_rules_continue_to_work_alongside_server_rules`.
- `compile_entities` includes the `Server::"root"` entity only when
a rule references a server-scoped action. Keeps test assertions
for graph-only policies tight.
- `PolicyEngine::authorize` builds the right resource UID at
request time based on `request.action.resource_kind()`.
Validation rules added to `PolicyConfig::validate`:
- A rule may not mix server-scoped and per-graph actions (different
resource kinds need different `permit` clauses).
- Server-scoped actions cannot have `branch_scope` or
`target_branch_scope` — there's no branch context at the server
level.
Operator impact: zero. The Cedar schema `Omnigraph::Server` entity is
internally referenced by `compile_policy_source`; operator policy.yaml
files only declare actions in `rules[].allow.actions` and never
reference the resource entity directly. Decision 6's "internal rename
only; operator policies unaffected" contract is preserved and pinned
by `per_graph_rules_continue_to_work_alongside_server_rules`.
Tests: 5 new (11 policy tests total, up from 6):
- `graph_list_action_authorizes_against_server_resource`
- `graph_create_action_authorizes_against_server_resource`
- `server_scoped_rule_cannot_use_branch_scope`
- `rule_mixing_server_and_per_graph_actions_is_rejected`
- `per_graph_rules_continue_to_work_alongside_server_rules`
No regression: 145 server tests (74 lib + 71 integration) still green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
cc2412dc65
|
Rename repo terminology to graph (#118)
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
|
||
|
|
bb1fe57640
|
release: v0.5.0 (#115)
* gitignore: exclude docs/internal/ from publication
Mirrors the existing "Local-only working files (not for the public
repo)" pattern. Working notes filed under docs/internal/ stay on the
contributor's machine instead of cluttering the published doc tree
or tripping the AGENTS.md / docs-index cross-link check
(scripts/check-agents-md.sh enumerates every docs/*.md and requires
each one to be linked from an audience index — internal notes don't
have an audience index by definition).
Incidental to the v0.5.0 release; lands separately from the version
bump commits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci: skip docs/internal/ in agents-md cross-link check
Matches the .gitignore exclusion. Mirrors the existing 'docs/releases/'
exclusion pattern: notes under docs/internal/ aren't part of the
published doc tree and don't need to be linked from an audience index.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* release: v0.5.0 — Lance 6 substrate, Cedar policy engine, schema-lint v1
Bumps the workspace from 0.4.2 to 0.5.0. Release notes at
docs/releases/v0.5.0.md.
Three user-visible pillars motivate the minor bump:
1. Lance 6.0.1 substrate (DataFusion 52→53, Arrow 57→58)
2. Engine-wide Cedar policy enforcement on every _as writer; server
defaults to deny-all; signed-token-claim-only actor identity
3. Schema-lint v1 chassis: OG-XXX-NNN codes, soft drops, and
`--allow-data-loss` (Hard mode) for destructive migrations
Plus structured DataFusion Expr filter pushdown (unblocks
CompOp::Contains via array_has), HTTP allow_data_loss parity, inline
.gq sources on CLI/HTTP, optional CORS layer, and bug fixes
(merge-insert dup-rowid, branch-merge coordinator restore on error,
blob columns in branch merge).
Sites bumped:
- 5 crate [package].version lines (omnigraph, omnigraph-cli,
omnigraph-compiler, omnigraph-policy, omnigraph-server)
- 10 internal path-dep `version = "..."` constraints across the
four manifests that depend on sister crates (engine, server, cli,
plus engine's dev-dep on the compiler)
- Cargo.lock (regenerated via cargo update --workspace)
- AGENTS.md "Version surveyed:"
- openapi.json `info.version` (regenerated via
OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test
openapi)
Verification:
- cargo test --workspace --locked: 907/907 green
- cargo test -p omnigraph-engine --test failpoints --features
failpoints: 19/19 green
- cargo test -p omnigraph-engine --test lance_surface_guards: 3/3
- scripts/check-agents-md.sh: clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
9973683261
|
policy: chassis core — omnigraph-policy crate + Omnigraph::enforce() (MR-722) (#102)
PR #2 of the policy chassis series (PR #1 = MR-731, merged in #101). The structural fix that moves Cedar enforcement from HTTP-only to engine-wide. apply_schema is the proof-of-concept writer; PR #3 fans the enforce() call out to the remaining six (mutate_as, load, ingest_as, branch_create_from, branch_delete, branch_merge). ## What lands ### New crate: omnigraph-policy The 844-line policy.rs moves from `omnigraph-server` into a new `omnigraph-policy` workspace crate so both engine and server can depend on it. Cedar dependency moves with it. The server's policy.rs becomes a re-export shim (`pub use omnigraph_policy::*`) so existing `omnigraph_server::PolicyAction` etc. paths keep working — CLI and test consumers don't have to migrate in one go. ### New trait: PolicyChecker ```rust pub trait PolicyChecker: Send + Sync { fn check(&self, action: PolicyAction, scope: &ResourceScope, actor: &str) -> Result<(), PolicyError>; } ``` `PolicyEngine` (Cedar-backed) implements it. `Omnigraph::with_policy()` takes `Arc<dyn PolicyChecker>`. Engine tests mock the trait without spinning up Cedar. MR-725 will extend the trait with `predicate_for()` for query-layer pushdown — additive, no call-site changes. ### New enum: ResourceScope Four variants — Graph, Branch, TargetBranch, BranchTransition — mapping cleanly to today's `(branch, target_branch)` shape on PolicyRequest via `to_branch_pair()`. Each engine writer picks the variant that matches the existing HTTP-layer convention so engine and HTTP evaluate the same Cedar decision. **Invariant**: ResourceScope stays at branch granularity. Per-type and per-row scope are MR-725's territory, not engine-layer's. Adding Type/Row variants here creates two places per-type policy can be evaluated, which can drift. See chassis design refinements comment on MR-722 (2026-05-17). ### Omnigraph::with_policy() + enforce() * New `policy: Option<Arc<dyn PolicyChecker>>` field on Omnigraph, None by default (preserves embedded/dev no-enforcement mode). * `with_policy(self, checker)` setter — builder-style, consumes self. * `enforce(action, scope, actor)` — the gate. When policy is None, no-op. When policy is Some AND actor is None, hard error — silent bypass via "I forgot the actor" is exactly the footgun this gate is here to prevent. ### apply_schema_as: first writer wired * New public method `apply_schema_as(source, options, actor)` that calls `enforce(SchemaApply, TargetBranch("main"), actor)` before acquiring the schema-apply lock or doing any other work. * Existing `apply_schema(source)` and `apply_schema_with_options(...)` delegate to it with actor=None (no-actor variants). * HTTP handler `server_schema_apply` updated to call apply_schema_as with the resolved actor. AppState construction injects the PolicyEngine into Omnigraph via `with_policy`. HTTP-layer authorize_request still fires first; the engine gate is the redundant-but-correct backstop and the only path that protects SDK / embedded callers. PR #3 removes the HTTP redundancy. ### OmniError::Policy New error variant for engine-layer policy denial / evaluation failure. ApiError::from_omni maps it to 403. ### MR-724 Admin action — Option A reservation PolicyAction::Admin kept in the enum with a load-bearing doc comment naming its future consumers (hot reload, audit log query, approvals list per MR-726 / MR-732 / MR-734). No enforce(Admin, ...) call site exists yet — the variant is reserved so the action vocabulary is complete from chassis day one. MR-724 closes when the first consumer surface ships. ### New SDK-side integration test `crates/omnigraph/tests/policy_engine_chassis.rs` — four tests covering: * Policy denies for unauthorized actor → OmniError::Policy * Policy permits for authorized actor → apply succeeds * Policy installed + no actor → hard error (forget-the-actor footgun) * No policy → no-op (embedded/dev default still works) These exercise the engine path directly — no HTTP layer involved. ## Test results - cargo test --workspace --locked --no-fail-fast: 851 passed, 0 failed * 45 server tests (existing) pass * 14 schema_apply tests (existing) pass * 4 new chassis tests pass * 60 OpenAPI tests pass (no HTTP API surface changes) * No regressions across the workspace ## Architectural decisions baked in Per MR-722 chassis design refinements comment (2026-05-17): 1. PolicyChecker is a trait, not just a concrete. Engine and server consume the trait. MR-725 adds predicate_for() additively. 2. ResourceScope stays at branch granularity. No Type/Row variants. 3. Coarse-vs-fine framing pinned: engine-layer is action gate; query-layer (MR-725) is predicate gate. Both backed by same Cedar engine; non-overlapping responsibilities. 4. Admin action reserved for policy-management surfaces (MR-724 Option A). ## Pending follow-ups (PR #3+) - Fan-out enforce() to mutate_as, load, ingest_as, branch_create_from, branch_delete, branch_merge (PR #3). - Remove HTTP-layer authorize_request redundancy once engine gate covers all writers (PR #3). - CLI policy injection into Omnigraph for non-`policy validate|test|explain` subcommands (PR #3 or follow-up). - MR-723 default-deny 3-state matrix (PR #4). - MR-736 severity warn/deny (PR #5). - AGENTS.md scope-of-enforcement rewrite once chassis fully lands. - Coarse-vs-fine framing in docs/user/policy.md. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> |