PR #2 of the policy chassis series (PR #1 = MR-731, merged in #101).
The structural fix that moves Cedar enforcement from HTTP-only to
engine-wide. apply_schema is the proof-of-concept writer; PR #3 fans
the enforce() call out to the remaining six (mutate_as, load,
ingest_as, branch_create_from, branch_delete, branch_merge).
## What lands
### New crate: omnigraph-policy
The 844-line policy.rs moves from `omnigraph-server` into a new
`omnigraph-policy` workspace crate so both engine and server can
depend on it. Cedar dependency moves with it. The server's policy.rs
becomes a re-export shim (`pub use omnigraph_policy::*`) so existing
`omnigraph_server::PolicyAction` etc. paths keep working — CLI and
test consumers don't have to migrate in one go.
### New trait: PolicyChecker
```rust
pub trait PolicyChecker: Send + Sync {
fn check(&self, action: PolicyAction, scope: &ResourceScope,
actor: &str) -> Result<(), PolicyError>;
}
```
`PolicyEngine` (Cedar-backed) implements it. `Omnigraph::with_policy()`
takes `Arc<dyn PolicyChecker>`. Engine tests mock the trait without
spinning up Cedar. MR-725 will extend the trait with `predicate_for()`
for query-layer pushdown — additive, no call-site changes.
### New enum: ResourceScope
Four variants — Graph, Branch, TargetBranch, BranchTransition —
mapping cleanly to today's `(branch, target_branch)` shape on
PolicyRequest via `to_branch_pair()`. Each engine writer picks the
variant that matches the existing HTTP-layer convention so engine
and HTTP evaluate the same Cedar decision.
**Invariant**: ResourceScope stays at branch granularity. Per-type
and per-row scope are MR-725's territory, not engine-layer's.
Adding Type/Row variants here creates two places per-type policy
can be evaluated, which can drift. See chassis design refinements
comment on MR-722 (2026-05-17).
### Omnigraph::with_policy() + enforce()
* New `policy: Option<Arc<dyn PolicyChecker>>` field on Omnigraph,
None by default (preserves embedded/dev no-enforcement mode).
* `with_policy(self, checker)` setter — builder-style, consumes self.
* `enforce(action, scope, actor)` — the gate. When policy is None,
no-op. When policy is Some AND actor is None, hard error — silent
bypass via "I forgot the actor" is exactly the footgun this gate
is here to prevent.
### apply_schema_as: first writer wired
* New public method `apply_schema_as(source, options, actor)` that
calls `enforce(SchemaApply, TargetBranch("main"), actor)` before
acquiring the schema-apply lock or doing any other work.
* Existing `apply_schema(source)` and `apply_schema_with_options(...)`
delegate to it with actor=None (no-actor variants).
* HTTP handler `server_schema_apply` updated to call apply_schema_as
with the resolved actor. AppState construction injects the
PolicyEngine into Omnigraph via `with_policy`. HTTP-layer
authorize_request still fires first; the engine gate is the
redundant-but-correct backstop and the only path that protects SDK
/ embedded callers. PR #3 removes the HTTP redundancy.
### OmniError::Policy
New error variant for engine-layer policy denial / evaluation
failure. ApiError::from_omni maps it to 403.
### MR-724 Admin action — Option A reservation
PolicyAction::Admin kept in the enum with a load-bearing doc
comment naming its future consumers (hot reload, audit log query,
approvals list per MR-726 / MR-732 / MR-734). No enforce(Admin, ...)
call site exists yet — the variant is reserved so the action
vocabulary is complete from chassis day one. MR-724 closes when
the first consumer surface ships.
### New SDK-side integration test
`crates/omnigraph/tests/policy_engine_chassis.rs` — four tests
covering:
* Policy denies for unauthorized actor → OmniError::Policy
* Policy permits for authorized actor → apply succeeds
* Policy installed + no actor → hard error (forget-the-actor footgun)
* No policy → no-op (embedded/dev default still works)
These exercise the engine path directly — no HTTP layer involved.
## Test results
- cargo test --workspace --locked --no-fail-fast: 851 passed, 0 failed
* 45 server tests (existing) pass
* 14 schema_apply tests (existing) pass
* 4 new chassis tests pass
* 60 OpenAPI tests pass (no HTTP API surface changes)
* No regressions across the workspace
## Architectural decisions baked in
Per MR-722 chassis design refinements comment (2026-05-17):
1. PolicyChecker is a trait, not just a concrete. Engine and server
consume the trait. MR-725 adds predicate_for() additively.
2. ResourceScope stays at branch granularity. No Type/Row variants.
3. Coarse-vs-fine framing pinned: engine-layer is action gate;
query-layer (MR-725) is predicate gate. Both backed by same Cedar
engine; non-overlapping responsibilities.
4. Admin action reserved for policy-management surfaces (MR-724
Option A).
## Pending follow-ups (PR #3+)
- Fan-out enforce() to mutate_as, load, ingest_as, branch_create_from,
branch_delete, branch_merge (PR #3).
- Remove HTTP-layer authorize_request redundancy once engine gate
covers all writers (PR #3).
- CLI policy injection into Omnigraph for non-`policy validate|test|explain`
subcommands (PR #3 or follow-up).
- MR-723 default-deny 3-state matrix (PR #4).
- MR-736 severity warn/deny (PR #5).
- AGENTS.md scope-of-enforcement rewrite once chassis fully lands.
- Coarse-vs-fine framing in docs/user/policy.md.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
PR 2 wraps the Omnigraph engine's catalog and schema_source fields in
ArcSwap so reads stay zero-cost while apply_schema can swap atomically
without &mut self. arc-swap lands as an unused workspace dep here so the
follow-up commits that wrap fields can land in isolation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add the scaffolding for the in-memory staged-write rewire — no behavior
change yet:
* New crates/omnigraph/src/exec/staging.rs with MutationStaging,
PendingTable, PendingMode, StagedTablePath, plus the end-of-query
finalize() that issues one stage_* + commit_staged per pending
table (Merge mode dedupes by id, last-write-wins).
* TableStore::scan_with_pending and count_rows_with_pending helpers —
Lance scan committed + DataFusion MemTable scan pending, concat.
Sidesteps the Scanner::with_fragments filter-pushdown limitation
documented on scan_with_staged.
* Add datafusion = "52" to workspace + omnigraph-engine deps for
MemTable (transitively pulled by Lance already).
Engine code still uses the legacy MutationStaging shape; the rewire
lands in subsequent commits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Parallel per-type load writes + omnigraph optimize/cleanup CLI
## MR-677.3 — parallel per-type load writes
The load path already groups records into one RecordBatch per type and
makes one Lance commit per table (loader::mod.rs:249-..), but those
commits ran sequentially. Wrap node and edge write loops in
`futures::stream::buffered(N)` against a new helper
`write_batches_concurrently`. Concurrency tunable via
`OMNIGRAPH_LOAD_CONCURRENCY` (default 8).
## MR-676 — `omnigraph optimize` and `omnigraph cleanup`
New CLI subcommands that walk every node + edge table in the repo:
- `omnigraph optimize <uri>` — runs Lance `compact_files` on each
table to merge small fragments into fewer larger ones.
- `omnigraph cleanup <uri> --keep N | --older-than 7d --confirm` —
runs Lance `cleanup_old_versions` to prune historical manifests +
unique fragments. Requires `--confirm` because it's destructive.
Supports both count-based and time-based retention (or both AND'd
together). Time uses chrono `DateTime<Utc>` (added as a workspace
dep, default-features off).
Both commands run their per-table loops in parallel (8-way bounded,
`OMNIGRAPH_MAINTENANCE_CONCURRENCY` env override). Smoke-tested
against the 114-table prod graph: optimize went 7m15s sequential
→ 1m28s parallel. cleanup --keep 1 removed 137 historical versions
across 114 tables in 1m57s without disrupting `/healthz` or query
responses.
Public API on `Omnigraph`:
pub async fn optimize(&mut self) -> Result<Vec<TableOptimizeStats>>
pub async fn cleanup(&mut self, opts: CleanupPolicyOptions)
-> Result<Vec<TableCleanupStats>>
All 10 existing loader tests still pass.
Closes MR-676.
Partially addresses MR-677 (the .3 — parallel by type — piece;
MR-677.1 is for the `omnigraph embed` path, not load, since load
doesn't call Gemini directly. .2 was already in place).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: regenerate openapi.json
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Fixes two live authz bugs in omnigraph-server:
- Bearer-token lookup previously used HashMap::get, which compares keys with
Eq and short-circuits on the first differing byte — a network-observable
timing oracle for brute-forcing tokens. Tokens are now stored as SHA-256
digests and compared with subtle::ConstantTimeEq, iterating every entry
unconditionally so total work is independent of which slot matches. Raw
token bytes no longer live in server memory after startup.
- authorize_request now overwrites PolicyRequest.actor_id from the
authenticated session instead of trusting the handler-supplied field,
which previously defaulted to "" via unwrap_or_default(). The empty
string can no longer reach Cedar as a policy subject even if a future
refactor drops the None check.
External API of AppState constructors is unchanged — tokens still enter as
Vec<(String, String)> and are hashed on the way in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Integrate utoipa 5 to auto-generate an OpenAPI 3.1 spec from the existing
Axum handlers and serde types. All 16 endpoints are annotated with path
metadata, request/response schemas, security requirements, and tags. A
public /openapi.json endpoint serves the spec without requiring auth.
Includes 59 tests covering path completeness, HTTP methods, schema fields,
enum variants, security scheme, path/query parameters, request bodies,
response references, and endpoint integration.
https://claude.ai/code/session_01NfoPVx21rZUQned1f7WpXY