The import landed in earlier work but no current call site uses it.
Emitted an `unused_imports` warning on every server build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The single-mode `GET /graphs` handler returned an `ApiError` built
via struct literal with `status: METHOD_NOT_ALLOWED, code: BadRequest`.
The body code disagreed with the HTTP status — clients deserializing
on `code` saw `bad_request`, clients deserializing on `status` saw
405. Same bug class as the earlier 503+Conflict mismatch on the
removed YAML drift path.
Close the class for this one remaining instance:
* Add `ErrorCode::MethodNotAllowed` to the API enum.
* Add `ApiError::method_not_allowed(msg)` — pairs the 405 status
with the matching code.
* Replace the struct literal in `server_graphs_list` with the
constructor.
* Regenerate `openapi.json` (adds `method_not_allowed` to the
ErrorCode schema enum).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`tracing::Span::current().record("graph_id", ...)` in the routing
middleware silently no-ops here: no upstream `#[tracing::instrument]`
on the handlers declares a `graph_id` field, and `TraceLayer::new_for_http`
doesn't either. The recorded value never lands anywhere visible.
Replace with an explicit `info!(graph_id = %handle.key.graph_id,
"graph routed")` event so operators can grep logs and correlate
requests with the active graph. In single mode the value is the
sentinel `"default"`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the routing-middleware refactor moved the engine into the per-graph
`GraphHandle` (extracted via `Extension<Arc<GraphHandle>>`), seven
read-only handlers — `server_snapshot`, `server_read`, `server_export`,
`server_schema_get`, `server_branch_list`, `server_commit_list`,
`server_commit_show` — kept an unused `State(_state): State<AppState>`
extractor. Drop it. Each request avoids one `FromRequestParts` clone
of `AppState`'s Arcs.
Handlers that actually use state (workload admission for write paths,
`server_policy` for management endpoints) keep theirs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `open_multi_graph_state` doc comment claims "Fail-fast — the
first open error aborts startup; other in-flight opens are dropped"
but the code did
.buffer_unordered(4)
.collect::<Vec<_>>()
.await
.into_iter()
.collect::<Result<Vec<_>>>()?;
which drains every future in the stream before propagating the first
`Err`. With N S3-backed graphs and graph #2 failing fast, the caller
still waits for #1, #3, #4, … to either succeed or fail before
seeing the error.
Replace the four-line dance with `futures::TryStreamExt::try_collect`,
which short-circuits on the first `Err` and drops the rest. The
doc comment now matches behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`AppState::requires_bearer_auth` walked the entire registry per
request (cloning Arcs into a `Vec`, then `.iter().any(|h| h.policy
.is_some())`) to decide whether the auth middleware should challenge.
The walk is unnecessary — the answer only changes when the registry
mutates, which is exactly the moment a new snapshot is constructed.
Move the flag onto the snapshot itself:
* `RegistrySnapshot { graphs, any_per_graph_policy: bool }`.
* `RegistrySnapshot::new(graphs)` is the only construction path —
it derives the flag from `graphs.values().any(|h| h.policy
.is_some())` so the cached value can't drift from the source data.
* `Default` delegates to `new(HashMap::new())`.
* `GraphRegistry::from_handles` and `insert` build snapshots via
`RegistrySnapshot::new(...)`.
* `GraphRegistry::snapshot_ref()` exposes the current snapshot
through an `arc_swap::Guard`; callers that need cached derived
state go through this accessor (callers that only want `graphs`
still use `list` / `get`).
`requires_bearer_auth` becomes one `ArcSwap::load` + bool read.
Also (drive-by, same file, same hunk): replace the dead
`if let Some(other) = seen_uris.get(...)` + `let _ = other;` pattern
in `from_handles` with a plain `seen_uris.contains_key(...)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior `with_policy_engine` constructor reused the engine `Arc`
from the existing handle (`engine: Arc::clone(&existing.engine)`)
without re-applying `Omnigraph::with_policy`. Combined with
`new_with_workload`, the documented composition pattern was
`AppState::new_with_workload(...).with_policy_engine(p)` — which
produced an `AppState` whose HTTP layer enforced Cedar but whose
underlying engine had no `PolicyChecker` installed. Any caller
reaching the engine via `state.registry().list()[i].engine` could
bypass policy entirely. The doc comment named this gap; the type
system didn't.
Make composition impossible to get wrong:
* Add `AppState::new_single(uri, db, tokens, Option<PolicyEngine>,
WorkloadController)` — canonical single-mode constructor that
takes every option together and routes through `build_single_mode`
(which applies `db.with_policy(checker)` to the engine itself).
* `new`, `new_with_bearer_token`, `new_with_bearer_tokens`,
`new_with_bearer_tokens_and_policy`, `new_with_workload` all become
thin wrappers around `new_single`.
* Delete `with_policy_engine`. There is no post-construction policy
install path any more; the single linear construction forces
HTTP-layer and engine-layer policy to install together or not at all.
Regression test `engine_layer_policy_fires_via_direct_arc_omnigraph_from_new_single`
constructs an `AppState::new_single` with a deny-all policy, pulls
the `Arc<Omnigraph>` from the registry handle (the same path an
embedded SDK consumer would take), and asserts a direct `mutate_as`
call returns `OmniError::Policy`. Pre-fix this test would have
succeeded the mutation.
Test caller in `ingest_per_actor_admission_cap_returns_429`
migrates from `.with_policy_engine(...)` to `new_single(...,
Some(policy_engine), workload)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The MR-731 "server-authoritative actor identity" invariant was enforced
by an in-function chokepoint (`request.actor_id = actor.actor_id...`
overwrite inside `authorize_request`). That worked but relied on every
caller passing in a `PolicyRequest` and trusting the overwrite — a
comment-enforced invariant.
Move the invariant into the type system:
* `PolicyRequest` no longer carries `actor_id`. The struct now models
what a caller wants to do, not who they are.
* `PolicyEngine::authorize(actor_id: &str, request: &PolicyRequest)`
and `validate_request(actor_id, request)` take identity as a
separate argument. The same shape `PolicyChecker::check` already had
for the engine layer.
* `authorize_request` in the HTTP layer extracts `actor_id` from the
bearer-resolved `ResolvedActor` and passes it positionally — no
overwrite step that could be skipped.
* CLI `omnigraph policy explain` updated (the only other consumer
that built a `PolicyRequest`).
Public API break for the `omnigraph-policy` crate. Worth it: handlers
can no longer accidentally populate `actor_id` from a request body
field, and external consumers are forced by the compiler to source
actor identity from a trusted path.
The MR-731 chokepoint test
`actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers`
still passes — the bearer-resolved actor is what reaches the engine.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Strip "PR Na/Nb" sub-PR references throughout MR-668 surfaces — they
were useful during the 10-PR delivery sequence but rot now that the
work is in the tree. Keep the MR-668 umbrella references.
Also:
- Add explicit `when = when` and `resource_literal = resource_literal`
named args in `compile_policy_source`'s outer `format!` to match the
surrounding crate style (already explicit for `group` and `action`).
- Rename the best-effort cleanup tracing target from
"omnigraph::init" to "omnigraph::init::cleanup" so operators can
filter init-failure cleanup events separately from init's other
log lines.
No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The POST /graphs runtime-create endpoint shipped in PR 7/10 has three
unresolved high-severity bugs:
- flock-on-renamed-inode race: the YAML flock is taken on
omnigraph.yaml itself, then a temp file is renamed over it.
Cross-process writers end up locking different inodes — both
believing they hold exclusive access.
- duplicate-check outside the file lock: precheck runs against
the in-memory registry only; the locked closure does
config.graphs.insert(...) unconditionally. Concurrent same-id
POSTs can persist the loser in YAML while the in-memory registry
keeps the winner — they disagree after restart.
- best_effort_cleanup_init_artifacts deletes _schema.pg /
_schema.ir.json / __schema_state.json on any init failure. An
accidental re-init against an existing graph's URI destroys its
schema; subsequent open() fails at read_text(_schema.pg).
The correct fix is a Lance-style cluster catalog (reserve → init →
publish with recovery sidecars), parallel to the engine's existing
__manifest discipline. That work is out of scope for v0.7.0.
For now, disable runtime add/remove from the network and CLI surface.
Operators add graphs by editing omnigraph.yaml and restarting. The
GET /graphs read-only enumeration stays.
Removed:
- POST /graphs handler + router fragment + utoipa registration
- 13 post_graphs_* server tests + 3 composite POST tests +
multi_mode_app_with_real_config / post_graph helpers
- CLI omnigraph graphs create subcommand + its handler + cli.rs tests
- system_remote.rs combined list+create test trimmed to list-only
- YAML rewrite infra: rewrite_atomic[_with_modify], RewriteAtomicError,
staging_path, hash_config_file, AppState::config_hash field +
threading through new_multi and open_multi_graph_state
- fs2 dependency (verified absent from cargo tree)
- sha2/fs2 imports in config.rs (only the rewrite path used them)
- Cedar PolicyAction::GraphCreate variant + "graph_create" match arms
+ action def in Cedar schema + graph_create_action_authorizes_against_server_resource test
- GraphCreateRequest / GraphCreateResponse / GraphSchemaSpec /
GraphPolicySpec API types (only the POST handler / CLI imported them)
Kept:
- GET /graphs (read-only enumeration) and graph_list Cedar action
- omnigraph graphs list CLI subcommand
- All multi-graph startup, mode inference, cluster routes,
per-graph + server-level Cedar policies
- server_settings_drive_multi_graph_startup_end_to_end (the test
that covers operator-authored YAML + restart — the path that
survives)
- best_effort_cleanup_init_artifacts and the three init failpoints
(still reachable from CLI `omnigraph init`; preflight fix deferred
as a follow-up)
- GraphRegistry::insert and its concurrency tests — production
callers gone, but the method is the natural seam for the future
cluster-catalog work
Also fixed (transcript issue 4):
- ALWAYS_FLAT_PATHS now includes /graphs so multi-mode OpenAPI
advertises the management route correctly (was previously rewritten
to /graphs/{graph_id}/graphs)
- multi_mode_openapi_keeps_healthz_flat → renamed to
multi_mode_openapi_keeps_management_paths_flat, asserts both
/healthz and /graphs stay flat
- multi_mode_openapi_prefixes_operation_ids_with_cluster skips
/graphs in addition to /healthz
Doc fixes:
- docs/user/cli.md: graphs list example was --target http://...,
but --target is a config-graph-name lookup; corrected to --uri.
Removed the graphs create example.
- docs/user/server.md: dropped POST /graphs row, "omnigraph.yaml
ownership", and "POST /graphs body shape" sections. Added a
paragraph stating runtime add/remove is not exposed in v0.7.0.
- docs/user/policy.md: dropped graph_create action; reworded the
"Configuration" line to clarify that server-scoped rules (graph_list)
take neither branch_scope nor target_branch_scope.
- docs/releases/v0.7.0.md: rewrote release narrative — multi-graph
mode ships; runtime add/remove deferred.
- AGENTS.md: HTTP server bullet and capability matrix row updated to
reflect read-only GET /graphs and the operator-edit workflow.
- openapi.json regenerated; /graphs has only .get, no .post.
Diff: 17 files, +123 −1525 LOC.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 9 — the final integration PR for MR-668 multi-graph server work.
Closes the v0.7.0 release.
Composite lifecycle tests (closes gaps flagged in PR 7's coverage
review):
- `multi_graph_lifecycle_post_query_restart_persistence` — POST a
graph, query it via cluster route, reload the config from disk
and confirm `load_server_settings` sees the rewritten YAML.
Validates the "restart resolves orphans" failure-mode story.
- `per_graph_policy_enforced_on_post_created_graph` — POST a graph
with a per-graph policy attached, then send authenticated read
and change requests. Per-graph Cedar enforcement fires correctly
on a POST-created graph (engine-layer policy reinstalled via
`Omnigraph::with_policy` inside the create flow).
- `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent
POSTs with distinct graph_ids all return 201. Caught a real
race in `rewrite_atomic` (see below).
Race fix — `rewrite_atomic_with_modify`:
The first composite test surfaced a real bug. The old
`rewrite_atomic(path, new_config, expected_hash)` captured the
baseline hash OUTSIDE the flock, then called rewrite_atomic which
re-acquired it inside. Under concurrent writers:
- POST A: captures baseline H0, calls rewrite_atomic.
- POST B: captures baseline H0 too (before A's update lands).
- A: acquires flock, on-disk == H0, writes H1, releases.
- A: updates baseline H0 → H1.
- B: tries to acquire flock — waits.
- B: acquires flock. On-disk is now H1. Expected (captured
before A finished) is H0. MISMATCH → spurious Drift error.
Worse: even if the timing happens to align, B's `updated` config
was constructed from BYTES read before the flock. B writes a config
that doesn't include A's new graph — silent data loss.
The fix: new `config::rewrite_atomic_with_modify(path, baseline,
modify)` takes a closure. Inside the flock + baseline mutex:
1. Read on-disk bytes, hash, compare to baseline.
2. Parse on-disk YAML.
3. Call `modify(parsed)` to produce the new config — receives
fresh on-disk state, returns the modification.
4. Serialize + write + fsync + rename + update baseline.
Everything is read-modify-write under the same critical section.
Concurrent writers serialize cleanly. Test confirmed this is no
longer a race.
The old `rewrite_atomic(path, new_config, expected_hash)` API stays
for tests that don't need the read-modify-write shape; the POST
handler switches to the new shape.
Version bump v0.6.0 → v0.7.0:
- All 5 `crates/*/Cargo.toml` (compiler, engine, policy, cli, server)
plus their inter-crate `path` dep version constraints.
- `Cargo.lock` regenerated by `cargo build --workspace`.
- `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server
row updated to mention multi-graph + cluster routes + atomic YAML
rewrite.
- `openapi.json` regenerated.
Docs:
- `docs/releases/v0.7.0.md` (new) — release notes with breaking
changes, new features, deferred items (DELETE, `delete_prefix`,
actor forwarding), and the single→multi migration recipe.
- `docs/user/server.md` — substantial section additions for the
two modes, mode inference, cluster endpoint table, management
endpoints, `omnigraph.yaml` ownership contract, `POST /graphs`
body shape + status codes.
- `docs/user/cli.md` — `omnigraph graphs list/create` section,
deferred-DELETE note.
- `docs/user/policy.md` — server-scoped Cedar actions
(`graph_create`, `graph_list`), per-graph vs server-level policy
composition, example server-level policy.
Workspace test pass: 573 tests green across all crates. Zero
failures. MR-731 spoof regression still pinned and passing across
the entire 10-PR series.
This commit closes MR-668. v0.7.0 is ready for tagging.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 7 of the MR-668 multi-graph server work. Operators can now add a
graph to a running multi-graph server without restarting:
curl -X POST http://server/graphs \
-H "Content-Type: application/json" \
-d '{
"graph_id": "beta",
"uri": "/data/beta.omni",
"schema": { "source": "node Person { name: String @key }\n" },
"policy": { "file": "./policies/beta.yaml" }
}'
DELETE remains deferred (out of v0.7.0 scope per the trimmed plan —
no `delete_prefix`, no tombstones).
Body shape (decision 7):
- Nested `schema: { source: "..." }` (mirrors the `policy: { file }`
pattern; leaves room for future fields without breakage).
- Optional nested `policy: { file: "..." }` for per-graph Cedar.
- 32 MiB body limit (reuses `INGEST_REQUEST_BODY_LIMIT_BYTES`).
- Asymmetric with `SchemaApplyRequest` which keeps flat
`schema_source: String` — documented in api.rs.
Atomic YAML rewrite + drift detection:
- New `config::rewrite_atomic(path, new_config, expected_hash)`:
flock → re-read + hash check → serialize → write `.tmp` → fsync
→ rename → fsync parent dir. Returns the new hash for the caller
to update its in-memory baseline.
- New `config::hash_config_file(path)` — SHA-256 of the on-disk
bytes, used at startup and after each rewrite.
- New `RewriteAtomicError { Drift | Io | Serialize }` enum.
- `AppState.config_hash: Option<Arc<Mutex<[u8;32]>>>` carries the
in-memory baseline. Updated after every successful rewrite so
subsequent POSTs don't false-trigger drift.
- The mutex is `std::sync::Mutex` (brief critical section, no .await
inside). The flock itself serializes file access process-wide
AND across multiple server instances (defense in depth).
- All sync I/O runs inside `tokio::task::spawn_blocking` — flock
is sync.
Handler ordering (the load-bearing sequence):
1. Mode check: 405 in single mode.
2. Cedar authorize: `GraphCreate` against `Omnigraph::Server::"root"`.
3. Validate body: `GraphId::try_from` (regex + reserved-name), empty
schema/uri checks, per-graph policy file parse.
4. Pre-check registry for duplicate graph_id / duplicate uri (409).
5. `Omnigraph::init` the new engine.
6. Atomic YAML rewrite (drift detection inside).
7. Publish in registry (atomic re-check via `GraphRegistry::insert`).
Failure modes (documented in handler rustdoc):
- Init fails → orphan storage at `req.uri` (PR 2a cleans up schema
files; Lance datasets remain orphans until `delete_prefix` lands).
- YAML rewrite fails (drift, IO) → orphan storage; YAML unchanged.
- Registry insert fails (race) → YAML has entry but registry doesn't;
next restart opens it cleanly.
New dependency: `fs2 = "0.4"` (workspace + omnigraph-server). POSIX-only
file locking. Linux/macOS deployment supported; Windows out of scope.
Tests (10 new in `tests/server.rs::multi_graph_startup`):
- `post_graphs_creates_a_new_graph_end_to_end` — happy path, includes
YAML inspection to confirm the rewrite landed.
- `post_graphs_baseline_hash_updates_between_rewrites` — two POSTs in
a row both succeed (drift baseline updates correctly).
- `post_graphs_duplicate_graph_id_returns_409`
- `post_graphs_duplicate_uri_returns_409`
- `post_graphs_invalid_graph_id_returns_400` (reserved name)
- `post_graphs_empty_schema_source_returns_400`
- `post_graphs_returns_405_in_single_mode`
- `post_graphs_yaml_drift_detection_returns_503` — operator hand-edits
omnigraph.yaml; server refuses to clobber.
- `hash_config_file_is_deterministic_and_detects_changes`
- `rewrite_atomic_refuses_when_hash_drifts`
OpenAPI: `server_graphs_create` registered in `ApiDoc::paths(...)`;
openapi.json regenerated.
Result: 225 server tests green (74 lib + 66 openapi + 85 integration),
all MR-731 regressions still pinned.
LOC: ~580 lib.rs net (handler + helpers), ~120 config.rs (rewrite
machinery), +71 api.rs (request/response shapes), +332 tests/server.rs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 6b of the MR-668 multi-graph server work. First management endpoint —
`GET /graphs` lists every graph registered with the server, gated by the
server-level Cedar policy from PR 6a.
New API shapes (in `omnigraph-server::api`):
- `GraphInfo { graph_id, uri }` — one entry per registered graph.
- `GraphListResponse { graphs: Vec<GraphInfo> }` — sorted alphabetically
by `graph_id` for deterministic output.
Handler `server_graphs_list`:
- Mounted at `GET /graphs` in both modes.
- Single mode: returns 405 (resource exists in the API surface, just
not operational without a `graphs:` map). 405 chosen over 404 so
clients see "resource exists, wrong context" rather than "no such
resource".
- Multi mode: requires bearer auth (when configured); Cedar-gated by
`PolicyAction::GraphList` against `Omnigraph::Server::"root"`
(PR 6a's chassis). Returns the sorted registry list.
Cedar gate composition:
- When no `server.policy.file` is configured, the MR-723 default-deny
falls through: `GraphList` is not `Read`, so an authenticated actor
without a server policy gets 403. This is the right default — don't
expose the registry until the operator explicitly authorizes it.
- When a server policy is configured, Cedar evaluates the rule. The
test `get_graphs_with_server_policy_authorizes_per_cedar` pins the
admin-allow / viewer-deny split.
Routing:
- New `management` sub-router holding `/graphs` (auth-required, no
`resolve_graph_handle` middleware — operates on the registry, not
a single graph).
- Single mode merges flat protected routes + management.
- Multi mode merges nested `/graphs/{graph_id}/...` + management.
OpenAPI:
- `server_graphs_list` registered in `ApiDoc::paths(...)`.
- `EXPECTED_PATHS` in `tests/openapi.rs` gains `/graphs`.
- `openapi.json` regenerated (auto-tracked by
`openapi_spec_is_up_to_date` in CI).
Tests: 4 new in `tests/server.rs::multi_graph_startup`:
- `get_graphs_lists_registered_graphs_in_multi_mode`
- `get_graphs_returns_405_in_single_mode`
- `get_graphs_requires_bearer_auth_when_configured`
- `get_graphs_with_server_policy_authorizes_per_cedar`
What's NOT in this PR (deferred):
- Per-graph policy enforcement is wired through `handle.policy`
(PR 4a already did this); PR 6b doesn't add new per-graph
behavior beyond making sure the server policy lookup composes
cleanly alongside it.
- `POST /graphs` (PR 7) and `DELETE /graphs/{id}` (out of scope
for v0.7.0).
- CLI `omnigraph graphs list` (PR 8 will add).
Result: 215 server tests green (74 lib + 66 openapi + 75 integration),
11 policy tests green. MR-731 spoof regression preserved across all
this work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 5 of the MR-668 multi-graph server work. This is the first PR that
makes multi mode actually usable end-to-end: operators invoking
`omnigraph-server --config omnigraph.yaml` with a non-empty `graphs:`
map and no single-mode selector now get a running multi-graph server.
Mode inference (MR-668 decision 2, four-rule matrix in
`load_server_settings`):
1. CLI `<URI>` positional → Single
2. CLI `--target <name>` → Single (URI from graphs.<name>)
3. `server.graph` in config → Single (URI from graphs.<name>)
4. `--config` + non-empty `graphs:` + no single-mode selector
→ Multi (all entries in `graphs:`)
5. otherwise → error with migration hint
Rule 5's error message names every escape hatch so operators can fix
their invocation without grepping docs.
Config schema extensions:
- `TargetConfig.policy: PolicySettings` (per-graph Cedar policy file).
`#[serde(default)]` so existing single-graph YAMLs keep parsing.
- `ServerDefaults.policy: PolicySettings` (server-level Cedar policy
for management endpoints — loaded in PR 5, wired into `GET /graphs`
in PR 6b).
- `OmnigraphConfig::resolve_target_policy_file(name)` and
`resolve_server_policy_file()` helpers — both resolve relative to
the config file's `base_dir`.
Public types added to `omnigraph-server`:
- `ServerConfigMode { Single { uri, policy_file } | Multi { graphs,
config_path, server_policy_file } }`.
- `GraphStartupConfig { graph_id, uri, policy_file }` — one entry
per graph in multi mode.
`ServerConfig` shape change:
- WAS: `{ uri: String, bind, policy_file, allow_unauthenticated }`.
- NOW: `{ mode: ServerConfigMode, bind, allow_unauthenticated }`.
- Breaking for any code that constructs `ServerConfig` directly.
`main.rs` is unaffected (uses `load_server_settings`).
`serve()` now forks on `ServerConfig.mode`:
- Single: existing flow via `AppState::open_with_bearer_tokens_and_policy`.
- Multi: parallel open via `futures::stream::iter(graphs)
.map(open_single_graph).buffer_unordered(4).collect()`. Bound 4 is
a rule-of-thumb for I/O-bound work — at N≤10 this trades startup
latency for a small amount of concurrent S3/Lance open pressure.
Fail-fast: first open error aborts startup; in-flight opens drop
their engine via Arc (Lance datasets close cleanly).
New helper `open_single_graph(GraphStartupConfig)`:
- Validates `GraphId` per the regex in PR 1.
- `Omnigraph::open(uri).await` with descriptive error context.
- Loads per-graph policy file and re-applies it via
`Omnigraph::with_policy` (engine-layer enforcement, MR-722).
- Returns `Arc<GraphHandle>` ready for the registry.
Routing middleware bug fix:
- `Router::nest("/graphs/{graph_id}", inner)` rewrites
`request.uri().path()` to the inner suffix (e.g. `/snapshot`).
The previous middleware tried to parse `{graph_id}` from
`request.uri().path()` and got 400 instead of 200. Fixed by reading
from `axum::extract::OriginalUri` request extension, which preserves
the pre-rewrite URI.
- Caught by the two new tests
`cluster_routes_dispatch_per_graph_handle` and
`cluster_route_for_unknown_graph_returns_404`.
Tests (14 new, all passing):
- Four-rule matrix: one test per branch + the joint case
`mode_inference_cli_uri_overrides_graphs_map` + the empty-graphs-map
error case.
- Per-graph + server-level policy file path resolution.
- Reserved `GraphId` rejection at startup.
- End-to-end multi-graph routing: two graphs side by side, each
cluster route hits the right engine.
- Unknown graph id under cluster prefix → 404.
- Flat routes 404 in multi mode.
Inline `ServerConfig` test (`serve_refuses_to_start_in_state_1_without_unauthenticated`)
and three `server_settings_*` tests updated to the new `mode` shape.
Result: 211 server tests green (74 lib + 71 integration + 66 openapi),
MR-731 regression test still pinned and passing.
LOC: +45 config.rs, +281 lib.rs (net), +395 tests/server.rs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 4b of the MR-668 multi-graph server work. In multi mode, the served
`/openapi.json` reports cluster routes (`/graphs/{graph_id}/...`) instead
of the legacy flat protected paths — matching what `build_app` actually
mounts (PR 4a's `Router::nest`). Single mode is unchanged.
Implementation:
- New `server_openapi` branch: when `state.mode()` is `Multi`, call
`nest_paths_under_cluster_prefix(&mut doc)` after `ApiDoc::openapi()`.
- The rewrite consumes `doc.paths.paths`, then for every path-item:
- If the path is in `ALWAYS_FLAT_PATHS` (`/healthz` for now), keep
it flat.
- Otherwise, prefix every operation_id with `cluster_` and reinsert
the item at `/graphs/{graph_id}<original_path>`.
- Single mode hits no extra work — the path map is untouched.
- The static `ApiDoc::openapi()` still emits the flat surface, so
in-process callers (the existing `openapi_json()` helper in tests)
see the unmodified spec.
Why cluster_ prefix on operation IDs: OpenAPI specs require unique
operation_ids across the document. With both flat (single-mode) and
cluster (multi-mode) surfaces ever co-existing in a generated SDK,
the prefix prevents collision. The current served doc only carries
one surface, so the prefix is forward-compat with potential future
dual-surface generation.
Tests: 6 new in `tests/openapi.rs`, all via the `/openapi.json` route
(not the static `ApiDoc::openapi()` helper):
- `multi_mode_openapi_lists_cluster_paths` — every protected path
appears as a cluster variant.
- `multi_mode_openapi_drops_flat_protected_paths` — flat protected
paths are absent.
- `multi_mode_openapi_keeps_healthz_flat` — `/healthz` survives.
- `multi_mode_openapi_prefixes_operation_ids_with_cluster` — every
cluster operation_id starts with `cluster_`.
- `multi_mode_operation_ids_are_unique` — no operation_id collisions.
- `single_mode_openapi_unchanged_by_cluster_filter` — single mode
still emits the legacy flat surface (regression).
New test helper `app_for_multi_mode(graph_ids)` exercises the new
`AppState::new_multi` constructor from PR 4a — first user of multi-mode
construction outside of unit tests.
Result: 66 openapi tests + 57 server integration tests + 74 lib tests
= 197 green. No regression in the existing OpenAPI drift check
(`openapi_spec_is_up_to_date` still validates the static flat surface
matches the committed openapi.json).
LOC: +67 in lib.rs (rewrite logic), +219 in tests/openapi.rs (test
suite + helper).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 4a of the MR-668 multi-graph server work. The heaviest single PR —
rewires every handler to extract `Arc<GraphHandle>` from a routing
middleware, replaces `AuthenticatedActor(Arc<str>)` with `ResolvedActor`
everywhere, and adds the `ServerMode` discriminator.
Behavior changes:
- **Single mode** (legacy `omnigraph-server <URI>`): flat routes
(`/snapshot`, `/read`, `/branches`, …) continue to work exactly as
v0.6.0. Internally, the registry holds a single handle keyed by the
sentinel `SINGLE_GRAPH_KEY_ID = "default"`; routing middleware injects
that handle on every request. No HTTP-visible change.
- **Multi mode** (new): routes nest under `/graphs/{graph_id}/...`.
Routing middleware extracts the graph id from the path, looks it up
in the registry, and injects the handle. 404 if not found.
(Multi-mode startup itself lands in PR 5; this PR provides the
router-side wiring.)
AppState refactor:
- `engine: Arc<Omnigraph>` and `policy_engine: Option<Arc<PolicyEngine>>`
fields removed — both now live inside `GraphHandle` in the registry.
- `mode: ServerMode { Single { uri } | Multi { config_path } }` added.
- `registry: Arc<GraphRegistry>` added.
- `server_policy: Option<Arc<PolicyEngine>>` added (placeholder for
management endpoints in PR 6b; unused today).
- Existing constructors (`new`, `new_with_bearer_token{s,_and_policy}`,
`new_with_workload`, `open*`) build a single-mode AppState
internally and remain source-compatible. Tests that constructed
AppState via these constructors continue to work.
- `with_policy_engine` post-construction setter — rebuilds the
single-mode handle with the policy attached. Engine-layer
enforcement is NOT reinstalled (matches the old single-field
semantics; `open_with_bearer_tokens_and_policy` is the path that
installs both layers).
- `new_multi` constructor added for PR 5's startup loop.
- `uri()` now returns `Option<&str>` (Some in single, None in multi).
Routing middleware:
- `resolve_graph_handle` injects `Arc<GraphHandle>` as a request
extension. Mode-aware: single returns the only handle; multi parses
`/graphs/{graph_id}/...` from the URI. Returns 404 in multi mode
when the graph id is unregistered. Records `graph_id` on the
current tracing span.
- `require_bearer_auth` updated to insert `ResolvedActor` (was
`AuthenticatedActor`).
Handler refactor — every protected handler:
- Gains `Extension(handle): Extension<Arc<GraphHandle>>` param.
- Replaces `state.engine` → `handle.engine`.
- Replaces `state.policy_engine()` → `handle.policy.as_deref()`.
- Replaces `state.uri()` → `handle.uri.as_str()` (or `.clone()`
where String is needed).
- Replaces `Arc::clone(&state.engine)` → `Arc::clone(&handle.engine)`
(the spawn-and-clone pattern in `server_export` — proof that a
long-running export survives the registry being mutated later).
authorize_request signature:
- Was: `(state: &AppState, actor: Option<&AuthenticatedActor>, request: PolicyRequest)`.
- Now: `(actor: Option<&ResolvedActor>, policy: Option<&PolicyEngine>, request: PolicyRequest)`.
- Per-graph callers pass `handle.policy.as_deref()`. The (future PR 6b)
management endpoints will pass `state.server_policy.as_deref()`.
MR-731 invariant preserved:
- The single chokepoint `request.actor_id = actor.actor_id.as_ref().to_string()`
inside `authorize_request` still overwrites any client-supplied
actor identity. Regression test
`actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers`
at `tests/server.rs:1114-1216` passes unchanged.
Tests: 0 new (the registry race tests in PR 3 already cover the
data structure; this PR exercises them indirectly via the existing
test suite). 74 lib + 57 server integration + 60 openapi = 191 tests
green. Clippy clean.
LOC: +397 insertions, -153 deletions in `crates/omnigraph-server/src/lib.rs`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 3 of the MR-668 multi-graph server work. Pure data structure — no
routing changes yet (that's PR 4a).
New file: `crates/omnigraph-server/src/registry.rs`
- `GraphHandle { key: GraphKey, uri: String, engine: Arc<Omnigraph>,
policy: Option<Arc<PolicyEngine>> }` — the per-graph state that the
routing middleware (PR 4a) will inject as a request extension.
- `RegistrySnapshot { graphs: HashMap<GraphKey, Arc<GraphHandle>> }` —
immutable snapshot; replaced atomically via `ArcSwap`.
- `GraphRegistry { snapshot: ArcSwap<_>, mutate: Mutex<()> }` — lock-free
reads, mutex-serialized mutations.
- `RegistryLookup { Ready(Arc<GraphHandle>) | Gone }` — two-valued, no
`Tombstoned` variant since DELETE is deferred in v0.7.0 scope.
- `InsertError { DuplicateKey | DuplicateUri }` — both rejection cases
for create-graph (maps to HTTP 409 in PR 7).
- Methods: `new`, `from_handles` (bulk startup-time init), `get`, `list`,
`len`, `insert`.
Race semantics pinned by three multi-thread tests:
- `concurrent_insert_same_key_exactly_one_succeeds` — N=8 spawned
inserts with the same key; exactly 1 returns Ok, 7 return DuplicateKey.
- `concurrent_insert_distinct_keys_all_succeed` — N=8 spawned inserts
with distinct keys; all succeed.
- `concurrent_reads_during_inserts_see_consistent_snapshots` — reader
loop concurrent with sequential writes; every listed handle's key
resolves via `get()` (no torn state).
Why no tombstones field: `DELETE /graphs/{id}` is deferred to bound
the scope of v0.7.0. Without a delete endpoint, there's no use for
tombstones — every key in the registry is `Ready`, and every key
not in the registry is `Gone`. When DELETE lands later, the
`Tombstoned` variant + `tombstones: HashSet<GraphKey>` slot in
additively without breaking caller signatures (the `Gone` variant
remains the "not currently active" case).
Why `tokio::sync::Mutex`: insert is async because PR 7's flow holds
this mutex across the atomic YAML rewrite step (file I/O). std::Mutex
would footgun across .await.
Dependency additions: `arc-swap = { workspace = true }`,
`thiserror = { workspace = true }` (used by InsertError).
Tests: 12 new (12 passing). 74 server lib tests total green
(62 from PR 1 + 12 new). Clippy clean on server crate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR 1 of the MR-668 multi-graph server work. Pure types, no runtime
behavior changes yet.
Ships the validated identity vocabulary that the rest of the implementation
will consume:
- `GraphId(String)` — `^[a-zA-Z0-9-]{1,64}$`, leading underscore rejected
(engine reserves every `_*` filename), reserved route names rejected
(`policies`, `healthz`, `openapi`, `openapi.json`, `graphs`). Validation
lives in `try_from` only; serde `Deserialize` re-runs it so JSON payloads
cannot bypass.
- `TenantId(String)` — same regex shape as GraphId. `None` in Cluster
mode; reserved for Cloud mode (RFC 0003) where it carries the OAuth
`org_id` claim.
- `GraphKey { tenant_id: Option<TenantId>, graph_id }` — the registry
HashMap key. `cluster()` constructor for the Cluster-mode default.
- `Scope` enum with `Full` variant — Cluster mode default; RFC 0004 will
extend with OAuth scopes (`graph:read`/`write`/`admin`/`*`).
- `AuthSource` enum with `Static` variant — Cluster mode default; RFC
0001 step 1 will add `Oidc`.
- `ResolvedActor { actor_id, tenant_id, scopes, source }` — replaces the
upcoming refactor of `AuthenticatedActor(Arc<str>)` in PR 4a.
Per MR-668 design decision 13: ship the Cloud-mode forward type shapes
now (no `TokenVerifier` trait yet — that's RFC 0001 step 1) so handler
signatures stay stable across the Cluster → Cloud trajectory. `Scope`
and `AuthSource` use `#[non_exhaustive]` so future variants don't break
caller matches.
Tests: 26 new (15 graph_id + 11 identity), all passing. No regression
in the existing 36 server library tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* gitignore: exclude docs/internal/ from publication
Mirrors the existing "Local-only working files (not for the public
repo)" pattern. Working notes filed under docs/internal/ stay on the
contributor's machine instead of cluttering the published doc tree
or tripping the AGENTS.md / docs-index cross-link check
(scripts/check-agents-md.sh enumerates every docs/*.md and requires
each one to be linked from an audience index — internal notes don't
have an audience index by definition).
Incidental to the v0.5.0 release; lands separately from the version
bump commits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci: skip docs/internal/ in agents-md cross-link check
Matches the .gitignore exclusion. Mirrors the existing 'docs/releases/'
exclusion pattern: notes under docs/internal/ aren't part of the
published doc tree and don't need to be linked from an audience index.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* release: v0.5.0 — Lance 6 substrate, Cedar policy engine, schema-lint v1
Bumps the workspace from 0.4.2 to 0.5.0. Release notes at
docs/releases/v0.5.0.md.
Three user-visible pillars motivate the minor bump:
1. Lance 6.0.1 substrate (DataFusion 52→53, Arrow 57→58)
2. Engine-wide Cedar policy enforcement on every _as writer; server
defaults to deny-all; signed-token-claim-only actor identity
3. Schema-lint v1 chassis: OG-XXX-NNN codes, soft drops, and
`--allow-data-loss` (Hard mode) for destructive migrations
Plus structured DataFusion Expr filter pushdown (unblocks
CompOp::Contains via array_has), HTTP allow_data_loss parity, inline
.gq sources on CLI/HTTP, optional CORS layer, and bug fixes
(merge-insert dup-rowid, branch-merge coordinator restore on error,
blob columns in branch merge).
Sites bumped:
- 5 crate [package].version lines (omnigraph, omnigraph-cli,
omnigraph-compiler, omnigraph-policy, omnigraph-server)
- 10 internal path-dep `version = "..."` constraints across the
four manifests that depend on sister crates (engine, server, cli,
plus engine's dev-dep on the compiler)
- Cargo.lock (regenerated via cargo update --workspace)
- AGENTS.md "Version surveyed:"
- openapi.json `info.version` (regenerated via
OMNIGRAPH_UPDATE_OPENAPI=1 cargo test -p omnigraph-server --test
openapi)
Verification:
- cargo test --workspace --locked: 907/907 green
- cargo test -p omnigraph-engine --test failpoints --features
failpoints: 19/19 green
- cargo test -p omnigraph-engine --test lance_surface_guards: 3/3
- scripts/check-agents-md.sh: clean
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tests: add lance_surface_guards pre-flight pins for the v6 bump
Land 8 named guards in a new test file that pin Lance API surfaces
OmniGraph relies on. Each guard turns a silent-break risk (variant
rename, struct restructure, async-flip) into a red CI bar instead of
runtime drift.
Guards (mapped to the silent-break inventory from the v6 migration plan):
Runtime (#[tokio::test]):
1. lance_error_too_much_write_contention_variant_exists — pins the
variant referenced by db/manifest/publisher.rs::map_lance_publish_error.
2. manifest_location_field_shape — pins .path/.size/.e_tag/.naming_scheme
types and ManifestLocation accessor returning &Self (the access
pattern at db/manifest/metadata.rs:84-88).
6. write_params_default_does_not_set_storage_version — confirms our
explicit V2_2 pin remains load-bearing (blob v2 requirement).
Compile-only async fns (#[allow(...)] + unimplemented!() placeholders;
never run, but cargo build --tests enforces the API shape):
3. checkout_version + restore chain — pins the recovery rollback hammer
at db/manifest/recovery.rs:505-522.
4. DatasetBuilder::from_namespace().with_branch().with_version().load()
— pins the namespace builder chain at db/manifest/namespace.rs:162-174.
5. MergeInsertBuilder fluent chain — pins the manifest CAS at
db/manifest/publisher.rs:370-391, including the return shape
(Arc<Dataset>, MergeStats).
7. compact_files(&mut ds, CompactionOptions, None) — pins
db/omnigraph/optimize.rs:107.
8. DeleteResult { new_dataset, num_deleted_rows } — pins the inline
delete result shape (MR-A will repurpose this guard to the staged
two-phase variant once Lance #6658 migration lands).
This is commit 1 of the chore/lance-6.0.1 migration. Cargo bump
follows in commit 2 (will trigger the guards under v6 if any surface
drifted).
Per the migration plan at ~/.claude/plans/shimmering-percolating-duckling.md
(written this session). Two guards from the plan deferred to follow-up:
- manifest_cas_returns_row_level_contention_variant (full publisher
race integration test — needs harness scaffolding)
- table_version_metadata_byte_compatible_with_v4 (TableVersionMetadata
is pub(crate); requires test reach extension).
Verified on v4: cargo test -p omnigraph-engine --test lance_surface_guards
passes 3/3 runtime tests; cargo build -p omnigraph-engine --tests
compiles all 5 compile-only guards clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(deps): bump Lance 4.0.0 → 6.0.1, DataFusion 52 → 53, Arrow 57 → 58
The Cargo bump itself. Source is intentionally untouched — this commit
will not compile. The compile errors are the work-list for subsequent
commits on this branch.
Lance updates: lance + 7 sub-crates 4.0.0 → 6.0.1. Transitive churn:
+ lance-tokenizer v6.0.1 (vendored tokenizer per Lance PR #6512)
+ object_store 0.13.x (Lance 6 brings it transitively; our explicit
pin stays at 0.12.5 for now — revisit in stages if diamond bites)
- tantivy* crates (replaced by lance-tokenizer)
Compile error landscape on this commit (11 errors):
• 1× E0432: `lance_index::DatasetIndexExt` import (Lance PR #6280
moved it to lance::index). Sites: table_store.rs:20,
db/manifest.rs:37 (the second site was missed by the pre-flight
inventory).
• 8× E0599: `create_index_builder` / `load_indices` missing on
`lance::Dataset` — all downstream of the DatasetIndexExt move.
Once the import is corrected on table_store.rs and db/manifest.rs,
these resolve automatically.
• 2× E0063: missing field `is_only_declared` in `DescribeTableResponse`
initializer at db/manifest/namespace.rs:221, 364. New Lance
namespace field per the v5 namespace restructure (PR #6186).
Surface guards (lance_surface_guards.rs, commit d571fa8) all still
compile + the 3 runtime ones pass on v6 — none of the silent-break
surfaces drifted. That's the load-bearing observation: the publisher
CAS chain, ManifestLocation field shape, checkout_version/restore,
DatasetBuilder fluent chain, MergeInsertBuilder return shape,
WriteParams::default, compact_files signature, and DeleteResult
fields are all v6-stable.
Next commits address the 11 errors per the migration plan stages
3-8.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* imports: move DatasetIndexExt to lance::index (Lance PR #6280)
Lance 5.0 (PR #6280) moved `DatasetIndexExt` out of `lance-index` into
`lance::index`. `is_system_index` and `IndexType` stayed in `lance-index`.
Mechanical update of 6 import sites:
crates/omnigraph/src/table_store.rs:20 — split into two `use` lines
crates/omnigraph-server/tests/server.rs:10 — was traits::DatasetIndexExt
crates/omnigraph/tests/search.rs:6
crates/omnigraph/tests/branching.rs:7
crates/omnigraph/tests/failpoints.rs:467
crates/omnigraph-cli/tests/cli.rs:3 — was traits::DatasetIndexExt
All 9 E0599 cascading errors on .create_index_builder / .load_indices
resolve once the trait is back in scope.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* namespace: add is_only_declared field to DescribeTableResponse
Lance namespace 6.0.0 added `is_only_declared: Option<bool>` to
`DescribeTableResponse` (lance-namespace-reqwest-client 0.7+ via the
v5.0 namespace API restructure, Lance PR #6186). Set to `Some(false)`
because every table BranchManifestNamespace returns from describe_table
is materialized — the manifest snapshot only includes entries for
tables we've already opened via Dataset::open.
Two sites in db/manifest/namespace.rs (BranchManifestNamespace +
StagedTableNamespace impls of LanceNamespace::describe_table).
Closes the last two compile errors from the v6 bump in the engine lib.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* cargo: add lance to omnigraph-cli + omnigraph-server dev-deps
Stage 3 moved DatasetIndexExt imports from `lance-index` to `lance::index`
in the cli and server test crates. Both crates only had `lance-index`
in their dev-dependencies; add `lance` alongside so the new path
resolves.
This is the last compile-error fix from the v6 bump — `cargo build
--workspace --tests` is now green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: refresh Lance alignment audit for v6.0.1; bump surveyed version
Per CLAUDE.md maintenance rule 2 (same-PR docs):
- docs/dev/lance.md: replace the v4.0.1 alignment audit stanza with
the v6.0.1 audit. Captures every v5/v6 finding from this PR (the
DatasetIndexExt move, DescribeTableResponse.is_only_declared,
MergeInsertBuilder return shape, ManifestLocation field shape,
LanceFileVersion::default flip, file-reader async, tokenizer
vendor, Lance #6658/#6666/#6877 status). Cross-references each
guard in tests/lance_surface_guards.rs.
- AGENTS.md: bump "Storage substrate: Lance 4.x" → "Lance 6.x".
Note: surveyed crate version stays at 0.4.2 — substrate version
bumps are independent of OmniGraph's release version.
- crates/omnigraph/src/storage_layer.rs: update the trait module-level
doc-comment to reflect that Lance #6658 closed 2026-05-14 and
delete_where two-phase migration is MR-A (the next follow-up).
#6666 stays open; create_vector_index inline residual stays.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* tests: silence clippy::diverging_sub_expression on compile-only guards
The five `_compile_*` async fns in lance_surface_guards.rs use
`let ds: Dataset = unimplemented!()` as a placeholder so type inference
can chase the method chain we want to pin, without ever running the
function. Clippy's `diverging_sub_expression` lint flags this pattern
because the RHS diverges; that's the entire point. Added to the
per-fn `#[allow(...)]` list, alongside dead_code / unreachable_code /
unused_variables / unused_mut already there.
No behavior change. cargo test -p omnigraph-engine --test
lance_surface_guards still 3/3 green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: correct #6658 status — closed but API ships in Lance v7.x, not v6.0.1
The audit stanza in docs/dev/lance.md and the storage_layer.rs trait
doc-comment both implied the public DeleteBuilder::execute_uncommitted
API shipped with Lance 6.0.1. It did not. Issue #6658 closed
2026-05-14, but binary search across the release stream confirms:
v6.0.1 ❌ no pub async fn execute_uncommitted on DeleteBuilder
v6.1.0-rc.1 ❌
v7.0.0-beta.5 ❌
v7.0.0-beta.10 ✅ first appearance
v7.0.0-rc.1 ✅
So MR-A (delete two-phase migration) is gated on the Lance v7.x bump,
not on this PR. v7.0.0-rc.1 dropped 2026-05-21; GA likely within a
week.
No behavior change. Doc-only correction.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* ci(lib): bump recursion_limit to 256 — Lance 6 trait depth on Linux
Lance 6's heavier trait surface around futures/streams in storage_layer.rs's
staged-write API pushes the rustc trait-resolution recursion limit past
the default 128 on Linux builds. CI on PR #111 surfaced this in both
`Test Workspace` and `Test omnigraph-server --features aws`:
error: queries overflow the depth limit!
= help: consider increasing the recursion limit by adding a
`#![recursion_limit = "256"]` attribute to your crate (`omnigraph`)
= note: query depth increased by 130 when computing layout of
`{async block@crates/omnigraph/src/storage_layer.rs:697:5: 697:10}`
(The async block is `stage_create_btree_index`'s body — its return type
is several layers of `impl Future<Output=Result<StagedHandle>>` deep on
top of Lance's own builder return types.)
Local macOS builds happened to short-circuit before tripping the limit,
which is why this didn't surface during the v6 bump sequence. The fix
rustc itself suggests is one line at the crate root.
No behavior change. Revisit if a future Lance bump stops needing it.
Verified: `cargo build --locked -p omnigraph-server --features aws`
compiles clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The schema-lint chassis v1.2 (PR #100) shipped `--allow-data-loss` on
the CLI, but `SchemaApplyRequest` had no equivalent field — Hard-mode
drops were CLI-only. This commit closes that feature gap and adds e2e
test coverage for drop modes across HTTP + CLI, plus data preservation
on additive apply, plus a CLI↔SDK plan-parity assertion.
Feature gap closed:
- `crates/omnigraph-server/src/api.rs` — added `allow_data_loss: bool`
(default false via `#[serde(default)]`) to `SchemaApplyRequest`.
Added `Default` derive so test usages can use `..Default::default()`.
- `crates/omnigraph-server/src/lib.rs` — `server_schema_apply` now
constructs `SchemaApplyOptions { allow_data_loss: request.allow_data_loss }`
and threads through to `apply_schema_as`.
- `crates/omnigraph-cli/src/main.rs` — remote-URI schema-apply path
used to bail with "--allow-data-loss not yet supported on remote";
now forwards the flag into the JSON payload so the CLI behaves
identically against local and remote URIs.
- `openapi.json` — regenerated; only diff is the new field on
`SchemaApplyRequest`.
Tests added (8 new):
* `crates/omnigraph-server/tests/server.rs` (+5):
- `schema_apply_route_soft_drops_property_via_http` — POST schema
removing nullable property, verify catalog reflects the drop AND
`snapshot_at_version(pre)` still has `age` in the field list
(time-travel reachability is the Soft contract).
- `schema_apply_route_soft_drops_node_type_via_http` — POST schema
removing `Company` node + cascading `WorksAt` edge.
- `schema_apply_route_hard_drops_property_with_allow_data_loss` —
POST with `allow_data_loss: true`, verify plan step reports
`mode: hard`.
- `schema_apply_route_keeps_drops_soft_without_flag` — same schema
without flag, verify `mode: soft`. Pins default semantics against
accidental Hard promotion.
- `schema_apply_route_additive_property_preserves_existing_rows` —
load fixture, POST adding nullable property, verify row count
preserved (SDK suite covers data preservation on drops + renames;
additive AddProperty wasn't pinned).
Plus helpers `schema_without_age` and `schema_without_company`.
* `crates/omnigraph-cli/tests/cli.rs` (+3):
- `schema_apply_allow_data_loss_flag_promotes_drops_to_hard` — CLI
`omnigraph schema apply --allow-data-loss --schema X.pg --json`,
verify plan step has `mode: hard`.
- `schema_apply_without_allow_data_loss_keeps_soft_drops` — without
flag, verify Soft.
- `schema_plan_parity_cli_and_sdk` — same `.pg` source through
`Omnigraph::plan_schema` (SDK) and `omnigraph schema plan --json`
(CLI), assert the steps array is byte-identical post-JSON. HTTP
has no `/schema/plan` endpoint; apply-side parity is implicitly
covered by the HTTP drop tests + CLI drop tests using identical
fixtures.
Docs:
- `docs/user/schema-language.md` — new "Destructive drops" section
documenting Soft vs Hard semantics and that `allow_data_loss` is
now honored uniformly across CLI / HTTP / SDK.
Verification: every new test passes; full `cargo test --workspace --locked`
green; `scripts/check-agents-md.sh` passes.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* tests: policy chassis e2e gap-fills (MR-722 follow-up)
Audit after PRs #101-105 surfaced real e2e gaps in the policy chassis
that could let regressions ride through silently. Coverage was strong
at the SDK level (18 chassis tests) and reasonable at HTTP (12+ policy
tests), but the CLI×writer matrix was asymmetric (only `change` tested
end-to-end), the `cli.actor` config-only precedence path was untested,
the `OMNIGRAPH_UNAUTHENTICATED` env-var read path was unexercised,
`serve()`'s startup-refusal propagation was structural-review only,
and engine↔HTTP decision parity was a structural property without a
test pinning it. This commit closes those gaps.
Added (15 new tests, all test-only):
* `policy_engine_chassis.rs` (+2): `load_file_as` allow + deny pair —
PR #104 added the actor-aware mirror of `load_file` but it was only
exercised via CLI integration; this is direct-SDK coverage.
* `omnigraph-server/src/lib.rs` mod tests (+2):
- `unauthenticated_env_var_classification` — consolidated single
test (process-global env var; running parallel would race) that
pins truthy values, falsy values, unset, and CLI-flag-overrides-
env behavior of the `OMNIGRAPH_UNAUTHENTICATED` read path inside
`load_server_settings`.
- `serve_refuses_to_start_in_state_1_without_unauthenticated` —
`#[serial]` integration test. Clears all bearer-token env vars,
builds a `ServerConfig` with no policy file and no flag, calls
`serve(config).await`, asserts Err before any side-effecting
work (Lance dataset open, TcpListener::bind). Guards the
classifier→serve propagation path so a future refactor that
drops the call turns red.
* `omnigraph-server/tests/server.rs` (+4): `policy_decision_parity_*`
— four cases (Change×allowed+denied, BranchMerge×allowed+denied).
Each case runs the same Cedar decision via both SDK
(`Omnigraph::with_policy().mutate_as` / `branch_merge_as`) and HTTP
(`POST /change` / `POST /branches/merge`) and asserts both either
Allow or Deny. The structural property (both paths call
`PolicyChecker::check`) is now test-asserted.
* `omnigraph-cli/tests/system_local.rs` (+8): the CLI×writer matrix
fan-out:
- `local_cli_load_enforces_engine_layer_policy`
- `local_cli_ingest_enforces_engine_layer_policy`
- `local_cli_schema_apply_enforces_engine_layer_policy`
- `local_cli_branch_create_enforces_engine_layer_policy`
- `local_cli_branch_delete_enforces_engine_layer_policy`
- `local_cli_branch_merge_enforces_engine_layer_policy`
Each: one denied case (`--as act-bruno` against protected main) +
one allowed case (`--as act-ragnor` via existing/extended admins-*
rules).
Plus:
- `local_cli_actor_from_config_used_when_no_flag` — proves the
config-only precedence path works.
- `local_cli_actor_flag_overrides_config_actor` — proves the
`--as` flag wins over `cli.actor` in the config.
Adds `local_policy_config_with_actor` helper. Extends
`POLICY_E2E_YAML` with `admins-branch-ops` (BranchCreate +
BranchDelete) and `admins-schema-apply` rules so the CLI×writer
matrix has positive-case rule coverage.
Verification: all new tests pass; full `cargo test --workspace
--locked` is green; `scripts/check-agents-md.sh` passes.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* tests: serialize env-touching server lib tests to fix CI flake
CI flake on PR #106's Test Workspace job: two of the new tests
(`serve_refuses_to_start_in_state_1_without_unauthenticated` and
`unauthenticated_env_var_classification`) raced against
`server_bearer_tokens_from_env_reads_legacy_token_and_token_file`,
which sets `OMNIGRAPH_SERVER_BEARER_TOKEN` via `EnvGuard`.
While `serve_refuses` was mid-execution with its EnvGuard cleared,
the bearer-token test's EnvGuard had `OMNIGRAPH_SERVER_BEARER_TOKEN`
set; `resolve_token_source()` saw it and classified the runtime
state as `DefaultDeny` rather than refusing — so the test panicked
with "Dataset at path X not found" instead of the expected refusal
message. The unauthenticated test had the symmetric failure: its
`OMNIGRAPH_UNAUTHENTICATED="anything"` got overwritten by a peer
`EnvGuard` drop.
Fix: mark every test that uses `EnvGuard` with `#[serial]` so they
serialize against each other (default key). Already on
`serve_refuses_to_start_in_state_1_without_unauthenticated`; added
to `unauthenticated_env_var_classification` and
`server_bearer_tokens_from_env_reads_legacy_token_and_token_file`.
The `parse_bearer_tokens_json_*` tests don't touch env vars and
stay parallel.
Locally green (36 tests pass on my workstation); the parallelism
issue is CI-runner-specific (more aggressive thread interleaving)
but the fix is universal.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Closes the "tokens but no policy" trap. Pre-MR-723, an operator who
configured bearer tokens and forgot to set policy.file got a server
that required auth and then permitted every action — the illusion of
protection. After MR-723, that configuration is default-deny: only
`read` actions succeed; every other action returns HTTP 403.
Three startup states, classified deterministically:
- **Open** — no tokens, no policy. Requires explicit
`--unauthenticated` flag or `OMNIGRAPH_UNAUTHENTICATED=1`; otherwise
`serve()` refuses to start. Forces the operator to opt in to
"fully open dev mode" so it can't happen accidentally.
- **DefaultDeny** — tokens configured, no policy. `authorize_request`
rejects every action except `Read` with 403. The warn-log on
startup names the misconfiguration explicitly.
- **PolicyEnabled** — policy file configured. Cedar evaluates every
request, unchanged from pre-MR-723.
What landed:
- `ServerConfig.allow_unauthenticated: bool` + `--unauthenticated` flag
on the `omnigraph-server` bin + `OMNIGRAPH_UNAUTHENTICATED` env var
(`load_server_settings` honors both).
- New `classify_server_runtime_state(has_tokens, has_policy,
allow_unauthenticated) -> Result<ServerRuntimeState>` pure function.
`serve()` calls it before opening the engine and bails with a clear
error when the operator hits the no-tokens-no-policy-no-flag cell.
- `authorize_request` state-2 branch: when `policy_engine()` is None
but the bearer-auth middleware delivered an authenticated actor, any
action other than `Read` returns 403 with a message that names the
misconfiguration.
- `AppState::with_policy_engine(self, engine)` builder method so
integration tests that need a custom workload (`new_with_workload`)
can still install a permit-all policy without a new constructor.
- `app_for_loaded_repo_with_auth(token)` and
`app_for_loaded_repo_with_auth_tokens(tokens)` test helpers now
install a permit-all policy alongside tokens — they previously
represented the "tokens but no policy" state that MR-723 makes
default-deny, and tests that don't care about policy were
inadvertently coupled to the loophole.
Tests:
- `classify_*` unit tests (3) — every cell of the matrix.
- `default_deny_mode_allows_read_for_authenticated_actor` — GET
/snapshot succeeds with bearer token + no policy.
- `default_deny_mode_rejects_change_with_forbidden` — POST /change
rejected with 403 + "default-deny" message.
- `default_deny_mode_rejects_schema_apply_with_forbidden` — POST
/schema/apply rejected with 403 + "default-deny" message.
- New `app_for_repo_with_auth_tokens_only(schema, tokens)` helper
builds the State-2 fixture without policy. The pre-MR-723 helpers
`app_for_loaded_repo_with_auth*` shift semantics to "tokens +
permit-all" so existing tests retain their original intent.
docs/user/policy.md: new "Server runtime states (MR-723)" section
documents the matrix and the explicit `--unauthenticated` opt-in.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Closes the CLI side of the policy chassis fan-out. Before this commit,
CLI direct-engine writes bypassed Cedar entirely because the CLI never
called `Omnigraph::with_policy(...)` for non-`policy validate|test|explain`
subcommands. After this commit, every CLI direct-engine writer
(change, load, ingest, branch create/delete/merge, schema apply) opens
the engine via a new `open_local_db_with_policy(uri, &config)` helper
that installs the configured `PolicyEngine` when `policy.file` is set,
and threads the resolved actor through to the `_as` writer methods.
Actor identity resolution:
- New top-level `--as <ACTOR>` global flag on the CLI overrides config.
- New `cli.actor` field in `omnigraph.yaml` provides a default actor.
- Precedence: `--as` > `cli.actor` > None.
- When policy is configured and neither is set, the engine-layer
footgun guard fires and the write is denied — silent bypass via
"I forgot the actor" is exactly what the guard prevents.
- Remote HTTP writes ignore both — bearer-token-resolved server-side.
Helpers added in main.rs:
- `open_local_db_with_policy(uri, &config) -> Result<Omnigraph>` —
opens the DB and installs the PolicyEngine when configured. Without
policy this is identical to a bare `Omnigraph::open`.
- `resolve_cli_actor(cli_as, &config) -> Option<&str>` — implements
the flag > config > None precedence.
Engine: added `load_file_as` to the loader as the actor-aware mirror of
`load_file`, so CLI file-path loads flow through the same enforce gate
as in-memory `load_as` calls.
Test rewrite: `local_cli_policy_tooling_is_end_to_end_while_local_writes_stay_unenforced`
was the explicit assertion of the pre-chassis hole. Renamed and split:
- `local_cli_policy_tooling_is_end_to_end` — sanity for the read-only
policy CLI surfaces (validate/test/explain), unchanged behavior.
- `local_cli_change_enforces_engine_layer_policy` — the new assertion:
policy installed + no actor → footgun-guard denial; `--as act-bruno`
on protected main → Cedar denial; `--as act-ragnor` (admins-write
rule) on main → permit, write committed.
POLICY_E2E_YAML gains an `admins-write` rule so the permit case has
a non-trivial actor to exercise.
docs/user/policy.md updated with `cli.actor` + `--as <ACTOR>` usage.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
PR #102 wired apply_schema_as. This PR completes the chassis-side
coverage so every public mutating engine entry point hits the same
Omnigraph::enforce(action, scope, actor) gate regardless of transport:
- mutate_as → enforce(Change, Branch(branch), actor)
- load_as → enforce(Change, Branch(branch), actor)
- ingest_as → enforce(Change, Branch(branch), actor); also threads
actor through the implicit branch_create_from_as so fresh-branch
ingest correctly hits BranchCreate too
- branch_create_as → enforce(BranchCreate, TargetBranch(name), actor)
- branch_create_from_as → enforce(BranchCreate,
BranchTransition { source, target }, actor)
- branch_delete_as → enforce(BranchDelete, TargetBranch(name), actor)
- branch_merge_as → enforce(BranchMerge,
BranchTransition { source, target }, actor)
Three new _as variants for branch ops (create, create_from, delete)
that had no actor surface before; existing actor-less variants delegate
with actor=None so the no-policy path is a strict no-op.
HTTP handlers updated to thread the resolved actor into the new _as
variants for branch_create and branch_delete (was previously dropped).
14 new SDK chassis tests (one allow + one deny pair per wired writer);
the existing 4 apply_schema_as tests stay. All 18 pass.
docs/user/policy.md updated to describe engine-wide enforcement and the
coarse-vs-fine layer split (engine = action gate, query layer per-row =
MR-725 future). AGENTS.md capability matrix updated to match.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
PR #2 of the policy chassis series (PR #1 = MR-731, merged in #101).
The structural fix that moves Cedar enforcement from HTTP-only to
engine-wide. apply_schema is the proof-of-concept writer; PR #3 fans
the enforce() call out to the remaining six (mutate_as, load,
ingest_as, branch_create_from, branch_delete, branch_merge).
## What lands
### New crate: omnigraph-policy
The 844-line policy.rs moves from `omnigraph-server` into a new
`omnigraph-policy` workspace crate so both engine and server can
depend on it. Cedar dependency moves with it. The server's policy.rs
becomes a re-export shim (`pub use omnigraph_policy::*`) so existing
`omnigraph_server::PolicyAction` etc. paths keep working — CLI and
test consumers don't have to migrate in one go.
### New trait: PolicyChecker
```rust
pub trait PolicyChecker: Send + Sync {
fn check(&self, action: PolicyAction, scope: &ResourceScope,
actor: &str) -> Result<(), PolicyError>;
}
```
`PolicyEngine` (Cedar-backed) implements it. `Omnigraph::with_policy()`
takes `Arc<dyn PolicyChecker>`. Engine tests mock the trait without
spinning up Cedar. MR-725 will extend the trait with `predicate_for()`
for query-layer pushdown — additive, no call-site changes.
### New enum: ResourceScope
Four variants — Graph, Branch, TargetBranch, BranchTransition —
mapping cleanly to today's `(branch, target_branch)` shape on
PolicyRequest via `to_branch_pair()`. Each engine writer picks the
variant that matches the existing HTTP-layer convention so engine
and HTTP evaluate the same Cedar decision.
**Invariant**: ResourceScope stays at branch granularity. Per-type
and per-row scope are MR-725's territory, not engine-layer's.
Adding Type/Row variants here creates two places per-type policy
can be evaluated, which can drift. See chassis design refinements
comment on MR-722 (2026-05-17).
### Omnigraph::with_policy() + enforce()
* New `policy: Option<Arc<dyn PolicyChecker>>` field on Omnigraph,
None by default (preserves embedded/dev no-enforcement mode).
* `with_policy(self, checker)` setter — builder-style, consumes self.
* `enforce(action, scope, actor)` — the gate. When policy is None,
no-op. When policy is Some AND actor is None, hard error — silent
bypass via "I forgot the actor" is exactly the footgun this gate
is here to prevent.
### apply_schema_as: first writer wired
* New public method `apply_schema_as(source, options, actor)` that
calls `enforce(SchemaApply, TargetBranch("main"), actor)` before
acquiring the schema-apply lock or doing any other work.
* Existing `apply_schema(source)` and `apply_schema_with_options(...)`
delegate to it with actor=None (no-actor variants).
* HTTP handler `server_schema_apply` updated to call apply_schema_as
with the resolved actor. AppState construction injects the
PolicyEngine into Omnigraph via `with_policy`. HTTP-layer
authorize_request still fires first; the engine gate is the
redundant-but-correct backstop and the only path that protects SDK
/ embedded callers. PR #3 removes the HTTP redundancy.
### OmniError::Policy
New error variant for engine-layer policy denial / evaluation
failure. ApiError::from_omni maps it to 403.
### MR-724 Admin action — Option A reservation
PolicyAction::Admin kept in the enum with a load-bearing doc
comment naming its future consumers (hot reload, audit log query,
approvals list per MR-726 / MR-732 / MR-734). No enforce(Admin, ...)
call site exists yet — the variant is reserved so the action
vocabulary is complete from chassis day one. MR-724 closes when
the first consumer surface ships.
### New SDK-side integration test
`crates/omnigraph/tests/policy_engine_chassis.rs` — four tests
covering:
* Policy denies for unauthorized actor → OmniError::Policy
* Policy permits for authorized actor → apply succeeds
* Policy installed + no actor → hard error (forget-the-actor footgun)
* No policy → no-op (embedded/dev default still works)
These exercise the engine path directly — no HTTP layer involved.
## Test results
- cargo test --workspace --locked --no-fail-fast: 851 passed, 0 failed
* 45 server tests (existing) pass
* 14 schema_apply tests (existing) pass
* 4 new chassis tests pass
* 60 OpenAPI tests pass (no HTTP API surface changes)
* No regressions across the workspace
## Architectural decisions baked in
Per MR-722 chassis design refinements comment (2026-05-17):
1. PolicyChecker is a trait, not just a concrete. Engine and server
consume the trait. MR-725 adds predicate_for() additively.
2. ResourceScope stays at branch granularity. No Type/Row variants.
3. Coarse-vs-fine framing pinned: engine-layer is action gate;
query-layer (MR-725) is predicate gate. Both backed by same Cedar
engine; non-overlapping responsibilities.
4. Admin action reserved for policy-management surfaces (MR-724
Option A).
## Pending follow-ups (PR #3+)
- Fan-out enforce() to mutate_as, load, ingest_as, branch_create_from,
branch_delete, branch_merge (PR #3).
- Remove HTTP-layer authorize_request redundancy once engine gate
covers all writers (PR #3).
- CLI policy injection into Omnigraph for non-`policy validate|test|explain`
subcommands (PR #3 or follow-up).
- MR-723 default-deny 3-state matrix (PR #4).
- MR-736 severity warn/deny (PR #5).
- AGENTS.md scope-of-enforcement rewrite once chassis fully lands.
- Coarse-vs-fine framing in docs/user/policy.md.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Warm-up commit for the policy chassis epic (MR-722). PR #1 of the
chassis series — same role as schema-lint v1's commit #1 baseline.
Zero behavioral change; establishes the regression test, the
load-bearing doc comment, and the user-doc paragraph for an
invariant already true in code.
Server auth already resolves `actor_id` from the matched bearer
token at `omnigraph-server/src/lib.rs:692-694`, overwriting whatever
the handler put in the PolicyRequest. The principle is named in
docs/dev/invariants.md Hard Invariant 11 ("clients cannot set actor
identity directly"). What was missing: a regression test, a
load-bearing doc comment at the resolution site, and a user-facing
documentation paragraph. This commit adds all three.
Why first. The actor-identity invariant is the foundation every
other policy decision stands on. If `actor_id` can be spoofed, every
chassis primitive (per-row scope, audit log, two-person rule)
becomes ungated. Pinning the invariant first means PR #2 (the
chassis core) doesn't have to re-prove this assertion.
Changes:
* crates/omnigraph-server/tests/server.rs — new regression test
actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers
with three sub-assertions:
- spoof-up: bearer for denied actor + X-Actor-Id naming allowed
actor → 403 (header doesn't promote)
- spoof-down: bearer for allowed actor + X-Actor-Id naming denied
actor → 200 (header doesn't demote)
- empty-string spoof: empty X-Actor-Id doesn't clear resolved actor
Cross-link to MR-777 (auth boundary cases — actor-id collision +
malformed bearer) noted in the test docstring.
* crates/omnigraph-server/src/lib.rs — expanded doc comment at
the actor-resolution site explaining the SECURITY INVARIANT,
citing Hard Invariant 11, the Supabase RLS history footgun, and
the regression test that pins the contract. Reader thinking "I
should let clients override actor_id for impersonation" hits
this comment first.
* docs/user/policy.md — new "Actor identity (signed-claim-only)"
section near the existing Server enforcement section. Closes the
user-facing doc gap MR-731's "Done when" requires.
Architectural decisions for PR #2+ pinned this session (not
implemented here, recorded so future implementers don't re-litigate):
- PolicyEngine moves to new `omnigraph-policy` workspace crate so
both engine and server can depend on it (Q2).
- `enforce(action, scope, actor)` will take a new `ResourceScope`
enum, leaving room for MR-725's per-type and per-row variants (Q3).
- `PolicyAction::Admin` is kept and wired (Option A) — meta-action
for policy-management surfaces (hot reload, audit log query,
approvals list) as those consumer features land (Q4).
Test results:
- cargo test -p omnigraph-server --test server: 45 pass (44 existing
+ 1 new); no regressions
- scripts/check-agents-md.sh: passes (34 links / 33 docs OK)
Out of scope (PR #2+):
- Omnigraph::with_policy() + enforce() method
- omnigraph-policy crate creation
- ResourceScope enum
- CLI policy injection into Omnigraph
- HTTP-layer redundant-check removal
- MR-724 Admin action wiring (PR #2)
- MR-723 default-deny 3-state (PR #4)
- MR-736 severity warn/deny (PR #5)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Cursor Bugbot LOW on commit 3ad359d: try_admit_rewrite is defined and
tested but no HTTP handler calls it; the six handler OpenAPI
annotations declared status = 503 (added in 8e1a8e7) but try_admit
(the only path handlers invoke) returns 429 only. 503 was unreachable.
Fix: remove (status = 503, ...) from the six handler OpenAPI
annotations and regenerate openapi.json. Kept as forward-looking
infrastructure: try_admit_rewrite, global rewrite semaphore,
RejectReason::GlobalRewriteExhausted, ApiError::ServiceUnavailable,
the 503 branch in IntoResponse, --global-rewrite-cap, and
OMNIGRAPH_GLOBAL_REWRITE_MAX. When a future commit wires
try_admit_rewrite into a handler, the 503 OpenAPI annotation lands
alongside that wiring.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migrates `ingest_per_actor_admission_cap_returns_429` from env-var
override to direct `WorkloadController::new(1, ...)` construction via
`AppState::new_with_workload`. Removes the `EnvGuard` and the
`#[serial]` annotation that paired with it.
Why correct by design (AGENTS.md rule 9): the previous round's matrix
fix (commit 8bd9a5f) shielded the matrix from this test's env
mutation, but the broader bug class — "test A's process-wide env
mutation can leak into any test B that calls
`AppState::open` / `WorkloadController::from_env()`" — was still
reachable by any future test that didn't think to opt out. Closing
the class at the source: this test no longer mutates global state at
all, so no other test needs to defend against it.
Net effect:
- This test no longer needs `#[serial]` (was the only reason it was
marked) — runs in parallel with the rest of the suite.
- The matrix's defensive `with_defaults()` construction (commit
8bd9a5f) remains correct but is no longer required for correctness;
it's now a "belt and suspenders" guard against any FUTURE
env-mutating test.
Verified locally: both tests pass when run together; full server
suite (44 tests) green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 4 CI failure: Test Workspace and server-aws both red on
`concurrent_branch_ops_morphological_matrix` cell b
("merge × merge: same-target-distinct-sources") — second merge
returned 429 instead of 200. The matrix passes locally.
Root cause: cargo test runs tests in parallel by default. The admission
test `ingest_per_actor_admission_cap_returns_429` is wrapped with
`#[serial]` and an EnvGuard that sets
`OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1` for its duration. Process-wide
env vars are visible to concurrently-running tests; the matrix's
`Harness::new()` called `AppState::open()` which delegates to
`WorkloadController::from_env()`, picking up cap=1 if it ran while
the admission test held the EnvGuard. With cap=1 + 2 concurrent
merges in cell b, one merge waits behind merge_exclusive while the
other is admitted; the waiter holds its admission permit, but a
fresh actor permit is needed when admission is per-actor — the
second merge's permit acquisition fails because the first hasn't
released yet, and 429 fires.
Fix (correct by design, AGENTS.md rule 9): the matrix harness builds
the WorkloadController explicitly via
`WorkloadController::with_defaults()` and passes it to
`AppState::new_with_workload`, the constructor added in commit
22d76db. Closes the bug class "tests pick up another concurrent test's
env override at construction time" — the matrix is now insulated from
any env-var manipulation in the rest of the test suite.
Verified locally: with `OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1` set in the
environment, the matrix passes (it ignores env entirely now).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit added `concurrent_branch_ops_morphological_matrix`
covering 11 cells with stronger assertions (identity + post-op /change
+ reopen). The three narrow tests it replaces:
- concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator
→ matrix cell f, with identity assertions added
- concurrent_branch_merges_distinct_targets_do_not_swap_into_each_other
→ matrix cells a + b + c, with identity assertions that close the
symmetric-swap blind spot cubic flagged on commit 64f2b99
- concurrent_change_during_branch_merge_preserves_writes
→ matrix cell d
The matrix retains the original tests' diagnostic granularity through
named cell labels in every assertion message ("[a:merge×merge:distinct-targets]
merge a"), so a CI failure points to the exact cell + invariant.
Net: 522 lines removed, 0 coverage lost. All other server tests pass
unchanged (44 total).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces three narrow concurrent_branch_* tests (folded in below) with
one parameterized matrix test covering 11 representative
(op_a, op_b, target_overlap) cells, asserting C1-C6 uniformly:
C1 — both complete (no deadlock; tokio::time::timeout(15s))
C2 — status: both 200 or exactly one clean conflict; never 500
C3 — per-target row count
C4 — per-target row identity (named persons present + absent — catches
the symmetric-swap class that count assertions miss; cubic P2 on
commit 64f2b99 flagged this gap on the round-3 merge race test)
C5 — engine state coherent (subsequent /snapshot consistent)
C6 — post-op /change on main succeeds (engine isn't poisoned)
Cells:
a. Merge × Merge, distinct targets — branch_merge_impl race pin
b. Merge × Merge, same target / distinct sources — merge_exclusive serialization
c. Merge × Merge, same source / distinct targets — fanout
d. Merge × Change, into target — per-(table, branch) queue
e. Merge × BranchCreateFrom, target — interaction with refresh path
f. BranchCreateFrom × BranchCreateFrom, distinct parents — round-1 race pin
g. BranchCreateFrom × BranchDelete, unrelated branches — disjoint state
h. BranchDelete × BranchDelete, distinct branches — concurrent refresh
i. BranchDelete × Change, distinct branch — refresh-side vs writer
j. BranchCreateFrom × Change, on source — fork-while-writing
k. Reopen consistency after concurrent pair — disk-vs-cache drift
Each cell:
- spins up its own tempdir + AppState so failures don't cascade,
- aligns the pair at a tokio::sync::Barrier so both reach the engine
close in time,
- wraps in a 15s deadlock timeout,
- asserts identity via a /read with the `get_person` fixture query
(specific names must be present on the right branch and absent from
the wrong one).
Subsumes:
- concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator
(now cell f, with identity assertions added)
- concurrent_branch_merges_distinct_targets_do_not_swap_into_each_other
(now cells a + b + c, with identity assertions; the symmetric-swap
blind spot cubic flagged on commit 64f2b99 is closed)
- concurrent_change_during_branch_merge_preserves_writes
(now cell d)
Those three narrow tests are removed in the next commit so this lands
green standalone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the cubic P2 finding on commit 22d76db: `Semaphore::new(concurrency.max(1))`
silently coerced --heavy-concurrency=0 to 1, so the JSON output reported
0 while execution actually used 1. Reported settings differed from
actual.
Adds an explicit `--heavy-concurrency > 0` check in `main()` (with a
helpful error message pointing to --heavy-batches=0 as the way to
disable heavy traffic) and a defensive `assert!()` inside
`drive_heavy_actor` so future callers can't pass 0 silently.
Verified: `bench_actor_isolation --heavy-concurrency 0` exits with
code 2 and the explanatory error message.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per AGENTS.md rule 8, this commit lands the failing regression test
ahead of the fix.
Cursor Bugbot HIGH on commit 22d76db rediscovered the residual flagged
in the round 1 honest-review note: `branch_merge_impl` at
`crates/omnigraph/src/exec/merge.rs:1085-1100` still uses the
swap_coordinator_for_branch + operate + restore_coordinator pattern
across three separate `coordinator.write().await` acquisitions. The
same shape that branch_create_from_impl shed in commit 4ffbf6e.
The test spawns two concurrent /branches/merge calls A (feature-a →
target-a) and B (feature-b → target-b) aligned at a tokio::sync::Barrier
so both reach swap_coordinator_for_branch close in time. M=4
iterations boost race-catching odds.
Currently fails on 22d76db with target-a=5, target-b=4: B's merge
landed on the wrong coord — target-b never got Frank because A's
swap pushed self.coordinator to target-a, B's swap captured target-a
as B's "previous", and B's restore set self.coordinator back to
target-a (not the original main). Subsequent operations using
self.coordinator point at the wrong branch.
Fix lands in the next commit: serialize concurrent branch merges via
`merge_exclusive: Arc<tokio::sync::Mutex<()>>` held across the entire
swap-operate-restore window. Closes the bug class "non-atomic
three-step coordinator manipulation" for branch_merge by serializing
merges relative to each other; per-(table, branch) queue inside the
merge body still lets merges and other writers run concurrently.
A deeper "operate on local coord" refactor (the round-1 fix shape for
branch_create_from) requires unwinding `branch_merge_on_current_target`
and its uses of `self.snapshot()` / `self.ensure_commit_graph_initialized()`;
deferred to a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cubic findings on bench_actor_isolation.rs flagged together:
P2 (lib.rs:202): `unsafe { std::env::set_var(...) }` ran inside
`#[tokio::main] async fn main()` AFTER the multi-thread tokio runtime
was up. Rust 2024 made `set_var` unsafe because libc's `setenv` is
not thread-safe; concurrent env reads from logging or runtime
internals can race or read torn state.
Fix (correct by design, AGENTS.md rule 9): add a public
`AppState::new_with_workload(uri, db, bearer_tokens, workload)`
constructor that takes a caller-built `WorkloadController`. Tests and
benches override per-actor caps via the constructor instead of
mutating global env. Closes the bug class "tests need to mutate
global env to override AppState defaults."
P2 (lib.rs:130): heavy actor's `oneshot.await` inside the loop
serialized — heavy in-flight count was always 1, so cap=1 never
tripped on the heavy side. The bench validated isolation (light p99
bounded) but didn't demonstrate the rejection path.
Fix: add a `--heavy-concurrency` arg (default 4) and spawn batches
as concurrent tokio tasks bounded by an internal semaphore. With
heavy_concurrency=4 and inflight_cap=1, the bench now reports
heavy_too_many_requests > 0 and heavy_ok == 1 at peak — proving the
gate fires for the heavy actor.
Sample run on local FS (4 light actors × 30 ops, 20 heavy batches ×
50 rows, heavy_concurrency=4, cap=1):
heavy_ok: 1
heavy_too_many_requests: 19
light_ok: 120
light_too_many_requests: 0
light_p99: 565 ms (target < 2 s)
Heavy saturates its own cap; light actors are completely unaffected.
The isolation property is now empirically proven by the rejection
counts rather than just by the latency tail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the cubic finding (P2) at lib.rs:1061: the new admission gates
add HTTP 429 / 503 failure paths but the affected endpoint
`#[utoipa::path(... responses(...) ...)]` annotations weren't updated.
Also closes a pre-existing miss on /change (admission-gated since
PR 2 Step F).
Adds (status = 429, ...) and (status = 503, ...) to all six
admission-gated endpoints:
- POST /change (operation_id = "change")
- POST /schema/apply (operation_id = "applySchema")
- POST /ingest (operation_id = "ingest")
- POST /branches (operation_id = "createBranch")
- DELETE /branches/{branch} (operation_id = "deleteBranch")
- POST /branches/merge (operation_id = "mergeBranches")
The descriptions reference the `Retry-After` header, which the
`IntoResponse for ApiError` impl emits on both codes (added in
commit c745dd6).
openapi.json regenerated via OMNIGRAPH_UPDATE_OPENAPI=1; the openapi
sentinel test passes both with the regen flag and in strict-check
mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Empirical proof of MR-686's central design promise: per-actor
admission control isolates noisy actors from light traffic. The
existing bench_concurrent_http harness measures aggregate throughput;
this harness measures the latency tail seen by light actors while a
heavy actor saturates its own per-actor cap.
Setup: one "heavy" actor flooding /ingest with multi-row NDJSON
batches; N "light" actors each running short bursts of /change
inserts, each authenticating with a distinct bearer token so the
WorkloadController accounts them as separate identities.
Output: heavy throughput / 429 count, light p50/p95/p99/max latency.
Acceptance heuristic on local FS: light-actor p99 < 2 s while the
heavy actor saturates its own cap.
Sample run on local FS, cap=1, 4 light actors x 30 ops, 20 heavy
batches x 50 rows: light p99 = 710 ms, light errors = 0 (well under
the 2 s acceptance target). The test demonstrates the isolation
property — the heavy /ingest holds its own admission slot but
doesn't affect light actors since they have separate per-actor
state.
Usage:
cargo run --release -p omnigraph-server --example bench_actor_isolation -- \
--light-actors 4 --light-ops-per-actor 30 \
--heavy-batches 20 --heavy-rows-per-batch 50 \
--inflight-cap 1 \
--output .context/bench-results/after-pr2-phase2/actor-isolation.json
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Future-proofs against MR-895 work that may move or remove the
per-(table, branch) writer queue acquisition inside `branch_merge`
(`crates/omnigraph/src/exec/merge.rs:1224`). Today the queue
linearizes a concurrent /change on main against a `branch_merge
feature → main` on the same touched tables; both succeed and the
inserted row is preserved post-merge.
Codex flagged this scenario as a P1 in PR #75 review claiming the
merge could silently overwrite concurrent target writes because the
source-rewrite path opens with `MutationOpKind::Merge` (skipping the
strict pre-stage check). Validation showed the queue at merge.rs:1224
is held across both Phase B (per-table commit_staged) and Phase C
(manifest publish), so there's no interleave window. The Merge
op_kind only affects same-process pre-stage drift detection, not
cross-write linearization. The test passes on f925ad1; landing it
as a regression sentinel catches future changes that drop the queue
acquisition.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the cubic acceptance-criteria gap (❌ "Integration test: two
/change requests targeting different (table_key, branch) execute
concurrently end-to-end"). The bench harness measures the throughput
side; this test is the regression sentinel that catches a future
change which accidentally re-introduces graph-wide serialization on
the disjoint path.
Spawns 4 concurrent /change inserts on node:Person and 4 on
node:Company. All 8 must return 200, and the post-test row counts
on each table must reflect every insert.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the doc-vs-code gap at api.rs:343 and lib.rs:344-355: the
documentation claims `Retry-After` is set on TooManyRequests /
ServiceUnavailable responses, but `IntoResponse for ApiError`
emitted only `(StatusCode, Json(ErrorOutput))` — no header.
Wires a constant `RETRY_AFTER_SECONDS = "60"` for both 429 and 503
codes. Plumbing per-RejectReason durations through is a follow-up;
the admission rejects we surface today recover bounded by request
handler duration rather than calendar wait, so a constant suffices.
Pinned by `ingest_per_actor_admission_cap_returns_429`. Test now
fully green: 1+ of 8 concurrent /ingest under cap=1 receives 429
with Retry-After: 60.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the gap that admission control only fired on /change. A heavy
actor sending bulk-ingest traffic could exhaust shared engine capacity
(Lance I/O threads, manifest churn) without hitting the per-actor cap.
Wires `state.workload.try_admit(&actor_arc, est_bytes)` into the five
remaining mutating handlers AFTER Cedar authorization (so denied
requests don't consume admission slots) and BEFORE the engine call.
Byte estimates per handler:
- /ingest: request.data.len() (NDJSON body)
- /schema/apply: request.schema_source.len()
- /branches/create, /branches/delete, /branches/merge: 256
(small JSON; the heavy work is bounded per-(table, branch) by the
engine's writer queue rather than by request size)
The admission guard is held in `let _admission = ...` so it stays
alive until handler return, releasing the count permit + decrementing
the byte budget on drop.
Pinned by `ingest_per_actor_admission_cap_returns_429` (previous
commit). The test still fails on the Retry-After header assertion;
the next commit emits the header.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per AGENTS.md rule 8, this commit lands the failing regression test
ahead of the fix. Currently fails on f925ad1 with 8/8 statuses returning
200 because /ingest does not call WorkloadController::try_admit.
The test pins:
- /ingest is gated on per-actor admission control (returns 429 when
the cap is exceeded).
- 429 responses carry the structured `code: too_many_requests` error
body so clients can distinguish them from generic conflicts.
- 429 responses include a `Retry-After` header so clients can implement
bounded backoff. The doc claim at api.rs:343 and lib.rs:344 was that
this header exists; the IntoResponse impl currently emits no headers.
Two follow-up commits will turn this green:
1. Wire WorkloadController::try_admit on /ingest and the four other
mutating handlers (Block 2.1).
2. Emit the Retry-After header on 429/503 responses (Block 2.2).
The test uses #[serial] + EnvGuard to override
OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1 without racing parallel tests, then
spawns 8 concurrent /ingest tasks aligned at a tokio::sync::Barrier so
multiple tasks reach try_admit close in time. With cap=1, at least one
must be rejected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The existing change_concurrent_inserts_same_key_serialize_without_409
test claimed in its comment "asserts the final row count equals N" but
only checked HTTP status codes. cubic flagged the gap; this commit
adds the actual /snapshot read after the concurrent inserts to verify
all N batches landed (no silent overwrite) by comparing the post-test
node:Person row_count against SEED + N.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per AGENTS.md rule 8, this commit lands the failing regression test
ahead of the fix so the red → green pair is visible in git log.
The test demonstrates that two concurrent `POST /branches` calls with
distinct `from` parents corrupt coordinator state: A's "operate" step
runs against B's swapped coordinator instead of its own, forking the
new branch off the wrong parent's HEAD.
Currently fails on f925ad1 with all 8 gamma branches (declared
parent: alpha, 5 rows) reporting 4 rows — beta's row count. The
operate step ran against beta's coord because B's swap interleaved
between A's swap and A's operate.
Fix lands in the next commit: hold a single `coordinator.write().await`
guard across the entire swap-operate-restore sequence in
`branch_create_from_impl` so the three steps are atomic relative to
other callers.
Closes the bug class "non-atomic three-step coordinator manipulation
under &self callers" rather than guarding the specific call site —
the right architectural seam (single critical section per swap-restore
sequence) eliminates the interleave window for branch_create_from and
any future swap-restore caller.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>