mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-12 01:45:14 +02:00
27 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
e676c151bb |
feat(engine): unify load/ingest — load_as gains an optional fork base
load_as/load_file_as gain a base: Option<&str> parameter: with Some(base) a missing target branch is forked from base first (the former ingest semantics); with None the target branch must exist — staging fails on an unknown branch, so a typo'd name can never create one. LoadResult gains branch/base_branch/branch_created metadata (additive). The ingest family (ingest, ingest_as, ingest_file, ingest_file_as) becomes #[deprecated] shims over load_as that preserve the historical contract exactly (from: None still means fork from main; base recorded even when no fork happened). IngestResult and to_ingest_tables stay for the shims and the server until the removal release. The layered policy check is unchanged: Change on the target branch always, BranchCreate additionally when a fork actually happens (enforced inside branch_create_from_as with the actor threaded through). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> |
||
|
|
2c578a60b2
|
(feat) convert engine call sites to &dyn TableStorage; demote legacy TableStore methods to pub(crate) (#86)
* MR-854: convert engine call sites to &dyn TableStorage; demote legacy methods
Phase 1b: every db.table_store.X(...) call site converts to
db.storage().X(...), reaching the storage layer through the sealed
TableStorage trait (returns &dyn TableStorage). Opaque SnapshotHandle
and StagedHandle replace bare lance::Dataset and Transaction in the
threaded values.
Phase 9: the inherent inline-commit methods on TableStore
(append_batch, merge_insert_batch{,es}, overwrite_batch,
create_btree_index, create_inverted_index) demote from pub to
pub(crate). Their only remaining direct users are table_store.rs
itself and the bulk loader's LoadMode::{Append, Overwrite, Merge}
concurrent fast-paths in loader::write_batch_to_dataset (no
two-phase shape in Lance 4.0.0 — closes after lance#6658 and #6666).
Docs:
- invariants.md \u00a7VI.23: drop "at the writer-trait surface"
qualifier; staged primitives are now the only engine surface.
- runs.md: residual matrix shrinks to delete_where and
create_vector_index (the two upstream-blocked residuals).
- forbidden_apis.rs: replace transitional language with the
current allow-list shape (table_store.rs + loader concurrent
fast-path only).
Files touched:
- changes/mod.rs, db/omnigraph.rs (+export/optimize/schema_apply/
table_ops.rs), exec/{merge,mod,mutation,staging}.rs,
loader/mod.rs, storage_layer.rs, table_store.rs,
tests/forbidden_apis.rs, docs/{invariants,runs}.md.
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
* MR-854: replace test-only inline-commit append callers with local Lance helpers
After demoting TableStore::append_batch from pub to pub(crate), the
integration tests in tests/recovery.rs and tests/staged_writes.rs
that previously called store.append_batch(...) directly to simulate
HEAD-ahead-of-manifest drift can no longer access the inherent
method. Replace those calls with small in-test helpers that do a raw
Dataset::append (the same body the inherent method runs).
- tests/helpers/mod.rs gains lance_append_inline (shared helper).
- tests/staged_writes.rs gets a file-local lance_append_inline_local
(staged_writes.rs does not import helpers::).
- tests/recovery.rs drops the unused TableStore import in the one
function whose store binding became unused after the conversion.
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
* MR-854: retrigger CI for flaky Test Workspace job
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
* MR-854: convert remaining table_store call sites in export.rs / read_blob
Two leftover `self.table_store.X` / `db.table_store.X` call sites were
missed in the initial sweep — flagged by Devin Review on PR #86. Both
now go through the trait surface:
- `entity_from_snapshot` (db/omnigraph/export.rs): switch from
`db.table_store.open_snapshot_table` + `db.table_store.scan` to
`db.storage().open_snapshot_at_table` + `db.storage().scan`.
- `read_blob` (db/omnigraph.rs): replace
`snapshot.open(table_key)` + `self.table_store.first_row_id_for_filter`
with `self.storage().open_snapshot_at_table` +
`self.storage().first_row_id_for_filter`. The follow-up
`take_blobs` call still needs an `Arc<Dataset>` (it's a Lance blob
accessor not surfaced through the trait), so we hand off via
`SnapshotHandle::into_arc()` with a comment.
After this commit, no engine code outside `table_store.rs` reaches the
inherent `TableStore` API — the docs/runs.md and docs/invariants.md
claim is now uniformly true.
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
* MR-854: post-rebase doc fixes (Lance 6.0.1, MR-A framing, into_dataset note)
Reviewer feedback on the rebased PR:
* docs/dev/writes.md residuals matrix: drop demoted methods from the trait-surface table (now `pub(crate)`); keep only the two genuine trait-surface residuals (`delete_where`, `create_vector_index`); reframe under MR-A (Lance v7.x bump) per docs/dev/lance.md.
* tests/forbidden_apis.rs: update transitional allow-list header to (a) drop the truncate_table mislabel (truncate_table is a Lance Dataset method, not a TableStore method — overwrite_batch's internal call), (b) reframe trait-surface residuals under MR-A / Lance #6666.
* crates/omnigraph/src/storage_layer.rs::SnapshotHandle::{into_arc, into_dataset}: add single-ref invariant doc — both consume Arc via try_unwrap-or-clone; sibling SnapshotHandle clones across an await point force a deep Dataset clone.
* Replace lance-4.0.0 version refs with lance-6.0.1 in active source/test/dev-doc comments (storage_layer.rs, table_store.rs, table_ops.rs, schema_apply.rs, merge.rs, recovery.rs, staged_writes.rs, consistency.rs, docs/dev/execution.md, docs/user/query-language.md). Historical refs in docs/releases/v0.4.1.md and the canonical "Lance 4.0.0 → 6.0.1 migration" line in docs/dev/lance.md left intact.
No engine code changes.
* MR-854: update docs/dev/invariants.md Storage trait row + gap entry
Reviewer feedback: the docs reorg landed; the invariant row now lives in
docs/dev/invariants.md with stable headings (no more numbered §VI.23).
Update two pieces to reflect MR-854 completion:
* Status table 'Storage trait' row: was 'full call-site migration ... incomplete';
now 'engine call sites all route through db.storage() (MR-854); inline-commit
inherent methods are pub(crate)-demoted; capability/stat surfaces are roadmap'.
* 'Known Gaps' 'Storage abstraction' entry: was 'older inherent TableStore call
sites and inline residuals remain'; now names the closed scope (MR-854 — call
sites migrated, methods demoted, loader fast-paths) and the remaining
trait-surface residuals under MR-A (Lance v7.x bump) and Lance #6666.
Cross-links to docs/dev/lance.md and docs/dev/writes.md so the framing stays
co-located with the canonical Lance surface tracking.
* MR-854: remove dead inline-commit methods from the storage surface
The loader concurrent fast-path (write_batch_to_dataset) is only reached
for LoadMode::Overwrite — Append/Merge route through MutationStaging — so
its Append/Merge arms were unreachable. Collapse it to overwrite-only and
drop the now-unused mode params, which removes the only callers of:
- TableStorage::append_batch + TableStorage::merge_insert_batches (trait)
- TableStore::merge_insert_batch + merge_insert_batches (inherent)
create_btree_index / create_inverted_index had zero callers anywhere
(scalar index builds use the stage_* primitives). Remove both from the
trait and the inherent impl.
Inherent append_batch stays pub(crate): overwrite_batch and recovery
tests use it. Migrate the one trait-append_batch test caller
(seed_person_row) to stage_append + commit_staged. The merge_insert
FirstSeen-workaround rationale moves from the deleted merge_insert_batch
into stage_merge_insert (now the sole merge path). No behavior change.
Also corrects the inaccurate loader residual comment (the prior text
blamed Lance #6658/#6666, which are the delete and vector-index issues,
for keeping overwrite inline; a stage_overwrite primitive already exists
and schema_apply uses it).
* MR-854: seal db.storage() to staged-only; move residuals to InlineCommitResidual
Split the three remaining inline-commit writes (overwrite_batch,
delete_where, create_vector_index) off the TableStorage trait onto a new
sealed InlineCommitResidual trait, reachable only via the explicit
Omnigraph::storage_inline_residual() accessor. db.storage() now exposes
only staged primitives + reads, so engine code cannot couple a write
with a Lance HEAD advance through the default surface — MR-793 acceptance
§1 ("no public method commits as a side effect of writing") now holds by
construction, not by review + naming.
Call sites moved to storage_inline_residual(): loader overwrite
fast-path, the three mutation delete_where paths, the branch-merge
delete, and the vector-index build. Impl bodies are unchanged (same
delegation to the pub(crate) inherent methods); this is a pure surface
reshape with no behavior change.
The residual trait holds two genuinely upstream-blocked methods
(delete_where -> Lance #6658/v7.x, create_vector_index -> Lance #6666)
plus overwrite_batch, kept for the loader's cross-table bulk-overwrite
concurrency until its staged migration lands (tracked follow-up).
* MR-854 docs: describe the staged-only seal; fix stale Lance index URLs
- writes.md / invariants.md / AGENTS.md: the inline-commit residuals now
live on InlineCommitResidual behind db.storage_inline_residual(), so
acceptance §1 holds by construction rather than 'option (b)' per-method
enumeration. Drop the inaccurate 'until Lance exposes
Operation::Overwrite { fragments }' claim (that op exists; stage_overwrite
already builds it) and reframe overwrite_batch as a removable legacy
residual gated on the loader's bulk-overwrite concurrency.
- forbidden_apis.rs: rewrite the allow-list doc for the split surface.
- lance.md: the index spec pages moved from /format/table/index/ to
/format/index/ in Lance 6.x (the old paths 404). Fix all 13 URLs.
* MR-854: fix stale lance-4.0.0 comment refs flagged in review
Addresses greptile (exec/merge.rs) and aaltshuler's stale-version blocker:
update lance-4.0.0 -> 6.0.1 in the comment/doc refs within this PR's
footprint (exec/merge.rs, exec/mutation.rs, docs/dev/writes.md). Also
corrects exec/merge.rs to cite lance#6666 (not #6658) for
build_index_metadata_from_segments — that is the vector-index segment-commit
API; #6658 is the two-phase delete. (Pre-existing 4.0.0 refs in untouched
files like architecture.md/storage.md are main's incomplete migration
cleanup, left out of scope.)
* fix(storage): stage loader overwrites
* fix(storage): stage empty schema rewrites
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com>
Co-authored-by: Ragnor Comerford <hello@ragnor.co>
|
||
|
|
e0d88d1295
|
fix(unique): collision-free tuple key shared by intake and merge, loud on un-keyable types (#160)
* fix(unique): collision-free tuple key shared by intake and merge, loud on un-keyable types Hardening on top of #133. That PR introduced a shared `loader::composite_unique_key(parts)` joining per-column scalars with U+001F and routed both intake and branch-merge through it, closing the original '|' vs U+001F separator drift. This takes the shared keying the rest of the way to correct-by-design: - Collision-free by construction: the key is now the tuple of per-column scalar strings (Vec<String>) keyed directly, no separator, so no data value (not even a literal U+001F) can forge a collision. - One scalar converter across both paths: intake used an explicit type-match, merge used Arrow's array_value_to_string. Both now derive the key through composite_unique_key(group_columns, row), so they can't drift on conversion. - Loud on un-keyable types: the scalar converter returned None for any Arrow type it didn't recognize, and the caller treated None as null-exempt, so a @unique on a column type it couldn't reduce (list, blob) was silently un-enforced. It now returns Err, surfacing the constraint it can't enforce instead of weakening it in silence. Tests: - consistency::composite_unique_key_is_consistent_across_intake_and_merge pins that intake and merge key the tuple identically (load-on-branch then merge of values containing '|'). - loader unit tests pin tuple keying + null exemption and the loud error on an un-keyable (binary) column. Docs: invariants truth-matrix updated; stale loader/mod.rs line pointers fixed. Scope unchanged: intra-batch / merge-candidate-set only; cross-version uniqueness against committed rows stays a documented gap. * fix(unique): cover all string encodings; make format_tuple private (PR #160 review) Addresses two Greptile P2 comments on PR #160: - unique_key_scalar handled only StringArray (Utf8). The loud-on-unknown-type behavior turned any legal string column that read back as LargeUtf8 or Utf8View into a hard write failure (the old code silently returned None). Add LargeStringArray and StringViewArray arms so a legal string column is keyable in every physical Arrow encoding; the Err path now fires only for a genuinely un-keyable logical type (list/blob/vector), never a legal value in an unenumerated encoding. - format_tuple was pub(crate) but only used within loader/mod.rs; make it a private fn (matches the old format_unique_columns it replaced, minimal exposed surface). New unit test unique_key_scalar_handles_all_string_encodings pins that Utf8 / LargeUtf8 / Utf8View all render rather than error. |
||
|
|
b7f5276ab5
|
fix(loader): enforce composite @unique(a, b) as a true composite key (#133)
* fix(loader): enforce composite @unique(a, b) as a true composite key
Node/edge composite uniqueness constraints were flattened into a single
list of property names, so @unique(a, b) was enforced as independent
single-field checks @unique(a) AND @unique(b) at intake. Preserve the
constraint grouping and check each group as a composite key, mirroring
the merge-path enforcement. Error messages now name the full composite.
MR-983
* docs: clarify unit-separator comment in composite unique check
* docs: fix separator reference in composite unique comment (merge.rs also uses U+001F)
* fix(merge): align composite @unique key separator with intake (U+001F)
The branch-merge path (update_unique_constraints) joined composite key
columns with '|', while intake joins with U+001F. The same @unique(a, b)
was keyed two different ways, and '|'-join can raise phantom merge
conflicts for values containing '|' (e.g. ('x|y','z') vs ('x','y|z')).
Factor the tuple-join into one shared helper (loader::composite_unique_key)
so the intake and merge paths cannot drift again. Add branching regression
tests for edge @unique(src, dst) on the merge path.
Refs MR-983.
---------
Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com>
Co-authored-by: Andrew Altshuler <andrew@collectivelab.io>
|
||
|
|
4a66d6e071
|
fix(loader): accept multi-line (pretty-printed) JSON in load (#146)
The loader read input line-by-line (reader.lines() + serde_json::from_str per line), so any delta where a JSON object spanned multiple lines failed with 'invalid JSON on line 1: EOF while parsing an object'. Compact JSONL worked; pretty-printed JSON never did. Switch to a streaming value deserializer (Deserializer::from_reader().into_iter::<Value>()), which treats any whitespace (including newlines inside objects) as a separator — so both compact JSONL and pretty-printed JSON load. Error labels switch from line numbers to record numbers (line numbers are meaningless once objects span lines). Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com> |
||
|
|
2d5c4b1202
|
docs: rename runs.md/runs.rs → writes and repoint all references (#131)
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
CI / Container Entrypoint (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / Test Windows release binaries (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
Release Edge / Build edge omnigraph-windows-x86_64 (push) Has been cancelled
Release Edge / Smoke Windows installer (push) Has been cancelled
The Run state machine was removed in MR-771 (v0.4.0); `docs/dev/runs.md` and `crates/omnigraph/tests/runs.rs` have since documented and tested the direct-publish write path, so the "runs" name was misleading. - git mv docs/dev/runs.md → docs/dev/writes.md (reframe H1 + intro; keep MR-771 history note) - git mv crates/omnigraph/tests/runs.rs → tests/writes.rs (reframe header) - repoint every runs.md / runs.rs reference across docs, AGENTS.md, and source comments - fix four pre-existing broken `docs/runs.md` links (the file never lived at that path) to `docs/dev/writes.md` - fix the stale v0.4.0 anchor to the live section No behavior change: every source edit is a comment. Engine builds and the renamed test passes 25/25; scripts/check-agents-md.sh passes. The run-removal cleanup itself (run_registry.rs guard, __run__ prefix) is deferred to MR-770. |
||
|
|
e0f13b32c5
|
(feat): multi-graph server mode (#119)
* mr-668: add GraphId newtype + Cloud-mode forward identity stubs (PR 1/10)
PR 1 of the MR-668 multi-graph server work. Pure types, no runtime
behavior changes yet.
Ships the validated identity vocabulary that the rest of the implementation
will consume:
- `GraphId(String)` — `^[a-zA-Z0-9-]{1,64}$`, leading underscore rejected
(engine reserves every `_*` filename), reserved route names rejected
(`policies`, `healthz`, `openapi`, `openapi.json`, `graphs`). Validation
lives in `try_from` only; serde `Deserialize` re-runs it so JSON payloads
cannot bypass.
- `TenantId(String)` — same regex shape as GraphId. `None` in Cluster
mode; reserved for Cloud mode (RFC 0003) where it carries the OAuth
`org_id` claim.
- `GraphKey { tenant_id: Option<TenantId>, graph_id }` — the registry
HashMap key. `cluster()` constructor for the Cluster-mode default.
- `Scope` enum with `Full` variant — Cluster mode default; RFC 0004 will
extend with OAuth scopes (`graph:read`/`write`/`admin`/`*`).
- `AuthSource` enum with `Static` variant — Cluster mode default; RFC
0001 step 1 will add `Oidc`.
- `ResolvedActor { actor_id, tenant_id, scopes, source }` — replaces the
upcoming refactor of `AuthenticatedActor(Arc<str>)` in PR 4a.
Per MR-668 design decision 13: ship the Cloud-mode forward type shapes
now (no `TokenVerifier` trait yet — that's RFC 0001 step 1) so handler
signatures stay stable across the Cluster → Cloud trajectory. `Scope`
and `AuthSource` use `#[non_exhaustive]` so future variants don't break
caller matches.
Tests: 26 new (15 graph_id + 11 identity), all passing. No regression
in the existing 36 server library tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: Omnigraph::init error-path cleanup + three failpoints (PR 2a/10)
PR 2a of the MR-668 multi-graph server work. Bug fix: a partially-failed
`Omnigraph::init` previously left orphan schema files at the graph URI,
making the URI unusable for a retry (the next `init` would refuse because
`_schema.pg` already exists).
Changes:
1. `init_with_storage` now wraps the I/O phase. On any error from
`init_storage_phase`, calls `best_effort_cleanup_init_artifacts` to
remove the three schema files before returning the original error:
- `_schema.pg`
- `_schema.ir.json`
- `__schema_state.json`
Cleanup is best-effort: a failure to delete is logged via
`tracing::warn` but does NOT mask the init error.
2. Three failpoints added at the init phase boundaries:
- `init.after_schema_pg_written`
- `init.after_schema_contract_written`
- `init.after_coordinator_init`
3. Four new failpoint tests in `tests/failpoints.rs` pin the cleanup
behavior at each boundary plus the "original error wins over cleanup
error" contract. All 23 failpoint tests pass.
Coverage gap (documented in code comments):
Lance per-type datasets and `__manifest/` directory created by
`GraphCoordinator::init` are NOT cleaned up after a coordinator-init-phase
failure. Recursive directory deletion requires `StorageAdapter::delete_prefix`,
which was deferred along with `DELETE /graphs/{id}` (originally PR 2b). When
that primitive lands, the third failpoint test can be tightened to assert
the graph root is fully empty.
Tests: 4 new (init_failpoint_*), all 23 failpoint tests green. No
regression in the 105 engine library tests or 64 end_to_end tests.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: add GraphHandle + GraphRegistry data structure (PR 3/10)
PR 3 of the MR-668 multi-graph server work. Pure data structure — no
routing changes yet (that's PR 4a).
New file: `crates/omnigraph-server/src/registry.rs`
- `GraphHandle { key: GraphKey, uri: String, engine: Arc<Omnigraph>,
policy: Option<Arc<PolicyEngine>> }` — the per-graph state that the
routing middleware (PR 4a) will inject as a request extension.
- `RegistrySnapshot { graphs: HashMap<GraphKey, Arc<GraphHandle>> }` —
immutable snapshot; replaced atomically via `ArcSwap`.
- `GraphRegistry { snapshot: ArcSwap<_>, mutate: Mutex<()> }` — lock-free
reads, mutex-serialized mutations.
- `RegistryLookup { Ready(Arc<GraphHandle>) | Gone }` — two-valued, no
`Tombstoned` variant since DELETE is deferred in v0.7.0 scope.
- `InsertError { DuplicateKey | DuplicateUri }` — both rejection cases
for create-graph (maps to HTTP 409 in PR 7).
- Methods: `new`, `from_handles` (bulk startup-time init), `get`, `list`,
`len`, `insert`.
Race semantics pinned by three multi-thread tests:
- `concurrent_insert_same_key_exactly_one_succeeds` — N=8 spawned
inserts with the same key; exactly 1 returns Ok, 7 return DuplicateKey.
- `concurrent_insert_distinct_keys_all_succeed` — N=8 spawned inserts
with distinct keys; all succeed.
- `concurrent_reads_during_inserts_see_consistent_snapshots` — reader
loop concurrent with sequential writes; every listed handle's key
resolves via `get()` (no torn state).
Why no tombstones field: `DELETE /graphs/{id}` is deferred to bound
the scope of v0.7.0. Without a delete endpoint, there's no use for
tombstones — every key in the registry is `Ready`, and every key
not in the registry is `Gone`. When DELETE lands later, the
`Tombstoned` variant + `tombstones: HashSet<GraphKey>` slot in
additively without breaking caller signatures (the `Gone` variant
remains the "not currently active" case).
Why `tokio::sync::Mutex`: insert is async because PR 7's flow holds
this mutex across the atomic YAML rewrite step (file I/O). std::Mutex
would footgun across .await.
Dependency additions: `arc-swap = { workspace = true }`,
`thiserror = { workspace = true }` (used by InsertError).
Tests: 12 new (12 passing). 74 server lib tests total green
(62 from PR 1 + 12 new). Clippy clean on server crate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: router restructure + handler refactor for multi-graph (PR 4a/10)
PR 4a of the MR-668 multi-graph server work. The heaviest single PR —
rewires every handler to extract `Arc<GraphHandle>` from a routing
middleware, replaces `AuthenticatedActor(Arc<str>)` with `ResolvedActor`
everywhere, and adds the `ServerMode` discriminator.
Behavior changes:
- **Single mode** (legacy `omnigraph-server <URI>`): flat routes
(`/snapshot`, `/read`, `/branches`, …) continue to work exactly as
v0.6.0. Internally, the registry holds a single handle keyed by the
sentinel `SINGLE_GRAPH_KEY_ID = "default"`; routing middleware injects
that handle on every request. No HTTP-visible change.
- **Multi mode** (new): routes nest under `/graphs/{graph_id}/...`.
Routing middleware extracts the graph id from the path, looks it up
in the registry, and injects the handle. 404 if not found.
(Multi-mode startup itself lands in PR 5; this PR provides the
router-side wiring.)
AppState refactor:
- `engine: Arc<Omnigraph>` and `policy_engine: Option<Arc<PolicyEngine>>`
fields removed — both now live inside `GraphHandle` in the registry.
- `mode: ServerMode { Single { uri } | Multi { config_path } }` added.
- `registry: Arc<GraphRegistry>` added.
- `server_policy: Option<Arc<PolicyEngine>>` added (placeholder for
management endpoints in PR 6b; unused today).
- Existing constructors (`new`, `new_with_bearer_token{s,_and_policy}`,
`new_with_workload`, `open*`) build a single-mode AppState
internally and remain source-compatible. Tests that constructed
AppState via these constructors continue to work.
- `with_policy_engine` post-construction setter — rebuilds the
single-mode handle with the policy attached. Engine-layer
enforcement is NOT reinstalled (matches the old single-field
semantics; `open_with_bearer_tokens_and_policy` is the path that
installs both layers).
- `new_multi` constructor added for PR 5's startup loop.
- `uri()` now returns `Option<&str>` (Some in single, None in multi).
Routing middleware:
- `resolve_graph_handle` injects `Arc<GraphHandle>` as a request
extension. Mode-aware: single returns the only handle; multi parses
`/graphs/{graph_id}/...` from the URI. Returns 404 in multi mode
when the graph id is unregistered. Records `graph_id` on the
current tracing span.
- `require_bearer_auth` updated to insert `ResolvedActor` (was
`AuthenticatedActor`).
Handler refactor — every protected handler:
- Gains `Extension(handle): Extension<Arc<GraphHandle>>` param.
- Replaces `state.engine` → `handle.engine`.
- Replaces `state.policy_engine()` → `handle.policy.as_deref()`.
- Replaces `state.uri()` → `handle.uri.as_str()` (or `.clone()`
where String is needed).
- Replaces `Arc::clone(&state.engine)` → `Arc::clone(&handle.engine)`
(the spawn-and-clone pattern in `server_export` — proof that a
long-running export survives the registry being mutated later).
authorize_request signature:
- Was: `(state: &AppState, actor: Option<&AuthenticatedActor>, request: PolicyRequest)`.
- Now: `(actor: Option<&ResolvedActor>, policy: Option<&PolicyEngine>, request: PolicyRequest)`.
- Per-graph callers pass `handle.policy.as_deref()`. The (future PR 6b)
management endpoints will pass `state.server_policy.as_deref()`.
MR-731 invariant preserved:
- The single chokepoint `request.actor_id = actor.actor_id.as_ref().to_string()`
inside `authorize_request` still overwrites any client-supplied
actor identity. Regression test
`actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers`
at `tests/server.rs:1114-1216` passes unchanged.
Tests: 0 new (the registry race tests in PR 3 already cover the
data structure; this PR exercises them indirectly via the existing
test suite). 74 lib + 57 server integration + 60 openapi = 191 tests
green. Clippy clean.
LOC: +397 insertions, -153 deletions in `crates/omnigraph-server/src/lib.rs`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: OpenAPI multi-mode cluster filter (PR 4b/10)
PR 4b of the MR-668 multi-graph server work. In multi mode, the served
`/openapi.json` reports cluster routes (`/graphs/{graph_id}/...`) instead
of the legacy flat protected paths — matching what `build_app` actually
mounts (PR 4a's `Router::nest`). Single mode is unchanged.
Implementation:
- New `server_openapi` branch: when `state.mode()` is `Multi`, call
`nest_paths_under_cluster_prefix(&mut doc)` after `ApiDoc::openapi()`.
- The rewrite consumes `doc.paths.paths`, then for every path-item:
- If the path is in `ALWAYS_FLAT_PATHS` (`/healthz` for now), keep
it flat.
- Otherwise, prefix every operation_id with `cluster_` and reinsert
the item at `/graphs/{graph_id}<original_path>`.
- Single mode hits no extra work — the path map is untouched.
- The static `ApiDoc::openapi()` still emits the flat surface, so
in-process callers (the existing `openapi_json()` helper in tests)
see the unmodified spec.
Why cluster_ prefix on operation IDs: OpenAPI specs require unique
operation_ids across the document. With both flat (single-mode) and
cluster (multi-mode) surfaces ever co-existing in a generated SDK,
the prefix prevents collision. The current served doc only carries
one surface, so the prefix is forward-compat with potential future
dual-surface generation.
Tests: 6 new in `tests/openapi.rs`, all via the `/openapi.json` route
(not the static `ApiDoc::openapi()` helper):
- `multi_mode_openapi_lists_cluster_paths` — every protected path
appears as a cluster variant.
- `multi_mode_openapi_drops_flat_protected_paths` — flat protected
paths are absent.
- `multi_mode_openapi_keeps_healthz_flat` — `/healthz` survives.
- `multi_mode_openapi_prefixes_operation_ids_with_cluster` — every
cluster operation_id starts with `cluster_`.
- `multi_mode_operation_ids_are_unique` — no operation_id collisions.
- `single_mode_openapi_unchanged_by_cluster_filter` — single mode
still emits the legacy flat surface (regression).
New test helper `app_for_multi_mode(graph_ids)` exercises the new
`AppState::new_multi` constructor from PR 4a — first user of multi-mode
construction outside of unit tests.
Result: 66 openapi tests + 57 server integration tests + 74 lib tests
= 197 green. No regression in the existing OpenAPI drift check
(`openapi_spec_is_up_to_date` still validates the static flat surface
matches the committed openapi.json).
LOC: +67 in lib.rs (rewrite logic), +219 in tests/openapi.rs (test
suite + helper).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: multi-graph startup + mode inference (PR 5/10)
PR 5 of the MR-668 multi-graph server work. This is the first PR that
makes multi mode actually usable end-to-end: operators invoking
`omnigraph-server --config omnigraph.yaml` with a non-empty `graphs:`
map and no single-mode selector now get a running multi-graph server.
Mode inference (MR-668 decision 2, four-rule matrix in
`load_server_settings`):
1. CLI `<URI>` positional → Single
2. CLI `--target <name>` → Single (URI from graphs.<name>)
3. `server.graph` in config → Single (URI from graphs.<name>)
4. `--config` + non-empty `graphs:` + no single-mode selector
→ Multi (all entries in `graphs:`)
5. otherwise → error with migration hint
Rule 5's error message names every escape hatch so operators can fix
their invocation without grepping docs.
Config schema extensions:
- `TargetConfig.policy: PolicySettings` (per-graph Cedar policy file).
`#[serde(default)]` so existing single-graph YAMLs keep parsing.
- `ServerDefaults.policy: PolicySettings` (server-level Cedar policy
for management endpoints — loaded in PR 5, wired into `GET /graphs`
in PR 6b).
- `OmnigraphConfig::resolve_target_policy_file(name)` and
`resolve_server_policy_file()` helpers — both resolve relative to
the config file's `base_dir`.
Public types added to `omnigraph-server`:
- `ServerConfigMode { Single { uri, policy_file } | Multi { graphs,
config_path, server_policy_file } }`.
- `GraphStartupConfig { graph_id, uri, policy_file }` — one entry
per graph in multi mode.
`ServerConfig` shape change:
- WAS: `{ uri: String, bind, policy_file, allow_unauthenticated }`.
- NOW: `{ mode: ServerConfigMode, bind, allow_unauthenticated }`.
- Breaking for any code that constructs `ServerConfig` directly.
`main.rs` is unaffected (uses `load_server_settings`).
`serve()` now forks on `ServerConfig.mode`:
- Single: existing flow via `AppState::open_with_bearer_tokens_and_policy`.
- Multi: parallel open via `futures::stream::iter(graphs)
.map(open_single_graph).buffer_unordered(4).collect()`. Bound 4 is
a rule-of-thumb for I/O-bound work — at N≤10 this trades startup
latency for a small amount of concurrent S3/Lance open pressure.
Fail-fast: first open error aborts startup; in-flight opens drop
their engine via Arc (Lance datasets close cleanly).
New helper `open_single_graph(GraphStartupConfig)`:
- Validates `GraphId` per the regex in PR 1.
- `Omnigraph::open(uri).await` with descriptive error context.
- Loads per-graph policy file and re-applies it via
`Omnigraph::with_policy` (engine-layer enforcement, MR-722).
- Returns `Arc<GraphHandle>` ready for the registry.
Routing middleware bug fix:
- `Router::nest("/graphs/{graph_id}", inner)` rewrites
`request.uri().path()` to the inner suffix (e.g. `/snapshot`).
The previous middleware tried to parse `{graph_id}` from
`request.uri().path()` and got 400 instead of 200. Fixed by reading
from `axum::extract::OriginalUri` request extension, which preserves
the pre-rewrite URI.
- Caught by the two new tests
`cluster_routes_dispatch_per_graph_handle` and
`cluster_route_for_unknown_graph_returns_404`.
Tests (14 new, all passing):
- Four-rule matrix: one test per branch + the joint case
`mode_inference_cli_uri_overrides_graphs_map` + the empty-graphs-map
error case.
- Per-graph + server-level policy file path resolution.
- Reserved `GraphId` rejection at startup.
- End-to-end multi-graph routing: two graphs side by side, each
cluster route hits the right engine.
- Unknown graph id under cluster prefix → 404.
- Flat routes 404 in multi mode.
Inline `ServerConfig` test (`serve_refuses_to_start_in_state_1_without_unauthenticated`)
and three `server_settings_*` tests updated to the new `mode` shape.
Result: 211 server tests green (74 lib + 71 integration + 66 openapi),
MR-731 regression test still pinned and passing.
LOC: +45 config.rs, +281 lib.rs (net), +395 tests/server.rs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: Cedar resource-model refactor (PR 6a/10)
PR 6a of the MR-668 multi-graph server work. Policy-crate-only refactor —
no HTTP handler changes, no operator-supplied policy.yaml changes. Sets
up the chassis that PR 6b's `GET /graphs` consumes.
Two new `PolicyAction` variants:
- `GraphCreate` — gates `POST /graphs` (deferred behavioral PR).
- `GraphList` — gates `GET /graphs` (lands in PR 6b).
Note: `GraphDelete` is intentionally NOT added in this PR. `DELETE
/graphs/{id}` is deferred from MR-668's v0.7.0 scope to bound complexity
(no `delete_prefix`, no tombstone, no `RegistryLookup::Tombstoned`).
Adding the Cedar action without a consumer would be the same kind of
"dead vocabulary" trap the `Admin` variant already documents.
New `PolicyResourceKind { Graph, Server }` enum, plus a
`PolicyAction::resource_kind()` method that classifies every action.
Per-graph actions (Read, Change, BranchCreate, …) bind to
`Omnigraph::Graph::"<graph_label>"`; server-scoped actions
(GraphCreate, GraphList) bind to the singleton
`Omnigraph::Server::"root"`. `Admin` stays classified as per-graph for
now — MR-724 will pick the final shape when the first consumer surface
ships.
Cedar schema string additions:
- `entity Server;`
- `action "graph_create" appliesTo { principal: Actor, resource: Server, ... }`
- `action "graph_list" appliesTo { principal: Actor, resource: Server, ... }`
Compiler updates:
- `compile_policy_source` picks the resource literal based on the
action's `resource_kind`. Existing graph-only policies generate
the same Cedar source as before — pinned by
`per_graph_rules_continue_to_work_alongside_server_rules`.
- `compile_entities` includes the `Server::"root"` entity only when
a rule references a server-scoped action. Keeps test assertions
for graph-only policies tight.
- `PolicyEngine::authorize` builds the right resource UID at
request time based on `request.action.resource_kind()`.
Validation rules added to `PolicyConfig::validate`:
- A rule may not mix server-scoped and per-graph actions (different
resource kinds need different `permit` clauses).
- Server-scoped actions cannot have `branch_scope` or
`target_branch_scope` — there's no branch context at the server
level.
Operator impact: zero. The Cedar schema `Omnigraph::Server` entity is
internally referenced by `compile_policy_source`; operator policy.yaml
files only declare actions in `rules[].allow.actions` and never
reference the resource entity directly. Decision 6's "internal rename
only; operator policies unaffected" contract is preserved and pinned
by `per_graph_rules_continue_to_work_alongside_server_rules`.
Tests: 5 new (11 policy tests total, up from 6):
- `graph_list_action_authorizes_against_server_resource`
- `graph_create_action_authorizes_against_server_resource`
- `server_scoped_rule_cannot_use_branch_scope`
- `rule_mixing_server_and_per_graph_actions_is_rejected`
- `per_graph_rules_continue_to_work_alongside_server_rules`
No regression: 145 server tests (74 lib + 71 integration) still green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: GET /graphs endpoint + per-graph policy wire-up (PR 6b/10)
PR 6b of the MR-668 multi-graph server work. First management endpoint —
`GET /graphs` lists every graph registered with the server, gated by the
server-level Cedar policy from PR 6a.
New API shapes (in `omnigraph-server::api`):
- `GraphInfo { graph_id, uri }` — one entry per registered graph.
- `GraphListResponse { graphs: Vec<GraphInfo> }` — sorted alphabetically
by `graph_id` for deterministic output.
Handler `server_graphs_list`:
- Mounted at `GET /graphs` in both modes.
- Single mode: returns 405 (resource exists in the API surface, just
not operational without a `graphs:` map). 405 chosen over 404 so
clients see "resource exists, wrong context" rather than "no such
resource".
- Multi mode: requires bearer auth (when configured); Cedar-gated by
`PolicyAction::GraphList` against `Omnigraph::Server::"root"`
(PR 6a's chassis). Returns the sorted registry list.
Cedar gate composition:
- When no `server.policy.file` is configured, the MR-723 default-deny
falls through: `GraphList` is not `Read`, so an authenticated actor
without a server policy gets 403. This is the right default — don't
expose the registry until the operator explicitly authorizes it.
- When a server policy is configured, Cedar evaluates the rule. The
test `get_graphs_with_server_policy_authorizes_per_cedar` pins the
admin-allow / viewer-deny split.
Routing:
- New `management` sub-router holding `/graphs` (auth-required, no
`resolve_graph_handle` middleware — operates on the registry, not
a single graph).
- Single mode merges flat protected routes + management.
- Multi mode merges nested `/graphs/{graph_id}/...` + management.
OpenAPI:
- `server_graphs_list` registered in `ApiDoc::paths(...)`.
- `EXPECTED_PATHS` in `tests/openapi.rs` gains `/graphs`.
- `openapi.json` regenerated (auto-tracked by
`openapi_spec_is_up_to_date` in CI).
Tests: 4 new in `tests/server.rs::multi_graph_startup`:
- `get_graphs_lists_registered_graphs_in_multi_mode`
- `get_graphs_returns_405_in_single_mode`
- `get_graphs_requires_bearer_auth_when_configured`
- `get_graphs_with_server_policy_authorizes_per_cedar`
What's NOT in this PR (deferred):
- Per-graph policy enforcement is wired through `handle.policy`
(PR 4a already did this); PR 6b doesn't add new per-graph
behavior beyond making sure the server policy lookup composes
cleanly alongside it.
- `POST /graphs` (PR 7) and `DELETE /graphs/{id}` (out of scope
for v0.7.0).
- CLI `omnigraph graphs list` (PR 8 will add).
Result: 215 server tests green (74 lib + 66 openapi + 75 integration),
11 policy tests green. MR-731 spoof regression preserved across all
this work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: POST /graphs runtime create endpoint (PR 7/10)
PR 7 of the MR-668 multi-graph server work. Operators can now add a
graph to a running multi-graph server without restarting:
curl -X POST http://server/graphs \
-H "Content-Type: application/json" \
-d '{
"graph_id": "beta",
"uri": "/data/beta.omni",
"schema": { "source": "node Person { name: String @key }\n" },
"policy": { "file": "./policies/beta.yaml" }
}'
DELETE remains deferred (out of v0.7.0 scope per the trimmed plan —
no `delete_prefix`, no tombstones).
Body shape (decision 7):
- Nested `schema: { source: "..." }` (mirrors the `policy: { file }`
pattern; leaves room for future fields without breakage).
- Optional nested `policy: { file: "..." }` for per-graph Cedar.
- 32 MiB body limit (reuses `INGEST_REQUEST_BODY_LIMIT_BYTES`).
- Asymmetric with `SchemaApplyRequest` which keeps flat
`schema_source: String` — documented in api.rs.
Atomic YAML rewrite + drift detection:
- New `config::rewrite_atomic(path, new_config, expected_hash)`:
flock → re-read + hash check → serialize → write `.tmp` → fsync
→ rename → fsync parent dir. Returns the new hash for the caller
to update its in-memory baseline.
- New `config::hash_config_file(path)` — SHA-256 of the on-disk
bytes, used at startup and after each rewrite.
- New `RewriteAtomicError { Drift | Io | Serialize }` enum.
- `AppState.config_hash: Option<Arc<Mutex<[u8;32]>>>` carries the
in-memory baseline. Updated after every successful rewrite so
subsequent POSTs don't false-trigger drift.
- The mutex is `std::sync::Mutex` (brief critical section, no .await
inside). The flock itself serializes file access process-wide
AND across multiple server instances (defense in depth).
- All sync I/O runs inside `tokio::task::spawn_blocking` — flock
is sync.
Handler ordering (the load-bearing sequence):
1. Mode check: 405 in single mode.
2. Cedar authorize: `GraphCreate` against `Omnigraph::Server::"root"`.
3. Validate body: `GraphId::try_from` (regex + reserved-name), empty
schema/uri checks, per-graph policy file parse.
4. Pre-check registry for duplicate graph_id / duplicate uri (409).
5. `Omnigraph::init` the new engine.
6. Atomic YAML rewrite (drift detection inside).
7. Publish in registry (atomic re-check via `GraphRegistry::insert`).
Failure modes (documented in handler rustdoc):
- Init fails → orphan storage at `req.uri` (PR 2a cleans up schema
files; Lance datasets remain orphans until `delete_prefix` lands).
- YAML rewrite fails (drift, IO) → orphan storage; YAML unchanged.
- Registry insert fails (race) → YAML has entry but registry doesn't;
next restart opens it cleanly.
New dependency: `fs2 = "0.4"` (workspace + omnigraph-server). POSIX-only
file locking. Linux/macOS deployment supported; Windows out of scope.
Tests (10 new in `tests/server.rs::multi_graph_startup`):
- `post_graphs_creates_a_new_graph_end_to_end` — happy path, includes
YAML inspection to confirm the rewrite landed.
- `post_graphs_baseline_hash_updates_between_rewrites` — two POSTs in
a row both succeed (drift baseline updates correctly).
- `post_graphs_duplicate_graph_id_returns_409`
- `post_graphs_duplicate_uri_returns_409`
- `post_graphs_invalid_graph_id_returns_400` (reserved name)
- `post_graphs_empty_schema_source_returns_400`
- `post_graphs_returns_405_in_single_mode`
- `post_graphs_yaml_drift_detection_returns_503` — operator hand-edits
omnigraph.yaml; server refuses to clobber.
- `hash_config_file_is_deterministic_and_detects_changes`
- `rewrite_atomic_refuses_when_hash_drifts`
OpenAPI: `server_graphs_create` registered in `ApiDoc::paths(...)`;
openapi.json regenerated.
Result: 225 server tests green (74 lib + 66 openapi + 85 integration),
all MR-731 regressions still pinned.
LOC: ~580 lib.rs net (handler + helpers), ~120 config.rs (rewrite
machinery), +71 api.rs (request/response shapes), +332 tests/server.rs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: CLI omnigraph graphs list/create (PR 8/10)
PR 8 of the MR-668 multi-graph server work. CLI parity for the
v0.7.0 management surface: operators can now manage graphs from
the command line against a running multi-graph server.
omnigraph graphs list --target dev --json
omnigraph graphs create \
--target dev \
--graph-id beta \
--graph-uri /data/beta.omni \
--schema schema.pg
DELETE is intentionally absent — server-side DELETE was deferred from
v0.7.0 scope, and shipping a client subcommand for a server endpoint
that doesn't exist would be dead vocabulary. The help output, the
subcommand enum, and the test that pins it (`graphs_subcommand_help_
lists_list_and_create`) all agree.
CLI architecture (modeled on `BranchCommand`):
- New `Command::Graphs { command: GraphsCommand }` top-level variant.
- `GraphsCommand { List, Create }` enum.
- List: GET `<base>/graphs`. Stdout is `<graph_id>\t<uri>` per line,
or JSON via `--json`.
- Create: reads `--schema <path>` from local disk, inlines as
`schema: { source: <file> }` in the POST body (nested per
MR-668 decision 7). Optional `--policy-file <path>` becomes
`policy: { file: <path> }`. Returns 201 → "created graph X at Y"
or JSON via `--json`.
- Both subcommands reject local URI targets with a clear
"remote multi-graph server URL" error.
New API type imports in the CLI: `GraphCreateRequest`,
`GraphCreateResponse`, `GraphListResponse`, `GraphSchemaSpec`,
`GraphPolicySpec` — all from `omnigraph-server::api`.
Tests:
- cli.rs (4 new, non-network):
* `graphs_subcommand_help_lists_list_and_create` — pins the
deferral of `delete` (catches scope creep).
* `graphs_list_against_local_uri_errors_with_remote_only_message`
* `graphs_create_against_local_uri_errors_with_remote_only_message`
* `graphs_create_with_missing_schema_file_errors` — pins the
IO context in the schema-read error path.
- system_remote.rs (1 new, `#[ignore]` like its peers):
* `graphs_list_and_create_against_multi_graph_server` — spawns a
multi-mode server, calls `graphs list` (sees `alpha`),
`graphs create` (adds `beta`), `graphs list` again (sees both),
and confirms the new graph is reachable via its cluster route.
CLI suite: 62 tests green (58 existing + 4 new). The new ignored
end-to-end test runs locally with `cargo test --ignored`.
LOC: +159 main.rs (enum + handlers), +88 cli.rs (unit tests),
+131 system_remote.rs (integration test).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: composite e2e tests, race fix, v0.7.0 release (PR 9/10)
PR 9 — the final integration PR for MR-668 multi-graph server work.
Closes the v0.7.0 release.
Composite lifecycle tests (closes gaps flagged in PR 7's coverage
review):
- `multi_graph_lifecycle_post_query_restart_persistence` — POST a
graph, query it via cluster route, reload the config from disk
and confirm `load_server_settings` sees the rewritten YAML.
Validates the "restart resolves orphans" failure-mode story.
- `per_graph_policy_enforced_on_post_created_graph` — POST a graph
with a per-graph policy attached, then send authenticated read
and change requests. Per-graph Cedar enforcement fires correctly
on a POST-created graph (engine-layer policy reinstalled via
`Omnigraph::with_policy` inside the create flow).
- `concurrent_post_graphs_distinct_ids_all_succeed` — 4 concurrent
POSTs with distinct graph_ids all return 201. Caught a real
race in `rewrite_atomic` (see below).
Race fix — `rewrite_atomic_with_modify`:
The first composite test surfaced a real bug. The old
`rewrite_atomic(path, new_config, expected_hash)` captured the
baseline hash OUTSIDE the flock, then called rewrite_atomic which
re-acquired it inside. Under concurrent writers:
- POST A: captures baseline H0, calls rewrite_atomic.
- POST B: captures baseline H0 too (before A's update lands).
- A: acquires flock, on-disk == H0, writes H1, releases.
- A: updates baseline H0 → H1.
- B: tries to acquire flock — waits.
- B: acquires flock. On-disk is now H1. Expected (captured
before A finished) is H0. MISMATCH → spurious Drift error.
Worse: even if the timing happens to align, B's `updated` config
was constructed from BYTES read before the flock. B writes a config
that doesn't include A's new graph — silent data loss.
The fix: new `config::rewrite_atomic_with_modify(path, baseline,
modify)` takes a closure. Inside the flock + baseline mutex:
1. Read on-disk bytes, hash, compare to baseline.
2. Parse on-disk YAML.
3. Call `modify(parsed)` to produce the new config — receives
fresh on-disk state, returns the modification.
4. Serialize + write + fsync + rename + update baseline.
Everything is read-modify-write under the same critical section.
Concurrent writers serialize cleanly. Test confirmed this is no
longer a race.
The old `rewrite_atomic(path, new_config, expected_hash)` API stays
for tests that don't need the read-modify-write shape; the POST
handler switches to the new shape.
Version bump v0.6.0 → v0.7.0:
- All 5 `crates/*/Cargo.toml` (compiler, engine, policy, cli, server)
plus their inter-crate `path` dep version constraints.
- `Cargo.lock` regenerated by `cargo build --workspace`.
- `AGENTS.md` "Version surveyed" line, capability matrix HTTP-server
row updated to mention multi-graph + cluster routes + atomic YAML
rewrite.
- `openapi.json` regenerated.
Docs:
- `docs/releases/v0.7.0.md` (new) — release notes with breaking
changes, new features, deferred items (DELETE, `delete_prefix`,
actor forwarding), and the single→multi migration recipe.
- `docs/user/server.md` — substantial section additions for the
two modes, mode inference, cluster endpoint table, management
endpoints, `omnigraph.yaml` ownership contract, `POST /graphs`
body shape + status codes.
- `docs/user/cli.md` — `omnigraph graphs list/create` section,
deferred-DELETE note.
- `docs/user/policy.md` — server-scoped Cedar actions
(`graph_create`, `graph_list`), per-graph vs server-level policy
composition, example server-level policy.
Workspace test pass: 573 tests green across all crates. Zero
failures. MR-731 spoof regression still pinned and passing across
the entire 10-PR series.
This commit closes MR-668. v0.7.0 is ready for tagging.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: remove POST /graphs and CLI graphs create (defer runtime graph mgmt)
The POST /graphs runtime-create endpoint shipped in PR 7/10 has three
unresolved high-severity bugs:
- flock-on-renamed-inode race: the YAML flock is taken on
omnigraph.yaml itself, then a temp file is renamed over it.
Cross-process writers end up locking different inodes — both
believing they hold exclusive access.
- duplicate-check outside the file lock: precheck runs against
the in-memory registry only; the locked closure does
config.graphs.insert(...) unconditionally. Concurrent same-id
POSTs can persist the loser in YAML while the in-memory registry
keeps the winner — they disagree after restart.
- best_effort_cleanup_init_artifacts deletes _schema.pg /
_schema.ir.json / __schema_state.json on any init failure. An
accidental re-init against an existing graph's URI destroys its
schema; subsequent open() fails at read_text(_schema.pg).
The correct fix is a Lance-style cluster catalog (reserve → init →
publish with recovery sidecars), parallel to the engine's existing
__manifest discipline. That work is out of scope for v0.7.0.
For now, disable runtime add/remove from the network and CLI surface.
Operators add graphs by editing omnigraph.yaml and restarting. The
GET /graphs read-only enumeration stays.
Removed:
- POST /graphs handler + router fragment + utoipa registration
- 13 post_graphs_* server tests + 3 composite POST tests +
multi_mode_app_with_real_config / post_graph helpers
- CLI omnigraph graphs create subcommand + its handler + cli.rs tests
- system_remote.rs combined list+create test trimmed to list-only
- YAML rewrite infra: rewrite_atomic[_with_modify], RewriteAtomicError,
staging_path, hash_config_file, AppState::config_hash field +
threading through new_multi and open_multi_graph_state
- fs2 dependency (verified absent from cargo tree)
- sha2/fs2 imports in config.rs (only the rewrite path used them)
- Cedar PolicyAction::GraphCreate variant + "graph_create" match arms
+ action def in Cedar schema + graph_create_action_authorizes_against_server_resource test
- GraphCreateRequest / GraphCreateResponse / GraphSchemaSpec /
GraphPolicySpec API types (only the POST handler / CLI imported them)
Kept:
- GET /graphs (read-only enumeration) and graph_list Cedar action
- omnigraph graphs list CLI subcommand
- All multi-graph startup, mode inference, cluster routes,
per-graph + server-level Cedar policies
- server_settings_drive_multi_graph_startup_end_to_end (the test
that covers operator-authored YAML + restart — the path that
survives)
- best_effort_cleanup_init_artifacts and the three init failpoints
(still reachable from CLI `omnigraph init`; preflight fix deferred
as a follow-up)
- GraphRegistry::insert and its concurrency tests — production
callers gone, but the method is the natural seam for the future
cluster-catalog work
Also fixed (transcript issue 4):
- ALWAYS_FLAT_PATHS now includes /graphs so multi-mode OpenAPI
advertises the management route correctly (was previously rewritten
to /graphs/{graph_id}/graphs)
- multi_mode_openapi_keeps_healthz_flat → renamed to
multi_mode_openapi_keeps_management_paths_flat, asserts both
/healthz and /graphs stay flat
- multi_mode_openapi_prefixes_operation_ids_with_cluster skips
/graphs in addition to /healthz
Doc fixes:
- docs/user/cli.md: graphs list example was --target http://...,
but --target is a config-graph-name lookup; corrected to --uri.
Removed the graphs create example.
- docs/user/server.md: dropped POST /graphs row, "omnigraph.yaml
ownership", and "POST /graphs body shape" sections. Added a
paragraph stating runtime add/remove is not exposed in v0.7.0.
- docs/user/policy.md: dropped graph_create action; reworded the
"Configuration" line to clarify that server-scoped rules (graph_list)
take neither branch_scope nor target_branch_scope.
- docs/releases/v0.7.0.md: rewrote release narrative — multi-graph
mode ships; runtime add/remove deferred.
- AGENTS.md: HTTP server bullet and capability matrix row updated to
reflect read-only GET /graphs and the operator-edit workflow.
- openapi.json regenerated; /graphs has only .get, no .post.
Diff: 17 files, +123 −1525 LOC.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: comment cleanup and policy format style
Strip "PR Na/Nb" sub-PR references throughout MR-668 surfaces — they
were useful during the 10-PR delivery sequence but rot now that the
work is in the tree. Keep the MR-668 umbrella references.
Also:
- Add explicit `when = when` and `resource_literal = resource_literal`
named args in `compile_policy_source`'s outer `format!` to match the
surrounding crate style (already explicit for `group` and `action`).
- Rename the best-effort cleanup tracing target from
"omnigraph::init" to "omnigraph::init::cleanup" so operators can
filter init-failure cleanup events separately from init's other
log lines.
No behavior change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: drop actor_id from PolicyRequest; pass actor as separate arg
The MR-731 "server-authoritative actor identity" invariant was enforced
by an in-function chokepoint (`request.actor_id = actor.actor_id...`
overwrite inside `authorize_request`). That worked but relied on every
caller passing in a `PolicyRequest` and trusting the overwrite — a
comment-enforced invariant.
Move the invariant into the type system:
* `PolicyRequest` no longer carries `actor_id`. The struct now models
what a caller wants to do, not who they are.
* `PolicyEngine::authorize(actor_id: &str, request: &PolicyRequest)`
and `validate_request(actor_id, request)` take identity as a
separate argument. The same shape `PolicyChecker::check` already had
for the engine layer.
* `authorize_request` in the HTTP layer extracts `actor_id` from the
bearer-resolved `ResolvedActor` and passes it positionally — no
overwrite step that could be skipped.
* CLI `omnigraph policy explain` updated (the only other consumer
that built a `PolicyRequest`).
Public API break for the `omnigraph-policy` crate. Worth it: handlers
can no longer accidentally populate `actor_id` from a request body
field, and external consumers are forced by the compiler to source
actor identity from a trusted path.
The MR-731 chokepoint test
`actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers`
still passes — the bearer-resolved actor is what reaches the engine.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: consolidate AppState single-mode constructors; delete with_policy_engine
The prior `with_policy_engine` constructor reused the engine `Arc`
from the existing handle (`engine: Arc::clone(&existing.engine)`)
without re-applying `Omnigraph::with_policy`. Combined with
`new_with_workload`, the documented composition pattern was
`AppState::new_with_workload(...).with_policy_engine(p)` — which
produced an `AppState` whose HTTP layer enforced Cedar but whose
underlying engine had no `PolicyChecker` installed. Any caller
reaching the engine via `state.registry().list()[i].engine` could
bypass policy entirely. The doc comment named this gap; the type
system didn't.
Make composition impossible to get wrong:
* Add `AppState::new_single(uri, db, tokens, Option<PolicyEngine>,
WorkloadController)` — canonical single-mode constructor that
takes every option together and routes through `build_single_mode`
(which applies `db.with_policy(checker)` to the engine itself).
* `new`, `new_with_bearer_token`, `new_with_bearer_tokens`,
`new_with_bearer_tokens_and_policy`, `new_with_workload` all become
thin wrappers around `new_single`.
* Delete `with_policy_engine`. There is no post-construction policy
install path any more; the single linear construction forces
HTTP-layer and engine-layer policy to install together or not at all.
Regression test `engine_layer_policy_fires_via_direct_arc_omnigraph_from_new_single`
constructs an `AppState::new_single` with a deny-all policy, pulls
the `Arc<Omnigraph>` from the registry handle (the same path an
embedded SDK consumer would take), and asserts a direct `mutate_as`
call returns `OmniError::Policy`. Pre-fix this test would have
succeeded the mutation.
Test caller in `ingest_per_actor_admission_cap_returns_429`
migrates from `.with_policy_engine(...)` to `new_single(...,
Some(policy_engine), workload)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: derive any_per_graph_policy on RegistrySnapshot; simplify dup check
`AppState::requires_bearer_auth` walked the entire registry per
request (cloning Arcs into a `Vec`, then `.iter().any(|h| h.policy
.is_some())`) to decide whether the auth middleware should challenge.
The walk is unnecessary — the answer only changes when the registry
mutates, which is exactly the moment a new snapshot is constructed.
Move the flag onto the snapshot itself:
* `RegistrySnapshot { graphs, any_per_graph_policy: bool }`.
* `RegistrySnapshot::new(graphs)` is the only construction path —
it derives the flag from `graphs.values().any(|h| h.policy
.is_some())` so the cached value can't drift from the source data.
* `Default` delegates to `new(HashMap::new())`.
* `GraphRegistry::from_handles` and `insert` build snapshots via
`RegistrySnapshot::new(...)`.
* `GraphRegistry::snapshot_ref()` exposes the current snapshot
through an `arc_swap::Guard`; callers that need cached derived
state go through this accessor (callers that only want `graphs`
still use `list` / `get`).
`requires_bearer_auth` becomes one `ArcSwap::load` + bool read.
Also (drive-by, same file, same hunk): replace the dead
`if let Some(other) = seen_uris.get(...)` + `let _ = other;` pattern
in `from_handles` with a plain `seen_uris.contains_key(...)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: fail-fast multi-graph startup with try_collect
The `open_multi_graph_state` doc comment claims "Fail-fast — the
first open error aborts startup; other in-flight opens are dropped"
but the code did
.buffer_unordered(4)
.collect::<Vec<_>>()
.await
.into_iter()
.collect::<Result<Vec<_>>>()?;
which drains every future in the stream before propagating the first
`Err`. With N S3-backed graphs and graph #2 failing fast, the caller
still waits for #1, #3, #4, … to either succeed or fail before
seeing the error.
Replace the four-line dance with `futures::TryStreamExt::try_collect`,
which short-circuits on the first `Err` and drops the rest. The
doc comment now matches behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: drop unused State extractor from 7 read-only handlers
After the routing-middleware refactor moved the engine into the per-graph
`GraphHandle` (extracted via `Extension<Arc<GraphHandle>>`), seven
read-only handlers — `server_snapshot`, `server_read`, `server_export`,
`server_schema_get`, `server_branch_list`, `server_commit_list`,
`server_commit_show` — kept an unused `State(_state): State<AppState>`
extractor. Drop it. Each request avoids one `FromRequestParts` clone
of `AppState`'s Arcs.
Handlers that actually use state (workload admission for write paths,
`server_policy` for management endpoints) keep theirs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: emit info! for graph routing decision
`tracing::Span::current().record("graph_id", ...)` in the routing
middleware silently no-ops here: no upstream `#[tracing::instrument]`
on the handlers declares a `graph_id` field, and `TraceLayer::new_for_http`
doesn't either. The recorded value never lands anywhere visible.
Replace with an explicit `info!(graph_id = %handle.key.graph_id,
"graph routed")` event so operators can grep logs and correlate
requests with the active graph. In single mode the value is the
sentinel `"default"`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: align GET /graphs 405 body code with HTTP status
The single-mode `GET /graphs` handler returned an `ApiError` built
via struct literal with `status: METHOD_NOT_ALLOWED, code: BadRequest`.
The body code disagreed with the HTTP status — clients deserializing
on `code` saw `bad_request`, clients deserializing on `status` saw
405. Same bug class as the earlier 503+Conflict mismatch on the
removed YAML drift path.
Close the class for this one remaining instance:
* Add `ErrorCode::MethodNotAllowed` to the API enum.
* Add `ApiError::method_not_allowed(msg)` — pairs the 405 status
with the matching code.
* Replace the struct literal in `server_graphs_list` with the
constructor.
* Regenerate `openapi.json` (adds `method_not_allowed` to the
ErrorCode schema enum).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: drop unused axum::handler::Handler import
The import landed in earlier work but no current call site uses it.
Emitted an `unused_imports` warning on every server build.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: drop unused fs2 workspace dependency
`fs2 = "0.4"` lingered in [workspace.dependencies] after the
POST /graphs flock-on-rename design was pulled. `cargo tree -i fs2`
reports no consumers in the workspace and the dep is not in
Cargo.lock. Removing the declaration closes the "phantom dep" class.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: AGENTS.md Cedar row no longer hardcodes action count
The "8 actions" claim drifted as soon as MR-668 added `graph_list`.
Bumping the count would just push the drift one PR forward; the
correct-by-design fix is to defer to the canonical list in
docs/user/policy.md and stop maintaining a duplicate count.
Closes the "doc hardcodes a count that drifts from the enum" class.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: cfg(test)-gate GraphRegistry::insert and its mutex
`insert` and the `mutate: Mutex<()>` that serializes it had no
runtime consumer in v0.7.0 — the only insertion path at startup
is `from_handles`, and runtime add/remove is deferred until a
managed cluster catalog ships. Leaving both `pub` and live made
them a "looks like API, isn't" footgun: a future change could
build on `insert` without re-establishing the concurrency contract
with an actual consumer in scope.
Gate both together (`#[cfg(test)]` on the method, the field, and
the `tokio::sync::Mutex` import) so the race-pinning tests still
compile but production cannot reach them. When a real consumer
ships, ungate both — they're a unit. Closes the "public API with
no runtime consumer drifts toward incorrect" class.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: drop vestigial PolicyEngine surface
* `validate_request` had zero callsites — pure surface for nothing.
* `deny`'s `_actor_id` and `_request` parameters were both unused
(the underscore prefix gave it away); the message is built by the
caller before `deny` ever sees the request. Trim both.
Closes the "public API that the type system can't justify" class
for the policy engine. No behavior change; every existing test
stays green because the deletions never had a runtime effect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: regression test for init re-init footgun (red)
A second `Omnigraph::init` against an existing graph URI today
destroys the existing graph's schema artifacts. `init_storage_phase`
overwrites `_schema.pg` before any preflight, and on the inner
`GraphCoordinator::init` failure that follows,
`best_effort_cleanup_init_artifacts` deletes all three schema files.
The existing Lance datasets and `__manifest/` survive but the
schema metadata is gone — unrecoverable without operator surgery.
This test exercises that path and currently fails with
"_schema.pg must not be deleted by a failed re-init", confirming
the destructive cleanup branch fires. The fix in the next commit
makes the test pass by preflighting with `storage.exists()` and
returning a typed error before any write touches disk.
Per AGENTS.md rule 12, the test commit lands just before the fix
commit so the red → green pair is visible in `git log` and a
reviewer can check out this commit alone to reproduce.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: close init re-init footgun via InitOptions preflight (green)
`Omnigraph::init` is "create a new graph"; existing graphs need
an explicit overwrite. Today's behavior — silently overwrite
schema files, then on inner failure delete them via best-effort
cleanup — is destructive against an existing graph regardless of
which branch fires.
Correct-by-design fix:
* New `InitOptions { force: bool }` struct (default `force: false`).
* New `Omnigraph::init_with_options(uri, schema, options)`. The
old `Omnigraph::init(uri, schema)` is a thin shortcut that
passes `InitOptions::default()`.
* `init_with_storage` runs a `storage.exists()` preflight on the
three schema URIs BEFORE any parse, write, or coordinator call.
Any hit → typed `OmniError::AlreadyInitialized { uri }`. The
destructive code paths (the `write_text` overwrite and the
best-effort cleanup) are now unreachable in strict mode against
an existing graph.
* `force: true` skips the preflight; existing operators who
actually mean to overwrite opt in explicitly.
* CLI: `omnigraph init --force` maps to `InitOptions { force: true }`.
* HTTP: `OmniError::AlreadyInitialized` maps to 409 via
`ApiError::from_omni`. Not currently HTTP-reachable (POST /graphs
was pulled), but the wiring lands here so a future runtime
create endpoint has one canonical translation.
Closes the "init is destructive against existing state" class.
The regression test added in the previous commit
(`init_on_existing_graph_uri_does_not_destroy_existing_schema`)
turns green: the original schema files now survive a second
init attempt byte-for-byte, and the call errors cleanly with
`AlreadyInitialized`. The four existing
`init_failpoint_after_*_cleans_up_*` tests stay green — strict
mode's preflight passes on a fresh tempdir, and cleanup still
runs as before when a failpoint fires mid-write.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: split PolicyEngine::load into kind-typed loaders
Pre-fix, every caller of `PolicyEngine::load(path, graph_id)`
passed *some* `graph_id` argument — even when the policy was
server-scoped and Cedar's resolution would never touch a Graph
entity. The server-level loader at lib.rs passed the meaningless
sentinel `"server"`. A graph policy file containing a `graph_list`
rule compiled fine; a server policy file containing a `read` rule
compiled fine. Both silently no-op'd at request time because the
engine kind and the rule's resource kind disagreed.
Correct-by-design fix: replace `load` with two kind-typed loaders.
* `PolicyEngine::load_graph(path, graph_id)` — for per-graph
policy files. Rejects any rule whose action `resource_kind()`
is `Server`.
* `PolicyEngine::load_server(path)` — for server-level policy
files. Takes no `graph_id`: server-scoped actions resolve against
the singleton `Omnigraph::Server::"root"` entity, never a Graph.
Rejects any rule whose action `resource_kind()` is `Graph`.
The old `load` is hard-deleted in the same commit because every
in-tree consumer migrates here (no semver promise on the workspace
crate, no external pinners). New `PolicyEngineKind` enum types
the loader's intent; `validate_kind_alignment` is the load-time
check that closes the "wrong action, wrong file, silent no-op"
class — operators get a load-time error instead of confused-and-
silent behavior at request time.
Callsites migrated:
* server lib.rs:374 (single-mode per-graph) → load_graph
* server lib.rs:1065 (multi-mode server) → load_server
* server lib.rs:1103 (multi-mode per-graph) → load_graph
* CLI main.rs:732 (resolve_policy_engine) → load_graph
* tests/server.rs ×5 (4 graph, 1 server) → load_graph/load_server
* policy_engine_chassis.rs → load_graph
Four new in-source tests pin the contract: both rejection paths
and both positive paths.
Closes the "operator puts an action in the wrong file and the
rule silently never matches" class.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: introduce GraphRouting, retire single_mode_handle
Pre-fix, `AppState` always carried `Arc<GraphRegistry>` even when
serving one graph. Single mode populated the registry with one
handle keyed by the `SINGLE_GRAPH_KEY_ID = "default"` sentinel;
`single_mode_handle` walked the registry, asserted `len == 1`,
and returned the single element with a 500-class "programmer
error" branch on mismatch. Three smells in a row — magic key,
walk-and-assert, programmer-error guard — all because the
single-mode runtime was forced through a multi-mode abstraction.
Correct-by-design fix: type the routing.
* New `pub enum GraphRouting { Single { handle }, Multi { registry,
config_path } }` on `AppState`. The `Single` arm carries the handle
directly — no registry, no key, no walk.
* `resolve_graph_handle` middleware matches on `routing`. Single mode
returns the handle in O(1); multi mode does the same path-extract +
registry lookup as before. The 500-class programmer-error branch
is gone — the type system now makes the violated invariant
("single mode has exactly one handle") unrepresentable.
* `requires_bearer_auth` reads `handle.policy.is_some()` directly
in the Single arm; Multi arm still uses the cached
`any_per_graph_policy` flag.
`ServerMode` and the legacy `registry` field on `AppState` are still
populated for now — C-3 removes both once every reader is migrated.
The `SINGLE_GRAPH_KEY_ID` sentinel and `ServerMode` will also go
away in C-3.
Closes the "single mode forced through a multi-mode abstraction"
class. All 76 server integration tests stay green: handlers still
extract `Extension<Arc<GraphHandle>>` from the request, so the
middleware's internal change is invisible to them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: remove ServerMode, registry field, and the SINGLE_GRAPH sentinel
C-1/C-2 introduced `GraphRouting` and pointed the middleware at it.
This commit removes the legacy shape that's now dead:
* `ServerMode` enum — deleted. Single mode's `uri` lives on
`handle.uri`; multi mode's `config_path` lives on the
`GraphRouting::Multi` arm.
* `AppState.mode: ServerMode` field — deleted.
* `AppState.registry: Arc<GraphRegistry>` field — deleted. Multi
mode's registry is on `GraphRouting::Multi { registry, .. }`;
single mode has no registry at all.
* `AppState::mode()`, `AppState::uri()`, `AppState::registry()`
accessors — deleted. New `AppState::routing() -> &GraphRouting`
is the single public entry point.
* `SINGLE_GRAPH_KEY_ID` constant — deleted. `GraphHandle.key` is
still required by the struct, but in single mode the key is now
only a tracing label (`"default"`, inlined with a comment naming
its sole remaining purpose). Single-mode flat routes never carry
a `{graph_id}` parameter, so the key is never compared against
user input, and there is no registry where it could be a map
key. C-1/C-2 already removed the registry walk that the sentinel
was named for.
Callers migrated:
* `build_app` (lib.rs:944) — matches on `state.routing()` instead
of `state.mode()`.
* `server_graphs_list` (lib.rs:1162) — destructures the Multi arm
to get the registry; Single arm short-circuits to 405.
* `server_openapi` (lib.rs:1217) — matches the Multi arm for the
cluster-prefix rewrite.
* `tests/server.rs:3735` — the B2 footgun regression test now
matches on `state.routing()` to extract the single-mode handle
(the test's earlier `state.registry().list().next()` shape was
the closest pre-fix analog to "embedded consumer reaches the
engine"; the new shape is more direct).
Closes the entire "single mode forced through a multi-mode
abstraction" class. After this commit:
* No magic sentinel as a routing key.
* No `single_mode_handle` walk-and-assert helper.
* No 500-class "programmer error" branch in the middleware.
* No two-field discriminant on `AppState` where one would do.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: regression test for nested-route path extraction (red)
`server_branch_delete` and `server_commit_show` use bare
`Path<String>` extractors. In single-mode flat routes
(`/branches/{branch}`, `/commits/{commit_id}`) this works — one
capture, one value. In multi-graph cluster routes
(`/graphs/{graph_id}/branches/{branch}`,
`/graphs/{graph_id}/commits/{commit_id}`) axum 0.8 propagates the
outer `{graph_id}` capture into the inner handler, so the
extractor sees two captures and 500s with
"Wrong number of path arguments. Expected 1 but got 2."
`cluster_routes_dispatch_per_graph_handle` only exercises
`/snapshot` (no Path extractor), so the regression slipped through.
This test closes that gap structurally: every cluster route with
an inner path param gets exercised here.
Currently fails with the exact symptom above. Fix in the next
commit makes it pass.
Per AGENTS.md rule 12, the red test commit lands just before the
fix so the pair is visible in `git log` and a reviewer can check
out this commit alone to reproduce.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: named-field path-param structs for nested cluster routes (green)
`Path<String>` deserializes one path-param value positionally.
Single-mode flat routes (`/branches/{branch}`,
`/commits/{commit_id}`) have one capture; multi-mode nested routes
(`/graphs/{graph_id}/branches/{branch}`,
`/graphs/{graph_id}/commits/{commit_id}`) have two — axum 0.8
propagates the outer capture into nested handlers. Same handler,
two different shapes; the multi-mode shape 500s with
"Wrong number of path arguments. Expected 1 but got 2."
Symptomatic fix: change to `Path<(String, String)>` and ignore the
first element. Breaks again the moment we add another nest layer
(e.g. tenant in Cloud mode).
Correct-by-design fix: named-field structs deserialized by name
from axum's path-param map. Each handler picks only the fields it
needs. Stable across single / multi / future-cloud nest depths
because deserialization is by field name, not position.
* New `BranchPath { branch: String }` (file-local to lib.rs)
* New `CommitPath { commit_id: String }`
* `server_branch_delete` extractor → `Path<BranchPath>`
* `server_commit_show` extractor → `Path<CommitPath>`
Closes the "handler path-extractor type is positional and breaks
when route nesting changes" class. Red test from the previous
commit turns green. All 77 server tests pass (single-mode branch
delete + commit show, plus new multi-mode coverage).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: centralize policy-requires-tokens check in the runtime classifier
Single-mode `open_with_bearer_tokens_and_policy` bailed at lib.rs:380
when policy was installed and no tokens. Multi-mode
`open_multi_graph_state` had no equivalent: the server started, every
request 401'd because no token could ever match, and the operator
spent time debugging a misconfiguration the single-mode path would
have caught at startup.
The doc/code contradiction made the gap easy to miss: the
`ServerRuntimeState::PolicyEnabled` docstring said tokens-or-not
was "unusual but valid — every request fails 401 without a bearer,
which is effectively 'locked'." The single-mode bail contradicted
that. In practice, silent-401-on-every-request is bug-shaped, not
feature-shaped (operators wanting deny-all should configure tokens
plus a deny-all Cedar rule to get meaningful 403s with
policy-decision logging).
Symptomatic fix: add a copy of the bail to multi-mode. Two copies
that can drift again the next time a startup path is added.
Correct-by-design fix: hoist the check into
`classify_server_runtime_state` so both modes get the same
enforcement from one source of truth. The classifier becomes the
single source of truth for "should we start?" and adding a startup
invariant there is now the natural extension point for any future
mode.
Classifier matrix is now complete:
| has_tokens | has_policy | allow_unauthenticated | Result |
|---|---|---|---|
| F | F | F | bail (existing) |
| F | F | T | Open (existing) |
| T | F | * | DefaultDeny (existing) |
| F | T | * | bail (NEW — closes the gap) |
| T | T | * | PolicyEnabled (existing) |
Changes:
* `classify_server_runtime_state` (lib.rs:870-890) gains the
`(false, true, _) => bail!(…)` arm with a clear message naming
the failure mode and the two valid resolutions.
* `open_with_bearer_tokens_and_policy` (lib.rs:369+) drops its
redundant local bail — the classifier rejected the invalid case
before construction was reached.
* `ServerRuntimeState::PolicyEnabled` docstring is rewritten:
drops the "(unusual but valid)" carve-out and states plainly
that PolicyEnabled requires tokens. Names the explicit
alternative (tokens + deny-all Cedar rule) for operators who
want the all-requests-denied behavior.
* `classify_policy_enabled_always_wins` test is renamed to
`classify_policy_enabled_requires_tokens` and the now-invalid
`(false, true, _)` assertion is removed (covered by the new
rejection test).
* New `classify_policy_without_tokens_is_rejected` test covers the
new arm.
* New `serve_refuses_to_start_with_policy_but_no_tokens_multi_mode`
integration test pins the multi-mode propagation path —
symmetric with the existing single-mode
`serve_refuses_to_start_in_state_1_without_unauthenticated`.
Closes the "single mode and multi mode startup branches can drift
on safety invariants" class.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: close coverage gaps surfaced by the test-coverage audit
The bot-review pass and the subsequent coverage audit surfaced two
material gaps in PR #119's test surface — both easy to close, both
worth closing before merge.
* **Gap 1 — cluster-route sweep.** The Bug-1 path-extractor
regression slipped through because
`cluster_routes_dispatch_per_graph_handle` only exercised
`/snapshot`. The other six protected cluster routes (`/read`,
`/change`, `/export`, `/schema`, `/schema/apply`, `/ingest`,
`/branches/merge`) were implicitly trusted to work without any
multi-mode integration test.
Add `all_protected_cluster_routes_resolve_to_their_handler`
(`tests/server.rs`) that hits each protected cluster route with
a minimal request and asserts the response is consistent with
the handler being reached — no 404 (router didn't match), no 500
with "Wrong number of path arguments" (Bug-1 class), no 500 with
"missing extension" (routing middleware didn't inject the handle).
Status code is a negative assertion because each handler's
happy-path inputs differ; what matters is "the request reached
the handler," not "the handler returned 200" — that's already
pinned by the single-mode tests.
* **Gap 2 — `--force` happy path.** The strict re-init regression
test (`init_on_existing_graph_uri_does_not_destroy_existing_schema`)
pins the error path; nothing pinned the `force: true` escape
hatch actually doing what its docstring claims.
Add `init_with_force_recovers_from_orphan_schema_files`
(`tests/lifecycle.rs`). Writes a bare `_schema.pg` to simulate
orphan files from a failed prior init, confirms strict mode
bails as expected, then confirms `init_with_options(force: true)`
succeeds and produces a functional graph.
Note: the test follows the documented semantics — force skips
the preflight only, it does NOT purge existing Lance state. An
earlier draft of the test (against full overwrite of an existing
populated graph) failed because `GraphCoordinator::init` errored
on the existing `__manifest`, which is exactly the limitation
the `InitOptions::force` docstring already calls out. Recursive
purge needs `StorageAdapter::delete_prefix` (tracked separately).
Coverage is now fully aligned with the PR's claims.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* mr-668: regression test for GraphList open-mode bypass (red)
Cursor bot's review at commit
|
||
|
|
a275306a15
|
policy: CLI policy injection — local writes go through engine enforce (MR-722) (#104)
Closes the CLI side of the policy chassis fan-out. Before this commit, CLI direct-engine writes bypassed Cedar entirely because the CLI never called `Omnigraph::with_policy(...)` for non-`policy validate|test|explain` subcommands. After this commit, every CLI direct-engine writer (change, load, ingest, branch create/delete/merge, schema apply) opens the engine via a new `open_local_db_with_policy(uri, &config)` helper that installs the configured `PolicyEngine` when `policy.file` is set, and threads the resolved actor through to the `_as` writer methods. Actor identity resolution: - New top-level `--as <ACTOR>` global flag on the CLI overrides config. - New `cli.actor` field in `omnigraph.yaml` provides a default actor. - Precedence: `--as` > `cli.actor` > None. - When policy is configured and neither is set, the engine-layer footgun guard fires and the write is denied — silent bypass via "I forgot the actor" is exactly what the guard prevents. - Remote HTTP writes ignore both — bearer-token-resolved server-side. Helpers added in main.rs: - `open_local_db_with_policy(uri, &config) -> Result<Omnigraph>` — opens the DB and installs the PolicyEngine when configured. Without policy this is identical to a bare `Omnigraph::open`. - `resolve_cli_actor(cli_as, &config) -> Option<&str>` — implements the flag > config > None precedence. Engine: added `load_file_as` to the loader as the actor-aware mirror of `load_file`, so CLI file-path loads flow through the same enforce gate as in-memory `load_as` calls. Test rewrite: `local_cli_policy_tooling_is_end_to_end_while_local_writes_stay_unenforced` was the explicit assertion of the pre-chassis hole. Renamed and split: - `local_cli_policy_tooling_is_end_to_end` — sanity for the read-only policy CLI surfaces (validate/test/explain), unchanged behavior. - `local_cli_change_enforces_engine_layer_policy` — the new assertion: policy installed + no actor → footgun-guard denial; `--as act-bruno` on protected main → Cedar denial; `--as act-ragnor` (admins-write rule) on main → permit, write committed. POLICY_E2E_YAML gains an `admins-write` rule so the permit case has a non-trivial actor to exercise. docs/user/policy.md updated with `cli.actor` + `--as <ACTOR>` usage. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
da42beec41
|
policy: chassis fan-out — _as variants on the remaining 6 writers (MR-722) (#103)
PR #102 wired apply_schema_as. This PR completes the chassis-side coverage so every public mutating engine entry point hits the same Omnigraph::enforce(action, scope, actor) gate regardless of transport: - mutate_as → enforce(Change, Branch(branch), actor) - load_as → enforce(Change, Branch(branch), actor) - ingest_as → enforce(Change, Branch(branch), actor); also threads actor through the implicit branch_create_from_as so fresh-branch ingest correctly hits BranchCreate too - branch_create_as → enforce(BranchCreate, TargetBranch(name), actor) - branch_create_from_as → enforce(BranchCreate, BranchTransition { source, target }, actor) - branch_delete_as → enforce(BranchDelete, TargetBranch(name), actor) - branch_merge_as → enforce(BranchMerge, BranchTransition { source, target }, actor) Three new _as variants for branch ops (create, create_from, delete) that had no actor surface before; existing actor-less variants delegate with actor=None so the no-policy path is a strict no-op. HTTP handlers updated to thread the resolved actor into the new _as variants for branch_create and branch_delete (was previously dropped). 14 new SDK chassis tests (one allow + one deny pair per wired writer); the existing 4 apply_schema_as tests stay. All 18 pass. docs/user/policy.md updated to describe engine-wide enforcement and the coarse-vs-fine layer split (engine = action gate, query layer per-row = MR-725 future). AGENTS.md capability matrix updated to match. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
4ca527cc53
|
staging: op-kind-aware drift check at commit_all entry
Closes the bug class "Lance internal conflict surfaces as 500 instead of 409" for in-process concurrent strict-op writers on the same row. Pre-fix: in `MutationStaging::commit_all`, after queue acquisition, the staged Lance transaction (built against V0) was handed straight to `commit_staged`. When Lance HEAD has advanced past V0 (because the queue's prior winner already published), Lance's transaction conflict resolver fires `RetryableCommitConflict` for Update vs Update on the same row, which wraps as `OmniError::Lance(<string>)` and the API maps it to HTTP 500. Users see "internal server error" instead of a clean retryable conflict. Fix: track the strictest `MutationOpKind` per touched table on `MutationStaging` and propagate through `StagedMutation`. In `commit_all`'s recapture loop, before each `commit_staged`, fail-fast with `OmniError::manifest_expected_version_mismatch` (→ HTTP 409 ExpectedVersionMismatch) for tables whose tracked op_kind has `strict_pre_stage_version_check() == true` (Update/Delete/SchemaRewrite) when the staged dataset's version doesn't match the fresh manifest pin under the queue. Insert/Merge tables skip the check — concurrent inserts on disjoint keys legitimately coexist via Lance's auto-rebase, so the check would over-reject the existing same-key insert path. Threading: `ensure_path` now takes `op_kind` and stores it in a new `op_kinds: HashMap<String, MutationOpKind>` on `MutationStaging`, with strictness-upgrade semantics so mixed insert+update on the same table still fires the strict check at commit time. `StagedMutation` carries `op_kinds` through to `commit_all`. Pinned by `change_concurrent_updates_same_key_serialize_via_publisher_cas` in `crates/omnigraph-server/tests/server.rs` (added in the previous commit). All Phase 2 sentinels still pass: change_concurrent_inserts_same_key_serialize_without_409, change_conflict_returns_manifest_conflict_409, branch_merge_conflict_response_includes_structured_conflicts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
f925ad1739
|
mr-686: Phase 2 — op-kind-aware version check + coord Mutex → RwLock
Fix A: op-kind-aware ensure_expected_version. Insert/Merge skip the strict pre-stage check; Update/Delete/SchemaRewrite keep it. New MutationOpKind enum threaded through open_for_mutation_on_branch / open_owned_dataset_for_branch_write / reopen_for_mutation and all callers (execute_insert/update/delete_node/delete_edge, branch_merge::publish_rewritten_merge_table, schema_apply, ensure_indices_for_branch, loader Append/Merge/Overwrite). Closes the 77% rejection rate on same-key concurrent inserts. Fix B: coordinator Mutex -> RwLock. Reads parallelize via .read(); writes serialize via .write(). Atomic-commit invariant preserved by the single .write() covering commit_manifest_updates + record_graph_commit. Bench-as-test change_concurrent_inserts_same_key_serialize_without_409 (server.rs:2180) spawns 12 concurrent /change inserts on a single (table, branch); asserts every request returns 200. Was failing pre-Phase-2; passes post-Phase-2. change_conflict_returns_manifest_conflict_409 (cross-process drift sentinel) and branch_merge_conflict_response_includes_structured_conflicts both still pass. Bench (after-pr2-phase2): - single-actor 1x1: 14.9 ops/s, p50 68ms (baseline 12.3, +22%) - disjoint 8x8: 7.04 ops/s, p50 1023ms (baseline 6.24, +13%) - same-key 8x1: 2.62 ops/s, 0 errors (after-pr2: 77% errors) Disjoint stayed at +13% — Fix B's RwLock helped read paths but the publisher's .write() critical section still serializes graph-wide. Splitting GraphCoordinator into per-concern primitives (manifest in ArcSwap, commit_graph in RwLock, atomic-commit serializer) is the deferred next step. 102 lib + 30 branching + 24 runs + 16 staged_writes + 63 end_to_end + 40 server tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
d08c42c369
|
engine: convert write APIs from &mut self to &self (PR 2 Step C)
The interior-mutability primitives from Step B (catalog ArcSwap,
schema_source ArcSwap, coordinator Mutex, and RuntimeCache's existing
internal locking) make every Omnigraph engine write API safe to expose
under &self. This commit flips the public surface so the HTTP server
can hold Arc<Omnigraph> in PR 2 Step F instead of Arc<RwLock<Omnigraph>>.
Public API conversions:
- mutate, mutate_as
- ingest, ingest_as, ingest_file, ingest_file_as
- load, load_as, load_file
- branch_merge, branch_merge_as
- apply_schema
- ensure_indices, ensure_indices_on
- optimize
Inner functions converted in lockstep (their signatures must match the
new caller shape):
- mutate_with_current_actor, ingest_with_current_actor,
load_direct_on_branch
- execute_named_mutation, execute_insert, execute_update,
execute_delete, execute_delete_node, execute_delete_edge
- branch_merge_impl, branch_merge_on_current_target
- load_jsonl_reader
- schema_apply::{apply_schema, apply_schema_with_lock,
acquire_schema_apply_lock, release_schema_apply_lock,
ensure_schema_apply_idle}
- table_ops::{ensure_indices, ensure_indices_on,
ensure_indices_for_branch, commit_prepared_updates,
commit_prepared_updates_with_expected,
commit_prepared_updates_on_branch,
commit_prepared_updates_on_branch_with_expected,
commit_manifest_updates, record_merge_commit,
ensure_commit_graph_initialized, commit_updates_on_branch_with_expected}
- optimize::optimize_all_tables
- Omnigraph::commit_manifest_updates, record_merge_commit,
commit_updates_on_branch_with_expected, ensure_commit_graph_initialized
The conversion is mechanical: callers that previously took `db: &mut
Omnigraph` now take `db: &Omnigraph`; every interior mutation goes
through the existing locks (coordinator.lock().await, store_catalog,
runtime_cache.invalidate_all). No new locks acquired, no new lock-order
hazards introduced.
102 lib tests + 24 runs + 30 branching + 63 end_to_end + 39 server
tests pass. Workspace compiles clean (1 warning on a now-redundant `mut`
binding in CLI; cleaned up in a follow-up). The remaining work in PR 2
is the AppState flip (Arc<RwLock<Omnigraph>> -> Arc<Omnigraph> +
WorkloadController), the revalidation perf optimization in commit_all,
and the WorkloadController itself.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
011f9b9610
|
engine: wrap coordinator in tokio Mutex (PR 2 Step B continued)
Wraps the GraphCoordinator field in `Arc<tokio::sync::Mutex<...>>` so engine APIs can move from `&mut self` to `&self` without giving up the coordinator's mutating refresh path. Lock acquisition order: always before runtime_cache (when both are needed in one scope). Critical sections stay short — load+clone for snapshot/version/current_branch, single-method delegations elsewhere. Public API changes: - `Omnigraph::version()` and `Omnigraph::snapshot()` (pub(crate)) become async; callers add `.await`. - `Omnigraph::active_branch()` returns `Option<String>` (cloned) instead of `Option<&str>` borrowed from the coordinator. Callers either `.await` the result + use `.as_deref()`, or hoist into a binding. `&self`-converted methods this round (tied to the coordinator wrap, not the Step C surface refactor): - `swap_coordinator_for_branch` - `restore_coordinator` (now async; was sync) - `sync_branch` - `refresh` - `refresh_coordinator_only` - `reload_schema_if_source_changed` - `branch_create`, `branch_create_from`, `branch_delete`, `branch_list` - `delete_branch_storage_only` - `ensure_branch_delete_safe` - `ensure_schema_apply_idle` - `ensure_schema_apply_idle` helper in schema_apply.rs (matches signature) Caller updates: branch_create_from_impl threads `restore_coordinator`'s new async signature; schema_apply, table_ops, exec/merge wrap every direct `db.coordinator.X()` in `db.coordinator.lock().await.X()`; exec/merge hoists `active_branch_for_keys` once outside the per-table closure that builds queue keys + sidecar pins. All 102 lib tests + 30 branching + 24 runs + 10 lifecycle + 16 staged_writes + 63 end_to_end pass workspace-wide. Zero test regressions; the only behavior change is on the `Omnigraph` API surface (sync -> async on the three accessors above). Step C (engine API conversion: apply_schema, mutate_as, ingest_as, branch_merge_as &mut self -> &self) follows in a subsequent commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
fcb47620d3
|
mr-686: bundle PR 0/1a/1b foundation + PR 2 catalog/schema_source ArcSwap
Bundles the working-tree state from the prior session (PR 0 bench harness,
PR 1a audit_actor_id removal, PR 1b WriteQueueManager + writer integration)
together with the first half of PR 2's interior-mutability foundation
(catalog and schema_source wrapped in Arc<ArcSwap<...>>). The two streams
intermix in 7 of the same files, so splitting via git add -p was
impractical. Subsequent PR 2 steps land as separate atomic commits.
PR 0 — server-level concurrent /change bench harness
- crates/omnigraph-server/examples/bench_concurrent_http.rs (new)
- .context/bench-results/{baseline-main,after-pr1}/ (gitignored)
PR 1a — drop the audit_actor_id field, thread per-call
- removed Omnigraph::audit_actor_id and the swap-restore patterns in
mutation.rs, merge.rs, loader/mod.rs
- actor_id: Option<&str> threaded through MutationStaging::finalize,
mutate_with_current_actor, ingest_with_current_actor,
branch_merge_impl, branch_merge_on_current_target,
commit_prepared_updates*, record_merge_commit,
commit_updates_on_branch_with_expected
- apply_schema and ensure_indices_for_branch pass None (system-attributed)
PR 1b — per-(table_key, branch) write queue + revalidation + sidecar
- new crates/omnigraph/src/db/write_queue.rs with WriteQueueManager,
acquire/acquire_many, sorted+deduped acquisition; 6 unit tests
- Arc<WriteQueueManager> field on Omnigraph + db.write_queue() accessor
- MutationStaging::finalize split into stage_all (Phase A, no queue)
and StagedMutation::commit_all (Phase B, acquire_many + revalidate
pins + sidecar + commit_staged); guards held across publisher
- delete-only mutations now emit recovery sidecars; revalidation
extended to inline_committed tables
- branch_merge_on_current_target, apply_schema_with_lock, and
ensure_indices_for_branch acquire per-table queues for their
touched tables
PR 2 Step B (partial) — catalog and schema_source via ArcSwap
- catalog: Catalog -> Arc<ArcSwap<Catalog>>
- schema_source: String -> Arc<ArcSwap<String>>
- public accessors return Arc<Catalog> / Arc<String>; readers bind
locally where the borrow has to outlive an expression
- new pub(crate) store_catalog / store_schema_source helpers replace
the field assignments in apply_schema and reload_schema_if_source_changed
- 117 tests across lifecycle/end_to_end/branching/runs pass; engine
lib + workspace compile clean
Coordinator wrap (Mutex) and the &mut self -> &self engine API
conversion follow in subsequent commits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
815ff743f5
|
recovery: refresh-time roll-forward closes the in-process residual + invariants helper
Bundle of three correctness fixes plus a shared invariants helper that
existing tests now use.
1. SchemaApply atomicity: close the residual gap where a sidecar exists
but staging files don't (e.g., Phase B failure BEFORE
`_schema.pg.staging` write). `recover_schema_state_files` now returns
a `SchemaStateRecovery` discriminator (`Noop` /
`CleanedStaging` / `CompletedStagingRename { schema_apply_sidecar }`);
the token threads through `recover_manifest_drift` →
`process_sidecar`. SchemaApply sidecars are eligible for roll-forward
ONLY when the staging rename completed in the same recovery pass.
Full mode rolls back; RollForwardOnly defers. Without this, recovery
would publish the manifest pin against new-schema data while
`_schema.pg` stayed old (real corruption). New failpoint
`schema_apply.before_staging_write` + new test
`schema_apply_without_schema_staging_rolls_back_on_next_open` pin
the gating.
2. Rollback target correction. Rollback now restores Lance HEAD to the
current manifest pin (`state.manifest_pinned`) instead of the
sidecar's `expected_version`. For UnexpectedAtP1/UnexpectedMultistep
classifications these can differ; the old code could regress Lance
HEAD past the manifest pin, re-introducing drift in the OTHER
direction. The new behavior establishes `Lance HEAD == manifest pin`
post-rollback — the canonical drift-free invariant. Param renamed
from `expected_version` → `target_version` to match. Audit
`to_version` records the actual restore target.
This is a latent-behavior change. Any external consumer that compared
`audit.to_version` against `sidecar.expected_version` for non-trivial
classifications now sees the manifest pin instead.
3. Audit commit-graph unification. `record_audit` now opens the
per-branch commit graph for ANY sidecar with `sidecar.branch.is_some()`
— not just BranchMerge. Plain Mutation/Load/EnsureIndices commits on a
feature branch now correctly land on that branch's commit graph,
instead of main's. Closes the class of bug analogous to D2 but for
non-merge writers.
Pre-existing repos with non-main commits already on main's commit
graph stay where they are; future recoveries write to the per-branch
ref. Mixed-version compatibility is asymmetric but safe (old binaries
ignore per-branch refs they don't know about; new binaries read both).
4. Recovery invariants helper + branch-axis cells. New
`tests/helpers/recovery.rs` (~505 LOC) exports
`assert_post_recovery_invariants(repo, op_id, RecoveryExpectation)`
plus a `TableExpectation` builder. Six existing recovery tests
refactored to call it; per-test bespoke assertions replaced. Two new
branch-axis cells added in `tests/failpoints.rs`:
- `recovery_rolls_forward_load_on_feature_branch`
- `recovery_rolls_forward_ensure_indices_on_feature_branch`
The loader gains a `mutation.post_finalize_pre_publisher` failpoint
hook (gated on the `failpoints` feature; zero-cost in release) so the
load test can pin the same Phase B → Phase C boundary the mutation
path uses.
Misc:
- `Omnigraph::refresh` extracts `reload_schema_if_source_changed`:
early-return when schema source unchanged (saves IR parse + catalog
rebuild on the steady-state refresh path).
- New test injection point
`failpoint_publish_table_head_without_index_rebuild_for_test`
under `#[cfg(feature = "failpoints")]`.
Tests: 31 recovery + failpoint integration tests pass (14 + 17, up from
14 + 16). Full workspace sweep with `--features failpoints` clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
05e52f2ee0
|
recovery: rename composite test, strip ticket references, address review
Three bundled changes:
1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and
the test function). OmniGraph is consumed by both humans and agents
- naming the test after one audience misframes the library.
2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and
review-round labels from source, tests, and docs added by this
branch. Internal traceability belongs in commit messages and PR
descriptions, not in checked-in artifacts. Upstream
lance-format/lance issue refs and pre-existing MR-XXX refs in docs
not touched by this branch are left alone.
3. Two outstanding review findings addressed:
- `needs_index_work_node` / `needs_index_work_edge`: propagate
`count_rows` errors instead of `unwrap_or(0)`. Silently treating
transient I/O failures as "0 rows" risked skipping a table from
the recovery sidecar pin set that was actually about to be
modified.
- `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`:
strengthen the assertion to fail when sidecar B classifies under
a stale snapshot. The new assertion checks post-recovery Lance
HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar
deleted + audit rows present" pair passed in both the bug and
fix paths because both delete the sidecar and write an audit
row; the differentiator is the post-recovery HEAD. Strengthening
the assertion exposed an additional nuance: in this overlapping-
sidecar scenario sidecar B's audit kind is RolledBack (no-op)
rather than RolledForward, since sidecar A's roll-forward
publishes Lance HEAD as the new manifest pin (absorbing B's
work). The docstring now explains why this is correct given
current `roll_forward_all` semantics.
All workspace tests pass with --features failpoints.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
164bafbbe7
|
recovery: address PR #72 review findings
Bot reviewers (cubic, cursor, chatgpt-codex) caught 4 merge-blocking bugs + 3 strongly-recommended fixes + 3 doc errors in the initial PR. Each fix has a paired test demonstrating the bug before the fix. Merge-blocking fixes: - BranchMerge moved to loose-match classifier arm. publish_rewritten_ merge_table runs multiple commit_staged calls per table (merge_insert + delete_where + index rebuilds). Strict classification rolled back valid completed Phase B work as UnexpectedMultistep. Three new unit tests pin the loose-match behavior for BranchMerge. - branch_merge sidecar uses self.active_branch() (the resolved target branch) instead of inferring from the first sorted table key. The previous heuristic could record None (== main) when the merge target was a non-main branch, causing recovery to publish to the wrong manifest namespace. - Best-effort sidecar delete in all 5 writer sites (mutation, loader, schema_apply, branch_merge, ensure_indices). Previously, a sidecar cleanup failure after a successful manifest publish would error out the user's call for a write that already landed. Now: log a warning and ignore — the next open's recovery sweep tidies the stale sidecar via NoMovement classification. - ensure_indices sidecar scoped to tables that need work via new helpers needs_index_work_node / needs_index_work_edge. Previously the sidecar pinned every catalog table; if only one needed indexing, the others classified as NoMovement and the all-or-nothing decision rolled back legitimate index work. Strongly-recommended fixes: - recover_manifest_drift now takes &mut GraphCoordinator and refreshes between sidecars. Sidecar B's classification needs to see sidecar A's manifest changes, otherwise B can be classified against stale pins and incorrectly roll back work that just landed. - list_sidecars sorts URIs before reading. Sidecar filenames are ULIDs (chronologically sortable), so this gives deterministic, time-ordered processing. Filesystem-order was nondeterministic. - ReadOnly opens skip recover_schema_state_files too (was: only the MR-847 sweep was gated). Read-only consumers may run with read-only credentials; silent open-time mutations violate the contract. Doc cleanups: - Removed stale "Phase 4 placeholder" comment from recover_manifest_drift. - docs/runs.md decision-tree wording now correctly surfaces the InvariantViolation abort path. - docs/branches-commits.md clarifies actor_id is in _graph_commit_actors.lance (joined by graph_commit_id), not on _graph_commits.lance itself. Test surface (post-fixes): - 25 unit tests in db::manifest::recovery (+4 from this commit). - 10 integration tests in tests/recovery.rs (+3 from this commit). - ~672 tests across ~25 binaries pass with --features failpoints. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
49ca7e5068
|
recovery: wire sidecar into MutationStaging::finalize + flip headline test (Phase 5)
Production wiring (~120 LOC):
- `MutationStaging::finalize` now takes a `SidecarKind` parameter and
returns an additional `Option<RecoverySidecarHandle>`. Builds a
Vec<SidecarTablePin> from `pending` BEFORE the per-table commit_staged
loop and writes the sidecar via `recovery::write_sidecar`. Skips the
sidecar when `pending` is empty (delete-only mutation; D₂ keeps these
out of the staged-write path so the option is just a clean signal,
not a code path users hit).
- `exec/mutation.rs::execute_mutation_as` (around line 740): destructure
the new third element, pass `SidecarKind::Mutation`, delete the
sidecar after `commit_updates_on_branch_with_expected` succeeds.
- `loader/mod.rs::ingest_loaded` (around line 540): same shape, with
`SidecarKind::Load`. The Overwrite path stays inline-commit (legacy
residual; out of MR-847 scope per docs/runs.md).
- New engine accessors `Omnigraph::storage_adapter()` and
`Omnigraph::root_uri()` for the sidecar I/O. The pre-existing
`db.storage` field stays private; no other engine code reaches around
the accessor.
- Re-exports from `db::manifest`: `new_sidecar`, `write_sidecar`,
`delete_sidecar`, plus the `RecoverySidecar*` types and `SidecarKind`,
so consumers in `exec/` can use them via `crate::db::manifest::...`.
Bugfix folded in (~5 LOC): make `coordinator` mutable in
`Omnigraph::open_with_storage_and_mode` and call `coordinator.refresh()`
after the recovery sweep returns. Roll-forward advances the manifest
pin on disk; without the refresh the returned engine carried a stale
in-memory snapshot. The Phase 4 tests passed only because they
opened Lance datasets directly rather than going through `db.snapshot()`.
Storage adapter (~15 LOC): `LocalStorageAdapter::write_text` now ensures
the parent directory exists via `tokio::fs::create_dir_all`. Required
because the sidecar protocol writes into `__recovery/` which doesn't
pre-exist after `Omnigraph::init`. S3 has no equivalent; PutObject is
path-agnostic.
Headline test flip (~150 LOC):
- `tests/failpoints.rs::finalize_publisher_residual_drifts_lance_head_until_next_writer_recovers`
is replaced by `recovery_rolls_forward_after_finalize_publisher_failure`.
Same setup (failpoint at `mutation.post_finalize_pre_publisher`) but
after the synthetic failure the test:
1. Asserts the sidecar persists in `__recovery/` for the recovery
sweep to find.
2. Drops the engine handle.
3. Reopens via `Omnigraph::open` — recovery sweep classifies
RolledPastExpected, decides RollForward, publishes the manifest
update, records the audit row, deletes the sidecar.
4. Asserts the sidecar is gone.
5. Asserts the originally-attempted Eve insert is now visible
(Person count = 1).
6. Asserts a subsequent insert succeeds without
ExpectedVersionMismatch (Person count = 2).
7. Asserts the audit dataset `_graph_commit_recoveries.lance` exists.
This is the headline contract the MR-847 acceptance criteria require.
All other failpoint and runs tests continue to pass (8 + 24 unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
044ed46019
|
chore: scrub Linear ticket numbers and review-bot mentions from code comments
OmniGraph is OSS; internal Linear ticket references and code-review-bot mentions in source-code comments don't help external readers and leak internal tooling. Replace ticket numbers (MR-XXX) with descriptive prose, drop linear.app URLs, and remove inline mentions of Cursor/Bugbot/Cubic/Codex review threads. Scope is limited to source-code comments (`crates/`). Docs under `docs/` keep their MR-XXX references — those are part of the established change-history narrative for in-repo docs and don't require a Linear account to find context for. No behavior changes; no public API changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
ea16c74329
|
MR-794 step 2: address Cursor Bugbot follow-ups on commits 3223b51 + 052b6e6
Four code/doc fixes from the latest Cursor Bugbot pass: * **Misplaced doc comment in table_store.rs (Medium):** the doc block intended for `scan_pending_batches` was, after my earlier edit, attached to `collect_string_column_values` because the new helper was inserted between the original docblock and `scan_pending_batches`. Move the docblock back onto its function and add a note about the shared SQL-dialect contract with the Lance scanner (the predicate goes to both, which is fine for `predicate_to_sql`'s plain comparison shapes today; future Lance-specific scanner extensions in the filter would need translation). * **Missing null check on committed `id` column (Low):** the committed-side loop in `collect_node_ids_with_pending` (and the parallel non-pending `collect_node_ids`) read `id_col.value(i)` without `is_valid(i)` first. `id` is the @key column on every node type and non-nullable by schema, so this is unreachable today, but the inconsistency with the pending-side `is_valid` guard is worth closing for symmetry / defense. * **Misleading comment in count_pending_src_with_dedupe (Low):** the comment claimed "fall back to naive counting" but the code did `continue`. Fix: it's unreachable in practice (the pending-side schema always contains the key when the caller passes one), so failing loudly with a typed error if it ever does fire is correct — silently skipping the batch would let `@card` violations slip past validation. * **PendingTable.schema mismatch surfaces too late (Medium):** PendingTable captures the schema from the first batch and never updates it. On a blob-bearing table, `insert` produces a full-schema batch and `update` (without assigning every blob) produces a subset-schema batch. Pre-fix the mismatch surfaced inside finalize/MemTable construction — distant from the offending op. Post-fix `MutationStaging::append_batch` validates the new batch's schema against the existing accumulator's schema and returns a typed error directing the caller to split the mutation. Error fires at the offending op, not at end-of-query. New helper `schemas_compatible` compares field name + data_type pairs; nullability and field metadata differences stay tolerated (downstream concat already permits those). Cubic Cursor Bugbot finding #5 (cascade delete edge re-open) self-resolved in the bot's own analysis ("logic appears sound on re-examination") — no action. New test on tests/runs.rs: * append_batch_rejects_mismatched_schema_in_blob_table_at_offending_op — pins the early-error path. Builds a blob-bearing schema, runs an `insert + update` query where the update doesn't assign the blob, asserts the error fires at the second op with the "Split the mutation" message and the manifest is unchanged. Local: tests/runs.rs 24/24 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
3223b51cf1
|
MR-794 step 2: address PR #68 review — merge semantics, cardinality, residual
Five fixes from PR #68 review (Cursor Bugbot + Codex + Cubic): * **scan_with_pending gains merge-shadow semantics** (Codex P1, Cubic P1#1): new `key_column: Option<&str>` parameter. When set, committed rows whose key value appears in any pending batch are excluded from the scan — making `scan_with_pending` correctly merge-semantic for chained updates instead of naively unioning. execute_update calls with Some("id"). Without this, a chained `update where age > 30` could match a row whose pending value already moved out of range. * **Multi-delete on same table no longer trips ExpectedVersionMismatch** (Cursor Bugbot HIGH): open_table_for_mutation routes through reopen_for_mutation when staging.inline_committed has the table, using the post-inline-commit Lance version captured at record_inline time. The legacy open_for_mutation_on_branch fence (Lance HEAD == manifest pinned) is correct cross-writer but wrong intra-query when deletes have already advanced HEAD on this table. Branch goes away when Lance ships two-phase delete (lance-format/lance#6658). * **Cardinality validation consolidated** (Cursor LOW + Codex P2 + Cubic P1#2 + Cubic P2): new exec/staging::count_src_per_edge + enforce_cardinality_bounds shared by mutation and loader paths. Restores the missing min-cardinality check on the engine path. Loader Merge mode passes Some("id") to dedupe edges being updated by id (not double-count committed + pending). Loader Append mode and engine path pass None (ULID-generated ids never collide). * **Dead count_rows_with_pending removed** (Cursor LOW): never called. * **Misleading concat-helper comment fixed** (Cubic P3): claimed schema normalization the helper doesn't implement. Updated to match reality. * **Documentation honesty** (Cubic P1#3): MR-794 narrows but doesn't eliminate the "Lance HEAD ahead of __manifest" drift class. Drift is unreachable for op-execution failures (the partial_failure test pins this), but a residual remains at the finalize→publisher boundary because Lance has no multi-dataset commit primitive: per-table commit_staged calls run sequentially before manifest commit. Updated docs/runs.md, docs/invariants.md §VI.25, docs/releases/v0.4.1.md to scope the claim precisely. * **Failpoint test pinning the residual**: new mutation.post_finalize_pre_publisher failpoint + two tests in tests/failpoints.rs that confirm the documented residual behavior. Catches future regressions that widen the residual. Test additions on tests/runs.rs: * chained_updates_with_overlapping_predicate_respects_intermediate_value * multi_statement_delete_on_same_node_table * cascade_delete_node_then_explicit_delete_edge_on_same_table * mutation_insert_edge_enforces_min_cardinality * load_merge_mode_dedupes_edge_for_cardinality_count 113/113 engine integration tests pass (runs + end_to_end + consistency + staged_writes + validators). Failpoints feature build runs in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
bbf610ea9b
|
MR-794 step 2: rewire loader for Append/Merge — Overwrite stays inline
load_jsonl_reader dispatches on mode: Append/Merge use the MutationStaging accumulator (per-type batch staging, single stage_* + commit_staged per touched table at end-of-load, publisher CAS). Overwrite keeps the legacy concurrent inline-commit path (truncate-then-append doesn't fit the staged shape; overwrite has no in-flight read-your-writes requirement). * New helpers collect_node_ids_with_pending and validate_edge_cardinality_with_pending_loader — loader analogs of the engine's pending-aware validators. * Phase 2c (RI) and Phase 3 (cardinality) consult pending batches for Append/Merge so a mid-load failure aborts the load before any Lance write reaches HEAD. A failed Append/Merge load no longer advances Lance HEAD on staged tables — the next load on the same tables proceeds normally with no ExpectedVersionMismatch. Overwrite mode's drift residual is unchanged from today's behavior; documented in docs/runs.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
a7109d5fba
|
MR-771: address PR review feedback
Three fixes from automated PR review on #65: 1. Internal-branch guard in mutation/load (Cursor Bugbot, Medium). Pre-MR-771 the begin_run path called ensure_public_branch_ref; the direct-publish replacements only normalized the name. A caller passing __run__* or __schema_apply_lock__ verbatim could write directly to a system branch. Re-add the explicit guard at the public write boundary in mutate_with_current_actor and load. 2. Panic-safe coordinator restoration (Cursor Bugbot, High). The previous swap-and-restore pattern would skip restore_coordinator if execute_named_mutation panicked, leaving the handle pinned to the wrong branch indefinitely. Replace with a CoordinatorRestoreGuard RAII type that captures the previous coordinator on swap and restores it in Drop. 3. Flaky cancel-safety test (cubic, P2). tests/runs.rs::cancelled_mutation_future_leaves_no_state asserted manifest version equality after handle.abort(), but abort races the spawned task. Re-frame around what actually defines cancel safety: no __run__* branches, no _graph_runs.lance, no synthesized public branches. The fourth comment (Codex P1: branch_delete losing its in-flight write barrier) is bigger in scope — fits in the MR-794 storage-trait staging story rather than a hotfix here. Tracked there. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
35be20cb05
|
MR-771: demote Run to direct-publish via expected_table_versions CAS
mutate_as and load now write directly to target tables and call the publisher once at the end with per-table expected versions; the Run state machine, _graph_runs.lance writers, __run__ staging branches, and server /runs/* endpoints are removed. Multi-statement mutations remain atomic at the manifest level via an in-memory MutationStaging accumulator that gives read-your-writes within a query and a single publish at the end. Concurrent-writer conflicts surface as ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the old DivergentUpdate merge shape. Documents one known limitation in docs/runs.md: a multi-statement mid-query failure where op-N writes a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the manifest until a follow-up introduces per-table Lance branches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
56b6319197
|
Enforce schema validators on every write path (#59)
Several validators were defined but only called from a subset of write paths, so writes that violated @unique, @range, @check, enum, or @cardinality constraints could silently succeed and corrupt data. Adds two new helpers in loader/mod.rs: - validate_enum_constraints — batch-level enum check, scans Arrow string columns (and list-of-string columns) for values outside the declared set - enforce_unique_constraints_intra_batch — single-batch duplicate detection over named columns; partial enforcement (does not check against committed rows yet — cross-batch enforcement is a separate effort) Wires the validators into: - load_jsonl_reader nodes (alongside the existing validate_value_constraints call) and edges (which had no enum or unique check at all) - exec/mutation.rs node insert, edge insert, and update paths - mutation edge insert now also calls validate_edge_cardinality after the row lands but before the manifest commit, matching the loader's Phase 3 behavior A new tests/validators.rs suite asserts rejection on every entry path for invalid enum values, @range violations, intra-batch @unique duplicates, and edge @card excesses. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
74eb5a5380
|
Parallel per-type load writes + omnigraph optimize/cleanup CLI (#46)
* Parallel per-type load writes + omnigraph optimize/cleanup CLI
## MR-677.3 — parallel per-type load writes
The load path already groups records into one RecordBatch per type and
makes one Lance commit per table (loader::mod.rs:249-..), but those
commits ran sequentially. Wrap node and edge write loops in
`futures::stream::buffered(N)` against a new helper
`write_batches_concurrently`. Concurrency tunable via
`OMNIGRAPH_LOAD_CONCURRENCY` (default 8).
## MR-676 — `omnigraph optimize` and `omnigraph cleanup`
New CLI subcommands that walk every node + edge table in the repo:
- `omnigraph optimize <uri>` — runs Lance `compact_files` on each
table to merge small fragments into fewer larger ones.
- `omnigraph cleanup <uri> --keep N | --older-than 7d --confirm` —
runs Lance `cleanup_old_versions` to prune historical manifests +
unique fragments. Requires `--confirm` because it's destructive.
Supports both count-based and time-based retention (or both AND'd
together). Time uses chrono `DateTime<Utc>` (added as a workspace
dep, default-features off).
Both commands run their per-table loops in parallel (8-way bounded,
`OMNIGRAPH_MAINTENANCE_CONCURRENCY` env override). Smoke-tested
against the 114-table prod graph: optimize went 7m15s sequential
→ 1m28s parallel. cleanup --keep 1 removed 137 historical versions
across 114 tables in 1m57s without disrupting `/healthz` or query
responses.
Public API on `Omnigraph`:
pub async fn optimize(&mut self) -> Result<Vec<TableOptimizeStats>>
pub async fn cleanup(&mut self, opts: CleanupPolicyOptions)
-> Result<Vec<TableCleanupStats>>
All 10 existing loader tests still pass.
Closes MR-676.
Partially addresses MR-677 (the .3 — parallel by type — piece;
MR-677.1 is for the `omnigraph embed` path, not load, since load
doesn't call Gemini directly. .2 was already in place).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: regenerate openapi.json
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
|
||
|
|
338289656a | Initial public Omnigraph repository |