omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-09 01:35:18 +02:00

Author	SHA1	Message	Date
Devin AI	e44a4704eb	docs: fix admission gating description	2026-05-10 14:16:26 +00:00
Devin AI	a42d178119	release: prepare omnigraph 0.4.2	2026-05-10 14:02:28 +00:00
Devin AI	6a3f0677ae	server: drop unwired try_admit_rewrite / 503 admission surface	2026-05-09 20:58:17 +00:00
Devin AI	4bb7964af9	tests: matrix cell k asserts post-reopen row count	2026-05-09 20:16:44 +00:00
Devin AI	708e170dc5	engine: branch-merge revalidates target snapshot under queue	2026-05-09 20:16:12 +00:00
Ragnor Comerford	f9a0f31f80	server: drop 503 from OpenAPI on admission-gated endpoints (unreachable) Cursor Bugbot LOW on commit `3ad359d`: try_admit_rewrite is defined and tested but no HTTP handler calls it; the six handler OpenAPI annotations declared status = 503 (added in `8e1a8e7`) but try_admit (the only path handlers invoke) returns 429 only. 503 was unreachable. Fix: remove (status = 503, ...) from the six handler OpenAPI annotations and regenerate openapi.json. Kept as forward-looking infrastructure: try_admit_rewrite, global rewrite semaphore, RejectReason::GlobalRewriteExhausted, ApiError::ServiceUnavailable, the 503 branch in IntoResponse, --global-rewrite-cap, and OMNIGRAPH_GLOBAL_REWRITE_MAX. When a future commit wires try_admit_rewrite into a handler, the 503 OpenAPI annotation lands alongside that wiring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 21:54:24 +02:00
Ragnor Comerford	3ad359db8b	tests: admission test uses new_with_workload, drops env mutation + #[serial] Migrates `ingest_per_actor_admission_cap_returns_429` from env-var override to direct `WorkloadController::new(1, ...)` construction via `AppState::new_with_workload`. Removes the `EnvGuard` and the `#[serial]` annotation that paired with it. Why correct by design (AGENTS.md rule 9): the previous round's matrix fix (commit `8bd9a5f`) shielded the matrix from this test's env mutation, but the broader bug class — "test A's process-wide env mutation can leak into any test B that calls `AppState::open` / `WorkloadController::from_env()`" — was still reachable by any future test that didn't think to opt out. Closing the class at the source: this test no longer mutates global state at all, so no other test needs to defend against it. Net effect: - This test no longer needs `#[serial]` (was the only reason it was marked) — runs in parallel with the rest of the suite. - The matrix's defensive `with_defaults()` construction (commit `8bd9a5f`) remains correct but is no longer required for correctness; it's now a "belt and suspenders" guard against any FUTURE env-mutating test. Verified locally: both tests pass when run together; full server suite (44 tests) green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 20:35:41 +02:00
Ragnor Comerford	8bd9a5ff14	tests: matrix harness uses with_defaults() workload, not from_env() Round 4 CI failure: Test Workspace and server-aws both red on `concurrent_branch_ops_morphological_matrix` cell b ("merge × merge: same-target-distinct-sources") — second merge returned 429 instead of 200. The matrix passes locally. Root cause: cargo test runs tests in parallel by default. The admission test `ingest_per_actor_admission_cap_returns_429` is wrapped with `#[serial]` and an EnvGuard that sets `OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1` for its duration. Process-wide env vars are visible to concurrently-running tests; the matrix's `Harness::new()` called `AppState::open()` which delegates to `WorkloadController::from_env()`, picking up cap=1 if it ran while the admission test held the EnvGuard. With cap=1 + 2 concurrent merges in cell b, one merge waits behind merge_exclusive while the other is admitted; the waiter holds its admission permit, but a fresh actor permit is needed when admission is per-actor — the second merge's permit acquisition fails because the first hasn't released yet, and 429 fires. Fix (correct by design, AGENTS.md rule 9): the matrix harness builds the WorkloadController explicitly via `WorkloadController::with_defaults()` and passes it to `AppState::new_with_workload`, the constructor added in commit `22d76db`. Closes the bug class "tests pick up another concurrent test's env override at construction time" — the matrix is now insulated from any env-var manipulation in the rest of the test suite. Verified locally: with `OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1` set in the environment, the matrix passes (it ignores env entirely now). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 20:19:42 +02:00
Ragnor Comerford	99b0941478	tests: remove three narrow concurrent_branch_* tests subsumed by T1 The previous commit added `concurrent_branch_ops_morphological_matrix` covering 11 cells with stronger assertions (identity + post-op /change + reopen). The three narrow tests it replaces: - concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator → matrix cell f, with identity assertions added - concurrent_branch_merges_distinct_targets_do_not_swap_into_each_other → matrix cells a + b + c, with identity assertions that close the symmetric-swap blind spot cubic flagged on commit `64f2b99` - concurrent_change_during_branch_merge_preserves_writes → matrix cell d The matrix retains the original tests' diagnostic granularity through named cell labels in every assertion message ("[a:merge×merge:distinct-targets] merge a"), so a CI failure points to the exact cell + invariant. Net: 522 lines removed, 0 coverage lost. All other server tests pass unchanged (44 total). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 20:09:21 +02:00
Ragnor Comerford	ac8594462e	tests: branch-ops morphological matrix (T1) Replaces three narrow concurrent_branch_* tests (folded in below) with one parameterized matrix test covering 11 representative (op_a, op_b, target_overlap) cells, asserting C1-C6 uniformly: C1 — both complete (no deadlock; tokio::time::timeout(15s)) C2 — status: both 200 or exactly one clean conflict; never 500 C3 — per-target row count C4 — per-target row identity (named persons present + absent — catches the symmetric-swap class that count assertions miss; cubic P2 on commit `64f2b99` flagged this gap on the round-3 merge race test) C5 — engine state coherent (subsequent /snapshot consistent) C6 — post-op /change on main succeeds (engine isn't poisoned) Cells: a. Merge × Merge, distinct targets — branch_merge_impl race pin b. Merge × Merge, same target / distinct sources — merge_exclusive serialization c. Merge × Merge, same source / distinct targets — fanout d. Merge × Change, into target — per-(table, branch) queue e. Merge × BranchCreateFrom, target — interaction with refresh path f. BranchCreateFrom × BranchCreateFrom, distinct parents — round-1 race pin g. BranchCreateFrom × BranchDelete, unrelated branches — disjoint state h. BranchDelete × BranchDelete, distinct branches — concurrent refresh i. BranchDelete × Change, distinct branch — refresh-side vs writer j. BranchCreateFrom × Change, on source — fork-while-writing k. Reopen consistency after concurrent pair — disk-vs-cache drift Each cell: - spins up its own tempdir + AppState so failures don't cascade, - aligns the pair at a tokio::sync::Barrier so both reach the engine close in time, - wraps in a 15s deadlock timeout, - asserts identity via a /read with the `get_person` fixture query (specific names must be present on the right branch and absent from the wrong one). Subsumes: - concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator (now cell f, with identity assertions added) - concurrent_branch_merges_distinct_targets_do_not_swap_into_each_other (now cells a + b + c, with identity assertions; the symmetric-swap blind spot cubic flagged on commit `64f2b99` is closed) - concurrent_change_during_branch_merge_preserves_writes (now cell d) Those three narrow tests are removed in the next commit so this lands green standalone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 20:07:37 +02:00
Ragnor Comerford	64f2b994f5	bench: assert --heavy-concurrency > 0 instead of silently clamping Closes the cubic P2 finding on commit `22d76db`: `Semaphore::new(concurrency.max(1))` silently coerced --heavy-concurrency=0 to 1, so the JSON output reported 0 while execution actually used 1. Reported settings differed from actual. Adds an explicit `--heavy-concurrency > 0` check in `main()` (with a helpful error message pointing to --heavy-batches=0 as the way to disable heavy traffic) and a defensive `assert!()` inside `drive_heavy_actor` so future callers can't pass 0 silently. Verified: `bench_actor_isolation --heavy-concurrency 0` exits with code 2 and the explanatory error message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 19:23:02 +02:00
Ragnor Comerford	2b2e723125	tests: pin branch_merge swap-restore race (red) Per AGENTS.md rule 8, this commit lands the failing regression test ahead of the fix. Cursor Bugbot HIGH on commit `22d76db` rediscovered the residual flagged in the round 1 honest-review note: `branch_merge_impl` at `crates/omnigraph/src/exec/merge.rs:1085-1100` still uses the swap_coordinator_for_branch + operate + restore_coordinator pattern across three separate `coordinator.write().await` acquisitions. The same shape that branch_create_from_impl shed in commit `4ffbf6e`. The test spawns two concurrent /branches/merge calls A (feature-a → target-a) and B (feature-b → target-b) aligned at a tokio::sync::Barrier so both reach swap_coordinator_for_branch close in time. M=4 iterations boost race-catching odds. Currently fails on `22d76db` with target-a=5, target-b=4: B's merge landed on the wrong coord — target-b never got Frank because A's swap pushed self.coordinator to target-a, B's swap captured target-a as B's "previous", and B's restore set self.coordinator back to target-a (not the original main). Subsequent operations using self.coordinator point at the wrong branch. Fix lands in the next commit: serialize concurrent branch merges via `merge_exclusive: Arc<tokio::sync::Mutex<()>>` held across the entire swap-operate-restore window. Closes the bug class "non-atomic three-step coordinator manipulation" for branch_merge by serializing merges relative to each other; per-(table, branch) queue inside the merge body still lets merges and other writers run concurrently. A deeper "operate on local coord" refactor (the round-1 fix shape for branch_create_from) requires unwinding `branch_merge_on_current_target` and its uses of `self.snapshot()` / `self.ensure_commit_graph_initialized()`; deferred to a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 19:12:03 +02:00
Ragnor Comerford	22d76dbb40	server+bench: AppState::new_with_workload; bench drops set_var, exercises heavy cap Two cubic findings on bench_actor_isolation.rs flagged together: P2 (lib.rs:202): `unsafe { std::env::set_var(...) }` ran inside `#[tokio::main] async fn main()` AFTER the multi-thread tokio runtime was up. Rust 2024 made `set_var` unsafe because libc's `setenv` is not thread-safe; concurrent env reads from logging or runtime internals can race or read torn state. Fix (correct by design, AGENTS.md rule 9): add a public `AppState::new_with_workload(uri, db, bearer_tokens, workload)` constructor that takes a caller-built `WorkloadController`. Tests and benches override per-actor caps via the constructor instead of mutating global env. Closes the bug class "tests need to mutate global env to override AppState defaults." P2 (lib.rs:130): heavy actor's `oneshot.await` inside the loop serialized — heavy in-flight count was always 1, so cap=1 never tripped on the heavy side. The bench validated isolation (light p99 bounded) but didn't demonstrate the rejection path. Fix: add a `--heavy-concurrency` arg (default 4) and spawn batches as concurrent tokio tasks bounded by an internal semaphore. With heavy_concurrency=4 and inflight_cap=1, the bench now reports heavy_too_many_requests > 0 and heavy_ok == 1 at peak — proving the gate fires for the heavy actor. Sample run on local FS (4 light actors × 30 ops, 20 heavy batches × 50 rows, heavy_concurrency=4, cap=1): heavy_ok: 1 heavy_too_many_requests: 19 light_ok: 120 light_too_many_requests: 0 light_p99: 565 ms (target < 2 s) Heavy saturates its own cap; light actors are completely unaffected. The isolation property is now empirically proven by the rejection counts rather than just by the latency tail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:57:42 +02:00
Ragnor Comerford	8e1a8e7d55	server: document 429 / 503 in admission-gated endpoint OpenAPI responses Closes the cubic finding (P2) at lib.rs:1061: the new admission gates add HTTP 429 / 503 failure paths but the affected endpoint `#[utoipa::path(... responses(...) ...)]` annotations weren't updated. Also closes a pre-existing miss on /change (admission-gated since PR 2 Step F). Adds (status = 429, ...) and (status = 503, ...) to all six admission-gated endpoints: - POST /change (operation_id = "change") - POST /schema/apply (operation_id = "applySchema") - POST /ingest (operation_id = "ingest") - POST /branches (operation_id = "createBranch") - DELETE /branches/{branch} (operation_id = "deleteBranch") - POST /branches/merge (operation_id = "mergeBranches") The descriptions reference the `Retry-After` header, which the `IntoResponse for ApiError` impl emits on both codes (added in commit `c745dd6`). openapi.json regenerated via OMNIGRAPH_UPDATE_OPENAPI=1; the openapi sentinel test passes both with the regen flag and in strict-check mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:49:02 +02:00
Ragnor Comerford	b09a0972cb	bench: add actor-isolation harness for WorkloadController Empirical proof of MR-686's central design promise: per-actor admission control isolates noisy actors from light traffic. The existing bench_concurrent_http harness measures aggregate throughput; this harness measures the latency tail seen by light actors while a heavy actor saturates its own per-actor cap. Setup: one "heavy" actor flooding /ingest with multi-row NDJSON batches; N "light" actors each running short bursts of /change inserts, each authenticating with a distinct bearer token so the WorkloadController accounts them as separate identities. Output: heavy throughput / 429 count, light p50/p95/p99/max latency. Acceptance heuristic on local FS: light-actor p99 < 2 s while the heavy actor saturates its own cap. Sample run on local FS, cap=1, 4 light actors x 30 ops, 20 heavy batches x 50 rows: light p99 = 710 ms, light errors = 0 (well under the 2 s acceptance target). The test demonstrates the isolation property — the heavy /ingest holds its own admission slot but doesn't affect light actors since they have separate per-actor state. Usage: cargo run --release -p omnigraph-server --example bench_actor_isolation -- \ --light-actors 4 --light-ops-per-actor 30 \ --heavy-batches 20 --heavy-rows-per-batch 50 \ --inflight-cap 1 \ --output .context/bench-results/after-pr2-phase2/actor-isolation.json Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:12:50 +02:00
Ragnor Comerford	976aa0ec1d	tests: pin concurrent /change + branch_merge interleave preserves writes Future-proofs against MR-895 work that may move or remove the per-(table, branch) writer queue acquisition inside `branch_merge` (`crates/omnigraph/src/exec/merge.rs:1224`). Today the queue linearizes a concurrent /change on main against a `branch_merge feature → main` on the same touched tables; both succeed and the inserted row is preserved post-merge. Codex flagged this scenario as a P1 in PR #75 review claiming the merge could silently overwrite concurrent target writes because the source-rewrite path opens with `MutationOpKind::Merge` (skipping the strict pre-stage check). Validation showed the queue at merge.rs:1224 is held across both Phase B (per-table commit_staged) and Phase C (manifest publish), so there's no interleave window. The Merge op_kind only affects same-process pre-stage drift detection, not cross-write linearization. The test passes on `f925ad1`; landing it as a regression sentinel catches future changes that drop the queue acquisition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:03:05 +02:00
Ragnor Comerford	5520ab72ff	tests: pin disjoint /change concurrency at HTTP level Closes the cubic acceptance-criteria gap (❌ "Integration test: two /change requests targeting different (table_key, branch) execute concurrently end-to-end"). The bench harness measures the throughput side; this test is the regression sentinel that catches a future change which accidentally re-introduces graph-wide serialization on the disjoint path. Spawns 4 concurrent /change inserts on node:Person and 4 on node:Company. All 8 must return 200, and the post-test row counts on each table must reflect every insert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:01:52 +02:00
Ragnor Comerford	c745dd69ae	server: emit Retry-After header on 429 / 503 responses Closes the doc-vs-code gap at api.rs:343 and lib.rs:344-355: the documentation claims `Retry-After` is set on TooManyRequests / ServiceUnavailable responses, but `IntoResponse for ApiError` emitted only `(StatusCode, Json(ErrorOutput))` — no header. Wires a constant `RETRY_AFTER_SECONDS = "60"` for both 429 and 503 codes. Plumbing per-RejectReason durations through is a follow-up; the admission rejects we surface today recover bounded by request handler duration rather than calendar wait, so a constant suffices. Pinned by `ingest_per_actor_admission_cap_returns_429`. Test now fully green: 1+ of 8 concurrent /ingest under cap=1 receives 429 with Retry-After: 60. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:58:47 +02:00
Ragnor Comerford	05a8bd5de1	server: gate /ingest /branches/* /schema/apply on per-actor admission Closes the gap that admission control only fired on /change. A heavy actor sending bulk-ingest traffic could exhaust shared engine capacity (Lance I/O threads, manifest churn) without hitting the per-actor cap. Wires `state.workload.try_admit(&actor_arc, est_bytes)` into the five remaining mutating handlers AFTER Cedar authorization (so denied requests don't consume admission slots) and BEFORE the engine call. Byte estimates per handler: - /ingest: request.data.len() (NDJSON body) - /schema/apply: request.schema_source.len() - /branches/create, /branches/delete, /branches/merge: 256 (small JSON; the heavy work is bounded per-(table, branch) by the engine's writer queue rather than by request size) The admission guard is held in `let _admission = ...` so it stays alive until handler return, releasing the count permit + decrementing the byte budget on drop. Pinned by `ingest_per_actor_admission_cap_returns_429` (previous commit). The test still fails on the Retry-After header assertion; the next commit emits the header. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:57:53 +02:00
Ragnor Comerford	0976cbebc5	tests: pin /ingest admission gate + 429 Retry-After (red) Per AGENTS.md rule 8, this commit lands the failing regression test ahead of the fix. Currently fails on `f925ad1` with 8/8 statuses returning 200 because /ingest does not call WorkloadController::try_admit. The test pins: - /ingest is gated on per-actor admission control (returns 429 when the cap is exceeded). - 429 responses carry the structured `code: too_many_requests` error body so clients can distinguish them from generic conflicts. - 429 responses include a `Retry-After` header so clients can implement bounded backoff. The doc claim at api.rs:343 and lib.rs:344 was that this header exists; the IntoResponse impl currently emits no headers. Two follow-up commits will turn this green: 1. Wire WorkloadController::try_admit on /ingest and the four other mutating handlers (Block 2.1). 2. Emit the Retry-After header on 429/503 responses (Block 2.2). The test uses #[serial] + EnvGuard to override OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1 without racing parallel tests, then spawns 8 concurrent /ingest tasks aligned at a tokio::sync::Barrier so multiple tasks reach try_admit close in time. With cap=1, at least one must be rejected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:57:01 +02:00
Ragnor Comerford	c263732b1a	tests: extend same-key insert test with /snapshot row-count assertion The existing change_concurrent_inserts_same_key_serialize_without_409 test claimed in its comment "asserts the final row count equals N" but only checked HTTP status codes. cubic flagged the gap; this commit adds the actual /snapshot read after the concurrent inserts to verify all N batches landed (no silent overwrite) by comparing the post-test node:Person row_count against SEED + N. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:49:38 +02:00
Ragnor Comerford	3b33e9ac56	tests: pin branch_create_from swap-restore race (red) Per AGENTS.md rule 8, this commit lands the failing regression test ahead of the fix so the red → green pair is visible in git log. The test demonstrates that two concurrent `POST /branches` calls with distinct `from` parents corrupt coordinator state: A's "operate" step runs against B's swapped coordinator instead of its own, forking the new branch off the wrong parent's HEAD. Currently fails on `f925ad1` with all 8 gamma branches (declared parent: alpha, 5 rows) reporting 4 rows — beta's row count. The operate step ran against beta's coord because B's swap interleaved between A's swap and A's operate. Fix lands in the next commit: hold a single `coordinator.write().await` guard across the entire swap-operate-restore sequence in `branch_create_from_impl` so the three steps are atomic relative to other callers. Closes the bug class "non-atomic three-step coordinator manipulation under &self callers" rather than guarding the specific call site — the right architectural seam (single critical section per swap-restore sequence) eliminates the interleave window for branch_create_from and any future swap-restore caller. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:44:50 +02:00
Ragnor Comerford	ebf5a5769d	tests: pin UPDATE RYW under in-process concurrency (red) Per AGENTS.md rule 8, this commit lands the failing regression test ahead of the fix so the red → green pair is visible in git log. The test asserts the RYW invariant for in-process concurrent UPDATEs on the same row: exactly one writer commits and N-1 receive 409 manifest_conflict. Currently fails on `f925ad1` with 1 x 200 + 7 x 500: > "storage: Retryable commit conflict for version 6: This Update > transaction was preempted by concurrent transaction Update at > version 6. Please retry." Lance's transaction conflict resolver correctly detects the Update vs Update race, but the error wraps as `OmniError::Lance(<string>)` and the API surfaces it as 500 internal rather than 409 retryable conflict. Users see "internal server error" for what is documented as a retryable conflict path. The fix lands in the next commit: an op-kind-aware drift check at the commit_all entry that returns 409 ExpectedVersionMismatch for tables whose first touch was Update / Delete / SchemaRewrite when the staged dataset version drifts from the manifest pin under the queue. Closes the bug class "Lance internal conflict surfaces as 500 instead of 409" rather than mapping the specific Lance error variant — the right architectural layer (engine boundary, under the queue) catches the drift before commit_staged ever runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:33:53 +02:00
Ragnor Comerford	f925ad1739	mr-686: Phase 2 — op-kind-aware version check + coord Mutex → RwLock Fix A: op-kind-aware ensure_expected_version. Insert/Merge skip the strict pre-stage check; Update/Delete/SchemaRewrite keep it. New MutationOpKind enum threaded through open_for_mutation_on_branch / open_owned_dataset_for_branch_write / reopen_for_mutation and all callers (execute_insert/update/delete_node/delete_edge, branch_merge::publish_rewritten_merge_table, schema_apply, ensure_indices_for_branch, loader Append/Merge/Overwrite). Closes the 77% rejection rate on same-key concurrent inserts. Fix B: coordinator Mutex -> RwLock. Reads parallelize via .read(); writes serialize via .write(). Atomic-commit invariant preserved by the single .write() covering commit_manifest_updates + record_graph_commit. Bench-as-test change_concurrent_inserts_same_key_serialize_without_409 (server.rs:2180) spawns 12 concurrent /change inserts on a single (table, branch); asserts every request returns 200. Was failing pre-Phase-2; passes post-Phase-2. change_conflict_returns_manifest_conflict_409 (cross-process drift sentinel) and branch_merge_conflict_response_includes_structured_conflicts both still pass. Bench (after-pr2-phase2): - single-actor 1x1: 14.9 ops/s, p50 68ms (baseline 12.3, +22%) - disjoint 8x8: 7.04 ops/s, p50 1023ms (baseline 6.24, +13%) - same-key 8x1: 2.62 ops/s, 0 errors (after-pr2: 77% errors) Disjoint stayed at +13% — Fix B's RwLock helped read paths but the publisher's .write() critical section still serializes graph-wide. Splitting GraphCoordinator into per-concern primitives (manifest in ArcSwap, commit_graph in RwLock, atomic-commit serializer) is the deferred next step. 102 lib + 30 branching + 24 runs + 16 staged_writes + 63 end_to_end + 40 server tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 12:42:26 +02:00
Ragnor Comerford	c15962e6b0	server: flip AppState to Arc<Omnigraph>, wire admission on /change (PR 2 Step F) The substantive PR 2 change. Removes the global server `RwLock<Omnigraph>` that has serialized every mutating request across all actors. Disjoint `(table, branch)` writes from different actors now run concurrently, guarded only by the engine's per-(table, branch) write queue (PR 1b) and per-actor admission control (PR 2 Step E). AppState changes: - `db: Arc<RwLock<Omnigraph>>` -> `engine: Arc<Omnigraph>` - New field: `workload: Arc<workload::WorkloadController>` initialized from env (`OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=16`, `OMNIGRAPH_PER_ACTOR_BYTES_MAX=4GiB`, `OMNIGRAPH_GLOBAL_REWRITE_MAX=4`). - `tokio::sync::RwLock` import dropped. Handler updates (16 sites): - All `Arc::clone(&state.db).read_owned().await` and `write_owned()` calls replaced with `let db = &state.engine`. Engine APIs are now `&self` (Step C) so this works directly. - `/export` clones `Arc<Omnigraph>` once and moves into the spawned task instead of acquiring a long-held read lock. - `/change` handler additionally wires `state.workload.try_admit(&actor_arc, est_bytes)`. Cedar runs FIRST so denied requests don't consume admission slots; admission runs SECOND before the engine call. `est_bytes` uses the request body size as a coarse proxy. API surface additions (`api::ErrorCode`): - `TooManyRequests` -> HTTP 429 (per-actor cap exceeded; respect `Retry-After`) - `ServiceUnavailable` -> HTTP 503 (global rewrite pool exhausted) `ApiError` constructors `too_many_requests` / `service_unavailable` and `from_workload_reject` (maps `RejectReason` variants to HTTP status). Other mutating handlers (`/ingest`, `/branches/*`, `/branches/merge`, `/schema/apply`) currently flow through the Arc<Omnigraph> path without admission gates; wiring those is mechanical and lands as a follow-up. The /change hot path covers the bulk of MR-686's load profile. OpenAPI regenerated to include the new ErrorCode variants. 102 lib + 39 server tests + 5 workload tests pass. The regression sentinel `change_conflict_returns_manifest_conflict_409` continues to pass (revalidation perf opt + per-table queue + publisher CAS preserve manifest_conflict semantics under concurrent writers). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 17:08:26 +02:00
Ragnor Comerford	17a1665002	server: add WorkloadController for per-actor admission (PR 2 Step E) PR 2 removes the global server `RwLock<Omnigraph>` (Step F). Without admission control, one heavy actor would exhaust shared capacity (Lance I/O threads, manifest churn, network) and starve other actors. The WorkloadController bounds per-actor in-flight count + bytes and provides a global rewrite-pool semaphore for compaction / index builds. New file: `crates/omnigraph-server/src/workload.rs` (~250 LOC + 5 tests). API: - `WorkloadController::new(inflight_cap, byte_cap, rewrite_cap)` / `from_env()` / `with_defaults()`. - `try_admit(actor_id, est_bytes) -> Result<AdmissionGuard, RejectReason>` acquires both an in-flight count permit and adds est_bytes to the per-actor counter atomically; returns RejectReason on either gate. - `try_admit_rewrite() -> Result<RewriteGuard, RejectReason>` for the global rewrite pool (Step F maps RewriteGuard exhaustion to HTTP 503). - `RejectReason::{InFlightCountExceeded, ByteBudgetExceeded, GlobalRewriteExhausted}`. Race-free admission via `tokio::sync::Semaphore::try_acquire_owned()` for the count gate (master plan Finding 6: independent atomic load+check+add lets two callers both pass a cap-N check; the Semaphore gate is atomic). Bytes use `fetch_add` + decrement-on-rejection so the cap is never exceeded even on rollback. Defaults (override via env): - OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=16 - OMNIGRAPH_PER_ACTOR_BYTES_MAX=4_294_967_296 (4 GiB) - OMNIGRAPH_GLOBAL_REWRITE_MAX=4 Tests cover under-cap admission, byte-budget rollback, per-actor isolation, global rewrite cap, and the load-bearing 32-concurrent-vs- cap-16 race test (forces real contention via a broadcast release channel so guards can't recycle permits task-by-task; pins the master plan's race-free invariant). Adds workspace dep `dashmap = "6"` for per-actor state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 16:59:45 +02:00
Ragnor Comerford	fcb47620d3	mr-686: bundle PR 0/1a/1b foundation + PR 2 catalog/schema_source ArcSwap Bundles the working-tree state from the prior session (PR 0 bench harness, PR 1a audit_actor_id removal, PR 1b WriteQueueManager + writer integration) together with the first half of PR 2's interior-mutability foundation (catalog and schema_source wrapped in Arc<ArcSwap<...>>). The two streams intermix in 7 of the same files, so splitting via git add -p was impractical. Subsequent PR 2 steps land as separate atomic commits. PR 0 — server-level concurrent /change bench harness - crates/omnigraph-server/examples/bench_concurrent_http.rs (new) - .context/bench-results/{baseline-main,after-pr1}/ (gitignored) PR 1a — drop the audit_actor_id field, thread per-call - removed Omnigraph::audit_actor_id and the swap-restore patterns in mutation.rs, merge.rs, loader/mod.rs - actor_id: Option<&str> threaded through MutationStaging::finalize, mutate_with_current_actor, ingest_with_current_actor, branch_merge_impl, branch_merge_on_current_target, commit_prepared_updates*, record_merge_commit, commit_updates_on_branch_with_expected - apply_schema and ensure_indices_for_branch pass None (system-attributed) PR 1b — per-(table_key, branch) write queue + revalidation + sidecar - new crates/omnigraph/src/db/write_queue.rs with WriteQueueManager, acquire/acquire_many, sorted+deduped acquisition; 6 unit tests - Arc<WriteQueueManager> field on Omnigraph + db.write_queue() accessor - MutationStaging::finalize split into stage_all (Phase A, no queue) and StagedMutation::commit_all (Phase B, acquire_many + revalidate pins + sidecar + commit_staged); guards held across publisher - delete-only mutations now emit recovery sidecars; revalidation extended to inline_committed tables - branch_merge_on_current_target, apply_schema_with_lock, and ensure_indices_for_branch acquire per-table queues for their touched tables PR 2 Step B (partial) — catalog and schema_source via ArcSwap - catalog: Catalog -> Arc<ArcSwap<Catalog>> - schema_source: String -> Arc<ArcSwap<String>> - public accessors return Arc<Catalog> / Arc<String>; readers bind locally where the borrow has to outlive an expression - new pub(crate) store_catalog / store_schema_source helpers replace the field assignments in apply_schema and reload_schema_if_source_changed - 117 tests across lifecycle/end_to_end/branching/runs pass; engine lib + workspace compile clean Coordinator wrap (Mutex) and the &mut self -> &self engine API conversion follow in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-07 16:22:38 +02:00
Ragnor Comerford	8726ffe0a3	release: bump version to 0.4.1	2026-05-02 23:20:50 +02:00
Ragnor Comerford	044ed46019	chore: scrub Linear ticket numbers and review-bot mentions from code comments OmniGraph is OSS; internal Linear ticket references and code-review-bot mentions in source-code comments don't help external readers and leak internal tooling. Replace ticket numbers (MR-XXX) with descriptive prose, drop linear.app URLs, and remove inline mentions of Cursor/Bugbot/Cubic/Codex review threads. Scope is limited to source-code comments (`crates/`). Docs under `docs/` keep their MR-XXX references — those are part of the established change-history narrative for in-repo docs and don't require a Linear account to find context for. No behavior changes; no public API changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 22:45:38 +02:00
Ragnor Comerford	35be20cb05	MR-771: demote Run to direct-publish via expected_table_versions CAS mutate_as and load now write directly to target tables and call the publisher once at the end with per-table expected versions; the Run state machine, _graph_runs.lance writers, __run__ staging branches, and server /runs/* endpoints are removed. Multi-statement mutations remain atomic at the manifest level via an in-memory MutationStaging accumulator that gives read-your-writes within a query and a single publish at the end. Concurrent-writer conflicts surface as ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the old DivergentUpdate merge shape. Documents one known limitation in docs/runs.md: a multi-statement mid-query failure where op-N writes a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the manifest until a follow-up introduces per-table Lance branches. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-30 08:52:50 +02:00
Andrew Altshuler	7310f69928	Revert "Merge pull request #49 from ModernRelay/ragnorc/x-request-id" (#54 ) This reverts commit `b352fca13c`, reversing changes made to `748ad334a9`.	2026-04-26 15:56:29 +03:00
Ragnor Comerford	b352fca13c	Merge pull request #49 from ModernRelay/ragnorc/x-request-id Add X-Request-Id middleware	2026-04-26 12:33:33 +02:00
Ragnor Comerford	e14b203208	Reuse X_REQUEST_ID constant for inbound header lookup Both Cursor Bugbot and Cubic flagged that the inbound `headers().get(...)` call constructed `HeaderName::from_static("x-request-id")` inline instead of reusing the `X_REQUEST_ID` constant defined at the top of the file. The two were already kept in sync by both being `from_static("x-request-id")`, but a future rename would have to touch both sites or risk silent drift between read and write. Also drops the now-unused `header` module import. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-26 12:05:19 +02:00
Ragnor Comerford	748ad334a9	Merge pull request #48 from ModernRelay/ragnorc/api-sdk-research Polish OpenAPI spec for SDK generation	2026-04-26 11:52:46 +02:00
Ragnor Comerford	284c9377c2	Add X-Request-Id middleware Per-request ULID minted at the edge, exposed in request extensions and on the response header. Caller-supplied X-Request-Id is echoed when well-formed (1..=128 ASCII printable characters); otherwise rejected and replaced with a fresh ULID so the value is always safe to log. Companion to the TypeScript SDK redesign — clients now correlate logs across the wire by reading X-Request-Id from response headers (and the SDK already surfaces it on every OmnigraphError as `requestId`). No spec change required; the header is a transport-layer concern. Tests: - mint a ULID when no header is provided - echo a valid caller-supplied id - reject overlong header (200 chars), mint a fresh ULID Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 22:56:17 +02:00
Ragnor Comerford	7809bf607e	Polish OpenAPI spec for SDK generation Add operation descriptions and examples to utoipa annotations so the generated TypeScript SDK has rich JSDoc, and so future Python/Go SDKs and any /openapi.json docs UI benefit from the same effort. - Doc comments on all 18 handlers (utoipa picks up summary/description) - #[schema(example = ...)] on free-text fields (query_source, schema_source, NDJSON data) and i64 timestamps - Destructive/irreversible warnings on change, applySchema, ingest, mergeBranches, deleteBranch, publishRun, abortRun Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-04-25 16:36:51 +02:00
Andrew Altshuler	74eb5a5380	Parallel per-type load writes + omnigraph optimize/cleanup CLI (#46 ) * Parallel per-type load writes + omnigraph optimize/cleanup CLI ## MR-677.3 — parallel per-type load writes The load path already groups records into one RecordBatch per type and makes one Lance commit per table (loader::mod.rs:249-..), but those commits ran sequentially. Wrap node and edge write loops in `futures::stream::buffered(N)` against a new helper `write_batches_concurrently`. Concurrency tunable via `OMNIGRAPH_LOAD_CONCURRENCY` (default 8). ## MR-676 — `omnigraph optimize` and `omnigraph cleanup` New CLI subcommands that walk every node + edge table in the repo: - `omnigraph optimize <uri>` — runs Lance `compact_files` on each table to merge small fragments into fewer larger ones. - `omnigraph cleanup <uri> --keep N \| --older-than 7d --confirm` — runs Lance `cleanup_old_versions` to prune historical manifests + unique fragments. Requires `--confirm` because it's destructive. Supports both count-based and time-based retention (or both AND'd together). Time uses chrono `DateTime<Utc>` (added as a workspace dep, default-features off). Both commands run their per-table loops in parallel (8-way bounded, `OMNIGRAPH_MAINTENANCE_CONCURRENCY` env override). Smoke-tested against the 114-table prod graph: optimize went 7m15s sequential → 1m28s parallel. cleanup --keep 1 removed 137 historical versions across 114 tables in 1m57s without disrupting `/healthz` or query responses. Public API on `Omnigraph`: pub async fn optimize(&mut self) -> Result<Vec<TableOptimizeStats>> pub async fn cleanup(&mut self, opts: CleanupPolicyOptions) -> Result<Vec<TableCleanupStats>> All 10 existing loader tests still pass. Closes MR-676. Partially addresses MR-677 (the .3 — parallel by type — piece; MR-677.1 is for the `omnigraph embed` path, not load, since load doesn't call Gemini directly. .2 was already in place). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: regenerate openapi.json --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2026-04-25 14:22:14 +03:00
Andrew Altshuler	8649b2084f	Prepare v0.3.0 release (#44 ) * Prepare v0.3.0 release Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: regenerate openapi.json * ci: retrigger CI on latest openapi.json --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2026-04-21 19:11:34 +03:00
Ragnor Comerford	a157f6a17c	Fold openapi.json auto-sync into main CI test job The separate openapi-sync workflow was duplicating the workspace build (~15 min cold-cache compile), paying the cost twice per PR. Fold the regen + auto-commit into the existing test job: one compile, shared rust-cache, same drift-check semantics. - Same-repo PRs: OMNIGRAPH_UPDATE_OPENAPI=1 during the test run, then commit the regenerated spec back to the PR branch - Fork PRs / pushes: env var empty, test stays in strict drift-check mode - openapi_spec_is_up_to_date treats empty env value as unset, so the conditional workflow env expression works Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 21:00:46 +02:00
Ragnor Comerford	9de2079263	Merge remote-tracking branch 'origin/main' into ragnorc/explore-api # Conflicts: # CONTRIBUTING.md	2026-04-18 20:24:39 +02:00
andrew	7a3bf5c758	Add aws feature + SecretsManagerTokenSource backend Introduces an opt-in AWS Secrets Manager backend for bearer tokens, behind the `aws` Cargo feature. Default builds (on-prem, local dev) don't pull in the AWS SDK and don't pay its compile cost. - New Cargo feature `aws` gates the `aws-config` + `aws-sdk-secretsmanager` optional deps. Default features remain empty. - New `auth::aws::SecretsManagerTokenSource` implements `TokenSource` by fetching a JSON `{"actor_id": "token", ...}` payload from a named Secrets Manager secret. Credentials resolve via the AWS default chain (env, shared config, IMDSv2 instance role, ECS task role) so no explicit plumbing is needed under an IAM role. - New `resolve_token_source()` dispatches based on the `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` env var. If the var is set but the binary was built without `--features aws`, returns a clear rebuild instruction rather than silently falling back. - `serve()` now uses `resolve_token_source()` and logs which source was selected at startup. - `parse_json_secret_payload()` is factored out as a free function so the payload validation (trim whitespace, reject blank actor/token, reject non-object) is unit-testable without the AWS SDK. - New CI job `test_aws_feature` builds + tests with `--features aws`. Not in this PR (follow-ups): - Background refresh loop for rotation. `SecretsManagerTokenSource` advertises `supports_refresh: true` but the AppState-level refresh task isn't wired yet. - Config-YAML dispatch (today the AWS source is selected via env var only; eventually `server.bearer_tokens.source` in `omnigraph.yaml`). Tests: - Default-feature build: 33 lib + 41 integration + 64 openapi. - `--features aws` build: 32 lib (one test is cfg-gated) + 41 + 64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 03:48:51 +03:00
andrew	af41630520	Extract TokenSource trait for bearer token loading Pure refactor. No behavior change. Introduces a TokenSource trait so additional backends (AWS Secrets Manager, Vault, etc.) can plug in behind feature flags without touching the server wiring. - New module crates/omnigraph-server/src/auth.rs with the TokenSource trait and a single EnvOrFileTokenSource implementation that delegates to the existing server_bearer_tokens_from_env() function. - serve() now constructs EnvOrFileTokenSource and calls load() instead of calling the free function directly. - The trait has a supports_refresh() hook (false for env/file) for future implementations that can rotate without restart. - async-trait added to omnigraph-server deps; it's already in the workspace. Tests: - Unit tests in auth.rs covering load paths and the default supports_refresh / name values. - Existing 128 tests (lib + integration + openapi) pass unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 03:31:43 +03:00
andrew	c338e80180	Harden bearer auth: constant-time compare, hashed at rest, authoritative actor_id Fixes two live authz bugs in omnigraph-server: - Bearer-token lookup previously used HashMap::get, which compares keys with Eq and short-circuits on the first differing byte — a network-observable timing oracle for brute-forcing tokens. Tokens are now stored as SHA-256 digests and compared with subtle::ConstantTimeEq, iterating every entry unconditionally so total work is independent of which slot matches. Raw token bytes no longer live in server memory after startup. - authorize_request now overwrites PolicyRequest.actor_id from the authenticated session instead of trusting the handler-supplied field, which previously defaulted to "" via unwrap_or_default(). The empty string can no longer reach Cedar as a policy subject even if a future refactor drops the None check. External API of AppState constructors is unchanged — tokens still enter as Vec<(String, String)> and are hashed on the way in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 01:41:02 +03:00
andrew	be520f31f4	Polish schema endpoint: rename show, align field name, add tests Review feedback on #23, applied on top of the original commit: - Rename the CLI subcommand from `schema get` to `schema show` to match the existing `run show` / `commit show` convention. A `#[command(alias = "get")]` preserves muscle memory for anyone who already typed `get`. - Rename `SchemaGetOutput` → `SchemaOutput` and its field `source` → `schema_source`, so the get response and the apply request use the same field name for the same concept. - Use `println!` instead of `print!` in the CLI so the shell prompt doesn't land on the last line of schema output. - Add three integration tests on `/schema`: happy path (no auth), 401 when bearer is required but missing, 403 when the policy grants the actor branch_create but not read. Follow-ups left for a separate PR: include `schema_ir_hash` and `schema_identity_version` in the response payload so clients can do drift detection and the server can set an ETag; and a fast-path local read that skips `Omnigraph::open()` when only the schema source is needed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-18 00:30:46 +03:00
Ragnor Comerford	228032a4ac	Add static OpenAPI spec and Stainless SDK config Introduce SDK generation scaffolding: commit a static openapi.json extracted from the Utoipa annotations via a golden-file test, add Stainless workspace/config for TypeScript and Python SDKs, and clean up operation IDs for ergonomic generated method names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-17 14:26:31 +02:00
Claude	0c4df674fa	Add schema get command to CLI and HTTP API Exposes the existing schema_source() method via a new `omnigraph schema get` CLI subcommand and a `GET /schema` API endpoint, allowing users to retrieve the current accepted schema from any graph repository. https://claude.ai/code/session_01UYybeBQks3fz3RJrTHtwQw	2026-04-16 21:15:17 +00:00
andrew	33bdab1fcb	Prepare v0.2.2 release	2026-04-14 20:13:00 +03:00
andrew	3d74cbfc20	Prepare v0.2.1 release	2026-04-14 19:19:00 +03:00
andrew	1a26e2e654	Rename config targets to graphs	2026-04-14 04:12:14 +03:00
andrew	5daeae7571	Prepare v0.2.0 release	2026-04-12 20:35:34 +03:00

1 2

56 commits