omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-09 01:35:18 +02:00

Author	SHA1	Message	Date
Devin AI	327eb821b5	Remove orphaned loader/{constraints,embeddings,jsonl}.rs files These three files in crates/omnigraph/src/loader/ have no `mod` declaration anywhere in the workspace and no `#[path = "…"]` reference. They are not compiled — `touch`-ing them does not trigger `cargo check` to recompile anything. Their imports (`crate::catalog::schema_ir`, `crate::error::NanoError`, `crate::store::manifest::hash_string`, `crate::types::ScalarType`, `super::super::graph::DatasetAccumulator`) reference modules that no longer exist in the engine crate, so they could not even be wired in without further work. They are vestigial code from an earlier monolithic crate layout. The live functionality is independently implemented inside crates/omnigraph/src/loader/mod.rs. These files have been orphaned since the initial public commit. `cargo check --workspace --all-targets` and `cargo test --workspace --no-run` both pass with no new warnings. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>	2026-05-12 22:57:20 +00:00
devin-ai-integration[bot]	6914e0256e	MR-786: merge-pair truth table with exhaustive op-variant matrix (#81 ) * MR-786: merge-pair truth table with exhaustive op-variant matrix Add crates/omnigraph/tests/merge_truth_table.rs that enumerates every (left_op, right_op) cell from the operation vocabulary named in the ticket — {noop, addNode, removeNode, addEdge, removeEdge, setProperty, dropProperty, addLabel, removeLabel} — and asserts the deterministic outcome of Omnigraph::branch_merge against a structured oracle. The matrix is built in a 9x9 match in build_case, so adding a new OpVariant is a compile-time, fail-on-omission task. Today's mutation grammar only exposes insert \| update set \| delete (see docs/query-language.md), so the 36 cells over the first six ops are executable and the 45 cells involving dropProperty/addLabel/removeLabel are recorded as Expected::Unsupported with a note. Each executable cell spins up a fresh tempdir, applies one mutation per branch, calls branch_merge, and asserts either: * MergeOutcome (AlreadyUpToDate / FastForward / Merged) plus a GraphAssert on the affected entities, or * an OmniError::MergeConflicts whose entries match the expected table_key + MergeConflictKind (row_id is optional because edge ULIDs are generated at runtime). branch_merge is directional, so the (L, R) and (R, L) cells live in separate entries in the matrix and are run independently — the op-pair symmetry encoded in build_case serves as the commutativity oracle without doubling the runtime. End-to-end the suite runs in ~10s on a fresh build, well under the 30s budget asserted at the bottom of the test. Also adds a row to docs/testing.md so the test-coverage map points future agents at this file. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * Use one Omnigraph handle for both branches Self-review caught that the runner was opening two Omnigraph handles on the same temp dataset (one for main, a second via Omnigraph::open for feature). tests/branching.rs uses one handle and passes the branch name to mutate_branch — same pattern works here and avoids any cache-coherency surprises between the two handles. Also drops the post-merge reopen, which only existed to give the second handle a fresh snapshot. Runtime drops ~10s -> ~9s. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * Assert exact conflict count, not subset inclusion cubic and Devin Review both flagged that check_outcome's Expected::Conflicts arm only enforces want ⊆ got, so a regression that produces a spurious extra conflict (e.g. emitting both OrphanEdge and a stray DivergentInsert) would silently pass the truth-table cell. For a deterministic oracle that's the wrong direction — the cell pins the exact conflict-artifact set, not a lower bound. Add an assert_eq!(got.len(), want.len()) before the existence loop. All 36 executable cells still pass; runtime unchanged. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * Subsume 4 conflict tests in branching.rs into truth table The four `branch_merge_reports__conflict` tests (DivergentUpdate / DivergentInsert / DeleteVsUpdate / OrphanEdge) were redundant with the deterministic-oracle cells in the new `merge_truth_table.rs` and only added drift risk. To preserve the post-conflict invariant that lived in `branch_merge_reports_divergent_update_conflict` (target unchanged after a failed merge), the truth-table runner now generalizes it: on every `Conflicts` cell, main's state is asserted against `state_after_apply_only(right_op)`. That gives strictly more coverage than the deleted tests carried, since the invariant now applies to all* seven conflict cells, not just one. The `UniqueViolation` and `CardinalityViolation` cases stay in `branching.rs` — they're combinatorial (require >1 op per side with a non-default schema) and out of scope for the pair-wise truth table. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> * Fix misleading 'Total edges: 0' comment in (AddEdge, RemoveEdge) cell Devin Review flagged that the comment said 'Total edges: 0' while the parenthetical math evaluates to 1 (matching `GraphAssert::base()`). The assertion is correct; only the leading number in the comment was wrong. Reworded to 'Net edges: … = 1 (matches base)' so the prose agrees with both the math and the assertion. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com> --------- Co-authored-by: Ragnor <ragnor@modernrelay.com> Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>	2026-05-12 22:36:01 +03:00
Ragnor Comerford	3bd072c917	docs: add docs/transactions.md — branch-as-transaction explainer (#69 ) The architectural rule "no cross-query BEGIN/COMMIT; branches fill that role" lives in docs/invariants.md §VI.23 but is not surfaced anywhere user-facing. New users coming from Postgres/MySQL hit the gap when they realize multiple queries on main are independently atomic, not jointly atomic. This page explains the model with worked examples: * Single-query multi-statement (atomic by default) * Two separate queries on main (NOT atomic — common surprise) * Many queries via a branch (atomic at merge) * Coordinating multiple agents via branch-per-agent Plus a comparison table to BEGIN/COMMIT, failure-mode rundown, and "when to use what" decision matrix. Linked from AGENTS.md "Where to find each topic" between branches-commits.md and runs.md. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 22:35:57 +03:00
Ragnor Comerford	c9c7c0672e	Update README.md	2026-05-12 08:17:31 -07:00
Ragnor Comerford	c2e3a9e5c3	Add use cases for unified company brain and context graphs	2026-05-12 08:08:08 -07:00
Ragnor Comerford	676c9eab05	Merge pull request #78 from ModernRelay/devin/1778363660-mr-901-blob-branch-merge Fix branch merge with blob columns	2026-05-12 07:31:04 -07:00
Ragnor Comerford	d6d2763609	Merge pull request #80 from ModernRelay/devin/1778524905-mr-923-merge-restore-refresh Fix MR-923: refresh restored coordinator on merge Err path	2026-05-11 15:55:43 -07:00
Devin AI	725d41205e	Drop redundant server-level regression test The matrix cell d:merge×change:into-target already exercises this race: pre-fix it flakes ~20% on shared-CPU hardware (sentinel 409s); post-fix it passes 100% regardless of which side of the racing pair returns first. That flake-to-stable transition is the regression signal. The replacement test (concurrent_merge_clean_409_does_not_poison_next_ change_on_target) tried to sharpen this by looping until the clean- 409 path fired and then strictly requiring it. On fast CI hardware the race window never opens in 50 iterations, which made the strict variant fail in CI despite passing 10/10 locally. The bug genuinely needs a real concurrent writer to advance on-disk manifest during the swap window — a deterministic failpoint can't substitute because forcing the merge body to Err without a real concurrent writer leaves no cache-vs-disk drift to validate. Reverting to the matrix cell as the sole regression coverage. Updated the comment in merge.rs accordingly. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>	2026-05-11 21:57:47 +00:00
Devin AI	a6c7e5fab5	Use if-let shape for refresh outcome handling Switch from match-on-Result to if-let-Err so the refresh outcome and merge_result outcome are checked independently, making the intent clearer: 'attempt refresh; on Ok-merge-with-refresh-error propagate; on Err-merge-with-refresh-error log and surface the original merge error'. No semantic change — both shapes were valid (wildcard patterns don't move the scrutinee) — but the if-let form sidesteps a needs-second-reading question raised in code review. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>	2026-05-11 21:50:26 +00:00
Devin AI	7d1a40102c	Address review feedback merge.rs: best-effort refresh on the Err path so a refresh-time storage error doesn't replace the merge body's structured error (typically the manifest_conflict that the HTTP layer maps to a 409 with a structured payload) with a less informative one. Ok-path behavior is unchanged — there a refresh failure is propagated so the caller knows the coord's cache is unsynced. server.rs: bump MAX_ITERATIONS to 50 and assert at the end that the named clean-409 path actually fired at least once. With ~20% per-iter rate on shared-CPU CI (per the original MR-923 repro), P(no hit in 50) is < 0.002%. Without this assertion the test silently degraded to exercising only the 200-merge path — covered already by the matrix cell. Both changes per Devin Review + cubic comments on PR #80. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>	2026-05-11 21:35:18 +00:00
Devin AI	b7353e1dc7	Use refresh_coordinator_only to avoid racing branch_merge's sidecar The previous fix used `self.refresh()` to sync the restored coordinator's cache after the swap-restore window. `refresh` runs the `RollForwardOnly` recovery sweep — which, on the merge Err path with a phase-B failure (sidecar written, per-table HEAD advanced, manifest publish skipped), would observe the merge's own in-flight sidecar and close it here. That violates the contract documented on `Omnigraph::refresh`: > Engine-internal callers that already hold an in-flight sidecar > (e.g. `schema_apply` mid-write) MUST use `refresh_coordinator_only` > to avoid the recovery sweep racing their own sidecar. The post-restore step's purpose is to sync the coord cache with disk, not to run recovery, so `refresh_coordinator_only` is the right primitive on both paths. CI surfaced this via `branch_merge_phase_b_failure_recovered_on_next_open` in `crates/omnigraph/tests/failpoints.rs`, which asserts the sidecar persists after the failpoint fires. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>	2026-05-11 21:09:44 +00:00
Devin AI	e91d5615c6	Fix MR-923: refresh restored coordinator on merge Err path branch_merge_impl swaps the coordinator for the merge target, runs the merge body, then restores the original coordinator. A concurrent /change on the same target during this window publishes against the swapped coord, advancing on-disk manifest state that the restored coord doesn't see. The post-restore refresh was previously gated on merge_result.is_ok(), so the clean-409 path (merge body's post_queue_snapshot drift check returning a recoverable conflict) left the restored coord's cached snapshot stale relative to disk. The next sequential /change seeded its publisher expected_versions from that stale cache and 409'd with ExpectedVersionMismatch — a non-retryable conflict surfaced to a caller with no concurrent writer of their own. Refresh on both Ok and Err paths so cached state cannot diverge from the manifest across the swap-restore window. Add a focused regression test (concurrent_merge_clean_409_does_not_poison_next_change_on_target) that loops the cell-d scenario until the clean-409 branch fires and asserts the follow-up sentinel succeeds in that branch specifically. Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>	2026-05-11 20:31:18 +00:00
Devin AI	fca2b74dee	Materialize external blob URIs during branch merge Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>	2026-05-11 12:54:04 +00:00
Devin AI	da89e18e62	Merge main into blob merge fix Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>	2026-05-10 21:55:02 +00:00
Ragnor Comerford	19e9292ec0	Merge pull request #75 from ModernRelay/ragnorc/mr-686-lance Per-table writer queues + per-actor admission + op-kind-aware version check	2026-05-10 23:50:56 +02:00
Devin AI	7a338a8223	agents: keep guide short for context	2026-05-10 14:41:02 +00:00
Devin AI	4eb865b340	docs: expand 0.4.2 release notes	2026-05-10 14:37:58 +00:00
Devin AI	e44a4704eb	docs: fix admission gating description	2026-05-10 14:16:26 +00:00
Devin AI	a42d178119	release: prepare omnigraph 0.4.2	2026-05-10 14:02:28 +00:00
Devin AI	31b8ffe7b5	engine: inline-delete sidecar covers version-mismatch check	2026-05-10 10:37:46 +00:00
Devin AI	01660faa26	Tighten blob descriptor validation Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>	2026-05-10 09:28:44 +00:00
Devin AI	16ac166059	Fix branch merge with blob columns Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>	2026-05-09 22:33:29 +00:00
Devin AI	6a3f0677ae	server: drop unwired try_admit_rewrite / 503 admission surface	2026-05-09 20:58:17 +00:00
Devin AI	4bb7964af9	tests: matrix cell k asserts post-reopen row count	2026-05-09 20:16:44 +00:00
Devin AI	708e170dc5	engine: branch-merge revalidates target snapshot under queue	2026-05-09 20:16:12 +00:00
Devin AI	a6d244e648	engine: strict drift check uses read-time pin	2026-05-09 20:06:25 +00:00
Ragnor Comerford	f9a0f31f80	server: drop 503 from OpenAPI on admission-gated endpoints (unreachable) Cursor Bugbot LOW on commit `3ad359d`: try_admit_rewrite is defined and tested but no HTTP handler calls it; the six handler OpenAPI annotations declared status = 503 (added in `8e1a8e7`) but try_admit (the only path handlers invoke) returns 429 only. 503 was unreachable. Fix: remove (status = 503, ...) from the six handler OpenAPI annotations and regenerate openapi.json. Kept as forward-looking infrastructure: try_admit_rewrite, global rewrite semaphore, RejectReason::GlobalRewriteExhausted, ApiError::ServiceUnavailable, the 503 branch in IntoResponse, --global-rewrite-cap, and OMNIGRAPH_GLOBAL_REWRITE_MAX. When a future commit wires try_admit_rewrite into a handler, the 503 OpenAPI annotation lands alongside that wiring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 21:54:24 +02:00
Ragnor Comerford	3ad359db8b	tests: admission test uses new_with_workload, drops env mutation + #[serial] Migrates `ingest_per_actor_admission_cap_returns_429` from env-var override to direct `WorkloadController::new(1, ...)` construction via `AppState::new_with_workload`. Removes the `EnvGuard` and the `#[serial]` annotation that paired with it. Why correct by design (AGENTS.md rule 9): the previous round's matrix fix (commit `8bd9a5f`) shielded the matrix from this test's env mutation, but the broader bug class — "test A's process-wide env mutation can leak into any test B that calls `AppState::open` / `WorkloadController::from_env()`" — was still reachable by any future test that didn't think to opt out. Closing the class at the source: this test no longer mutates global state at all, so no other test needs to defend against it. Net effect: - This test no longer needs `#[serial]` (was the only reason it was marked) — runs in parallel with the rest of the suite. - The matrix's defensive `with_defaults()` construction (commit `8bd9a5f`) remains correct but is no longer required for correctness; it's now a "belt and suspenders" guard against any FUTURE env-mutating test. Verified locally: both tests pass when run together; full server suite (44 tests) green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 20:35:41 +02:00
Ragnor Comerford	8bd9a5ff14	tests: matrix harness uses with_defaults() workload, not from_env() Round 4 CI failure: Test Workspace and server-aws both red on `concurrent_branch_ops_morphological_matrix` cell b ("merge × merge: same-target-distinct-sources") — second merge returned 429 instead of 200. The matrix passes locally. Root cause: cargo test runs tests in parallel by default. The admission test `ingest_per_actor_admission_cap_returns_429` is wrapped with `#[serial]` and an EnvGuard that sets `OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1` for its duration. Process-wide env vars are visible to concurrently-running tests; the matrix's `Harness::new()` called `AppState::open()` which delegates to `WorkloadController::from_env()`, picking up cap=1 if it ran while the admission test held the EnvGuard. With cap=1 + 2 concurrent merges in cell b, one merge waits behind merge_exclusive while the other is admitted; the waiter holds its admission permit, but a fresh actor permit is needed when admission is per-actor — the second merge's permit acquisition fails because the first hasn't released yet, and 429 fires. Fix (correct by design, AGENTS.md rule 9): the matrix harness builds the WorkloadController explicitly via `WorkloadController::with_defaults()` and passes it to `AppState::new_with_workload`, the constructor added in commit `22d76db`. Closes the bug class "tests pick up another concurrent test's env override at construction time" — the matrix is now insulated from any env-var manipulation in the rest of the test suite. Verified locally: with `OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1` set in the environment, the matrix passes (it ignores env entirely now). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 20:19:42 +02:00
Ragnor Comerford	99b0941478	tests: remove three narrow concurrent_branch_* tests subsumed by T1 The previous commit added `concurrent_branch_ops_morphological_matrix` covering 11 cells with stronger assertions (identity + post-op /change + reopen). The three narrow tests it replaces: - concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator → matrix cell f, with identity assertions added - concurrent_branch_merges_distinct_targets_do_not_swap_into_each_other → matrix cells a + b + c, with identity assertions that close the symmetric-swap blind spot cubic flagged on commit `64f2b99` - concurrent_change_during_branch_merge_preserves_writes → matrix cell d The matrix retains the original tests' diagnostic granularity through named cell labels in every assertion message ("[a:merge×merge:distinct-targets] merge a"), so a CI failure points to the exact cell + invariant. Net: 522 lines removed, 0 coverage lost. All other server tests pass unchanged (44 total). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 20:09:21 +02:00
Ragnor Comerford	ac8594462e	tests: branch-ops morphological matrix (T1) Replaces three narrow concurrent_branch_* tests (folded in below) with one parameterized matrix test covering 11 representative (op_a, op_b, target_overlap) cells, asserting C1-C6 uniformly: C1 — both complete (no deadlock; tokio::time::timeout(15s)) C2 — status: both 200 or exactly one clean conflict; never 500 C3 — per-target row count C4 — per-target row identity (named persons present + absent — catches the symmetric-swap class that count assertions miss; cubic P2 on commit `64f2b99` flagged this gap on the round-3 merge race test) C5 — engine state coherent (subsequent /snapshot consistent) C6 — post-op /change on main succeeds (engine isn't poisoned) Cells: a. Merge × Merge, distinct targets — branch_merge_impl race pin b. Merge × Merge, same target / distinct sources — merge_exclusive serialization c. Merge × Merge, same source / distinct targets — fanout d. Merge × Change, into target — per-(table, branch) queue e. Merge × BranchCreateFrom, target — interaction with refresh path f. BranchCreateFrom × BranchCreateFrom, distinct parents — round-1 race pin g. BranchCreateFrom × BranchDelete, unrelated branches — disjoint state h. BranchDelete × BranchDelete, distinct branches — concurrent refresh i. BranchDelete × Change, distinct branch — refresh-side vs writer j. BranchCreateFrom × Change, on source — fork-while-writing k. Reopen consistency after concurrent pair — disk-vs-cache drift Each cell: - spins up its own tempdir + AppState so failures don't cascade, - aligns the pair at a tokio::sync::Barrier so both reach the engine close in time, - wraps in a 15s deadlock timeout, - asserts identity via a /read with the `get_person` fixture query (specific names must be present on the right branch and absent from the wrong one). Subsumes: - concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator (now cell f, with identity assertions added) - concurrent_branch_merges_distinct_targets_do_not_swap_into_each_other (now cells a + b + c, with identity assertions; the symmetric-swap blind spot cubic flagged on commit `64f2b99` is closed) - concurrent_change_during_branch_merge_preserves_writes (now cell d) Those three narrow tests are removed in the next commit so this lands green standalone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 20:07:37 +02:00
Ragnor Comerford	64f2b994f5	bench: assert --heavy-concurrency > 0 instead of silently clamping Closes the cubic P2 finding on commit `22d76db`: `Semaphore::new(concurrency.max(1))` silently coerced --heavy-concurrency=0 to 1, so the JSON output reported 0 while execution actually used 1. Reported settings differed from actual. Adds an explicit `--heavy-concurrency > 0` check in `main()` (with a helpful error message pointing to --heavy-batches=0 as the way to disable heavy traffic) and a defensive `assert!()` inside `drive_heavy_actor` so future callers can't pass 0 silently. Verified: `bench_actor_isolation --heavy-concurrency 0` exits with code 2 and the explanatory error message. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 19:23:02 +02:00
Ragnor Comerford	3e6b2af4e9	engine: serialize concurrent branch merges via merge_exclusive mutex Closes the Cursor Bugbot HIGH on commit `22d76db` (round 2 review): `branch_merge_impl` at `crates/omnigraph/src/exec/merge.rs:1085-1100` still used the swap_coordinator_for_branch + operate + restore_coordinator pattern across three separate `coordinator.write().await` acquisitions. Two concurrent merges with distinct targets would interleave their swaps, leaving each merge's body running against the other's swapped coord — A's `feature_a → target_a` would land its rewrite in target_b instead. Adds `merge_exclusive: Arc<tokio::sync::Mutex<()>>` to `Omnigraph`, held across the entire swap → operate → restore window in `branch_merge_impl`. Concurrent branch merges now serialize relative to each other; everything else (per-(table, branch) writer queues, /change, /ingest) is unaffected. Why the mutex rather than the deeper "operate on local coord" refactor (the round-1 fix shape applied to `branch_create_from`): `branch_merge_on_current_target` calls `self.snapshot()` and `self.ensure_commit_graph_initialized()` internally, which use `self.coordinator` directly. Threading an explicit target coord parameter through the merge body would unwind dozens of call sites. The mutex is a smaller intrusion that fully closes the race. Documented as a follow-up if telemetry shows merge concurrency matters. Pinned by `concurrent_branch_merges_distinct_targets_do_not_swap_into_each_other` (previous commit). Pre-fix: M=4 iterations of concurrent merges deterministically corrupted target row counts. Post-fix: all M iterations land each merge on its declared target. The two adjacent branch concurrency tests (`concurrent_change_during_branch_merge_preserves_writes`, `concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator`) still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 19:14:54 +02:00
Ragnor Comerford	2b2e723125	tests: pin branch_merge swap-restore race (red) Per AGENTS.md rule 8, this commit lands the failing regression test ahead of the fix. Cursor Bugbot HIGH on commit `22d76db` rediscovered the residual flagged in the round 1 honest-review note: `branch_merge_impl` at `crates/omnigraph/src/exec/merge.rs:1085-1100` still uses the swap_coordinator_for_branch + operate + restore_coordinator pattern across three separate `coordinator.write().await` acquisitions. The same shape that branch_create_from_impl shed in commit `4ffbf6e`. The test spawns two concurrent /branches/merge calls A (feature-a → target-a) and B (feature-b → target-b) aligned at a tokio::sync::Barrier so both reach swap_coordinator_for_branch close in time. M=4 iterations boost race-catching odds. Currently fails on `22d76db` with target-a=5, target-b=4: B's merge landed on the wrong coord — target-b never got Frank because A's swap pushed self.coordinator to target-a, B's swap captured target-a as B's "previous", and B's restore set self.coordinator back to target-a (not the original main). Subsequent operations using self.coordinator point at the wrong branch. Fix lands in the next commit: serialize concurrent branch merges via `merge_exclusive: Arc<tokio::sync::Mutex<()>>` held across the entire swap-operate-restore window. Closes the bug class "non-atomic three-step coordinator manipulation" for branch_merge by serializing merges relative to each other; per-(table, branch) queue inside the merge body still lets merges and other writers run concurrently. A deeper "operate on local coord" refactor (the round-1 fix shape for branch_create_from) requires unwinding `branch_merge_on_current_target` and its uses of `self.snapshot()` / `self.ensure_commit_graph_initialized()`; deferred to a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 19:12:03 +02:00
Ragnor Comerford	22d76dbb40	server+bench: AppState::new_with_workload; bench drops set_var, exercises heavy cap Two cubic findings on bench_actor_isolation.rs flagged together: P2 (lib.rs:202): `unsafe { std::env::set_var(...) }` ran inside `#[tokio::main] async fn main()` AFTER the multi-thread tokio runtime was up. Rust 2024 made `set_var` unsafe because libc's `setenv` is not thread-safe; concurrent env reads from logging or runtime internals can race or read torn state. Fix (correct by design, AGENTS.md rule 9): add a public `AppState::new_with_workload(uri, db, bearer_tokens, workload)` constructor that takes a caller-built `WorkloadController`. Tests and benches override per-actor caps via the constructor instead of mutating global env. Closes the bug class "tests need to mutate global env to override AppState defaults." P2 (lib.rs:130): heavy actor's `oneshot.await` inside the loop serialized — heavy in-flight count was always 1, so cap=1 never tripped on the heavy side. The bench validated isolation (light p99 bounded) but didn't demonstrate the rejection path. Fix: add a `--heavy-concurrency` arg (default 4) and spawn batches as concurrent tokio tasks bounded by an internal semaphore. With heavy_concurrency=4 and inflight_cap=1, the bench now reports heavy_too_many_requests > 0 and heavy_ok == 1 at peak — proving the gate fires for the heavy actor. Sample run on local FS (4 light actors × 30 ops, 20 heavy batches × 50 rows, heavy_concurrency=4, cap=1): heavy_ok: 1 heavy_too_many_requests: 19 light_ok: 120 light_too_many_requests: 0 light_p99: 565 ms (target < 2 s) Heavy saturates its own cap; light actors are completely unaffected. The isolation property is now empirically proven by the rejection counts rather than just by the latency tail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:57:42 +02:00
Ragnor Comerford	8e1a8e7d55	server: document 429 / 503 in admission-gated endpoint OpenAPI responses Closes the cubic finding (P2) at lib.rs:1061: the new admission gates add HTTP 429 / 503 failure paths but the affected endpoint `#[utoipa::path(... responses(...) ...)]` annotations weren't updated. Also closes a pre-existing miss on /change (admission-gated since PR 2 Step F). Adds (status = 429, ...) and (status = 503, ...) to all six admission-gated endpoints: - POST /change (operation_id = "change") - POST /schema/apply (operation_id = "applySchema") - POST /ingest (operation_id = "ingest") - POST /branches (operation_id = "createBranch") - DELETE /branches/{branch} (operation_id = "deleteBranch") - POST /branches/merge (operation_id = "mergeBranches") The descriptions reference the `Retry-After` header, which the `IntoResponse for ApiError` impl emits on both codes (added in commit `c745dd6`). openapi.json regenerated via OMNIGRAPH_UPDATE_OPENAPI=1; the openapi sentinel test passes both with the regen flag and in strict-check mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:49:02 +02:00
Ragnor Comerford	7fc00142a4	engine: scope refresh() write guard to recovery; release before schema reload Closes the HIGH-severity deadlock flagged by Cursor Bugbot on PR #75 review of commit `b09a097`. Pre-fix: `Omnigraph::refresh()` held `coordinator.write().await` from omnigraph.rs:468 through function exit, including across the call to `reload_schema_if_source_changed()` at line 484. That helper's `self.coordinator.read().await` (only reached when on-disk schema source differs from in-memory cache) deadlocked against the outer write guard because tokio's RwLock is non-reentrant. Reachable from `branch_delete` (omnigraph.rs:910) and `branch_merge` (post-merge refresh at merge.rs:1100). Cross-handle scenario: handle A calls apply_schema, handle B's stale cache hits the reload path on its next refresh. Why correct by design (AGENTS.md rule 9): the write guard's purpose is to serialize the recovery sweep's mutation of GraphCoordinator; the schema reload reads coord.branch_list() and stores into the ArcSwap'd schema_source / catalog without touching the coord. The two operations have disjoint lock requirements; coupling them was over-locking. Scoping the guard matches the natural data-flow: snapshot recovery state under the write, release, then reload schema using a fresh read on the same lock. Pinned by `composite_flow_schema_apply_then_branch_ops_no_deadlock_in_refresh` (previous commit). Pre-fix: 15s timeout fires. Post-fix: completes in 0.25s. Both other composite_flow tests still pass: canonical_lifecycle and multi_branch_sequential_merges. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:47:08 +02:00
Ragnor Comerford	8686b1deed	tests: pin refresh() deadlock after schema_apply (red) Per AGENTS.md rule 8, this commit lands the failing regression test ahead of the fix so the red → green pair is visible in git log. Cursor Bugbot flagged the deadlock at HIGH severity on commit `b09a097`: `Omnigraph::refresh()` holds `coordinator.write().await` from omnigraph.rs:468 through function exit, including across the call to `reload_schema_if_source_changed()` at line 484. That helper, when the on-disk schema source differs from the in-memory cache, attempts `self.coordinator.read().await` at line 496. Tokio's RwLock isn't reentrant — the read blocks waiting for the write to release, the write isn't released until refresh() returns. Hard hang. Reachable from `branch_delete` (omnigraph.rs:910 calls `self.refresh()`) and `branch_merge_as` (post-merge refresh at merge.rs:1100). Cross-handle setup is the realistic trigger: handle A applies a schema, advancing _schema.pg on disk and updating A's ArcSwap cache in-line; handle B has stale in-memory schema_source. B's next refresh() (here via branch_delete) hits the read-after-write reload path because B's cache no longer matches disk. Single-handle is unreachable since apply_schema updates the local cache atomically. Test currently fails on `b09a097` with the timeout firing at 15s, proving branch_delete hung. The next commit scopes the write guard to the recovery section so reload_schema_if_source_changed runs without the write held — uncontested read acquisition, no deadlock. The test extends `composite_flow.rs` with a broader sequence (apply_schema → branch_create → branch_delete → branch_merge → mutate with new column → reopen) so the post-fix path's correctness is pinned alongside the deadlock pin per the user's request. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:46:07 +02:00
Ragnor Comerford	b09a0972cb	bench: add actor-isolation harness for WorkloadController Empirical proof of MR-686's central design promise: per-actor admission control isolates noisy actors from light traffic. The existing bench_concurrent_http harness measures aggregate throughput; this harness measures the latency tail seen by light actors while a heavy actor saturates its own per-actor cap. Setup: one "heavy" actor flooding /ingest with multi-row NDJSON batches; N "light" actors each running short bursts of /change inserts, each authenticating with a distinct bearer token so the WorkloadController accounts them as separate identities. Output: heavy throughput / 429 count, light p50/p95/p99/max latency. Acceptance heuristic on local FS: light-actor p99 < 2 s while the heavy actor saturates its own cap. Sample run on local FS, cap=1, 4 light actors x 30 ops, 20 heavy batches x 50 rows: light p99 = 710 ms, light errors = 0 (well under the 2 s acceptance target). The test demonstrates the isolation property — the heavy /ingest holds its own admission slot but doesn't affect light actors since they have separate per-actor state. Usage: cargo run --release -p omnigraph-server --example bench_actor_isolation -- \ --light-actors 4 --light-ops-per-actor 30 \ --heavy-batches 20 --heavy-rows-per-batch 50 \ --inflight-cap 1 \ --output .context/bench-results/after-pr2-phase2/actor-isolation.json Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:12:50 +02:00
Ragnor Comerford	976aa0ec1d	tests: pin concurrent /change + branch_merge interleave preserves writes Future-proofs against MR-895 work that may move or remove the per-(table, branch) writer queue acquisition inside `branch_merge` (`crates/omnigraph/src/exec/merge.rs:1224`). Today the queue linearizes a concurrent /change on main against a `branch_merge feature → main` on the same touched tables; both succeed and the inserted row is preserved post-merge. Codex flagged this scenario as a P1 in PR #75 review claiming the merge could silently overwrite concurrent target writes because the source-rewrite path opens with `MutationOpKind::Merge` (skipping the strict pre-stage check). Validation showed the queue at merge.rs:1224 is held across both Phase B (per-table commit_staged) and Phase C (manifest publish), so there's no interleave window. The Merge op_kind only affects same-process pre-stage drift detection, not cross-write linearization. The test passes on `f925ad1`; landing it as a regression sentinel catches future changes that drop the queue acquisition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:03:05 +02:00
Ragnor Comerford	5520ab72ff	tests: pin disjoint /change concurrency at HTTP level Closes the cubic acceptance-criteria gap (❌ "Integration test: two /change requests targeting different (table_key, branch) execute concurrently end-to-end"). The bench harness measures the throughput side; this test is the regression sentinel that catches a future change which accidentally re-introduces graph-wide serialization on the disjoint path. Spawns 4 concurrent /change inserts on node:Person and 4 on node:Company. All 8 must return 200, and the post-test row counts on each table must reflect every insert. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 17:01:52 +02:00
Ragnor Comerford	6ef07386d3	docs+engine: refresh server.md rate-limiting note; cache version() TOCTOU Two cleanups bundled because they're both single-line, post-MR-686 hygiene flagged by cubic during PR review: - docs/server.md:102 said "Rate limiting — none" while the new admission-control section earlier in the file documents 429s on the five mutating handlers. Replace with a pointer to the admission section and clarify that no graph-wide rate limiter is wired. - schema_apply.rs:445-451 called `db.version().await` twice — once for the conditional check, once in the error format string — creating a cosmetic TOCTOU under interior mutability. Cache the result in `current_manifest_version` so the error message reflects the version that triggered the rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:59:45 +02:00
Ragnor Comerford	c745dd69ae	server: emit Retry-After header on 429 / 503 responses Closes the doc-vs-code gap at api.rs:343 and lib.rs:344-355: the documentation claims `Retry-After` is set on TooManyRequests / ServiceUnavailable responses, but `IntoResponse for ApiError` emitted only `(StatusCode, Json(ErrorOutput))` — no header. Wires a constant `RETRY_AFTER_SECONDS = "60"` for both 429 and 503 codes. Plumbing per-RejectReason durations through is a follow-up; the admission rejects we surface today recover bounded by request handler duration rather than calendar wait, so a constant suffices. Pinned by `ingest_per_actor_admission_cap_returns_429`. Test now fully green: 1+ of 8 concurrent /ingest under cap=1 receives 429 with Retry-After: 60. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:58:47 +02:00
Ragnor Comerford	05a8bd5de1	server: gate /ingest /branches/* /schema/apply on per-actor admission Closes the gap that admission control only fired on /change. A heavy actor sending bulk-ingest traffic could exhaust shared engine capacity (Lance I/O threads, manifest churn) without hitting the per-actor cap. Wires `state.workload.try_admit(&actor_arc, est_bytes)` into the five remaining mutating handlers AFTER Cedar authorization (so denied requests don't consume admission slots) and BEFORE the engine call. Byte estimates per handler: - /ingest: request.data.len() (NDJSON body) - /schema/apply: request.schema_source.len() - /branches/create, /branches/delete, /branches/merge: 256 (small JSON; the heavy work is bounded per-(table, branch) by the engine's writer queue rather than by request size) The admission guard is held in `let _admission = ...` so it stays alive until handler return, releasing the count permit + decrementing the byte budget on drop. Pinned by `ingest_per_actor_admission_cap_returns_429` (previous commit). The test still fails on the Retry-After header assertion; the next commit emits the header. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:57:53 +02:00
Ragnor Comerford	0976cbebc5	tests: pin /ingest admission gate + 429 Retry-After (red) Per AGENTS.md rule 8, this commit lands the failing regression test ahead of the fix. Currently fails on `f925ad1` with 8/8 statuses returning 200 because /ingest does not call WorkloadController::try_admit. The test pins: - /ingest is gated on per-actor admission control (returns 429 when the cap is exceeded). - 429 responses carry the structured `code: too_many_requests` error body so clients can distinguish them from generic conflicts. - 429 responses include a `Retry-After` header so clients can implement bounded backoff. The doc claim at api.rs:343 and lib.rs:344 was that this header exists; the IntoResponse impl currently emits no headers. Two follow-up commits will turn this green: 1. Wire WorkloadController::try_admit on /ingest and the four other mutating handlers (Block 2.1). 2. Emit the Retry-After header on 429/503 responses (Block 2.2). The test uses #[serial] + EnvGuard to override OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1 without racing parallel tests, then spawns 8 concurrent /ingest tasks aligned at a tokio::sync::Barrier so multiple tasks reach try_admit close in time. With cap=1, at least one must be rejected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:57:01 +02:00
Ragnor Comerford	c263732b1a	tests: extend same-key insert test with /snapshot row-count assertion The existing change_concurrent_inserts_same_key_serialize_without_409 test claimed in its comment "asserts the final row count equals N" but only checked HTTP status codes. cubic flagged the gap; this commit adds the actual /snapshot read after the concurrent inserts to verify all N batches landed (no silent overwrite) by comparing the post-test node:Person row_count against SEED + N. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:49:38 +02:00
Ragnor Comerford	4ffbf6ec61	engine: drop swap-restore in branch_create_from; operate on local coord Closes the swap-restore race in `branch_create_from_impl` by simply not touching `self.coordinator` at all. Open the source-branch coordinator locally, call `branch_create` on it, drop it. The new branch is durable on disk via the manifest write that `GraphCoordinator::branch_create` issues on its own commit graph; subsequent reads of any coord will see it after their normal manifest refresh. Pre-fix: `branch_create_from_impl` ran swap → operate → restore as three separate `coordinator.write().await` acquisitions. Under `&self` concurrency, two callers with distinct source branches could interleave their swaps, leaving each caller's "operate" step running against the other's swapped coordinator and forking the new branch off the wrong HEAD. Pinned by `concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator` (previous commit) which deterministically reproduced the race with 8/8 forks landing on the wrong parent. Why correct by design (AGENTS.md rule 9): closing the bug class "non-atomic three-step coordinator manipulation under &self callers" by removing the manipulation entirely. There's no scratch-space race to lose because there's no scratch space. Note: `branch_merge_impl` at `crates/omnigraph/src/exec/merge.rs:1085-1100` keeps the same swap-restore pattern. Its inner `branch_merge_on_current_target` calls `self.snapshot()` and `self.ensure_commit_graph_initialized()` which acquire the coord lock independently, so the simple "operate on local coord" refactor doesn't compose without a deeper interface change. The per-(table, branch) writer queue inside the merge body (`crates/omnigraph/src/exec/merge.rs:1224`) bounds the damage in practice; a deterministic regression for concurrent merges is tracked under Block 3.1 of the plan. `swap_coordinator_for_branch` and `restore_coordinator` remain crate-internal for now (still used by `branch_merge_impl`); a follow-up can remove them if the merge path is similarly refactored. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:48:17 +02:00
Ragnor Comerford	3b33e9ac56	tests: pin branch_create_from swap-restore race (red) Per AGENTS.md rule 8, this commit lands the failing regression test ahead of the fix so the red → green pair is visible in git log. The test demonstrates that two concurrent `POST /branches` calls with distinct `from` parents corrupt coordinator state: A's "operate" step runs against B's swapped coordinator instead of its own, forking the new branch off the wrong parent's HEAD. Currently fails on `f925ad1` with all 8 gamma branches (declared parent: alpha, 5 rows) reporting 4 rows — beta's row count. The operate step ran against beta's coord because B's swap interleaved between A's swap and A's operate. Fix lands in the next commit: hold a single `coordinator.write().await` guard across the entire swap-operate-restore sequence in `branch_create_from_impl` so the three steps are atomic relative to other callers. Closes the bug class "non-atomic three-step coordinator manipulation under &self callers" rather than guarding the specific call site — the right architectural seam (single critical section per swap-restore sequence) eliminates the interleave window for branch_create_from and any future swap-restore caller. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:44:50 +02:00
Ragnor Comerford	4ca527cc53	staging: op-kind-aware drift check at commit_all entry Closes the bug class "Lance internal conflict surfaces as 500 instead of 409" for in-process concurrent strict-op writers on the same row. Pre-fix: in `MutationStaging::commit_all`, after queue acquisition, the staged Lance transaction (built against V0) was handed straight to `commit_staged`. When Lance HEAD has advanced past V0 (because the queue's prior winner already published), Lance's transaction conflict resolver fires `RetryableCommitConflict` for Update vs Update on the same row, which wraps as `OmniError::Lance(<string>)` and the API maps it to HTTP 500. Users see "internal server error" instead of a clean retryable conflict. Fix: track the strictest `MutationOpKind` per touched table on `MutationStaging` and propagate through `StagedMutation`. In `commit_all`'s recapture loop, before each `commit_staged`, fail-fast with `OmniError::manifest_expected_version_mismatch` (→ HTTP 409 ExpectedVersionMismatch) for tables whose tracked op_kind has `strict_pre_stage_version_check() == true` (Update/Delete/SchemaRewrite) when the staged dataset's version doesn't match the fresh manifest pin under the queue. Insert/Merge tables skip the check — concurrent inserts on disjoint keys legitimately coexist via Lance's auto-rebase, so the check would over-reject the existing same-key insert path. Threading: `ensure_path` now takes `op_kind` and stores it in a new `op_kinds: HashMap<String, MutationOpKind>` on `MutationStaging`, with strictness-upgrade semantics so mixed insert+update on the same table still fires the strict check at commit time. `StagedMutation` carries `op_kinds` through to `commit_all`. Pinned by `change_concurrent_updates_same_key_serialize_via_publisher_cas` in `crates/omnigraph-server/tests/server.rs` (added in the previous commit). All Phase 2 sentinels still pass: change_concurrent_inserts_same_key_serialize_without_409, change_conflict_returns_manifest_conflict_409, branch_merge_conflict_response_includes_structured_conflicts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:42:14 +02:00
Ragnor Comerford	ebf5a5769d	tests: pin UPDATE RYW under in-process concurrency (red) Per AGENTS.md rule 8, this commit lands the failing regression test ahead of the fix so the red → green pair is visible in git log. The test asserts the RYW invariant for in-process concurrent UPDATEs on the same row: exactly one writer commits and N-1 receive 409 manifest_conflict. Currently fails on `f925ad1` with 1 x 200 + 7 x 500: > "storage: Retryable commit conflict for version 6: This Update > transaction was preempted by concurrent transaction Update at > version 6. Please retry." Lance's transaction conflict resolver correctly detects the Update vs Update race, but the error wraps as `OmniError::Lance(<string>)` and the API surfaces it as 500 internal rather than 409 retryable conflict. Users see "internal server error" for what is documented as a retryable conflict path. The fix lands in the next commit: an op-kind-aware drift check at the commit_all entry that returns 409 ExpectedVersionMismatch for tables whose first touch was Update / Delete / SchemaRewrite when the staged dataset version drifts from the manifest pin under the queue. Closes the bug class "Lance internal conflict surfaces as 500 instead of 409" rather than mapping the specific Lance error variant — the right architectural layer (engine boundary, under the queue) catches the drift before commit_staged ever runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-08 16:33:53 +02:00

1 2 3 4 5 ...

277 commits