Commit graph

56 commits

Author SHA1 Message Date
Devin AI
e44a4704eb docs: fix admission gating description 2026-05-10 14:16:26 +00:00
Devin AI
a42d178119 release: prepare omnigraph 0.4.2 2026-05-10 14:02:28 +00:00
Devin AI
6a3f0677ae server: drop unwired try_admit_rewrite / 503 admission surface 2026-05-09 20:58:17 +00:00
Devin AI
4bb7964af9 tests: matrix cell k asserts post-reopen row count 2026-05-09 20:16:44 +00:00
Devin AI
708e170dc5 engine: branch-merge revalidates target snapshot under queue 2026-05-09 20:16:12 +00:00
Ragnor Comerford
f9a0f31f80
server: drop 503 from OpenAPI on admission-gated endpoints (unreachable)
Cursor Bugbot LOW on commit 3ad359d: try_admit_rewrite is defined and
tested but no HTTP handler calls it; the six handler OpenAPI
annotations declared status = 503 (added in 8e1a8e7) but try_admit
(the only path handlers invoke) returns 429 only. 503 was unreachable.

Fix: remove (status = 503, ...) from the six handler OpenAPI
annotations and regenerate openapi.json. Kept as forward-looking
infrastructure: try_admit_rewrite, global rewrite semaphore,
RejectReason::GlobalRewriteExhausted, ApiError::ServiceUnavailable,
the 503 branch in IntoResponse, --global-rewrite-cap, and
OMNIGRAPH_GLOBAL_REWRITE_MAX. When a future commit wires
try_admit_rewrite into a handler, the 503 OpenAPI annotation lands
alongside that wiring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 21:54:24 +02:00
Ragnor Comerford
3ad359db8b
tests: admission test uses new_with_workload, drops env mutation + #[serial]
Migrates `ingest_per_actor_admission_cap_returns_429` from env-var
override to direct `WorkloadController::new(1, ...)` construction via
`AppState::new_with_workload`. Removes the `EnvGuard` and the
`#[serial]` annotation that paired with it.

Why correct by design (AGENTS.md rule 9): the previous round's matrix
fix (commit 8bd9a5f) shielded the matrix from this test's env
mutation, but the broader bug class — "test A's process-wide env
mutation can leak into any test B that calls
`AppState::open` / `WorkloadController::from_env()`" — was still
reachable by any future test that didn't think to opt out. Closing
the class at the source: this test no longer mutates global state at
all, so no other test needs to defend against it.

Net effect:
- This test no longer needs `#[serial]` (was the only reason it was
  marked) — runs in parallel with the rest of the suite.
- The matrix's defensive `with_defaults()` construction (commit
  8bd9a5f) remains correct but is no longer required for correctness;
  it's now a "belt and suspenders" guard against any FUTURE
  env-mutating test.

Verified locally: both tests pass when run together; full server
suite (44 tests) green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 20:35:41 +02:00
Ragnor Comerford
8bd9a5ff14
tests: matrix harness uses with_defaults() workload, not from_env()
Round 4 CI failure: Test Workspace and server-aws both red on
`concurrent_branch_ops_morphological_matrix` cell b
("merge × merge: same-target-distinct-sources") — second merge
returned 429 instead of 200. The matrix passes locally.

Root cause: cargo test runs tests in parallel by default. The admission
test `ingest_per_actor_admission_cap_returns_429` is wrapped with
`#[serial]` and an EnvGuard that sets
`OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1` for its duration. Process-wide
env vars are visible to concurrently-running tests; the matrix's
`Harness::new()` called `AppState::open()` which delegates to
`WorkloadController::from_env()`, picking up cap=1 if it ran while
the admission test held the EnvGuard. With cap=1 + 2 concurrent
merges in cell b, one merge waits behind merge_exclusive while the
other is admitted; the waiter holds its admission permit, but a
fresh actor permit is needed when admission is per-actor — the
second merge's permit acquisition fails because the first hasn't
released yet, and 429 fires.

Fix (correct by design, AGENTS.md rule 9): the matrix harness builds
the WorkloadController explicitly via
`WorkloadController::with_defaults()` and passes it to
`AppState::new_with_workload`, the constructor added in commit
22d76db. Closes the bug class "tests pick up another concurrent test's
env override at construction time" — the matrix is now insulated from
any env-var manipulation in the rest of the test suite.

Verified locally: with `OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1` set in the
environment, the matrix passes (it ignores env entirely now).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 20:19:42 +02:00
Ragnor Comerford
99b0941478
tests: remove three narrow concurrent_branch_* tests subsumed by T1
The previous commit added `concurrent_branch_ops_morphological_matrix`
covering 11 cells with stronger assertions (identity + post-op /change
+ reopen). The three narrow tests it replaces:

- concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator
  → matrix cell f, with identity assertions added
- concurrent_branch_merges_distinct_targets_do_not_swap_into_each_other
  → matrix cells a + b + c, with identity assertions that close the
    symmetric-swap blind spot cubic flagged on commit 64f2b99
- concurrent_change_during_branch_merge_preserves_writes
  → matrix cell d

The matrix retains the original tests' diagnostic granularity through
named cell labels in every assertion message ("[a:merge×merge:distinct-targets]
merge a"), so a CI failure points to the exact cell + invariant.

Net: 522 lines removed, 0 coverage lost. All other server tests pass
unchanged (44 total).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 20:09:21 +02:00
Ragnor Comerford
ac8594462e
tests: branch-ops morphological matrix (T1)
Replaces three narrow concurrent_branch_* tests (folded in below) with
one parameterized matrix test covering 11 representative
(op_a, op_b, target_overlap) cells, asserting C1-C6 uniformly:

  C1 — both complete (no deadlock; tokio::time::timeout(15s))
  C2 — status: both 200 or exactly one clean conflict; never 500
  C3 — per-target row count
  C4 — per-target row identity (named persons present + absent — catches
       the symmetric-swap class that count assertions miss; cubic P2 on
       commit 64f2b99 flagged this gap on the round-3 merge race test)
  C5 — engine state coherent (subsequent /snapshot consistent)
  C6 — post-op /change on main succeeds (engine isn't poisoned)

Cells:

  a. Merge × Merge, distinct targets    — branch_merge_impl race pin
  b. Merge × Merge, same target / distinct sources    — merge_exclusive serialization
  c. Merge × Merge, same source / distinct targets    — fanout
  d. Merge × Change, into target    — per-(table, branch) queue
  e. Merge × BranchCreateFrom, target — interaction with refresh path
  f. BranchCreateFrom × BranchCreateFrom, distinct parents — round-1 race pin
  g. BranchCreateFrom × BranchDelete, unrelated branches — disjoint state
  h. BranchDelete × BranchDelete, distinct branches — concurrent refresh
  i. BranchDelete × Change, distinct branch — refresh-side vs writer
  j. BranchCreateFrom × Change, on source — fork-while-writing
  k. Reopen consistency after concurrent pair — disk-vs-cache drift

Each cell:
- spins up its own tempdir + AppState so failures don't cascade,
- aligns the pair at a tokio::sync::Barrier so both reach the engine
  close in time,
- wraps in a 15s deadlock timeout,
- asserts identity via a /read with the `get_person` fixture query
  (specific names must be present on the right branch and absent from
  the wrong one).

Subsumes:
- concurrent_branch_create_from_distinct_parents_does_not_corrupt_coordinator
  (now cell f, with identity assertions added)
- concurrent_branch_merges_distinct_targets_do_not_swap_into_each_other
  (now cells a + b + c, with identity assertions; the symmetric-swap
  blind spot cubic flagged on commit 64f2b99 is closed)
- concurrent_change_during_branch_merge_preserves_writes
  (now cell d)

Those three narrow tests are removed in the next commit so this lands
green standalone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 20:07:37 +02:00
Ragnor Comerford
64f2b994f5
bench: assert --heavy-concurrency > 0 instead of silently clamping
Closes the cubic P2 finding on commit 22d76db: `Semaphore::new(concurrency.max(1))`
silently coerced --heavy-concurrency=0 to 1, so the JSON output reported
0 while execution actually used 1. Reported settings differed from
actual.

Adds an explicit `--heavy-concurrency > 0` check in `main()` (with a
helpful error message pointing to --heavy-batches=0 as the way to
disable heavy traffic) and a defensive `assert!()` inside
`drive_heavy_actor` so future callers can't pass 0 silently.

Verified: `bench_actor_isolation --heavy-concurrency 0` exits with
code 2 and the explanatory error message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 19:23:02 +02:00
Ragnor Comerford
2b2e723125
tests: pin branch_merge swap-restore race (red)
Per AGENTS.md rule 8, this commit lands the failing regression test
ahead of the fix.

Cursor Bugbot HIGH on commit 22d76db rediscovered the residual flagged
in the round 1 honest-review note: `branch_merge_impl` at
`crates/omnigraph/src/exec/merge.rs:1085-1100` still uses the
swap_coordinator_for_branch + operate + restore_coordinator pattern
across three separate `coordinator.write().await` acquisitions. The
same shape that branch_create_from_impl shed in commit 4ffbf6e.

The test spawns two concurrent /branches/merge calls A (feature-a →
target-a) and B (feature-b → target-b) aligned at a tokio::sync::Barrier
so both reach swap_coordinator_for_branch close in time. M=4
iterations boost race-catching odds.

Currently fails on 22d76db with target-a=5, target-b=4: B's merge
landed on the wrong coord — target-b never got Frank because A's
swap pushed self.coordinator to target-a, B's swap captured target-a
as B's "previous", and B's restore set self.coordinator back to
target-a (not the original main). Subsequent operations using
self.coordinator point at the wrong branch.

Fix lands in the next commit: serialize concurrent branch merges via
`merge_exclusive: Arc<tokio::sync::Mutex<()>>` held across the entire
swap-operate-restore window. Closes the bug class "non-atomic
three-step coordinator manipulation" for branch_merge by serializing
merges relative to each other; per-(table, branch) queue inside the
merge body still lets merges and other writers run concurrently.

A deeper "operate on local coord" refactor (the round-1 fix shape for
branch_create_from) requires unwinding `branch_merge_on_current_target`
and its uses of `self.snapshot()` / `self.ensure_commit_graph_initialized()`;
deferred to a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 19:12:03 +02:00
Ragnor Comerford
22d76dbb40
server+bench: AppState::new_with_workload; bench drops set_var, exercises heavy cap
Two cubic findings on bench_actor_isolation.rs flagged together:

P2 (lib.rs:202): `unsafe { std::env::set_var(...) }` ran inside
`#[tokio::main] async fn main()` AFTER the multi-thread tokio runtime
was up. Rust 2024 made `set_var` unsafe because libc's `setenv` is
not thread-safe; concurrent env reads from logging or runtime
internals can race or read torn state.

Fix (correct by design, AGENTS.md rule 9): add a public
`AppState::new_with_workload(uri, db, bearer_tokens, workload)`
constructor that takes a caller-built `WorkloadController`. Tests and
benches override per-actor caps via the constructor instead of
mutating global env. Closes the bug class "tests need to mutate
global env to override AppState defaults."

P2 (lib.rs:130): heavy actor's `oneshot.await` inside the loop
serialized — heavy in-flight count was always 1, so cap=1 never
tripped on the heavy side. The bench validated isolation (light p99
bounded) but didn't demonstrate the rejection path.

Fix: add a `--heavy-concurrency` arg (default 4) and spawn batches
as concurrent tokio tasks bounded by an internal semaphore. With
heavy_concurrency=4 and inflight_cap=1, the bench now reports
heavy_too_many_requests > 0 and heavy_ok == 1 at peak — proving the
gate fires for the heavy actor.

Sample run on local FS (4 light actors × 30 ops, 20 heavy batches ×
50 rows, heavy_concurrency=4, cap=1):

  heavy_ok: 1
  heavy_too_many_requests: 19
  light_ok: 120
  light_too_many_requests: 0
  light_p99: 565 ms (target < 2 s)

Heavy saturates its own cap; light actors are completely unaffected.
The isolation property is now empirically proven by the rejection
counts rather than just by the latency tail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:57:42 +02:00
Ragnor Comerford
8e1a8e7d55
server: document 429 / 503 in admission-gated endpoint OpenAPI responses
Closes the cubic finding (P2) at lib.rs:1061: the new admission gates
add HTTP 429 / 503 failure paths but the affected endpoint
`#[utoipa::path(... responses(...) ...)]` annotations weren't updated.
Also closes a pre-existing miss on /change (admission-gated since
PR 2 Step F).

Adds (status = 429, ...) and (status = 503, ...) to all six
admission-gated endpoints:

- POST /change (operation_id = "change")
- POST /schema/apply (operation_id = "applySchema")
- POST /ingest (operation_id = "ingest")
- POST /branches (operation_id = "createBranch")
- DELETE /branches/{branch} (operation_id = "deleteBranch")
- POST /branches/merge (operation_id = "mergeBranches")

The descriptions reference the `Retry-After` header, which the
`IntoResponse for ApiError` impl emits on both codes (added in
commit c745dd6).

openapi.json regenerated via OMNIGRAPH_UPDATE_OPENAPI=1; the openapi
sentinel test passes both with the regen flag and in strict-check
mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:49:02 +02:00
Ragnor Comerford
b09a0972cb
bench: add actor-isolation harness for WorkloadController
Empirical proof of MR-686's central design promise: per-actor
admission control isolates noisy actors from light traffic. The
existing bench_concurrent_http harness measures aggregate throughput;
this harness measures the latency tail seen by light actors while a
heavy actor saturates its own per-actor cap.

Setup: one "heavy" actor flooding /ingest with multi-row NDJSON
batches; N "light" actors each running short bursts of /change
inserts, each authenticating with a distinct bearer token so the
WorkloadController accounts them as separate identities.

Output: heavy throughput / 429 count, light p50/p95/p99/max latency.
Acceptance heuristic on local FS: light-actor p99 < 2 s while the
heavy actor saturates its own cap.

Sample run on local FS, cap=1, 4 light actors x 30 ops, 20 heavy
batches x 50 rows: light p99 = 710 ms, light errors = 0 (well under
the 2 s acceptance target). The test demonstrates the isolation
property — the heavy /ingest holds its own admission slot but
doesn't affect light actors since they have separate per-actor
state.

Usage:
    cargo run --release -p omnigraph-server --example bench_actor_isolation -- \
        --light-actors 4 --light-ops-per-actor 30 \
        --heavy-batches 20 --heavy-rows-per-batch 50 \
        --inflight-cap 1 \
        --output .context/bench-results/after-pr2-phase2/actor-isolation.json

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:12:50 +02:00
Ragnor Comerford
976aa0ec1d
tests: pin concurrent /change + branch_merge interleave preserves writes
Future-proofs against MR-895 work that may move or remove the
per-(table, branch) writer queue acquisition inside `branch_merge`
(`crates/omnigraph/src/exec/merge.rs:1224`). Today the queue
linearizes a concurrent /change on main against a `branch_merge
feature → main` on the same touched tables; both succeed and the
inserted row is preserved post-merge.

Codex flagged this scenario as a P1 in PR #75 review claiming the
merge could silently overwrite concurrent target writes because the
source-rewrite path opens with `MutationOpKind::Merge` (skipping the
strict pre-stage check). Validation showed the queue at merge.rs:1224
is held across both Phase B (per-table commit_staged) and Phase C
(manifest publish), so there's no interleave window. The Merge
op_kind only affects same-process pre-stage drift detection, not
cross-write linearization. The test passes on f925ad1; landing it
as a regression sentinel catches future changes that drop the queue
acquisition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:03:05 +02:00
Ragnor Comerford
5520ab72ff
tests: pin disjoint /change concurrency at HTTP level
Closes the cubic acceptance-criteria gap ( "Integration test: two
/change requests targeting different (table_key, branch) execute
concurrently end-to-end"). The bench harness measures the throughput
side; this test is the regression sentinel that catches a future
change which accidentally re-introduces graph-wide serialization on
the disjoint path.

Spawns 4 concurrent /change inserts on node:Person and 4 on
node:Company. All 8 must return 200, and the post-test row counts
on each table must reflect every insert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 17:01:52 +02:00
Ragnor Comerford
c745dd69ae
server: emit Retry-After header on 429 / 503 responses
Closes the doc-vs-code gap at api.rs:343 and lib.rs:344-355: the
documentation claims `Retry-After` is set on TooManyRequests /
ServiceUnavailable responses, but `IntoResponse for ApiError`
emitted only `(StatusCode, Json(ErrorOutput))` — no header.

Wires a constant `RETRY_AFTER_SECONDS = "60"` for both 429 and 503
codes. Plumbing per-RejectReason durations through is a follow-up;
the admission rejects we surface today recover bounded by request
handler duration rather than calendar wait, so a constant suffices.

Pinned by `ingest_per_actor_admission_cap_returns_429`. Test now
fully green: 1+ of 8 concurrent /ingest under cap=1 receives 429
with Retry-After: 60.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:58:47 +02:00
Ragnor Comerford
05a8bd5de1
server: gate /ingest /branches/* /schema/apply on per-actor admission
Closes the gap that admission control only fired on /change. A heavy
actor sending bulk-ingest traffic could exhaust shared engine capacity
(Lance I/O threads, manifest churn) without hitting the per-actor cap.

Wires `state.workload.try_admit(&actor_arc, est_bytes)` into the five
remaining mutating handlers AFTER Cedar authorization (so denied
requests don't consume admission slots) and BEFORE the engine call.
Byte estimates per handler:

- /ingest: request.data.len() (NDJSON body)
- /schema/apply: request.schema_source.len()
- /branches/create, /branches/delete, /branches/merge: 256
  (small JSON; the heavy work is bounded per-(table, branch) by the
  engine's writer queue rather than by request size)

The admission guard is held in `let _admission = ...` so it stays
alive until handler return, releasing the count permit + decrementing
the byte budget on drop.

Pinned by `ingest_per_actor_admission_cap_returns_429` (previous
commit). The test still fails on the Retry-After header assertion;
the next commit emits the header.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:57:53 +02:00
Ragnor Comerford
0976cbebc5
tests: pin /ingest admission gate + 429 Retry-After (red)
Per AGENTS.md rule 8, this commit lands the failing regression test
ahead of the fix. Currently fails on f925ad1 with 8/8 statuses returning
200 because /ingest does not call WorkloadController::try_admit.

The test pins:
- /ingest is gated on per-actor admission control (returns 429 when
  the cap is exceeded).
- 429 responses carry the structured `code: too_many_requests` error
  body so clients can distinguish them from generic conflicts.
- 429 responses include a `Retry-After` header so clients can implement
  bounded backoff. The doc claim at api.rs:343 and lib.rs:344 was that
  this header exists; the IntoResponse impl currently emits no headers.

Two follow-up commits will turn this green:
1. Wire WorkloadController::try_admit on /ingest and the four other
   mutating handlers (Block 2.1).
2. Emit the Retry-After header on 429/503 responses (Block 2.2).

The test uses #[serial] + EnvGuard to override
OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=1 without racing parallel tests, then
spawns 8 concurrent /ingest tasks aligned at a tokio::sync::Barrier so
multiple tasks reach try_admit close in time. With cap=1, at least one
must be rejected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:57:01 +02:00
Ragnor Comerford
c263732b1a
tests: extend same-key insert test with /snapshot row-count assertion
The existing change_concurrent_inserts_same_key_serialize_without_409
test claimed in its comment "asserts the final row count equals N" but
only checked HTTP status codes. cubic flagged the gap; this commit
adds the actual /snapshot read after the concurrent inserts to verify
all N batches landed (no silent overwrite) by comparing the post-test
node:Person row_count against SEED + N.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:49:38 +02:00
Ragnor Comerford
3b33e9ac56
tests: pin branch_create_from swap-restore race (red)
Per AGENTS.md rule 8, this commit lands the failing regression test
ahead of the fix so the red → green pair is visible in git log.

The test demonstrates that two concurrent `POST /branches` calls with
distinct `from` parents corrupt coordinator state: A's "operate" step
runs against B's swapped coordinator instead of its own, forking the
new branch off the wrong parent's HEAD.

Currently fails on f925ad1 with all 8 gamma branches (declared
parent: alpha, 5 rows) reporting 4 rows — beta's row count. The
operate step ran against beta's coord because B's swap interleaved
between A's swap and A's operate.

Fix lands in the next commit: hold a single `coordinator.write().await`
guard across the entire swap-operate-restore sequence in
`branch_create_from_impl` so the three steps are atomic relative to
other callers.

Closes the bug class "non-atomic three-step coordinator manipulation
under &self callers" rather than guarding the specific call site —
the right architectural seam (single critical section per swap-restore
sequence) eliminates the interleave window for branch_create_from and
any future swap-restore caller.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:44:50 +02:00
Ragnor Comerford
ebf5a5769d
tests: pin UPDATE RYW under in-process concurrency (red)
Per AGENTS.md rule 8, this commit lands the failing regression test
ahead of the fix so the red → green pair is visible in git log.

The test asserts the RYW invariant for in-process concurrent UPDATEs on
the same row: exactly one writer commits and N-1 receive 409
manifest_conflict. Currently fails on f925ad1 with 1 x 200 + 7 x 500:

> "storage: Retryable commit conflict for version 6: This Update
>  transaction was preempted by concurrent transaction Update at
>  version 6. Please retry."

Lance's transaction conflict resolver correctly detects the Update vs
Update race, but the error wraps as `OmniError::Lance(<string>)` and the
API surfaces it as 500 internal rather than 409 retryable conflict. Users
see "internal server error" for what is documented as a retryable
conflict path.

The fix lands in the next commit: an op-kind-aware drift check at the
commit_all entry that returns 409 ExpectedVersionMismatch for tables
whose first touch was Update / Delete / SchemaRewrite when the staged
dataset version drifts from the manifest pin under the queue.

Closes the bug class "Lance internal conflict surfaces as 500 instead
of 409" rather than mapping the specific Lance error variant — the
right architectural layer (engine boundary, under the queue) catches
the drift before commit_staged ever runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 16:33:53 +02:00
Ragnor Comerford
f925ad1739
mr-686: Phase 2 — op-kind-aware version check + coord Mutex → RwLock
Fix A: op-kind-aware ensure_expected_version. Insert/Merge skip the
strict pre-stage check; Update/Delete/SchemaRewrite keep it. New
MutationOpKind enum threaded through open_for_mutation_on_branch /
open_owned_dataset_for_branch_write / reopen_for_mutation and all
callers (execute_insert/update/delete_node/delete_edge,
branch_merge::publish_rewritten_merge_table, schema_apply,
ensure_indices_for_branch, loader Append/Merge/Overwrite). Closes the
77% rejection rate on same-key concurrent inserts.

Fix B: coordinator Mutex -> RwLock. Reads parallelize via .read();
writes serialize via .write(). Atomic-commit invariant preserved by
the single .write() covering commit_manifest_updates +
record_graph_commit.

Bench-as-test change_concurrent_inserts_same_key_serialize_without_409
(server.rs:2180) spawns 12 concurrent /change inserts on a single
(table, branch); asserts every request returns 200. Was failing
pre-Phase-2; passes post-Phase-2.
change_conflict_returns_manifest_conflict_409 (cross-process drift
sentinel) and branch_merge_conflict_response_includes_structured_conflicts
both still pass.

Bench (after-pr2-phase2):
- single-actor 1x1: 14.9 ops/s, p50 68ms (baseline 12.3, +22%)
- disjoint 8x8:    7.04 ops/s, p50 1023ms (baseline 6.24, +13%)
- same-key 8x1:    2.62 ops/s, 0 errors (after-pr2: 77% errors)

Disjoint stayed at +13% — Fix B's RwLock helped read paths but the
publisher's .write() critical section still serializes graph-wide.
Splitting GraphCoordinator into per-concern primitives (manifest in
ArcSwap, commit_graph in RwLock, atomic-commit serializer) is the
deferred next step.

102 lib + 30 branching + 24 runs + 16 staged_writes + 63 end_to_end
+ 40 server tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-08 12:42:26 +02:00
Ragnor Comerford
c15962e6b0
server: flip AppState to Arc<Omnigraph>, wire admission on /change (PR 2 Step F)
The substantive PR 2 change. Removes the global server `RwLock<Omnigraph>`
that has serialized every mutating request across all actors. Disjoint
`(table, branch)` writes from different actors now run concurrently,
guarded only by the engine's per-(table, branch) write queue (PR 1b)
and per-actor admission control (PR 2 Step E).

AppState changes:
- `db: Arc<RwLock<Omnigraph>>` -> `engine: Arc<Omnigraph>`
- New field: `workload: Arc<workload::WorkloadController>` initialized
  from env (`OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=16`,
  `OMNIGRAPH_PER_ACTOR_BYTES_MAX=4GiB`,
  `OMNIGRAPH_GLOBAL_REWRITE_MAX=4`).
- `tokio::sync::RwLock` import dropped.

Handler updates (16 sites):
- All `Arc::clone(&state.db).read_owned().await` and `write_owned()`
  calls replaced with `let db = &state.engine`. Engine APIs are now
  `&self` (Step C) so this works directly.
- `/export` clones `Arc<Omnigraph>` once and moves into the spawned
  task instead of acquiring a long-held read lock.
- `/change` handler additionally wires
  `state.workload.try_admit(&actor_arc, est_bytes)`. Cedar runs FIRST
  so denied requests don't consume admission slots; admission runs
  SECOND before the engine call. `est_bytes` uses the request body
  size as a coarse proxy.

API surface additions (`api::ErrorCode`):
- `TooManyRequests` -> HTTP 429 (per-actor cap exceeded; respect
  `Retry-After`)
- `ServiceUnavailable` -> HTTP 503 (global rewrite pool exhausted)

`ApiError` constructors `too_many_requests` / `service_unavailable` and
`from_workload_reject` (maps `RejectReason` variants to HTTP status).

Other mutating handlers (`/ingest`, `/branches/*`, `/branches/merge`,
`/schema/apply`) currently flow through the Arc<Omnigraph> path
without admission gates; wiring those is mechanical and lands as a
follow-up. The /change hot path covers the bulk of MR-686's load
profile.

OpenAPI regenerated to include the new ErrorCode variants.
102 lib + 39 server tests + 5 workload tests pass. The regression
sentinel `change_conflict_returns_manifest_conflict_409` continues
to pass (revalidation perf opt + per-table queue + publisher CAS
preserve manifest_conflict semantics under concurrent writers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 17:08:26 +02:00
Ragnor Comerford
17a1665002
server: add WorkloadController for per-actor admission (PR 2 Step E)
PR 2 removes the global server `RwLock<Omnigraph>` (Step F). Without
admission control, one heavy actor would exhaust shared capacity
(Lance I/O threads, manifest churn, network) and starve other actors.
The WorkloadController bounds per-actor in-flight count + bytes and
provides a global rewrite-pool semaphore for compaction / index builds.

New file: `crates/omnigraph-server/src/workload.rs` (~250 LOC + 5 tests).

API:
- `WorkloadController::new(inflight_cap, byte_cap, rewrite_cap)` /
  `from_env()` / `with_defaults()`.
- `try_admit(actor_id, est_bytes) -> Result<AdmissionGuard, RejectReason>`
  acquires both an in-flight count permit and adds est_bytes to the
  per-actor counter atomically; returns RejectReason on either gate.
- `try_admit_rewrite() -> Result<RewriteGuard, RejectReason>` for the
  global rewrite pool (Step F maps RewriteGuard exhaustion to HTTP 503).
- `RejectReason::{InFlightCountExceeded, ByteBudgetExceeded,
  GlobalRewriteExhausted}`.

Race-free admission via `tokio::sync::Semaphore::try_acquire_owned()`
for the count gate (master plan Finding 6: independent atomic
load+check+add lets two callers both pass a cap-N check; the Semaphore
gate is atomic). Bytes use `fetch_add` + decrement-on-rejection so the
cap is never exceeded even on rollback.

Defaults (override via env):
- OMNIGRAPH_PER_ACTOR_INFLIGHT_MAX=16
- OMNIGRAPH_PER_ACTOR_BYTES_MAX=4_294_967_296 (4 GiB)
- OMNIGRAPH_GLOBAL_REWRITE_MAX=4

Tests cover under-cap admission, byte-budget rollback, per-actor
isolation, global rewrite cap, and the load-bearing 32-concurrent-vs-
cap-16 race test (forces real contention via a broadcast release
channel so guards can't recycle permits task-by-task; pins the
master plan's race-free invariant).

Adds workspace dep `dashmap = "6"` for per-actor state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:59:45 +02:00
Ragnor Comerford
fcb47620d3
mr-686: bundle PR 0/1a/1b foundation + PR 2 catalog/schema_source ArcSwap
Bundles the working-tree state from the prior session (PR 0 bench harness,
PR 1a audit_actor_id removal, PR 1b WriteQueueManager + writer integration)
together with the first half of PR 2's interior-mutability foundation
(catalog and schema_source wrapped in Arc<ArcSwap<...>>). The two streams
intermix in 7 of the same files, so splitting via git add -p was
impractical. Subsequent PR 2 steps land as separate atomic commits.

PR 0 — server-level concurrent /change bench harness
  - crates/omnigraph-server/examples/bench_concurrent_http.rs (new)
  - .context/bench-results/{baseline-main,after-pr1}/ (gitignored)

PR 1a — drop the audit_actor_id field, thread per-call
  - removed Omnigraph::audit_actor_id and the swap-restore patterns in
    mutation.rs, merge.rs, loader/mod.rs
  - actor_id: Option<&str> threaded through MutationStaging::finalize,
    mutate_with_current_actor, ingest_with_current_actor,
    branch_merge_impl, branch_merge_on_current_target,
    commit_prepared_updates*, record_merge_commit,
    commit_updates_on_branch_with_expected
  - apply_schema and ensure_indices_for_branch pass None (system-attributed)

PR 1b — per-(table_key, branch) write queue + revalidation + sidecar
  - new crates/omnigraph/src/db/write_queue.rs with WriteQueueManager,
    acquire/acquire_many, sorted+deduped acquisition; 6 unit tests
  - Arc<WriteQueueManager> field on Omnigraph + db.write_queue() accessor
  - MutationStaging::finalize split into stage_all (Phase A, no queue)
    and StagedMutation::commit_all (Phase B, acquire_many + revalidate
    pins + sidecar + commit_staged); guards held across publisher
  - delete-only mutations now emit recovery sidecars; revalidation
    extended to inline_committed tables
  - branch_merge_on_current_target, apply_schema_with_lock, and
    ensure_indices_for_branch acquire per-table queues for their
    touched tables

PR 2 Step B (partial) — catalog and schema_source via ArcSwap
  - catalog: Catalog -> Arc<ArcSwap<Catalog>>
  - schema_source: String -> Arc<ArcSwap<String>>
  - public accessors return Arc<Catalog> / Arc<String>; readers bind
    locally where the borrow has to outlive an expression
  - new pub(crate) store_catalog / store_schema_source helpers replace
    the field assignments in apply_schema and reload_schema_if_source_changed
  - 117 tests across lifecycle/end_to_end/branching/runs pass; engine
    lib + workspace compile clean

Coordinator wrap (Mutex) and the &mut self -> &self engine API
conversion follow in subsequent commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 16:22:38 +02:00
Ragnor Comerford
8726ffe0a3
release: bump version to 0.4.1 2026-05-02 23:20:50 +02:00
Ragnor Comerford
044ed46019
chore: scrub Linear ticket numbers and review-bot mentions from code comments
OmniGraph is OSS; internal Linear ticket references and code-review-bot
mentions in source-code comments don't help external readers and leak
internal tooling. Replace ticket numbers (MR-XXX) with descriptive
prose, drop linear.app URLs, and remove inline mentions of
Cursor/Bugbot/Cubic/Codex review threads.

Scope is limited to source-code comments (`crates/`). Docs under
`docs/` keep their MR-XXX references — those are part of the
established change-history narrative for in-repo docs and don't
require a Linear account to find context for.

No behavior changes; no public API changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:45:38 +02:00
Ragnor Comerford
35be20cb05
MR-771: demote Run to direct-publish via expected_table_versions CAS
mutate_as and load now write directly to target tables and call the
publisher once at the end with per-table expected versions; the Run
state machine, _graph_runs.lance writers, __run__ staging branches,
and server /runs/* endpoints are removed. Multi-statement mutations
remain atomic at the manifest level via an in-memory MutationStaging
accumulator that gives read-your-writes within a query and a single
publish at the end. Concurrent-writer conflicts surface as
ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the
old DivergentUpdate merge shape. Documents one known limitation in
docs/runs.md: a multi-statement mid-query failure where op-N writes
a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the
manifest until a follow-up introduces per-table Lance branches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 08:52:50 +02:00
Andrew Altshuler
7310f69928
Revert "Merge pull request #49 from ModernRelay/ragnorc/x-request-id" (#54)
This reverts commit b352fca13c, reversing
changes made to 748ad334a9.
2026-04-26 15:56:29 +03:00
Ragnor Comerford
b352fca13c
Merge pull request #49 from ModernRelay/ragnorc/x-request-id
Add X-Request-Id middleware
2026-04-26 12:33:33 +02:00
Ragnor Comerford
e14b203208
Reuse X_REQUEST_ID constant for inbound header lookup
Both Cursor Bugbot and Cubic flagged that the inbound `headers().get(...)`
call constructed `HeaderName::from_static("x-request-id")` inline instead
of reusing the `X_REQUEST_ID` constant defined at the top of the file.
The two were already kept in sync by both being `from_static("x-request-id")`,
but a future rename would have to touch both sites or risk silent drift
between read and write.

Also drops the now-unused `header` module import.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-26 12:05:19 +02:00
Ragnor Comerford
748ad334a9
Merge pull request #48 from ModernRelay/ragnorc/api-sdk-research
Polish OpenAPI spec for SDK generation
2026-04-26 11:52:46 +02:00
Ragnor Comerford
284c9377c2
Add X-Request-Id middleware
Per-request ULID minted at the edge, exposed in request extensions and
on the response header. Caller-supplied X-Request-Id is echoed when
well-formed (1..=128 ASCII printable characters); otherwise rejected
and replaced with a fresh ULID so the value is always safe to log.

Companion to the TypeScript SDK redesign — clients now correlate logs
across the wire by reading X-Request-Id from response headers (and the
SDK already surfaces it on every OmnigraphError as `requestId`).

No spec change required; the header is a transport-layer concern.

Tests:
- mint a ULID when no header is provided
- echo a valid caller-supplied id
- reject overlong header (200 chars), mint a fresh ULID

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 22:56:17 +02:00
Ragnor Comerford
7809bf607e
Polish OpenAPI spec for SDK generation
Add operation descriptions and examples to utoipa annotations so the
generated TypeScript SDK has rich JSDoc, and so future Python/Go SDKs
and any /openapi.json docs UI benefit from the same effort.

- Doc comments on all 18 handlers (utoipa picks up summary/description)
- #[schema(example = ...)] on free-text fields (query_source,
  schema_source, NDJSON data) and i64 timestamps
- Destructive/irreversible warnings on change, applySchema, ingest,
  mergeBranches, deleteBranch, publishRun, abortRun

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 16:36:51 +02:00
Andrew Altshuler
74eb5a5380
Parallel per-type load writes + omnigraph optimize/cleanup CLI (#46)
* Parallel per-type load writes + omnigraph optimize/cleanup CLI

## MR-677.3 — parallel per-type load writes

The load path already groups records into one RecordBatch per type and
makes one Lance commit per table (loader::mod.rs:249-..), but those
commits ran sequentially. Wrap node and edge write loops in
`futures::stream::buffered(N)` against a new helper
`write_batches_concurrently`. Concurrency tunable via
`OMNIGRAPH_LOAD_CONCURRENCY` (default 8).

## MR-676 — `omnigraph optimize` and `omnigraph cleanup`

New CLI subcommands that walk every node + edge table in the repo:

- `omnigraph optimize <uri>` — runs Lance `compact_files` on each
  table to merge small fragments into fewer larger ones.
- `omnigraph cleanup <uri> --keep N | --older-than 7d --confirm` —
  runs Lance `cleanup_old_versions` to prune historical manifests +
  unique fragments. Requires `--confirm` because it's destructive.
  Supports both count-based and time-based retention (or both AND'd
  together). Time uses chrono `DateTime<Utc>` (added as a workspace
  dep, default-features off).

Both commands run their per-table loops in parallel (8-way bounded,
`OMNIGRAPH_MAINTENANCE_CONCURRENCY` env override). Smoke-tested
against the 114-table prod graph: optimize went 7m15s sequential
→ 1m28s parallel. cleanup --keep 1 removed 137 historical versions
across 114 tables in 1m57s without disrupting `/healthz` or query
responses.

Public API on `Omnigraph`:

  pub async fn optimize(&mut self) -> Result<Vec<TableOptimizeStats>>
  pub async fn cleanup(&mut self, opts: CleanupPolicyOptions)
      -> Result<Vec<TableCleanupStats>>

All 10 existing loader tests still pass.

Closes MR-676.
Partially addresses MR-677 (the .3 — parallel by type — piece;
MR-677.1 is for the `omnigraph embed` path, not load, since load
doesn't call Gemini directly. .2 was already in place).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: regenerate openapi.json

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-04-25 14:22:14 +03:00
Andrew Altshuler
8649b2084f
Prepare v0.3.0 release (#44)
* Prepare v0.3.0 release

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* chore: regenerate openapi.json

* ci: retrigger CI on latest openapi.json

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2026-04-21 19:11:34 +03:00
Ragnor Comerford
a157f6a17c
Fold openapi.json auto-sync into main CI test job
The separate openapi-sync workflow was duplicating the workspace build
(~15 min cold-cache compile), paying the cost twice per PR. Fold the
regen + auto-commit into the existing test job: one compile, shared
rust-cache, same drift-check semantics.

- Same-repo PRs: OMNIGRAPH_UPDATE_OPENAPI=1 during the test run, then
  commit the regenerated spec back to the PR branch
- Fork PRs / pushes: env var empty, test stays in strict drift-check mode
- openapi_spec_is_up_to_date treats empty env value as unset, so the
  conditional workflow env expression works

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 21:00:46 +02:00
Ragnor Comerford
9de2079263
Merge remote-tracking branch 'origin/main' into ragnorc/explore-api
# Conflicts:
#	CONTRIBUTING.md
2026-04-18 20:24:39 +02:00
andrew
7a3bf5c758 Add aws feature + SecretsManagerTokenSource backend
Introduces an opt-in AWS Secrets Manager backend for bearer tokens,
behind the `aws` Cargo feature. Default builds (on-prem, local dev)
don't pull in the AWS SDK and don't pay its compile cost.

- New Cargo feature `aws` gates the `aws-config` + `aws-sdk-secretsmanager`
  optional deps. Default features remain empty.
- New `auth::aws::SecretsManagerTokenSource` implements `TokenSource` by
  fetching a JSON `{"actor_id": "token", ...}` payload from a named
  Secrets Manager secret. Credentials resolve via the AWS default chain
  (env, shared config, IMDSv2 instance role, ECS task role) so no
  explicit plumbing is needed under an IAM role.
- New `resolve_token_source()` dispatches based on the
  `OMNIGRAPH_SERVER_BEARER_TOKENS_AWS_SECRET` env var. If the var is set
  but the binary was built without `--features aws`, returns a clear
  rebuild instruction rather than silently falling back.
- `serve()` now uses `resolve_token_source()` and logs which source was
  selected at startup.
- `parse_json_secret_payload()` is factored out as a free function so
  the payload validation (trim whitespace, reject blank actor/token,
  reject non-object) is unit-testable without the AWS SDK.
- New CI job `test_aws_feature` builds + tests with `--features aws`.

Not in this PR (follow-ups):
- Background refresh loop for rotation. `SecretsManagerTokenSource`
  advertises `supports_refresh: true` but the AppState-level refresh
  task isn't wired yet.
- Config-YAML dispatch (today the AWS source is selected via env var
  only; eventually `server.bearer_tokens.source` in `omnigraph.yaml`).

Tests:
- Default-feature build: 33 lib + 41 integration + 64 openapi.
- `--features aws` build: 32 lib (one test is cfg-gated) + 41 + 64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 03:48:51 +03:00
andrew
af41630520 Extract TokenSource trait for bearer token loading
Pure refactor. No behavior change. Introduces a TokenSource trait so
additional backends (AWS Secrets Manager, Vault, etc.) can plug in
behind feature flags without touching the server wiring.

- New module crates/omnigraph-server/src/auth.rs with the TokenSource
  trait and a single EnvOrFileTokenSource implementation that delegates
  to the existing server_bearer_tokens_from_env() function.
- serve() now constructs EnvOrFileTokenSource and calls load() instead
  of calling the free function directly.
- The trait has a supports_refresh() hook (false for env/file) for
  future implementations that can rotate without restart.
- async-trait added to omnigraph-server deps; it's already in the
  workspace.

Tests:
- Unit tests in auth.rs covering load paths and the default supports_refresh
  / name values.
- Existing 128 tests (lib + integration + openapi) pass unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 03:31:43 +03:00
andrew
c338e80180 Harden bearer auth: constant-time compare, hashed at rest, authoritative actor_id
Fixes two live authz bugs in omnigraph-server:

- Bearer-token lookup previously used HashMap::get, which compares keys with
  Eq and short-circuits on the first differing byte — a network-observable
  timing oracle for brute-forcing tokens. Tokens are now stored as SHA-256
  digests and compared with subtle::ConstantTimeEq, iterating every entry
  unconditionally so total work is independent of which slot matches. Raw
  token bytes no longer live in server memory after startup.

- authorize_request now overwrites PolicyRequest.actor_id from the
  authenticated session instead of trusting the handler-supplied field,
  which previously defaulted to "" via unwrap_or_default(). The empty
  string can no longer reach Cedar as a policy subject even if a future
  refactor drops the None check.

External API of AppState constructors is unchanged — tokens still enter as
Vec<(String, String)> and are hashed on the way in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 01:41:02 +03:00
andrew
be520f31f4 Polish schema endpoint: rename show, align field name, add tests
Review feedback on #23, applied on top of the original commit:

- Rename the CLI subcommand from `schema get` to `schema show` to match
  the existing `run show` / `commit show` convention. A `#[command(alias
  = "get")]` preserves muscle memory for anyone who already typed `get`.
- Rename `SchemaGetOutput` → `SchemaOutput` and its field `source` →
  `schema_source`, so the get response and the apply request use the
  same field name for the same concept.
- Use `println!` instead of `print!` in the CLI so the shell prompt
  doesn't land on the last line of schema output.
- Add three integration tests on `/schema`: happy path (no auth),
  401 when bearer is required but missing, 403 when the policy grants
  the actor branch_create but not read.

Follow-ups left for a separate PR: include `schema_ir_hash` and
`schema_identity_version` in the response payload so clients can do
drift detection and the server can set an ETag; and a fast-path local
read that skips `Omnigraph::open()` when only the schema source is
needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-18 00:30:46 +03:00
Ragnor Comerford
228032a4ac
Add static OpenAPI spec and Stainless SDK config
Introduce SDK generation scaffolding: commit a static openapi.json
extracted from the Utoipa annotations via a golden-file test, add
Stainless workspace/config for TypeScript and Python SDKs, and clean
up operation IDs for ergonomic generated method names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-17 14:26:31 +02:00
Claude
0c4df674fa
Add schema get command to CLI and HTTP API
Exposes the existing schema_source() method via a new `omnigraph schema get`
CLI subcommand and a `GET /schema` API endpoint, allowing users to retrieve
the current accepted schema from any graph repository.

https://claude.ai/code/session_01UYybeBQks3fz3RJrTHtwQw
2026-04-16 21:15:17 +00:00
andrew
33bdab1fcb Prepare v0.2.2 release 2026-04-14 20:13:00 +03:00
andrew
3d74cbfc20 Prepare v0.2.1 release 2026-04-14 19:19:00 +03:00
andrew
1a26e2e654 Rename config targets to graphs 2026-04-14 04:12:14 +03:00
andrew
5daeae7571 Prepare v0.2.0 release 2026-04-12 20:35:34 +03:00