mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-18 02:24:27 +02:00
8 commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
6a2dfa7325
|
fix: self-heal manifest-unreferenced branch forks (stop wedged branches) (#231)
* chore: correct stale global-lock comments
The global Arc<RwLock<Omnigraph>> that once serialized every server write was
removed — the server holds the engine as a lockless Arc<Omnigraph> and write
methods are &self, so the per-(table_key, branch) write queues are now the
actual write-serialization mechanism (in-process only).
Correct comments that still claimed the global lock is 'still in place' /
'today', or framed the queues as MR-686 scaffolding: write_queue.rs module doc,
exec/merge.rs, db/omnigraph/schema_apply.rs, db/manifest/recovery.rs, and the
bench_concurrent_http.rs example (which also wrongly stated mutate_as is
&mut self). workload.rs is left as-is — its 'previous global RwLock' wording is
accurate history.
* test: regression for self-healing a manifest-unreferenced fork
An interrupted first-write fork (create_branch succeeded, the manifest publish
did not) leaves a fully-formed Lance branch ref the manifest never references.
The branch stays a valid manifest branch, so cleanup's reconciler never
reclaims it, and today the next write to that table wedges with 'incomplete
prior delete; run cleanup'.
Forge that exact residue (a live 'feature' branch + a directly-created
'feature' ref on the Person table the manifest doesn't reference) and assert
the next load AND mutate self-heal. Deterministic and local — no S3 or timing,
since the forge IS the post-crash state. Adds a shared node_table_uri helper.
This commit is RED: it reproduces the bug and fails against the unfixed engine
with the predicted symptom. The fix follows in the next commit.
* fix: self-heal manifest-unreferenced branch forks
The first write to a table on a branch lazily forks it via Lance create_branch,
a durable two-phase op that advances Lance state BEFORE the atomic manifest
publish. If the writer dies or its request future is cancelled between the fork
and the publish, the branch ref is fully formed but the manifest never
references it. The next write re-enters the fork path, create_branch collides,
and the engine wedged with 'orphaned table state ... incomplete prior delete;
run cleanup' — which cleanup could not even fix, because the branch is still a
live manifest branch. This hit load, mutate, ingest, and the merge fork path
(one shared engine chokepoint), so a routine deploy restart or client
disconnect could wedge a branch.
Fix: treat the per-table fork ref as derived state of the manifest. fork_branch_
from_state returns a typed ForkOutcome instead of a human 'incomplete prior
delete' error; on RefAlreadyExists the db layer reclaims the manifest-
unreferenced fork (force_delete_branch + re-fork, exactly once) and proceeds.
A live committed fork is still routed to a retryable conflict before the fork
path, so concurrent first-writes stay correct.
Reclaim is only safe if no in-process writer can be mid-fork, so the write
entry points (load, mutate) acquire the per-(table, branch) write queues for
all touched tables up front — before the fork, held through the publish — when
forking a non-main branch. commit_all accepts these pre-held guards instead of
re-acquiring (the queue is non-re-entrant). The merge fork path already holds
the queue and self-heals through the shared wrapper. Cross-process in-flight
forks remain the documented one-winner-CAS gap.
Mechanical prep folded in: mutation IR lowering is hoisted so the touched-table
set is known before execution; commit_all gains the held_guards parameter.
Flips recreate_over_orphaned_fork_before_cleanup_is_actionable to assert
self-heal; fork_collision_with_live_concurrent_fork_is_retryable still holds.
Docs: writes.md cancelled-future note, invariants.md cross-process known gap.
* fix(cleanup): reconcile per-table manifest-unreferenced forks
reconcile_orphaned_branches keyed orphans on the branch NAME (absent from the
manifest), so it only reclaimed forks from a fully-deleted branch. A fork left
on a still-live branch by an interrupted first-write was never reclaimed — the
backstop the handoff expected cleanup to provide did not cover that case.
Broaden it to a per-table authority test: a Lance branch B on table T is an
orphan iff B is not a live manifest branch (delete-leftover) OR the manifest's
branch-B snapshot does not place T on B (interrupted first-write). Per-branch
snapshots are resolved once and cached across tables. Legitimately-forked
tables, main, and internal/system branches are never reclaimed; children are
dropped before parents to avoid Lance's referenced-parent RefConflict. The
commit-graph half stays whole-branch (per-table doesn't apply there).
This is the guaranteed-convergence backstop to the write-path self-heal: it
reclaims any fork the write path never revisits, and is what Lance's own
create_branch docstring asks embedders to provide for zombie/orphan refs.
* fix: reclaim self-validates against fresh manifest authority
The fork reclaim force-deletes a Lance branch ref, gated on the caller's proof
that the manifest does not place the table on the branch. But the first-write
path obtains that proof via snapshot_for_branch, which returns the coordinator's
CACHED snapshot when the handle is bound to the branch (an embedded handle on
the branch, or branch_merge's target swap). If that snapshot is stale and a
concurrent writer already published a legitimate fork, the reclaim would
force-delete it and re-fork from source, stranding the manifest at a version the
recreated ref no longer has.
Make the destructive primitive own its safety precondition: re-derive it from a
FRESH manifest read (fresh_snapshot_for_branch, which bypasses the cache)
immediately before force-deleting. If fresh authority shows the table is on the
branch, refuse with a retryable conflict instead of destroying a valid fork.
Correct for any caller regardless of snapshot staleness. Also stop branching on
Lance's exact RefConflict prose (loosened match; typed-variant is the durable
follow-up). Addresses PR review (Codex P1, Greptile P2).
* fix: cover delete-cascade edges in up-front fork-queue acquisition
A node delete cascades to every edge table touching that node (execute_delete_
node), forking those edge tables during execution. But touched_table_keys
derived the up-front fork-queue set from the IR ops alone (just node:Type), so a
branch delete that forks node + cascade edges held only the node queue —
commit_all then saw cascade-edge keys it had no guard for.
The touched set is a pure function of (IR ops + catalog), so compute the
COMPLETE set: op types plus, for delete-node ops, the cascade edges derived the
same way the executor derives them (from_type/to_type match). Pre-computed now
equals actual by construction.
Also promote commit_all's held-guard coverage check out of debug_assert into an
all-builds check that fails the write with a typed manifest_internal error: a
load-bearing serialization invariant must fail loudly+safely in release, not
silently proceed unguarded if a future execution path ever touches a table
outside the pre-computed set.
Adds branch_cascade_delete_forks_node_and_edges_under_held_queues, which drives
the cascade path on a branch (the gap the existing insert/load tests missed).
Addresses PR review (Cursor medium, Greptile P2).
* fix(cleanup): serialize fork reclaim against in-process live writers
The broadened per-table reconciler force_delete'd an orphan candidate on a LIVE
branch without holding the per-(table, branch) write queue. An in-process
first-write fork in its fork->publish window holds that queue and has not yet
advanced the manifest, so it looks exactly like an origin-2 orphan — concurrent
cleanup could delete the ref the writer still holds and is about to publish.
(The old branch-name-based reconciler did not have this race: a deleted branch
cannot have a live first-write.)
Bring the reconciler under the same invariant the write-path reclaim already
obeys: never force_delete a fork ref without holding the (table, branch) write
queue AND confirming, under it, from a fresh read, that the ref is still
manifest-unreferenced. Acquire one key at a time (no lock-order inversion vs
multi-table acquire_many writers); if the writer published meanwhile, the fresh
re-check sees the table on the branch and skips. Cross-process writers remain
the documented one-winner-CAS gap. Addresses PR review (Cursor high).
* fix: classify create_branch failure by ref existence, not by failure
fork_branch_from_state mapped ANY create_branch failure to RefAlreadyExists,
routing transient I/O / version / Lance-internal errors into the destructive
reclaim path and masking the real error as a retryable conflict.
Branch on the actual fact instead: on create_branch failure, check whether the
ref exists (list_branches). Only a genuinely pre-existing ref — a fully-formed
manifest-unreferenced fork — is a reclaim candidate; any other failure
propagates with fidelity. We deliberately do NOT force-delete on a not-found-ref
failure: it is indistinguishable from a transient error on a fresh create, and
force-deleting there is the overreach the fresh-authority guard already removed.
A phase-1-only Lance zombie (rarer; create_branch interrupted mid its two
internal phases) surfaces as the propagated error for manual reclaim.
Addresses PR review (Cursor medium).
* fix(cleanup): skip (not delete) on a transient re-check error for a live branch
The reconcile pre-delete re-check treated ANY fresh_snapshot error as 'still an
orphan' and proceeded to force_delete. A transient manifest read failure on a
LIVE branch could therefore destroy a fork the manifest still considers
legitimate — inconsistent with the write-path reclaim (aborts on the same error)
and the candidate scan (skips on snapshot failure).
Distinguish the two origins under the queue: a branch absent from the manifest
authority (origin 1) is a confirmed orphan and is deleted without a fresh read
(no live writer can hold a deleted branch's queue); a LIVE branch (origin 2)
gets the fresh re-check and, on a transient read error, is SKIPPED — never
destroyed on ambiguity — converging on a later cleanup. Same don't-destroy-on-
ambiguous-error principle as the create_branch failure classification.
Addresses PR review (Cursor medium).
* fix(cleanup): unify fork-ref reclaim on fresh authority under the queue
Consolidates the reconcile/reclaim hardening from PR review (the earlier per-site
commits were collapsed when reconciling with the main merge). Both destructive
fork-ref sites — the write-path reclaim and the cleanup reconciler — now share
one classifier, classify_fork_ref -> ForkRefStatus { Legitimate, Orphan,
Indeterminate }, evaluated from FRESH manifest authority under the held
(table, branch) write queue. A fork ref is destroyed ONLY on a confirmed Orphan;
a Legitimate (concurrent writer published a real fork) or Indeterminate
(transient read) status is never destroyed — the write path maps it to a
retryable conflict, cleanup maps it to skip. This closes, by construction:
- reclaim trusting a possibly-cached caller proof (Codex P1);
- reconcile racing an in-process live fork without the queue (Cursor);
- delete-on-transient-error in the re-check (Cursor/Greptile);
- origin-1 trusting a stale live_branches capture for a created-since branch
(Cursor/Greptile P1).
Having one classifier removes the duplication that let the two sites drift.
ForkOutcome is made pub to match the sealed trait method returning it. Verified
green on Lance 7.0.0 (full engine suite + 48/48 failpoints).
* test(cleanup): pin classify_fork_ref decision (Legitimate / Orphan / ghost)
Both fork-ref reclaim sites (write-path reclaim + cleanup reconciler) route
their destroy/skip decision through classify_fork_ref, but it had no direct
test — reverting the fresh-authority logic was not test-detectable. Add a
deterministic in-source unit test that forges each state and asserts the status:
a manifest-placed fork -> Legitimate (never destroyed); a ref the manifest does
not place on the branch -> Orphan; a ref for a branch absent from the manifest
-> Orphan (ghost reclaim preserved). This makes the core fresh-authority
decision behind every reclaim fix revert-detectable in one place.
(The Indeterminate arm — transient read on a live branch -> skip — needs an
injected read failure and is left to the failpoints suite; the cross-process
cleanup-vs-writer and cached-snapshot reclaim races are the documented
one-winner-CAS gap, not reachable same-process bugs, so they are not faked here.)
* test(cleanup): pin the Indeterminate (transient re-check) reclaim arm
Closes the last untested classify_fork_ref arm. Adds a 'classify.fresh_read'
failpoint (no-op without the failpoints feature) that simulates a transient
failure of the fresh-authority read, and a failpoints test driving it through
cleanup: a genuine origin-2 orphan on a LIVE branch whose fresh re-check fails
classifies as Indeterminate, so the reconciler SKIPS it (never destroys on an
inconclusive read) and reclaims it on the next run once the read succeeds.
This makes the don't-destroy-on-ambiguity rule revert-detectable end-to-end.
The only paths now left untested are the cross-process cleanup-vs-writer and
reclaim-vs-publish races — the documented one-winner-CAS gap (cleanup is
&mut self / CLI-only, so no reachable same-process race), not faked here.
* test(server): avoid stale schema apply route handle
* fix(cleanup): report indeterminate fork authority clearly
|
||
|
|
1a4d2cee97
|
feat: inline query strings in CLI and HTTP server (#110)
* feat(MR-656): inline query strings in CLI and HTTP server
CLI:
- Add -e / --query-string <STRING> to omnigraph read and omnigraph change
- Exactly one of --query, --query-string, --alias is required (3-way XOR)
- Empty --query-string is rejected with a clear error
HTTP:
- New POST /query (read-only, clean field names: query/name/params/branch/snapshot)
- Mutations on /query are rejected with 400 -- use POST /change instead
- ChangeRequest fields polished: query (alias query_source), name (alias query_name)
- POST /read and POST /change remain byte-compatible for existing clients
Tests:
- cli.rs: -e happy-path on read/change, mutex error vs --query, empty -e rejected
- system_local.rs: inline -e read and -e change exercise the local flow
- system_remote.rs: inline -e read/change over HTTP plus direct /query 200/400
- server.rs: /query 200, /query 400 on mutation, /change legacy field alias
- openapi.rs: new /query path, QueryRequest schema, ChangeRequest field-name polish
Docs: cli.md (-e examples), cli-reference.md (read/change rows), server.md (/query)
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
* feat(MR-656): rename read/change to query/mutate with deprecation signals
HTTP server:
- Add POST /mutate as canonical write endpoint (pairs with POST /query).
- Mark POST /read and POST /change as deprecated. Three-channel signal:
* OpenAPI: `deprecated: true` on the operation (every codegen flags
the generated SDK method).
* RFC 9745: response `Deprecation: true` header on every response.
* RFC 8288: response `Link: </successor>; rel="successor-version"`
pointing at /query and /mutate respectively.
- Share business logic across /mutate and /change via run_mutate(); the
/change wrapper is the only place that adds the deprecation headers.
- ChangeRequest field aliases (query_source/query_name) preserved.
- AliasCommand serde now accepts `query`/`mutate` alongside `read`/`change`.
CLI:
- Promote `omnigraph query` / `omnigraph mutate` to top-level canonical
subcommands (clap visible_alias keeps `omnigraph read` / `omnigraph
change` working forever).
- Promote `omnigraph lint` / `omnigraph check` to top-level (was nested
under `omnigraph query lint`, which is now a deprecated argv shim that
rewrites to the canonical form).
- Argv-level preprocessing prints a one-line deprecation warning to
stderr when any legacy spelling is used. Canonical names are silent.
Tests:
- Server: /mutate works, /change emits Deprecation+Link headers, /read
emits Deprecation+Link headers, /query carries no deprecation signal.
- OpenAPI: /read and /change flagged deprecated; /query and /mutate not.
- CLI: canonical `lint` matches deprecated `query lint` / `query check`
output; `read` / `change` print deprecation warnings.
Docs:
- cli.md: new canonical examples; "Deprecated names" migration table.
- cli-reference.md: top-level table updated; aliases.<name>.command
accepts both legacy and canonical spellings.
- server.md: endpoint inventory shows /query and /mutate as canonical
and /read and /change as deprecated; dedicated section explains the
three-channel deprecation signal.
- og-cheet-sheet.md: use new `omnigraph lint` / `omnigraph check`.
- openapi.json regenerated.
Migration is purely cosmetic — every deprecated form continues to work
indefinitely; only the spelling changes.
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
* fix(MR-656): address Devin Review findings on /query and /change
Two issues raised by Devin Review on PR #110:
1. `POST /query` mutation-rejection error pointed at the deprecated
`/change` endpoint instead of the canonical `/mutate`. Fixed in
three places: the runtime error message in `server_query`, the
utoipa 400-response description, and the handler doc comment. The
`QueryRequest` schema docstrings in `api.rs` got the same update so
the openapi.json bodies match. Server and openapi tests updated.
2. `execute_change_remote` serialized `ChangeRequest` directly, which
emits the new canonical field names `query` / `name` on the wire.
`#[serde(alias = "query_source")]` only affects deserialization, so
a newer CLI talking to an older server would have its `/change`
POST body fail with "missing field: query_source". Fixed by
extracting a `legacy_change_request_body` helper that hand-rolls
the JSON with the legacy keys (`query_source` / `query_name`), the
same byte-stable contract `execute_read_remote` already uses
against `/read`. Added two unit tests on the helper to lock the
wire shape in.
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
* docs(dev): RFC 001 — inline + stored queries, envelope, MCP
Tracked artifact consolidating the design across MR-656 (this branch),
MR-976 (Phase 1 envelope hardening parent, with MR-977/978/979/980
sub-issues), and MR-969 (stored queries + MCP).
Sections:
* Two paths, one engine — inline `/query` + `/mutate` (this PR) coexist
with stored `/queries/{name}` (MR-969). Same `run_query` / `run_mutate`
backend (the fold-in landed in the previous commit).
* Request envelope ("before") — Idempotency-Key, If-Match, X-Deadline,
X-Trace-Id, expect, dry_run, fields. Phase 1 ships the load-bearing
subset on `/mutate`.
* Response envelope ("after") — audit_id, snapshot_id, commit_id, stats,
warnings. Closes the provenance loop today's `ChangeOutput` leaves
open.
* `.gq` pragmas — `@description`, `@returns`, `@mcp`. Source-of-truth
for the stored-query agent contract; no separate YAML registry.
* Multi-graph MCP — per-graph `/graphs/{id}/mcp/tools` + `/mcp/invoke`.
Token binds to one graph by default; cross-graph agents loop.
* Cedar split — `read`/`change` for inline, `invoke_query` for stored.
Operators deny ad-hoc for agent groups while keeping curated tool
list open.
* Rejected alternatives — per-env override files, compiled bundles,
tool-name prefixing across graphs, body-field graph dispatch.
Index entry added under "Active Implementation Plans" so future agents
land on the RFC before touching queries / mutations / envelope code.
`scripts/check-agents-md.sh` clean (35 links, 34 docs).
* docs(server): clarify why run_query lacks AppState parameter
run_mutate takes state for workload admission; run_query doesn't because
reads aren't admission-gated today. Mark the asymmetry as intentional and
flag the two future events that would grow the signature: Phase 1's
`expect: { max_rows_scanned: N }` budget (MR-976) or per-actor admission
extending to stored-read invocations (MR-969). Prevents the natural
"make these symmetrical" follow-up.
* refactor(server): run_query / run_mutate take &ResolvedActor
Replace `Option<Extension<ResolvedActor>>` in the helpers with
`Option<&ResolvedActor>`. Saves MR-969's stored-query handler from
wrapping a bare actor in axum's `Extension(...)` before calling.
Handler signatures (`server_query`, `server_read`, `server_mutate`,
`server_change`) keep `Option<Extension<ResolvedActor>>` because that
is what axum injects, and unwrap at the call site with
`actor.as_ref().map(|Extension(actor)| actor)`.
Net: -13/+10 LOC, 89/0 server tests pass.
* docs(releases): v0.6.0 — describe inline + canonical-named queries (MR-656)
Extend the v0.6.0 release notes to cover the third piece of work landing
alongside the graph terminology rename and multi-graph server mode:
canonical-named `POST /query` and `POST /mutate` endpoints, the CLI's
new `-e/--query-string` flag, the top-level promotion of `lint` /
`check`, and the three-channel deprecation signal on `/read` and
`/change` (OpenAPI `deprecated: true` + RFC 9745 + RFC 8288).
Additions:
* Top blurb: "Two pieces" -> "Three pieces" with a bullet describing
the rename + inline flow.
* Breaking Changes: new "Query / mutation rename" subsection covering
the `ChangeRequest` field rename (with the back-compat serde aliases
and the CLI's `legacy_change_request_body` byte-stable wire helper)
and the `omnigraph query lint` -> `omnigraph lint` move.
* New: 5 bullets — the two endpoints, the CLI subcommands, the `-e`
flag, the deprecation signal channels, the widened `aliases.<name>.command`
vocabulary.
* User Impact: one bullet making explicit that the rename is cosmetic
on the client side and migration is voluntary.
* Documentation: pointers to the updated `server.md` / `cli.md` /
`cli-reference.md` and the new `docs/dev/rfc-001-queries-envelope-mcp.md`.
+15/-1 lines. `./scripts/check-agents-md.sh` clean.
* refactor(cli): demote `check` from visible_alias to deprecation shim
`omnigraph check` was a clap `visible_alias` on `lint`, advertised in
`--help` as an equivalent canonical name. Per MR-981 §6 (long-form
flags as canonical, short forms as visible aliases), visible aliases
on subcommand names hurt agent CX: agents emit either spelling
depending on training-data drift, and there's no length signal
pointing at the canonical name.
Changes:
* Remove `#[command(visible_alias = "check")]` from the `Lint` variant.
`omnigraph --help` now shows only `lint`.
* Add bare `check` to `rewrite_deprecated_argv` so `omnigraph check
<args>` still works — it rewrites to `omnigraph lint <args>` and
emits a one-line stderr deprecation warning, matching the existing
pattern for `read` / `change` / `query lint` / `query check`.
* Fix the nested `query check` shim to substitute `check` -> `lint` in
the rewritten argv (previously it relied on `check` being a
visible_alias to reach the `Lint` variant).
* New test `deprecated_check_top_level_rewrites_to_lint` covers: bare
`check` produces identical stdout to `lint`, emits the deprecation
warning, and `check` does NOT appear as an alias in `omnigraph
--help`.
* Release notes updated to reflect the deprecation-shim treatment and
cross-reference MR-981 §6 reasoning.
Cargo / Go users typing `check` still work indefinitely; one stderr
nudge per invocation teaches the canonical name. Agents see only
`lint` in `--help --json` so they emit one canonical form.
67/0 omnigraph-cli tests pass; 39 workspace test suites green.
---------
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com>
Co-authored-by: Ragnor Comerford <hello@ragnor.co>
|
||
|
|
cc2412dc65
|
Rename repo terminology to graph (#118)
Some checks failed
CI / Classify Changes (push) Has been cancelled
CI / Check AGENTS.md Links (push) Has been cancelled
Release Edge / Prepare edge release (push) Has been cancelled
CI / Test Workspace (push) Has been cancelled
CI / Test omnigraph-server --features aws (push) Has been cancelled
CI / RustFS S3 Integration (push) Has been cancelled
Release Edge / Build edge omnigraph-linux-x86_64 (push) Has been cancelled
Release Edge / Build edge omnigraph-macos-arm64 (push) Has been cancelled
|
||
|
|
6a3f0677ae | server: drop unwired try_admit_rewrite / 503 admission surface | ||
|
|
64f2b994f5
|
bench: assert --heavy-concurrency > 0 instead of silently clamping
Closes the cubic P2 finding on commit
|
||
|
|
22d76dbb40
|
server+bench: AppState::new_with_workload; bench drops set_var, exercises heavy cap
Two cubic findings on bench_actor_isolation.rs flagged together:
P2 (lib.rs:202): `unsafe { std::env::set_var(...) }` ran inside
`#[tokio::main] async fn main()` AFTER the multi-thread tokio runtime
was up. Rust 2024 made `set_var` unsafe because libc's `setenv` is
not thread-safe; concurrent env reads from logging or runtime
internals can race or read torn state.
Fix (correct by design, AGENTS.md rule 9): add a public
`AppState::new_with_workload(uri, db, bearer_tokens, workload)`
constructor that takes a caller-built `WorkloadController`. Tests and
benches override per-actor caps via the constructor instead of
mutating global env. Closes the bug class "tests need to mutate
global env to override AppState defaults."
P2 (lib.rs:130): heavy actor's `oneshot.await` inside the loop
serialized — heavy in-flight count was always 1, so cap=1 never
tripped on the heavy side. The bench validated isolation (light p99
bounded) but didn't demonstrate the rejection path.
Fix: add a `--heavy-concurrency` arg (default 4) and spawn batches
as concurrent tokio tasks bounded by an internal semaphore. With
heavy_concurrency=4 and inflight_cap=1, the bench now reports
heavy_too_many_requests > 0 and heavy_ok == 1 at peak — proving the
gate fires for the heavy actor.
Sample run on local FS (4 light actors × 30 ops, 20 heavy batches ×
50 rows, heavy_concurrency=4, cap=1):
heavy_ok: 1
heavy_too_many_requests: 19
light_ok: 120
light_too_many_requests: 0
light_p99: 565 ms (target < 2 s)
Heavy saturates its own cap; light actors are completely unaffected.
The isolation property is now empirically proven by the rejection
counts rather than just by the latency tail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
b09a0972cb
|
bench: add actor-isolation harness for WorkloadController
Empirical proof of MR-686's central design promise: per-actor
admission control isolates noisy actors from light traffic. The
existing bench_concurrent_http harness measures aggregate throughput;
this harness measures the latency tail seen by light actors while a
heavy actor saturates its own per-actor cap.
Setup: one "heavy" actor flooding /ingest with multi-row NDJSON
batches; N "light" actors each running short bursts of /change
inserts, each authenticating with a distinct bearer token so the
WorkloadController accounts them as separate identities.
Output: heavy throughput / 429 count, light p50/p95/p99/max latency.
Acceptance heuristic on local FS: light-actor p99 < 2 s while the
heavy actor saturates its own cap.
Sample run on local FS, cap=1, 4 light actors x 30 ops, 20 heavy
batches x 50 rows: light p99 = 710 ms, light errors = 0 (well under
the 2 s acceptance target). The test demonstrates the isolation
property — the heavy /ingest holds its own admission slot but
doesn't affect light actors since they have separate per-actor
state.
Usage:
cargo run --release -p omnigraph-server --example bench_actor_isolation -- \
--light-actors 4 --light-ops-per-actor 30 \
--heavy-batches 20 --heavy-rows-per-batch 50 \
--inflight-cap 1 \
--output .context/bench-results/after-pr2-phase2/actor-isolation.json
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
fcb47620d3
|
mr-686: bundle PR 0/1a/1b foundation + PR 2 catalog/schema_source ArcSwap
Bundles the working-tree state from the prior session (PR 0 bench harness,
PR 1a audit_actor_id removal, PR 1b WriteQueueManager + writer integration)
together with the first half of PR 2's interior-mutability foundation
(catalog and schema_source wrapped in Arc<ArcSwap<...>>). The two streams
intermix in 7 of the same files, so splitting via git add -p was
impractical. Subsequent PR 2 steps land as separate atomic commits.
PR 0 — server-level concurrent /change bench harness
- crates/omnigraph-server/examples/bench_concurrent_http.rs (new)
- .context/bench-results/{baseline-main,after-pr1}/ (gitignored)
PR 1a — drop the audit_actor_id field, thread per-call
- removed Omnigraph::audit_actor_id and the swap-restore patterns in
mutation.rs, merge.rs, loader/mod.rs
- actor_id: Option<&str> threaded through MutationStaging::finalize,
mutate_with_current_actor, ingest_with_current_actor,
branch_merge_impl, branch_merge_on_current_target,
commit_prepared_updates*, record_merge_commit,
commit_updates_on_branch_with_expected
- apply_schema and ensure_indices_for_branch pass None (system-attributed)
PR 1b — per-(table_key, branch) write queue + revalidation + sidecar
- new crates/omnigraph/src/db/write_queue.rs with WriteQueueManager,
acquire/acquire_many, sorted+deduped acquisition; 6 unit tests
- Arc<WriteQueueManager> field on Omnigraph + db.write_queue() accessor
- MutationStaging::finalize split into stage_all (Phase A, no queue)
and StagedMutation::commit_all (Phase B, acquire_many + revalidate
pins + sidecar + commit_staged); guards held across publisher
- delete-only mutations now emit recovery sidecars; revalidation
extended to inline_committed tables
- branch_merge_on_current_target, apply_schema_with_lock, and
ensure_indices_for_branch acquire per-table queues for their
touched tables
PR 2 Step B (partial) — catalog and schema_source via ArcSwap
- catalog: Catalog -> Arc<ArcSwap<Catalog>>
- schema_source: String -> Arc<ArcSwap<String>>
- public accessors return Arc<Catalog> / Arc<String>; readers bind
locally where the borrow has to outlive an expression
- new pub(crate) store_catalog / store_schema_source helpers replace
the field assignments in apply_schema and reload_schema_if_source_changed
- 117 tests across lifecycle/end_to_end/branching/runs pass; engine
lib + workspace compile clean
Coordinator wrap (Mutex) and the &mut self -> &self engine API
conversion follow in subsequent commits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|