Commit graph

425 commits

Author SHA1 Message Date
Andrew Altshuler
2b5fb7197e
Merge pull request #180 from ModernRelay/feat/cluster-local-config
feat(cli): per-operator actor for cluster ops; pin omnigraph.yaml isolation
2026-06-10 23:57:31 +03:00
aaltshuler
3b2bf755ae fix(cli): address review — honor the one-thing contract, restore docs, untangle test phases
- resolve_cluster_actor uses load_config directly: load_cli_config also
  loads auth.env_file into the process env — a second thing, violating the
  documented 'exactly one thing' omnigraph.yaml contract for cluster ops.
- resolve_cli_actor gets its doc comment back (the inserted helper had
  absorbed the contiguous /// block).
- The actor-default test imports once as setup and asserts on apply alone,
  idempotently, instead of re-importing inside the assertion helper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:54:05 +03:00
aaltshuler
fbe9726ac7 test(cli): stop the S3 e2e scaffolding omnigraph.yaml into the crate dir
local_cli_s3_end_to_end_init_load_read_flow ran `omnigraph init` without a
current_dir, so init's project scaffold landed in crates/omnigraph-cli/ —
poisoning any later test that resolves a graph target from the cwd config
(query_lint_requires_schema_or_resolvable_graph_target fails determinis-
tically once the file exists). Only manifests when OMNIGRAPH_S3_TEST_BUCKET
is set, which is why local FS runs and CI's scoped rustfs job never caught
it. The init and load calls now run inside the test's tempdir.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:34:54 +03:00
aaltshuler
99f7f36864 docs(cluster): the precise omnigraph.yaml contract
The 'Relationship to omnigraph.yaml' section becomes the exact rule set:
cluster commands read the per-operator config for exactly one thing (the
cli.actor default when --as is omitted), a --cluster server reads it for
nothing, and pointing data-plane targets at derived roots is ergonomics,
not coupling. Operator guide and CLI reference updated to match.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:30:40 +03:00
aaltshuler
f7368b58a0 test(cli): pin --cluster boot isolation from cwd omnigraph.yaml
A --cluster server process whose cwd contains a MALFORMED omnigraph.yaml
boots and serves — proving mode-inference rule 0 returns before any config
search can run. New spawn_server_with_cluster_in support helper sets the
spawned server's cwd explicitly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:29:49 +03:00
aaltshuler
f3374ac6dc feat(cli): resolve cluster actor via the per-operator config cascade
Cluster FACTS stay unlayered (cluster.yaml only), but the operator's
identity is a per-operator fact — exactly the per-operator omnigraph.yaml's
permanent job, and the cascade every data-plane write already uses. cluster
apply/approve now resolve: --as flag wins and skips any config read
entirely (containers and CI stay config-free); without it, the standard cwd
search supplies cli.actor, with a malformed config failing loudly and
actionably ('pass --as to skip this lookup') rather than silently dropping
attribution. approve's no-actor error now names both sources.

Tests pin the contract from both sides: cli.actor is the no-flag default
for apply (echoed actor) and approve (approved_by), the flag overrides it,
a malformed omnigraph.yaml in cwd breaks nothing except the no-flag actor
lookup, and a conflicting well-formed one leaks nothing into cluster
outputs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:29:49 +03:00
Andrew Altshuler
b8300736be
Merge pull request #179 from ModernRelay/docs/cluster-operator-guide
docs(cluster): operator how-to guide for deploying and managing clusters
2026-06-10 22:25:35 +03:00
aaltshuler
97eb65e921 docs(cluster): operator how-to guide for deploying and managing clusters
New docs/user/cluster.md — the practical companion to cluster-config.md's
reference: zero-to-served walkthrough (validate/import/plan/apply, derived
roots, data loading, --cluster serving), the day-2 edit->plan->apply->restart
loop with a per-change-kind table, drift observation and convergence, the
approval gate for destructive changes, crash/lock/lost-ledger recovery, the
boot-refusal table with remedies, deployment patterns (replicas, backup
unit, CI gating), and the explicit not-yet list (hot reload, S3-hosted
cluster dirs, per-query exposure, pipelines). Linked from the user index,
the agent guide's topic map, and cross-linked from the reference.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:10:19 +03:00
Andrew Altshuler
13d5c52abc
Merge pull request #177 from ModernRelay/test/cluster-full-cycle-e2e
test(cli): comprehensive full-cycle cluster e2e with a live server
2026-06-10 19:17:04 +03:00
Andrew Altshuler
e8833ef980
Merge pull request #178 from ModernRelay/ci/pin-rustfs-beta8
ci: pin RustFS to 1.0.0-beta.8
2026-06-10 19:10:30 +03:00
aaltshuler
d8354ac213 test(cli): address review — assert schema-show success, document exit-code stance, add e2e opt-out
- The drift-heal verification now asserts `schema show` succeeded and
  produced a schema before checking the rogue field's absence (a failed
  command previously made the negative assertion vacuously pass).
- cluster_cli documents why it deliberately does not assert exit codes
  (blocked applies exit non-zero by contract while emitting the structured
  output callers assert on).
- The comprehensive lifecycle e2es honor OMNIGRAPH_SKIP_SYSTEM_E2E=1
  (graceful skip-with-message, the S3-gate pattern) for constrained
  sandboxes; requirements + suppression documented in testing.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:05:12 +03:00
aaltshuler
711e04a161 ci: pin RustFS to 1.0.0-beta.8
beta.4+ refuses the rustfsadmin/rustfsadmin test credentials unless
RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true is set — acceptable for the
ephemeral CI container and the local bootstrap script (which already passed
it). The three S3 suites were validated against the beta.8 binary locally
before this bump. The pin stays explicit, never `latest`, so future
upgrades remain deliberate.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:44:05 +03:00
aaltshuler
7d70811df1 test(cli): comprehensive full-cycle cluster e2e with a live server
Two system tests composing the whole Phase 1-5 surface with real binaries:

- local_cluster_full_lifecycle_declare_serve_evolve_delete: declare two
  graphs -> one apply creates and converges them -> the --cluster server
  serves both stored queries -> schema+query evolve in one apply (migration
  previewed in plan) -> restart serves the new shape -> out-of-band schema
  drift observed by refresh and converged back by apply (rogue field
  soft-dropped) -> approved graph delete -> restart serves the survivor and
  404s the tombstoned graph -> final plan empty. Catches composition
  regressions where each stage passes its own tests but the lifecycle
  breaks (the composite_flow.rs principle at the control-plane level).

- local_cluster_serving_enforces_applied_policy_bindings: applied policy
  bundles gate serving per their bindings over HTTP with bearer-resolved
  actors — the cluster-bound bundle owns graph_list (admin 200, reader 403,
  anonymous 401), the graph-bound bundle owns invoke_query (reader gets
  rows; denied invocation is the documented anti-probing 404).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:07:29 +03:00
Andrew Altshuler
af6a1096b0
Merge pull request #176 from ModernRelay/feat/server-cluster-boot-5b
feat(server,cluster): Phase 5 — omnigraph-server boots from cluster state
2026-06-10 18:00:57 +03:00
aaltshuler
711865e6f1 docs(cluster,server): the Phase 5 mode switch; retire applied-not-serving caveats
The standing caveat ('applied means recorded in the cluster catalog —
nothing more; the server still boots from omnigraph.yaml') retires: cluster
docs gain the 'Serving from the cluster' section (exclusivity, applied-
revision serving, fail-fast readiness, restart-to-pick-up, expose-all
bridge), server.md gains mode-inference rule 0 and the cluster-booted multi
mode, deployment.md the boot-source choice, and the CLI's apply note plus
the cli-reference cluster row (stale back to Stage 3A) now describe the full
convergence surface. RFC-005 flips to Landed with four implementation
deviations recorded.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:56:54 +03:00
aaltshuler
f3eb60fa4e test(cli): applied-means-serving system e2e
The Phase-5 contract end to end with real binaries: cluster import + apply
via the CLI, seed a row through the graph plane, boot omnigraph-server with
--cluster (no omnigraph.yaml anywhere), and the applied stored query serves
the row over HTTP through the multi-graph routes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:51:40 +03:00
aaltshuler
948a54daa7 feat(server): boot from cluster state via --cluster
RFC-005 §D1/§D2: omnigraph-server --cluster <dir> is rule 0 of the mode
inference — an exclusive boot source (hard error when combined with a graph
URI, --target, or --config) that never opens omnigraph.yaml, not even the
implicit current-directory search. The cluster branch reads the applied
revision through omnigraph-cluster's serving-snapshot API and feeds the
EXISTING multi-graph pipeline: GraphStartupConfig per recorded graph at its
derived root, stored queries built via QueryRegistry::from_specs from
verified blob content (expose-all — the §D5 bridge until Phase 6
policy-owned exposure), cluster-bound policy bundles as the server-level
Cedar engine and graph-bound bundles per graph, straight from the
content-addressed blob paths. Multiple bundles binding one scope refuse boot
(one-bundle-per-scope is the serving pipeline's shape; stacking is a later
slice). Everything downstream — parallel opens, query type-checking,
registry, routing, auth, OpenAPI — is reused unchanged; cluster mode is a
new source, not a new pipeline.

First server->cluster crate dependency: read-only types + one fn;
omnigraph-cluster stays HTTP-free. open_multi_graph_state goes pub for
integration tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:48:10 +03:00
aaltshuler
f5b43164b8 feat(cluster): pub read-only serving-snapshot API
RFC-005 §D2/§D4: read_serving_snapshot reads the applied revision as
everything a server needs to boot — graphs at derived roots, stored-query
sources read from the content-addressed catalog and re-hashed against the
recorded digests, policy blob paths with their applied applies_to bindings.
All-or-nothing: missing state, pending recovery sidecars, missing/tampered
blobs, pre-5A entries without bindings, and an empty graph set each refuse
the snapshot with a remedy; no partial serving. Lock-free by design — the
state file is replaced atomically, so the read is a consistent
point-in-time ledger.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:39:26 +03:00
Andrew Altshuler
bed36a8423
Merge pull request #175 from ModernRelay/feat/cluster-policy-bindings-5a
feat(cluster): Slice 5A — policy applies_to bindings in the applied revision
2026-06-10 16:57:26 +03:00
aaltshuler
6c98560dde docs(cluster): document policy binding metadata (5A)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:30:57 +03:00
aaltshuler
0b84b1adc3 feat(cluster): record policy applies_to bindings in the applied revision
Slice 5A of RFC-005: the state ledger becomes serving-sufficient for the
Phase-5 server boot. StateResource gains an optional applies_to (normalized
typed refs: cluster | graph.<id>), written by apply for every applied policy
create/update from the desired config's validated bindings.

The hole this closes: applies_to is not part of the policy file digest, so a
binding-only edit previously produced NO plan change at all (a 4C e2e even
asserted that — the gap, not a contract). Binding changes are now
first-class: a post-diff pass emits an Update with equal before/after
digests and a binding_change marker (visible in plan/apply JSON and human
output as [bindings]), classification/execution treat it as an ordinary
catalog-tier applied change (payload skips naturally — the blob is
unchanged), and convergence requires zero binding divergence, so stale
bindings can never report converged. Pre-5A ledger entries (no bindings
recorded) surface as the same backfill Update; one apply heals them, exactly
the remedy RFC-005's boot-error path names.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:30:33 +03:00
Andrew Altshuler
3e8f103804
docs(cluster): RFC-005 — server boots from cluster state (Phase 5 design) (#174)
The axiom-15 mode switch: omnigraph-server --cluster <dir> (mutually
exclusive with uri/--target/--config, zero omnigraph.yaml reads) serves the
APPLIED revision — graph set from state, query/policy content from the
content-addressed catalog at applied digests, cluster-scoped policy bundles
as the server-level Cedar engine. The load-bearing finding: state is not yet
serving-sufficient (policy applies_to bindings live only in cluster.yaml), so
slice 5A records binding metadata into the applied revision at apply time —
without it, boot-from-state silently becomes the merged read axiom 15
forbids. Fail-fast readiness table (missing state, pending sidecars, missing
blobs, unbound policies all refuse boot with remedies), the expose-all
mcp.expose bridge with its Phase 6 sunset, the operator migration path (exit
criterion 7), and 5A/5B/5C sequencing. The existing boot pipeline
(GraphStartupConfig -> registry -> routing/auth) is reused as-is — a new
source, not a new pipeline.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:22:12 +03:00
Andrew Altshuler
61da7bf406
docs(cluster): descope ETL pipelines to a separate project; keep the socket (#172)
Pipelines (scheduler, connectors, mapping, idempotency, run ledger) leave the
cluster control-plane rollout and become their own project with their own
RFC. This rollout guarantees only the socket, all of which already exists and
is enforced: the pipelines: config field is reserved (typed
future_phase_field rejection, test-covered), the pipeline.<name> typed
address and Pipeline resource kind are reserved in the resource model, and
axiom 13 fixes the contract any future implementation must satisfy
(definition reconciled, execution data-plane, fan-out statusful). The ETL
section in the high-level spec stands as the requirements record for that
project; exit criterion 9 defers to its RFC.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:53:16 +03:00
Andrew Altshuler
14b85a59de
Merge pull request #173 from ModernRelay/feat/cluster-graph-delete-4c
feat(cluster): Stage 4C — gated graph delete; Phase 4 complete
2026-06-10 14:53:11 +03:00
aaltshuler
c949a2b717 docs(cluster): document Stage 4C — Phase 4 complete
Approvals + gated graph deletion in the user docs, the approve command in the
CLI reference, RFC-004 flipped to Landed with its three implementation
deviations recorded (row-8 retire-and-repropose, --as instead of --actor/--by,
consumed artifacts rewritten in place rather than moved).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:44:12 +03:00
aaltshuler
87691fe9c7 test(cluster): failpoint coverage for delete crash windows
- Crash before the removal: root intact, approval file unconsumed, sidecar
  survives, no ack; the next run retires the stale intent (row 8) and the
  still-approved delete completes in the same run.
- Crash after the removal, before the state CAS: root gone, ledger
  byte-identical, the sidecar carries the approval id; the next run's sweep
  rolls the tombstone forward, consumes the approval, audits the recovery,
  and converges (row 7b).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:34:54 +03:00
aaltshuler
d1d04217ab feat(cluster): execute approved graph deletes in cluster apply
Stage 4C execution half (RFC-004 §D5/§D6 + sweep rows 7/7b/8): an approved
graph.<id> delete — and its riding schema/query deletes — classifies Applied
and executes LAST in the run, sidecar-fenced: pre-op manifest pin (best
effort; partial roots still delete), approval_id carried in the sidecar,
recursive root removal (NotFound tolerated), subtree tombstoned out of the
ledger with a tombstone observation, the approval consumed in the same state
CAS (ledger summary) and its artifact file rewritten with consumed_at only
after the CAS lands — a failed run consumes nothing and the approval stays
valid for the retry.

Sweep rows: already-tombstoned intents retire (7); a completed delete with a
stale ledger rolls forward — tombstone + approval consumption + audit entry
(7b, idempotent); a still-present root retires the stale intent with a
graph_delete_incomplete warning and the still-approved delete re-executes in
the same run (8) — prefix removal is idempotent, so retry IS the repair.

The multi-graph mixed e2e gets its conclusion: blocked without approval,
cluster approve graph.engineering --as andrew, converge, tombstone visible
in status. Phase 4's disposition matrix is now fully executable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:34:02 +03:00
aaltshuler
f4e9105272 feat(cluster): cluster approve — digest-bound approval artifacts
RFC-004 §D4, gate half: graph deletes (and their subtree) now classify
Blocked/approval_required instead of Deferred; the new cluster approve
command (requires the global --as actor) writes
__cluster/approvals/{ulid}.json bound to the desired config digest and the
change's before/after digests, so config or state drift invalidates the
artifact automatically (approval_stale warning, never authorizes). One gate
per subtree: compute_approvals lists only the graph-level delete, and
ApprovalRequirement gains a satisfied flag surfaced by plan. Consumption and
the delete executor land next — until then approved deletes stay blocked so
a gate-only build can never strip state without removing the root.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:30:05 +03:00
Andrew Altshuler
f799d4578c
Merge pull request #171 from ModernRelay/feat/cluster-schema-apply-4b
feat(cluster): Stage 4B — cluster-driven schema apply
2026-06-10 14:03:31 +03:00
aaltshuler
f217352c93 docs(cluster): document Stage 4B schema apply
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:14:20 +03:00
aaltshuler
80cae4e8e1 test(cluster): failpoint coverage for schema-apply crash windows
- Crash before the engine call: sidecar (carrying the --as actor) survives,
  live schema and ledger untouched, no ack; the next run's sweep retires the
  stale intent and the same run applies and converges.
- Crash after the engine call, before the state CAS: the manifest moved with
  the post-op pin in the sidecar, state.json byte-identical; the next run's
  sweep rolls the ledger forward with a schema_apply audit entry and the run
  converges.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:13:15 +03:00
aaltshuler
a1ba4dc413 feat(cluster): execute schema applies in cluster apply
Stage 4B (RFC-004 §D1/§D5): schema.<id> Update changes classify Applied and
execute after graph creates, sequentially and sidecar-fenced — read-write
open (the engine's own recovery runs first), pre-op manifest pin recorded,
apply_schema_as with allow_data_loss: false (soft drops only; hard drops wait
for 4C's approval artifacts), post-op pin rewritten into the sidecar, sidecar
retired only after the final state CAS. Queries gated on a same-plan schema
update unblock (the migration lands first in the same run); failures —
unsupported migrations, lock contention, user branches — surface as
schema_apply_failed with the engine's message, demote dependents via the
origin-aware demotion helper, and stop further graph-moving work.

Schema evolution is now fully cluster-driven (the defer -> manual schema
apply -> refresh loop is gone), and out-of-band schema drift is converged
back by apply as an ordinary soft migration (axiom 8: drift correction is
gated like any change; the recoverable tier needs no approval) — both pinned
by reworked e2es. The multi-graph mixed e2e's deferred row is now
delete-shaped, pre-staging the 4C surface.

Actor: cluster apply accepts the CLI's global --as via the new ApplyOptions /
apply_config_dir_with_options (apply_config_dir delegates unchanged); the
actor is echoed in ApplyOutput and recorded in sidecars and audit entries,
and threads to apply_schema_as so Cedar fires wherever a checker is
installed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:12:15 +03:00
aaltshuler
0571c05ebb feat(cluster): schema-apply recovery sidecar kind and sweep
RecoverySidecarKind::SchemaApply with digest-based sweep classification
(robust to unrelated manifest movement; version pins stay forensic):
ledger-consistent -> sidecar retired (RFC-004 rows 1+2); live digest matches
the intended schema, state stale -> roll forward with composite recompute and
a recovery_records audit entry (row 3); unverifiable or unexpected digests ->
pending, kept, graph-moving work blocked (rows 1-unopenable/6).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:05:42 +03:00
aaltshuler
ca63a9340b feat(cluster): embed schema migration previews in cluster plan
RFC-004 §D7's data-aware preview: for every schema update, plan opens the
live graph read-only and embeds the engine's migration plan (supported flag
+ typed steps) in the change record; the human renderer prints the steps.
Preview failures (unreachable graph, planner error) degrade to the digest
diff with a schema_preview_unavailable warning — planning never blocks.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:04:19 +03:00
aaltshuler
b313075476 refactor(cluster): make plan_config_dir async
Mechanical conversion ahead of Stage 4B (plan will preview schema migrations
against live graphs): signature, CLI dispatch, and test callers. Zero
behavior change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:02:12 +03:00
Andrew Altshuler
e6921157cc
Merge pull request #170 from ModernRelay/feat/cluster-graph-create-4a
feat(cluster): Stage 4A — graph create in cluster apply
2026-06-10 05:19:17 +03:00
aaltshuler
cb6c67f196 docs(cluster): document Stage 4A graph create
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 05:00:42 +03:00
aaltshuler
83d77bcb16 test(cluster): failpoint coverage for graph-create crash windows
- Crash before the init (row 1): sidecar survives, nothing moved, no ack;
  the next run's sweep removes the intent and the same run creates and
  converges.
- Crash after the init, before the state CAS (row 4): the graph exists with
  the post-init manifest pin in the sidecar, state.json byte-identical; the
  next run's sweep rolls the ledger forward with a recovery_records audit
  entry and the run converges.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:59:48 +03:00
aaltshuler
c3007369cd feat(cluster): execute graph creates in cluster apply
Stage 4A (RFC-004 §D1/§D5): graph.<id> Create — and its paired schema Create,
which the init carries — classify Applied and execute first in the run,
sequentially and sidecar-fenced: sidecar written before Omnigraph::init at
the derived root, rewritten with the post-init manifest pin, deleted only
after the final state CAS lands. Dependent queries and policies no longer
block on a graph create in the same plan — creates run first, so they apply
in the same run; a create failure demotes them to blocked
(dependency_not_applied) and stops further graph-moving work (loud partials),
with the sidecar left for the sweep to classify. Graphs with a kept recovery
sidecar (rows 5/6) classify Blocked/cluster_recovery_pending, and the sweep's
Drifted/Error statuses are never clobbered by a generic Blocked.

Schema source is re-read and digest-verified under the lock before the init
(the write_resource_payload TOCTOU posture). Plan previews the same
dispositions. e2e fallout updated: a fresh multi-graph config now converges
in one apply; a destroyed root is re-created as an EMPTY graph by the next
apply (declarative convergence — visible in plan, called out in docs); the
new cluster_e2e_declared_graph_created_by_apply pins the no-manual-init flow.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:58:56 +03:00
aaltshuler
bf8cc7a753 feat(cluster): graph-create recovery sidecars and sweep
RFC-004 §D2/§D3 for the graph_create kind. RecoverySidecar records intent
under __cluster/recoveries/{ulid}.json; the roll-forward-only sweep runs at
the start of apply/refresh/import under the state lock and classifies each
survivor by observation: root absent -> intent removed (row 1); outcome
already recorded -> retired (row 2); create completed but state stale ->
ledger rolled forward with a recovery_records audit entry (row 4); partial
root -> Error/graph_create_incomplete, kept, never auto-deleted (row 5);
unexpected schema -> Drifted/actual_applied_state_pending, kept (row 6).
Sweep mutations ride the command's existing CAS write; completed sidecars
are deleted only after that write lands. Read-only status/plan warn
(cluster_recovery_pending) without acting. The apply payload gate now counts
only payload-phase errors so kept-sidecar diagnostics don't abort the run
before their statuses persist.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:50:42 +03:00
aaltshuler
6fbf09d5c9 refactor(cluster): make apply_config_dir async
Mechanical conversion ahead of Stage 4A graph create (which calls the async
Omnigraph::init from inside apply): the fn signature, the CLI dispatch arm,
and every test caller (#[test] -> #[tokio::test]). Zero behavior change; all
60 lib tests and 3 failpoint tests green before and after.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:43:38 +03:00
Andrew Altshuler
26b26999fd
ci(codeowners): aaltshuler owns all paths; remove ragnorc (#169)
Engineering and docs roles both resolve to @aaltshuler; every path
(catch-all, crates/**, docs/**, repo-level docs) now requires their review.
CODEOWNERS and the doc tables regenerated from codeowners-roles.yml via
render-codeowners.py.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:34:17 +03:00
Andrew Altshuler
58c66a54a2
docs(cluster): RFC-004 — graph & schema apply design (Phase 4) (#168)
* docs(cluster): RFC-004 — graph & schema apply design (Phase 4)

The design the implementation spec's exit criteria require before
graph-moving cluster apply ships. Core positions:

- Cluster recovery is roll-forward-only: the engine's own sidecars make every
  graph-level operation atomic within the graph, so the cluster never rolls a
  graph back — its sidecars (__cluster/recoveries/{ulid}.json) classify and
  record, converging the ledger to observable reality (axiom 5) or surfacing
  a loud pending-repair condition. Eight-row decision matrix, every row
  testable with the Stage 3B failpoint harness.
- Irreversible operations (graph delete, allow_data_loss schema apply)
  consume digest-bound approval artifacts written by a new cluster approve
  command and retired into state.approval_records (axiom 11). A stale
  approval can never authorize a different change.
- cluster apply gains an actor, threaded to apply_schema_as so engine Cedar
  enforcement and commit attribution work unchanged; the cluster adds no
  policy engine of its own.
- Deterministic ordering (creates -> schema applies -> catalog -> deletes),
  per-resource apply groups, cross-graph atomicity explicitly not promised.
- Staged 4A graph create / 4B schema apply / 4C graph delete, each gated on
  per-matrix-row failpoint tests.

Answers exit criteria 2 and 4 fully, 1/5/6 partially; 3/7/8/9 deferred to
their phases (coverage table in the RFC). Linked from the dev index and the
implementation spec's Phase 4 section.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* docs(cluster): RFC-004 review fixes — graph_delete sweep rows, state_cas_base contract

Two greptile findings: (1) D3 row 2 could not be evaluated for graph_delete
(no manifest to version-check after prefix removal) and 'root absent, state
already tombstoned' fell into the stale row — split into rows 7 (delete's
analog of row 2) and 7b (the roll-forward), with expected_manifest_version
documented as always null for the delete kind. (2) state_cas_base is now
explicitly audit/diagnostics-only — the sweep never consults it; independent
state mutations are handled by the ordinary CAS like any concurrent write.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:34:14 +03:00
Andrew Altshuler
effb9cc068
Merge pull request #167 from ModernRelay/feat/cluster-stage3b
feat(cluster): Stage 3B — catalog payload verification + failpoint coverage
2026-06-10 03:17:11 +03:00
aaltshuler
16759b28b9 fix(cluster): RAII-guard the callback failpoint
ScopedFailPoint::with_callback gives cfg_callback the same Drop-based cleanup
as cfg actions; a panic while the point is active no longer leaks the callback
into the process-global registry where it would fire under later tests
(greptile review, PR #167).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:36:24 +03:00
aaltshuler
08ea659c9b build: commit Cargo.lock for omnigraph-cluster's optional fail dependency
The failpoints feature added fail = { workspace = true, optional = true } to
the crate manifest; the lockfile edge belongs with it (--locked CI gate).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:21:10 +03:00
aaltshuler
50543a8ce0 docs(cluster): record Stage 3B failpoint + verification coverage
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:15:13 +03:00
aaltshuler
211b37e6de test(cluster): failpoint tests for crash-mid-apply and state CAS race
The apply-side coverage the implementation spec's hard gate requires before
Phase 4 graph-moving apply:

- crash after the payload phase: state.json byte-identical, blobs inert on
  disk, lock released, no phantom statuses, nothing acknowledged; a plain
  re-run repairs via skip-if-exists blob reuse.
- CAS race: a cfg_callback rewrites state.json at the exact read->write
  window (the state.lock:false concurrent-writer scenario); apply surfaces
  state_cas_mismatch, acknowledges nothing, reports the persisted status
  snapshot, leaves the concurrent writer's state on disk; a re-run converges.

CI's failpoints step now runs both the engine and cluster suites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:14:06 +03:00
aaltshuler
21b531605f feat(cluster): failpoint infrastructure mirroring the engine
Optional failpoints feature (dep:fail + fail/failpoints, deliberately NOT
enabling omnigraph/failpoints), a maybe_fail/ScopedFailPoint module returning
Diagnostic-typed injected errors, and two call sites in apply_config_dir:
cluster_apply.after_payload_phase (the crash point: blobs on disk, state
untouched) and cluster_apply.before_state_write (routes through the
persisted-statuses revert contract; a cfg_callback here can mutate state.json
to make the CAS check fail organically). Feature off compiles to Ok(()) —
zero behavior change. Tests live in a separate integration binary because the
fail registry is process-global. Also refresh the crate description (stale
'read-only' since Stage 3A).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:12:59 +03:00
aaltshuler
acb3f1cc14 test(cli): e2e for catalog payload drift self-heal loop
status warns read-only -> refresh persists drift and drops the digest ->
apply republishes the blob -> status clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:08:14 +03:00