Commit graph

12 commits

Author SHA1 Message Date
aaltshuler
16759b28b9 fix(cluster): RAII-guard the callback failpoint
ScopedFailPoint::with_callback gives cfg_callback the same Drop-based cleanup
as cfg actions; a panic while the point is active no longer leaks the callback
into the process-global registry where it would fire under later tests
(greptile review, PR #167).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:36:24 +03:00
aaltshuler
211b37e6de test(cluster): failpoint tests for crash-mid-apply and state CAS race
The apply-side coverage the implementation spec's hard gate requires before
Phase 4 graph-moving apply:

- crash after the payload phase: state.json byte-identical, blobs inert on
  disk, lock released, no phantom statuses, nothing acknowledged; a plain
  re-run repairs via skip-if-exists blob reuse.
- CAS race: a cfg_callback rewrites state.json at the exact read->write
  window (the state.lock:false concurrent-writer scenario); apply surfaces
  state_cas_mismatch, acknowledges nothing, reports the persisted status
  snapshot, leaves the concurrent writer's state on disk; a re-run converges.

CI's failpoints step now runs both the engine and cluster suites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:14:06 +03:00
aaltshuler
21b531605f feat(cluster): failpoint infrastructure mirroring the engine
Optional failpoints feature (dep:fail + fail/failpoints, deliberately NOT
enabling omnigraph/failpoints), a maybe_fail/ScopedFailPoint module returning
Diagnostic-typed injected errors, and two call sites in apply_config_dir:
cluster_apply.after_payload_phase (the crash point: blobs on disk, state
untouched) and cluster_apply.before_state_write (routes through the
persisted-statuses revert contract; a cfg_callback here can mutate state.json
to make the CAS check fail organically). Feature off compiles to Ok(()) —
zero behavior change. Tests live in a separate integration binary because the
fail registry is process-global. Also refresh the crate description (stale
'read-only' since Stage 3A).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:12:59 +03:00
aaltshuler
15868972ff feat(cluster): verify catalog payload blobs in status and refresh
Closes the Stage 3A product gap where a deleted or corrupted blob under
__cluster/resources/ went unnoticed forever (status reported converged and
apply could not repair it because the digests matched).

verify_catalog_payloads checks every query/policy digest in state against its
content-addressed blob (existence + full sha256 re-hash; graph/schema/unknown
addresses have no payloads and are skipped). status reports findings read-only
(warnings catalog_payload_missing/_mismatch; error catalog_payload_read_error
— an unverifiable catalog must not report healthy). refresh closes the
self-heal loop: missing/mismatched blobs mark the resource drifted and remove
its digest from state so the next plan proposes a create and the next apply
republishes; unreadable blobs keep the digest (no spurious republish), mark
error, and exit non-zero. Verification runs before graph observation so the
recomputed graph composite already excludes removed query digests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:07:08 +03:00
aaltshuler
5e1dede08f fix(cluster,cli): apply failure output — persisted statuses only, changes list printed
Two review findings (greptile, PR #165):

- ApplyOutput.resource_statuses on a failed state write now carries the
  pre-apply on-disk snapshot instead of the in-memory mutations that were
  never persisted, so automation reading the field independently of `ok`
  cannot see phantom applied/blocked statuses. Regression test forces the
  state write to fail via a read-only __cluster dir (unix-only, skips when
  permissions are not enforced).
- Human-mode `cluster apply` prints the classified changes list on failure
  too, so an operator debugging a partial apply without --json sees what was
  attempted.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 00:35:03 +03:00
aaltshuler
1f8e5945cf feat(cluster): config-only apply with content-addressed catalog publish
apply_config_dir executes the query/policy subset of the plan: payloads are
written content-addressed under __cluster/resources/{query,policy}/... before
the state CAS (state is the publish point; orphaned blobs from a failed CAS
are inert and re-apply is the repair), then state.json is CAS-updated with
applied digests, Applied/Blocked statuses, and a revision bump. Graph/schema
changes are never executed here: schema content and graph lifecycle defer to
a later phase with loud warnings, while graph.<id> composite-digest updates
whose schema component is unchanged converge automatically via recomputation
from state's own components (without which apply could never converge).
Idempotent re-apply leaves state bytes and revision untouched.

PlanChange gains optional disposition/reason fields, populated by the same
classifier in cluster plan, so plan is an honest preview of what apply will
execute, derive, defer, or block.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-09 23:32:13 +03:00
aaltshuler
89b876c797 Add cluster state lock recovery 2026-06-09 22:31:46 +03:00
aaltshuler
d00d42274e Implement cluster refresh and import 2026-06-09 21:17:23 +03:00
aaltshuler
2f19656c0e fix(cluster): tighten state lock observations 2026-06-09 18:30:33 +03:00
aaltshuler
b046515e1c Merge origin/main into cluster-config-docs 2026-06-09 18:11:12 +03:00
aaltshuler
a7956ea5a9 Add cluster JSON state ledger status 2026-06-08 21:09:23 +03:00
aaltshuler
043b02e617 feat(cluster): add read-only validate and plan 2026-06-08 20:07:39 +03:00