ScopedFailPoint::with_callback gives cfg_callback the same Drop-based cleanup
as cfg actions; a panic while the point is active no longer leaks the callback
into the process-global registry where it would fire under later tests
(greptile review, PR #167).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The apply-side coverage the implementation spec's hard gate requires before
Phase 4 graph-moving apply:
- crash after the payload phase: state.json byte-identical, blobs inert on
disk, lock released, no phantom statuses, nothing acknowledged; a plain
re-run repairs via skip-if-exists blob reuse.
- CAS race: a cfg_callback rewrites state.json at the exact read->write
window (the state.lock:false concurrent-writer scenario); apply surfaces
state_cas_mismatch, acknowledges nothing, reports the persisted status
snapshot, leaves the concurrent writer's state on disk; a re-run converges.
CI's failpoints step now runs both the engine and cluster suites.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Optional failpoints feature (dep:fail + fail/failpoints, deliberately NOT
enabling omnigraph/failpoints), a maybe_fail/ScopedFailPoint module returning
Diagnostic-typed injected errors, and two call sites in apply_config_dir:
cluster_apply.after_payload_phase (the crash point: blobs on disk, state
untouched) and cluster_apply.before_state_write (routes through the
persisted-statuses revert contract; a cfg_callback here can mutate state.json
to make the CAS check fail organically). Feature off compiles to Ok(()) —
zero behavior change. Tests live in a separate integration binary because the
fail registry is process-global. Also refresh the crate description (stale
'read-only' since Stage 3A).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Closes the Stage 3A product gap where a deleted or corrupted blob under
__cluster/resources/ went unnoticed forever (status reported converged and
apply could not repair it because the digests matched).
verify_catalog_payloads checks every query/policy digest in state against its
content-addressed blob (existence + full sha256 re-hash; graph/schema/unknown
addresses have no payloads and are skipped). status reports findings read-only
(warnings catalog_payload_missing/_mismatch; error catalog_payload_read_error
— an unverifiable catalog must not report healthy). refresh closes the
self-heal loop: missing/mismatched blobs mark the resource drifted and remove
its digest from state so the next plan proposes a create and the next apply
republishes; unreadable blobs keep the digest (no spurious republish), mark
error, and exit non-zero. Verification runs before graph observation so the
recomputed graph composite already excludes removed query digests.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two review findings (greptile, PR #165):
- ApplyOutput.resource_statuses on a failed state write now carries the
pre-apply on-disk snapshot instead of the in-memory mutations that were
never persisted, so automation reading the field independently of `ok`
cannot see phantom applied/blocked statuses. Regression test forces the
state write to fail via a read-only __cluster dir (unix-only, skips when
permissions are not enforced).
- Human-mode `cluster apply` prints the classified changes list on failure
too, so an operator debugging a partial apply without --json sees what was
attempted.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
apply_config_dir executes the query/policy subset of the plan: payloads are
written content-addressed under __cluster/resources/{query,policy}/... before
the state CAS (state is the publish point; orphaned blobs from a failed CAS
are inert and re-apply is the repair), then state.json is CAS-updated with
applied digests, Applied/Blocked statuses, and a revision bump. Graph/schema
changes are never executed here: schema content and graph lifecycle defer to
a later phase with loud warnings, while graph.<id> composite-digest updates
whose schema component is unchanged converge automatically via recomputation
from state's own components (without which apply could never converge).
Idempotent re-apply leaves state bytes and revision untouched.
PlanChange gains optional disposition/reason fields, populated by the same
classifier in cluster plan, so plan is an honest preview of what apply will
execute, derive, defer, or block.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>