omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-24 02:38:06 +02:00

Author	SHA1	Message	Date
Andrew Altshuler	7fd23c54a3	fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284 ) * fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap A `cluster apply` carrying a schema change against a graph that has non-main branches, or an unsupported "needs backfill" migration, armed a recovery sidecar before calling the engine, then left it behind when the engine rejected the apply pre-movement. The server refuses to boot while any sidecar is pending, and re-running apply re-armed a fresh sidecar — an unescapable crash loop. None of the engine rejections are bugs; the trap is in the apply/serve choreography. Three coordinated changes: 1. Preview before arming the sidecar. `cluster apply` now runs `preview_schema_apply_with_options` before `write_recovery_sidecar`, so parser/planner rejections (non-main branches, unsupported plan) fail loudly without leaving recovery work behind. The post-preview engine error path now deletes the sidecar when the live schema still matches the recorded digest (nothing moved), and keeps it only on real mid-movement failure — both branches covered by new engine-failpoint tests (cluster failpoints now enable omnigraph/failpoints). 2. Per-graph quarantine at serve time instead of whole-cluster refusal. A graph-attributed pending sidecar, an unopenable graph root, a query parse failure, or an unresolvable embedding provider now quarantines just that graph (logged loudly at every boot layer) while healthy graphs serve; `/graphs` lists only ready graphs and quarantined routes 404. Cluster-global problems (missing/unreadable state, malformed or unattributable sidecars, shared-catalog or cluster-policy errors, zero healthy graphs) stay fail-fast. `--require-all-graphs` / OMNIGRAPH_REQUIRE_ALL_GRAPHS=1 restores all-or-nothing boot. 3. Backfill embedding-provider profile metadata on apply. Mirrors the existing policy-binding backfill: a pre-5A ledger missing `embedding_profile` is now detected as a metadata-only change and backfilled by a no-op apply, instead of bricking serve with `embedding_provider_profile_missing` forever. Tests: trap (no sidecar after a rejected apply), both digest-cleanup branches, per-graph quarantine (cluster + server), embedding backfill. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs: resilient cluster boot + recovery-sidecar trap fix Amend RFC-005 D4 readiness posture (cluster-global fail-fast vs graph-local quarantine; deviation #5 for --require-all-graphs), add the v0.7.0 release note, and update the user cluster/server/deployment docs and the OMNIGRAPH_REQUIRE_ALL_GRAPHS env var. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(cluster): surface sidecar-cleanup failures; document severity promotion Address Greptile review on PR #284: - The pre-movement sidecar cleanup fast-path discarded `delete_object`'s result, so a transient delete failure left the graph quarantined with no signal. Add `try_delete_object` (Result-returning) and emit a `recovery_sidecar_cleanup_failed` warning diagnostic on failure; the fire-and-forget `delete_object` now delegates to it. - Document why the serve-time loop promotes every `list_recovery_sidecars` diagnostic to a cluster-fatal error (the listing only emits genuine read/parse/version failures, as warnings, whose blast radius serving cannot prove) and note the promote-by-code path if that ever changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-19 03:34:15 +03:00
aaltshuler	4590c91f9d	rename compiler NanoError and fix cluster config warnings	2026-06-17 23:44:24 +03:00
aaltshuler	4f8c71fa23	Merge remote-tracking branch 'origin/main' into ragnorc/shaping-config-integration # Conflicts: # crates/omnigraph-cluster/src/lib.rs # crates/omnigraph-cluster/src/serve.rs # crates/omnigraph-server/src/lib.rs # crates/omnigraph-server/src/settings.rs # docs/user/clusters/config.md	2026-06-16 04:13:00 +03:00
aaltshuler	16e4a833c0	Wire cluster embedding providers	2026-06-16 04:02:08 +03:00
Andrew Altshuler	1bc0ea6b51	feat(cli): no-default-graph errors list candidate graphs (RFC-011 D7) (#245 ) When a server/cluster scope resolves with no --graph and no default_graph, the CLI auto-uses a sole graph (cluster) or errors listing the candidate graph ids (cluster catalog; multi-graph server via best-effort GET /graphs), never a silent pick. GraphClient::resolve becomes async; flat/single-graph servers and happy paths are unaffected.	2026-06-15 15:48:29 +03:00
Andrew Altshuler	6144bb18d6	feat(cli): cluster-managed maintenance addressing + init signpost (RFC-010 Slice 3) (#221 ) * feat(cluster): cluster_root_for_graph_uri detection helper (RFC-010 Slice 3) Public helper the CLI uses to refuse `init` into a cluster-managed location: given a graph storage URI of the cluster layout (`<root>/graphs/<id>.omni`), return the cluster root if `<root>` holds `__cluster/state.json`, else None. Cheap by construction — a URI that doesn't match the `<root>/graphs/<id>.omni` shape returns None with zero I/O, so ordinary `init` targets never probe storage. Works for file:// and s3:// via the storage adapter. Adds two ClusterStore accessors (`display_root`, `has_state`). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli): cluster-managed maintenance addressing + init signpost (RFC-010 Slice 3) Two cluster-graph-aware CLI behaviors, sharing the cluster-resolution path. Maintenance addressing. `optimize`/`repair`/`cleanup` gain `--cluster <dir\|s3://…> --cluster-graph <id>`, which resolves the graph's storage URI from the served cluster snapshot (the same truth a `--cluster` server boots from — `read_serving_snapshot`) and opens it embedded. The operator no longer hand-types `<storage>/graphs/<id>.omni`. A distinct flag is required because the global `--graph` is `requires = server` and means a remote multi-graph id. clap enforces both-or-neither and exclusion with the positional URI / `--target`; an unserved graph errors loudly, pointing at `cluster apply`. init signpost. `init` refuses a cluster-managed positional path (the `<root>/graphs/<id>.omni` layout where `<root>` holds `__cluster/state.json`, detected by `cluster_root_for_graph_uri`) and points at `cluster apply` — graphs in an established cluster are created with ledger/recovery/approvals, not by hand. The check is gated on the path shape, so ordinary `init` does no extra I/O and existing pre-apply cluster-graph inits are unaffected. planes guard remediation now also mentions `--cluster … --cluster-graph …` (the two Slice-1 guard-string tests track it). Docs updated (cli-reference Command planes, maintenance.md, cluster.md §7); the stale "no S3-hosted cluster directories" limitation is dropped (RFC-006 landed it). Tests (cli_cluster.rs, reusing the apply-a-cluster fixture): resolve by id, unknown-id error, `--cluster` requires `--cluster-graph`, init refusal + signpost, and ordinary init still works. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> fix(cli): resolve cluster graphs from the state ledger, not the serving snapshot Addresses the Greptile review on #221. `read_serving_snapshot` does all-or-nothing serving validation — recovery-sidecar checks plus a digest verify of every catalog payload (query .gq, policy blobs). Using it to resolve a maintenance target coupled `optimize`/`repair`/`cleanup` to the readiness of unrelated resources: a single corrupt policy blob, or a pending recovery sweep, would block the command before it could touch the graph — worst for `repair`, the tool you reach for when the cluster is degraded*. Add `omnigraph_cluster::resolve_graph_storage_uri(cluster, graph_id)`: read the state ledger, confirm the graph is in the applied revision, return `graph_root(id)` — the URI is deterministically derivable, no catalog validation. The CLI's cluster resolver now calls it. Test: `optimize --cluster … --cluster-graph …` still resolves after the catalog payloads (`__cluster/resources/`) are removed — the ledger-only path is not blocked by degraded/unrelated catalog state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-14 02:52:21 +03:00
aaltshuler	58855c0a7c	feat(cluster,server): inline policy content + config-free --cluster URI boot Two serving changes that complete RFC-006's read side: ServingPolicy carries the policy bundle CONTENT (digest-verified at snapshot read) instead of a blob path — the catalog may live on object storage, and the server must not re-read mutable state after the snapshot. The server grows a PolicySource enum: File for omnigraph.yaml deployments (unchanged), Inline for cluster boots, wired through PolicyEngine::load_{graph,server}_from_source. read_serving_snapshot_from_storage(uri) reads the applied revision straight from a storage root, and --cluster accepts a scheme-qualified URI (s3://bucket/prefix): config-free serving — a serving box needs only the URI and credentials; the ledger and catalog on the bucket ARE the deployment artifact. Bare paths keep the config-directory behavior. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 15:56:22 +03:00
aaltshuler	8dc2f15255	feat(cluster): the storage: root — state, catalog, and graph roots relocatable cluster.yaml gains an optional storage: URI deciding where everything the cluster STORES lives: the state ledger, lock, content-addressed catalog, recovery sidecars, approval artifacts, and the derived graph roots (<storage>/graphs/<id>.omni). Absent, it defaults to the config directory itself — the original layout, byte-compatible, so pre-existing clusters and the whole test suite are untouched. Declared configuration always stays in the working tree (Terraform's config-local/state-remote split); credentials are env-only, never in cluster.yaml. Every command resolves its store from the declared root (a bad root is a loud invalid_storage_root). Graph-root derivation, the delete executor (prefix delete via the adapter), the sweep's existence probes, the catalog payload write/verify/read paths, and the serving snapshot all flow through ClusterStore — the last raw-fs holdouts for stored state are gone, and the deny-list gains the rule that keeps it that way. Tests: default-layout byte-compat, a file:// root relocating the entire cluster (ledger+catalog+graphs under the new root, nothing under the config dir, serving snapshot follows), invalid-root validation. 98 in-crate + 9 failpoints + full workspace gate green. The s3:// flavor lands with PR 3's gated RustFS e2e. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 14:28:04 +03:00
aaltshuler	fd002abaa5	feat(cluster): port the storage backend to the engine StorageAdapter LocalStateBackend becomes ClusterStore: every stored byte — state ledger, lock, recovery sidecars, approval artifacts — now flows through the engine's StorageAdapter, making file:// and s3:// one code path. Behavior on the file backend is byte-compatible (layout, CAS semantics, diagnostics, lock release timing) and the entire pre-existing suite passes unchanged. Mechanics: the ledger CAS keeps its public sha256 vocabulary while the physical swap is token-conditioned (ETag If-Match on S3 via PR #186's primitives; content-token + temp/rename locally — the pre-port semantics); the lock is a create-only put (genuinely cross-machine on object stores) with deterministic drop-release locally and best-effort spawned release on S3; sidecars/approvals address by URI (SweepOutcome and the executors carry strings); sweep row-1 retirement joins the uniform deferred post-CAS cleanup. ClusterStore also gains the catalog-payload and graph-root methods that commit 2 wires in. Async ripple: status/force-unlock/serving-snapshot and the server's settings loader chain go async (CLI dispatch and ~20 test hosts follow, mechanically). tokio joins the cluster crate's runtime deps for the lock guard's handle. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 14:11:14 +03:00
aaltshuler	db6fe03be1	refactor(cluster): move type definitions to types.rs Verbatim move of the public output/diagnostic types and the internal state/sidecar/approval models; previously-private types and their fields get pub(crate) (they were crate-visible by position before). lib.rs is now the command pipeline + public API. 95 tests green; full workspace gate green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:42:02 +03:00
aaltshuler	dc0a1fc5a5	refactor(cluster): move declared-config loading to config.rs Verbatim move of cluster.yaml parsing, query discovery, source digesting, header/id validation, path resolution, and live-graph observation. Two helpers that the cut swept along were relocated to their right homes (state-status helpers back to lib.rs, lock-file helpers to store.rs). 95 tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:37:20 +03:00
aaltshuler	dd17c0c50f	refactor(cluster): move diffing and classification to diff.rs Verbatim move of diff_resources, binding-change diffing, blast radius, approval gating, ResourceKind, classify_changes, and demotion. 95 tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:33:13 +03:00
aaltshuler	9c3e09e838	refactor(cluster): move the recovery sweep to sweep.rs Verbatim move of the sidecar classification (all RFC-004 D3 rows), tombstoning, and approval-consumption helpers. 95 tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:30:55 +03:00
aaltshuler	00fc5cf537	refactor(cluster): move the serving snapshot to serve.rs Verbatim move of the Serving* types, read_serving_snapshot, and read_verified_payload; public re-exports preserved (the server's imports are unchanged). 95 tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:29:44 +03:00
aaltshuler	5a8047e5d0	refactor(cluster): move the storage backend to store.rs Verbatim move of LocalStateBackend, StateSnapshot, StateLockGuard and their impls — the single home for stored-state I/O (state ledger, lock, recovery sidecars, approval artifacts), where the RFC-006 object-storage port lands next as a focused diff. Visibility bumps (pub(crate)) only; 95 tests green before and after. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:28:04 +03:00
aaltshuler	fbb86dee0e	refactor(cluster): move the in-source test suite to tests.rs Verbatim move (indentation preserved — embedded raw-string fixtures are content). lib.rs drops from 7,857 to ~4,750 lines; `use super::*` resolves to the crate root through the #[path] module declaration unchanged. 95 tests green before and after. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:25:53 +03:00
aaltshuler	4558454bc7	fix(cluster): address review — discovery reads each file exactly once resolve_query_decls hands its file contents to the caller; the per-query digest/typecheck pass reuses them instead of re-reading (a file with N queries was read N+1 times), which also closes the window where a file changing between enumeration and validation produced a confusing query_key_mismatch for a just-discovered name. Explicit-map declarations read as before. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 01:35:47 +03:00
aaltshuler	677320ceec	feat(cluster): Terraform-shaped query declaration — discover from files cluster.yaml's graphs.<id>.queries previously accepted only an explicit name->file map, forcing configs to re-enumerate every `query <name>` that the .gq files already declare (the SPIKE cookbook needed 66 entries for 6 files). The files ARE the declaration now: `queries: queries/` discovers every declaration in a directory's top-level *.gq (sorted), a list form takes explicit files, and the map stays for fine-grained control. Discovery is loud — unreadable/unparseable files and duplicate query names fail validation (query_parse_error, duplicate_query_name). Downstream is untouched: each discovered query is still an individually addressed resource with the containing file's digest. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 00:46:21 +03:00
aaltshuler	f5b43164b8	feat(cluster): pub read-only serving-snapshot API RFC-005 §D2/§D4: read_serving_snapshot reads the applied revision as everything a server needs to boot — graphs at derived roots, stored-query sources read from the content-addressed catalog and re-hashed against the recorded digests, policy blob paths with their applied applies_to bindings. All-or-nothing: missing state, pending recovery sidecars, missing/tampered blobs, pre-5A entries without bindings, and an empty graph set each refuse the snapshot with a remedy; no partial serving. Lock-free by design — the state file is replaced atomically, so the read is a consistent point-in-time ledger. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:39:26 +03:00
aaltshuler	0b84b1adc3	feat(cluster): record policy applies_to bindings in the applied revision Slice 5A of RFC-005: the state ledger becomes serving-sufficient for the Phase-5 server boot. StateResource gains an optional applies_to (normalized typed refs: cluster \| graph.<id>), written by apply for every applied policy create/update from the desired config's validated bindings. The hole this closes: applies_to is not part of the policy file digest, so a binding-only edit previously produced NO plan change at all (a 4C e2e even asserted that — the gap, not a contract). Binding changes are now first-class: a post-diff pass emits an Update with equal before/after digests and a binding_change marker (visible in plan/apply JSON and human output as [bindings]), classification/execution treat it as an ordinary catalog-tier applied change (payload skips naturally — the blob is unchanged), and convergence requires zero binding divergence, so stale bindings can never report converged. Pre-5A ledger entries (no bindings recorded) surface as the same backfill Update; one apply heals them, exactly the remedy RFC-005's boot-error path names. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:30:33 +03:00
aaltshuler	d1d04217ab	feat(cluster): execute approved graph deletes in cluster apply Stage 4C execution half (RFC-004 §D5/§D6 + sweep rows 7/7b/8): an approved graph.<id> delete — and its riding schema/query deletes — classifies Applied and executes LAST in the run, sidecar-fenced: pre-op manifest pin (best effort; partial roots still delete), approval_id carried in the sidecar, recursive root removal (NotFound tolerated), subtree tombstoned out of the ledger with a tombstone observation, the approval consumed in the same state CAS (ledger summary) and its artifact file rewritten with consumed_at only after the CAS lands — a failed run consumes nothing and the approval stays valid for the retry. Sweep rows: already-tombstoned intents retire (7); a completed delete with a stale ledger rolls forward — tombstone + approval consumption + audit entry (7b, idempotent); a still-present root retires the stale intent with a graph_delete_incomplete warning and the still-approved delete re-executes in the same run (8) — prefix removal is idempotent, so retry IS the repair. The multi-graph mixed e2e gets its conclusion: blocked without approval, cluster approve graph.engineering --as andrew, converge, tombstone visible in status. Phase 4's disposition matrix is now fully executable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:34:02 +03:00
aaltshuler	f4e9105272	feat(cluster): cluster approve — digest-bound approval artifacts RFC-004 §D4, gate half: graph deletes (and their subtree) now classify Blocked/approval_required instead of Deferred; the new cluster approve command (requires the global --as actor) writes __cluster/approvals/{ulid}.json bound to the desired config digest and the change's before/after digests, so config or state drift invalidates the artifact automatically (approval_stale warning, never authorizes). One gate per subtree: compute_approvals lists only the graph-level delete, and ApprovalRequirement gains a satisfied flag surfaced by plan. Consumption and the delete executor land next — until then approved deletes stay blocked so a gate-only build can never strip state without removing the root. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:30:05 +03:00
aaltshuler	a1ba4dc413	feat(cluster): execute schema applies in cluster apply Stage 4B (RFC-004 §D1/§D5): schema.<id> Update changes classify Applied and execute after graph creates, sequentially and sidecar-fenced — read-write open (the engine's own recovery runs first), pre-op manifest pin recorded, apply_schema_as with allow_data_loss: false (soft drops only; hard drops wait for 4C's approval artifacts), post-op pin rewritten into the sidecar, sidecar retired only after the final state CAS. Queries gated on a same-plan schema update unblock (the migration lands first in the same run); failures — unsupported migrations, lock contention, user branches — surface as schema_apply_failed with the engine's message, demote dependents via the origin-aware demotion helper, and stop further graph-moving work. Schema evolution is now fully cluster-driven (the defer -> manual schema apply -> refresh loop is gone), and out-of-band schema drift is converged back by apply as an ordinary soft migration (axiom 8: drift correction is gated like any change; the recoverable tier needs no approval) — both pinned by reworked e2es. The multi-graph mixed e2e's deferred row is now delete-shaped, pre-staging the 4C surface. Actor: cluster apply accepts the CLI's global --as via the new ApplyOptions / apply_config_dir_with_options (apply_config_dir delegates unchanged); the actor is echoed in ApplyOutput and recorded in sidecars and audit entries, and threads to apply_schema_as so Cedar fires wherever a checker is installed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:12:15 +03:00
aaltshuler	0571c05ebb	feat(cluster): schema-apply recovery sidecar kind and sweep RecoverySidecarKind::SchemaApply with digest-based sweep classification (robust to unrelated manifest movement; version pins stay forensic): ledger-consistent -> sidecar retired (RFC-004 rows 1+2); live digest matches the intended schema, state stale -> roll forward with composite recompute and a recovery_records audit entry (row 3); unverifiable or unexpected digests -> pending, kept, graph-moving work blocked (rows 1-unopenable/6). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:05:42 +03:00
aaltshuler	ca63a9340b	feat(cluster): embed schema migration previews in cluster plan RFC-004 §D7's data-aware preview: for every schema update, plan opens the live graph read-only and embeds the engine's migration plan (supported flag + typed steps) in the change record; the human renderer prints the steps. Preview failures (unreachable graph, planner error) degrade to the digest diff with a schema_preview_unavailable warning — planning never blocks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:04:19 +03:00
aaltshuler	b313075476	refactor(cluster): make plan_config_dir async Mechanical conversion ahead of Stage 4B (plan will preview schema migrations against live graphs): signature, CLI dispatch, and test callers. Zero behavior change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:02:12 +03:00
aaltshuler	c3007369cd	feat(cluster): execute graph creates in cluster apply Stage 4A (RFC-004 §D1/§D5): graph.<id> Create — and its paired schema Create, which the init carries — classify Applied and execute first in the run, sequentially and sidecar-fenced: sidecar written before Omnigraph::init at the derived root, rewritten with the post-init manifest pin, deleted only after the final state CAS lands. Dependent queries and policies no longer block on a graph create in the same plan — creates run first, so they apply in the same run; a create failure demotes them to blocked (dependency_not_applied) and stops further graph-moving work (loud partials), with the sidecar left for the sweep to classify. Graphs with a kept recovery sidecar (rows 5/6) classify Blocked/cluster_recovery_pending, and the sweep's Drifted/Error statuses are never clobbered by a generic Blocked. Schema source is re-read and digest-verified under the lock before the init (the write_resource_payload TOCTOU posture). Plan previews the same dispositions. e2e fallout updated: a fresh multi-graph config now converges in one apply; a destroyed root is re-created as an EMPTY graph by the next apply (declarative convergence — visible in plan, called out in docs); the new cluster_e2e_declared_graph_created_by_apply pins the no-manual-init flow. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:58:56 +03:00
aaltshuler	bf8cc7a753	feat(cluster): graph-create recovery sidecars and sweep RFC-004 §D2/§D3 for the graph_create kind. RecoverySidecar records intent under __cluster/recoveries/{ulid}.json; the roll-forward-only sweep runs at the start of apply/refresh/import under the state lock and classifies each survivor by observation: root absent -> intent removed (row 1); outcome already recorded -> retired (row 2); create completed but state stale -> ledger rolled forward with a recovery_records audit entry (row 4); partial root -> Error/graph_create_incomplete, kept, never auto-deleted (row 5); unexpected schema -> Drifted/actual_applied_state_pending, kept (row 6). Sweep mutations ride the command's existing CAS write; completed sidecars are deleted only after that write lands. Read-only status/plan warn (cluster_recovery_pending) without acting. The apply payload gate now counts only payload-phase errors so kept-sidecar diagnostics don't abort the run before their statuses persist. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:50:42 +03:00
aaltshuler	6fbf09d5c9	refactor(cluster): make apply_config_dir async Mechanical conversion ahead of Stage 4A graph create (which calls the async Omnigraph::init from inside apply): the fn signature, the CLI dispatch arm, and every test caller (#[test] -> #[tokio::test]). Zero behavior change; all 60 lib tests and 3 failpoint tests green before and after. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:43:38 +03:00
aaltshuler	21b531605f	feat(cluster): failpoint infrastructure mirroring the engine Optional failpoints feature (dep:fail + fail/failpoints, deliberately NOT enabling omnigraph/failpoints), a maybe_fail/ScopedFailPoint module returning Diagnostic-typed injected errors, and two call sites in apply_config_dir: cluster_apply.after_payload_phase (the crash point: blobs on disk, state untouched) and cluster_apply.before_state_write (routes through the persisted-statuses revert contract; a cfg_callback here can mutate state.json to make the CAS check fail organically). Feature off compiles to Ok(()) — zero behavior change. Tests live in a separate integration binary because the fail registry is process-global. Also refresh the crate description (stale 'read-only' since Stage 3A). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 02:12:59 +03:00
aaltshuler	15868972ff	feat(cluster): verify catalog payload blobs in status and refresh Closes the Stage 3A product gap where a deleted or corrupted blob under __cluster/resources/ went unnoticed forever (status reported converged and apply could not repair it because the digests matched). verify_catalog_payloads checks every query/policy digest in state against its content-addressed blob (existence + full sha256 re-hash; graph/schema/unknown addresses have no payloads and are skipped). status reports findings read-only (warnings catalog_payload_missing/_mismatch; error catalog_payload_read_error — an unverifiable catalog must not report healthy). refresh closes the self-heal loop: missing/mismatched blobs mark the resource drifted and remove its digest from state so the next plan proposes a create and the next apply republishes; unreadable blobs keep the digest (no spurious republish), mark error, and exit non-zero. Verification runs before graph observation so the recomputed graph composite already excludes removed query digests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 02:07:08 +03:00
aaltshuler	5e1dede08f	fix(cluster,cli): apply failure output — persisted statuses only, changes list printed Two review findings (greptile, PR #165): - ApplyOutput.resource_statuses on a failed state write now carries the pre-apply on-disk snapshot instead of the in-memory mutations that were never persisted, so automation reading the field independently of `ok` cannot see phantom applied/blocked statuses. Regression test forces the state write to fail via a read-only __cluster dir (unix-only, skips when permissions are not enforced). - Human-mode `cluster apply` prints the classified changes list on failure too, so an operator debugging a partial apply without --json sees what was attempted. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 00:35:03 +03:00
aaltshuler	1f8e5945cf	feat(cluster): config-only apply with content-addressed catalog publish apply_config_dir executes the query/policy subset of the plan: payloads are written content-addressed under __cluster/resources/{query,policy}/... before the state CAS (state is the publish point; orphaned blobs from a failed CAS are inert and re-apply is the repair), then state.json is CAS-updated with applied digests, Applied/Blocked statuses, and a revision bump. Graph/schema changes are never executed here: schema content and graph lifecycle defer to a later phase with loud warnings, while graph.<id> composite-digest updates whose schema component is unchanged converge automatically via recomputation from state's own components (without which apply could never converge). Idempotent re-apply leaves state bytes and revision untouched. PlanChange gains optional disposition/reason fields, populated by the same classifier in cluster plan, so plan is an honest preview of what apply will execute, derive, defer, or block. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-09 23:32:13 +03:00
aaltshuler	89b876c797	Add cluster state lock recovery	2026-06-09 22:31:46 +03:00
aaltshuler	d00d42274e	Implement cluster refresh and import	2026-06-09 21:17:23 +03:00
aaltshuler	2f19656c0e	fix(cluster): tighten state lock observations	2026-06-09 18:30:33 +03:00
aaltshuler	a7956ea5a9	Add cluster JSON state ledger status	2026-06-08 21:09:23 +03:00
aaltshuler	043b02e617	feat(cluster): add read-only validate and plan	2026-06-08 20:07:39 +03:00

38 commits