omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-12 01:45:14 +02:00

Author	SHA1	Message	Date
Andrew Altshuler	c3ff076e89	Merge pull request #181 from ModernRelay/feat/container-cluster-mode feat(docker): cluster-mode container + AWS/Railway recipes	2026-06-10 23:57:34 +03:00
Andrew Altshuler	2b5fb7197e	Merge pull request #180 from ModernRelay/feat/cluster-local-config feat(cli): per-operator actor for cluster ops; pin omnigraph.yaml isolation	2026-06-10 23:57:31 +03:00
aaltshuler	f165145b63	docs(deploy): address review — consistent placeholders, complete ECS command The ECS day-2 apply gains its required --config flag (the image ships no omnigraph.yaml, so the CLI cannot locate the cluster dir without it), and the docker-exec example uses the <you> placeholder convention instead of a real-looking actor name. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:54:26 +03:00
aaltshuler	3b2bf755ae	fix(cli): address review — honor the one-thing contract, restore docs, untangle test phases - resolve_cluster_actor uses load_config directly: load_cli_config also loads auth.env_file into the process env — a second thing, violating the documented 'exactly one thing' omnigraph.yaml contract for cluster ops. - resolve_cli_actor gets its doc comment back (the inserted helper had absorbed the contiguous /// block). - The actor-default test imports once as setup and asserts on apply alone, idempotently, instead of re-importing inside the assertion helper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:54:05 +03:00
aaltshuler	6b3ae7ac79	docs(deploy): AWS and Railway cluster-mode recipes The container contract (OMNIGRAPH_CLUSTER + mounted volume + token env), ECS/Fargate+EFS and Railway-volume walkthroughs, the in-container day-2 loop, and the honest constraints list (volume mandatory, no hot reload, single-writer apply, shared-volume replicas unvalidated). Operator guide links the recipes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:45:30 +03:00
aaltshuler	d3ae31be08	feat(docker): cluster-mode entrypoint and the CLI in the image OMNIGRAPH_CLUSTER boots the container from a mounted cluster directory's applied revision — checked first and exclusive (exit 64 when combined with OMNIGRAPH_TARGET_URI/CONFIG/TARGET), the entrypoint-level mirror of the server's mode-inference rule 0. The omnigraph CLI joins the image so the day-2 loop (cluster apply/approve/status, data loads by explicit URI) runs in-container via docker/ECS exec or railway shell — no omnigraph.yaml required, which the cluster-local-config PR pins. entrypoint_test gains the cluster case plus all three exclusivity refusals. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:44:54 +03:00
aaltshuler	fbe9726ac7	test(cli): stop the S3 e2e scaffolding omnigraph.yaml into the crate dir local_cli_s3_end_to_end_init_load_read_flow ran `omnigraph init` without a current_dir, so init's project scaffold landed in crates/omnigraph-cli/ — poisoning any later test that resolves a graph target from the cwd config (query_lint_requires_schema_or_resolvable_graph_target fails determinis- tically once the file exists). Only manifests when OMNIGRAPH_S3_TEST_BUCKET is set, which is why local FS runs and CI's scoped rustfs job never caught it. The init and load calls now run inside the test's tempdir. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:34:54 +03:00
aaltshuler	99f7f36864	docs(cluster): the precise omnigraph.yaml contract The 'Relationship to omnigraph.yaml' section becomes the exact rule set: cluster commands read the per-operator config for exactly one thing (the cli.actor default when --as is omitted), a --cluster server reads it for nothing, and pointing data-plane targets at derived roots is ergonomics, not coupling. Operator guide and CLI reference updated to match. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:30:40 +03:00
aaltshuler	f7368b58a0	test(cli): pin --cluster boot isolation from cwd omnigraph.yaml A --cluster server process whose cwd contains a MALFORMED omnigraph.yaml boots and serves — proving mode-inference rule 0 returns before any config search can run. New spawn_server_with_cluster_in support helper sets the spawned server's cwd explicitly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:29:49 +03:00
aaltshuler	f3374ac6dc	feat(cli): resolve cluster actor via the per-operator config cascade Cluster FACTS stay unlayered (cluster.yaml only), but the operator's identity is a per-operator fact — exactly the per-operator omnigraph.yaml's permanent job, and the cascade every data-plane write already uses. cluster apply/approve now resolve: --as flag wins and skips any config read entirely (containers and CI stay config-free); without it, the standard cwd search supplies cli.actor, with a malformed config failing loudly and actionably ('pass --as to skip this lookup') rather than silently dropping attribution. approve's no-actor error now names both sources. Tests pin the contract from both sides: cli.actor is the no-flag default for apply (echoed actor) and approve (approved_by), the flag overrides it, a malformed omnigraph.yaml in cwd breaks nothing except the no-flag actor lookup, and a conflicting well-formed one leaks nothing into cluster outputs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:29:49 +03:00
Andrew Altshuler	b8300736be	Merge pull request #179 from ModernRelay/docs/cluster-operator-guide docs(cluster): operator how-to guide for deploying and managing clusters	2026-06-10 22:25:35 +03:00
aaltshuler	97eb65e921	docs(cluster): operator how-to guide for deploying and managing clusters New docs/user/cluster.md — the practical companion to cluster-config.md's reference: zero-to-served walkthrough (validate/import/plan/apply, derived roots, data loading, --cluster serving), the day-2 edit->plan->apply->restart loop with a per-change-kind table, drift observation and convergence, the approval gate for destructive changes, crash/lock/lost-ledger recovery, the boot-refusal table with remedies, deployment patterns (replicas, backup unit, CI gating), and the explicit not-yet list (hot reload, S3-hosted cluster dirs, per-query exposure, pipelines). Linked from the user index, the agent guide's topic map, and cross-linked from the reference. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:10:19 +03:00
Andrew Altshuler	13d5c52abc	Merge pull request #177 from ModernRelay/test/cluster-full-cycle-e2e test(cli): comprehensive full-cycle cluster e2e with a live server	2026-06-10 19:17:04 +03:00
Andrew Altshuler	e8833ef980	Merge pull request #178 from ModernRelay/ci/pin-rustfs-beta8 ci: pin RustFS to 1.0.0-beta.8	2026-06-10 19:10:30 +03:00
aaltshuler	d8354ac213	test(cli): address review — assert schema-show success, document exit-code stance, add e2e opt-out - The drift-heal verification now asserts `schema show` succeeded and produced a schema before checking the rogue field's absence (a failed command previously made the negative assertion vacuously pass). - cluster_cli documents why it deliberately does not assert exit codes (blocked applies exit non-zero by contract while emitting the structured output callers assert on). - The comprehensive lifecycle e2es honor OMNIGRAPH_SKIP_SYSTEM_E2E=1 (graceful skip-with-message, the S3-gate pattern) for constrained sandboxes; requirements + suppression documented in testing.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 19:05:12 +03:00
aaltshuler	711e04a161	ci: pin RustFS to 1.0.0-beta.8 beta.4+ refuses the rustfsadmin/rustfsadmin test credentials unless RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true is set — acceptable for the ephemeral CI container and the local bootstrap script (which already passed it). The three S3 suites were validated against the beta.8 binary locally before this bump. The pin stays explicit, never `latest`, so future upgrades remain deliberate. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:44:05 +03:00
aaltshuler	7d70811df1	test(cli): comprehensive full-cycle cluster e2e with a live server Two system tests composing the whole Phase 1-5 surface with real binaries: - local_cluster_full_lifecycle_declare_serve_evolve_delete: declare two graphs -> one apply creates and converges them -> the --cluster server serves both stored queries -> schema+query evolve in one apply (migration previewed in plan) -> restart serves the new shape -> out-of-band schema drift observed by refresh and converged back by apply (rogue field soft-dropped) -> approved graph delete -> restart serves the survivor and 404s the tombstoned graph -> final plan empty. Catches composition regressions where each stage passes its own tests but the lifecycle breaks (the composite_flow.rs principle at the control-plane level). - local_cluster_serving_enforces_applied_policy_bindings: applied policy bundles gate serving per their bindings over HTTP with bearer-resolved actors — the cluster-bound bundle owns graph_list (admin 200, reader 403, anonymous 401), the graph-bound bundle owns invoke_query (reader gets rows; denied invocation is the documented anti-probing 404). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:07:29 +03:00
Andrew Altshuler	af6a1096b0	Merge pull request #176 from ModernRelay/feat/server-cluster-boot-5b feat(server,cluster): Phase 5 — omnigraph-server boots from cluster state	2026-06-10 18:00:57 +03:00
aaltshuler	711865e6f1	docs(cluster,server): the Phase 5 mode switch; retire applied-not-serving caveats The standing caveat ('applied means recorded in the cluster catalog — nothing more; the server still boots from omnigraph.yaml') retires: cluster docs gain the 'Serving from the cluster' section (exclusivity, applied- revision serving, fail-fast readiness, restart-to-pick-up, expose-all bridge), server.md gains mode-inference rule 0 and the cluster-booted multi mode, deployment.md the boot-source choice, and the CLI's apply note plus the cli-reference cluster row (stale back to Stage 3A) now describe the full convergence surface. RFC-005 flips to Landed with four implementation deviations recorded. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:56:54 +03:00
aaltshuler	f3eb60fa4e	test(cli): applied-means-serving system e2e The Phase-5 contract end to end with real binaries: cluster import + apply via the CLI, seed a row through the graph plane, boot omnigraph-server with --cluster (no omnigraph.yaml anywhere), and the applied stored query serves the row over HTTP through the multi-graph routes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:51:40 +03:00
aaltshuler	948a54daa7	feat(server): boot from cluster state via --cluster RFC-005 §D1/§D2: omnigraph-server --cluster <dir> is rule 0 of the mode inference — an exclusive boot source (hard error when combined with a graph URI, --target, or --config) that never opens omnigraph.yaml, not even the implicit current-directory search. The cluster branch reads the applied revision through omnigraph-cluster's serving-snapshot API and feeds the EXISTING multi-graph pipeline: GraphStartupConfig per recorded graph at its derived root, stored queries built via QueryRegistry::from_specs from verified blob content (expose-all — the §D5 bridge until Phase 6 policy-owned exposure), cluster-bound policy bundles as the server-level Cedar engine and graph-bound bundles per graph, straight from the content-addressed blob paths. Multiple bundles binding one scope refuse boot (one-bundle-per-scope is the serving pipeline's shape; stacking is a later slice). Everything downstream — parallel opens, query type-checking, registry, routing, auth, OpenAPI — is reused unchanged; cluster mode is a new source, not a new pipeline. First server->cluster crate dependency: read-only types + one fn; omnigraph-cluster stays HTTP-free. open_multi_graph_state goes pub for integration tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:48:10 +03:00
aaltshuler	f5b43164b8	feat(cluster): pub read-only serving-snapshot API RFC-005 §D2/§D4: read_serving_snapshot reads the applied revision as everything a server needs to boot — graphs at derived roots, stored-query sources read from the content-addressed catalog and re-hashed against the recorded digests, policy blob paths with their applied applies_to bindings. All-or-nothing: missing state, pending recovery sidecars, missing/tampered blobs, pre-5A entries without bindings, and an empty graph set each refuse the snapshot with a remedy; no partial serving. Lock-free by design — the state file is replaced atomically, so the read is a consistent point-in-time ledger. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:39:26 +03:00
Andrew Altshuler	bed36a8423	Merge pull request #175 from ModernRelay/feat/cluster-policy-bindings-5a feat(cluster): Slice 5A — policy applies_to bindings in the applied revision	2026-06-10 16:57:26 +03:00
aaltshuler	6c98560dde	docs(cluster): document policy binding metadata (5A) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:30:57 +03:00
aaltshuler	0b84b1adc3	feat(cluster): record policy applies_to bindings in the applied revision Slice 5A of RFC-005: the state ledger becomes serving-sufficient for the Phase-5 server boot. StateResource gains an optional applies_to (normalized typed refs: cluster \| graph.<id>), written by apply for every applied policy create/update from the desired config's validated bindings. The hole this closes: applies_to is not part of the policy file digest, so a binding-only edit previously produced NO plan change at all (a 4C e2e even asserted that — the gap, not a contract). Binding changes are now first-class: a post-diff pass emits an Update with equal before/after digests and a binding_change marker (visible in plan/apply JSON and human output as [bindings]), classification/execution treat it as an ordinary catalog-tier applied change (payload skips naturally — the blob is unchanged), and convergence requires zero binding divergence, so stale bindings can never report converged. Pre-5A ledger entries (no bindings recorded) surface as the same backfill Update; one apply heals them, exactly the remedy RFC-005's boot-error path names. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:30:33 +03:00
Andrew Altshuler	3e8f103804	docs(cluster): RFC-005 — server boots from cluster state (Phase 5 design) (#174 ) The axiom-15 mode switch: omnigraph-server --cluster <dir> (mutually exclusive with uri/--target/--config, zero omnigraph.yaml reads) serves the APPLIED revision — graph set from state, query/policy content from the content-addressed catalog at applied digests, cluster-scoped policy bundles as the server-level Cedar engine. The load-bearing finding: state is not yet serving-sufficient (policy applies_to bindings live only in cluster.yaml), so slice 5A records binding metadata into the applied revision at apply time — without it, boot-from-state silently becomes the merged read axiom 15 forbids. Fail-fast readiness table (missing state, pending sidecars, missing blobs, unbound policies all refuse boot with remedies), the expose-all mcp.expose bridge with its Phase 6 sunset, the operator migration path (exit criterion 7), and 5A/5B/5C sequencing. The existing boot pipeline (GraphStartupConfig -> registry -> routing/auth) is reused as-is — a new source, not a new pipeline. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:22:12 +03:00
Andrew Altshuler	61da7bf406	docs(cluster): descope ETL pipelines to a separate project; keep the socket (#172 ) Pipelines (scheduler, connectors, mapping, idempotency, run ledger) leave the cluster control-plane rollout and become their own project with their own RFC. This rollout guarantees only the socket, all of which already exists and is enforced: the pipelines: config field is reserved (typed future_phase_field rejection, test-covered), the pipeline.<name> typed address and Pipeline resource kind are reserved in the resource model, and axiom 13 fixes the contract any future implementation must satisfy (definition reconciled, execution data-plane, fan-out statusful). The ETL section in the high-level spec stands as the requirements record for that project; exit criterion 9 defers to its RFC. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:53:16 +03:00
Andrew Altshuler	14b85a59de	Merge pull request #173 from ModernRelay/feat/cluster-graph-delete-4c feat(cluster): Stage 4C — gated graph delete; Phase 4 complete	2026-06-10 14:53:11 +03:00
aaltshuler	c949a2b717	docs(cluster): document Stage 4C — Phase 4 complete Approvals + gated graph deletion in the user docs, the approve command in the CLI reference, RFC-004 flipped to Landed with its three implementation deviations recorded (row-8 retire-and-repropose, --as instead of --actor/--by, consumed artifacts rewritten in place rather than moved). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:44:12 +03:00
aaltshuler	87691fe9c7	test(cluster): failpoint coverage for delete crash windows - Crash before the removal: root intact, approval file unconsumed, sidecar survives, no ack; the next run retires the stale intent (row 8) and the still-approved delete completes in the same run. - Crash after the removal, before the state CAS: root gone, ledger byte-identical, the sidecar carries the approval id; the next run's sweep rolls the tombstone forward, consumes the approval, audits the recovery, and converges (row 7b). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:34:54 +03:00
aaltshuler	d1d04217ab	feat(cluster): execute approved graph deletes in cluster apply Stage 4C execution half (RFC-004 §D5/§D6 + sweep rows 7/7b/8): an approved graph.<id> delete — and its riding schema/query deletes — classifies Applied and executes LAST in the run, sidecar-fenced: pre-op manifest pin (best effort; partial roots still delete), approval_id carried in the sidecar, recursive root removal (NotFound tolerated), subtree tombstoned out of the ledger with a tombstone observation, the approval consumed in the same state CAS (ledger summary) and its artifact file rewritten with consumed_at only after the CAS lands — a failed run consumes nothing and the approval stays valid for the retry. Sweep rows: already-tombstoned intents retire (7); a completed delete with a stale ledger rolls forward — tombstone + approval consumption + audit entry (7b, idempotent); a still-present root retires the stale intent with a graph_delete_incomplete warning and the still-approved delete re-executes in the same run (8) — prefix removal is idempotent, so retry IS the repair. The multi-graph mixed e2e gets its conclusion: blocked without approval, cluster approve graph.engineering --as andrew, converge, tombstone visible in status. Phase 4's disposition matrix is now fully executable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:34:02 +03:00
aaltshuler	f4e9105272	feat(cluster): cluster approve — digest-bound approval artifacts RFC-004 §D4, gate half: graph deletes (and their subtree) now classify Blocked/approval_required instead of Deferred; the new cluster approve command (requires the global --as actor) writes __cluster/approvals/{ulid}.json bound to the desired config digest and the change's before/after digests, so config or state drift invalidates the artifact automatically (approval_stale warning, never authorizes). One gate per subtree: compute_approvals lists only the graph-level delete, and ApprovalRequirement gains a satisfied flag surfaced by plan. Consumption and the delete executor land next — until then approved deletes stay blocked so a gate-only build can never strip state without removing the root. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:30:05 +03:00
Andrew Altshuler	f799d4578c	Merge pull request #171 from ModernRelay/feat/cluster-schema-apply-4b feat(cluster): Stage 4B — cluster-driven schema apply	2026-06-10 14:03:31 +03:00
aaltshuler	f217352c93	docs(cluster): document Stage 4B schema apply Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:14:20 +03:00
aaltshuler	80cae4e8e1	test(cluster): failpoint coverage for schema-apply crash windows - Crash before the engine call: sidecar (carrying the --as actor) survives, live schema and ledger untouched, no ack; the next run's sweep retires the stale intent and the same run applies and converges. - Crash after the engine call, before the state CAS: the manifest moved with the post-op pin in the sidecar, state.json byte-identical; the next run's sweep rolls the ledger forward with a schema_apply audit entry and the run converges. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:13:15 +03:00
aaltshuler	a1ba4dc413	feat(cluster): execute schema applies in cluster apply Stage 4B (RFC-004 §D1/§D5): schema.<id> Update changes classify Applied and execute after graph creates, sequentially and sidecar-fenced — read-write open (the engine's own recovery runs first), pre-op manifest pin recorded, apply_schema_as with allow_data_loss: false (soft drops only; hard drops wait for 4C's approval artifacts), post-op pin rewritten into the sidecar, sidecar retired only after the final state CAS. Queries gated on a same-plan schema update unblock (the migration lands first in the same run); failures — unsupported migrations, lock contention, user branches — surface as schema_apply_failed with the engine's message, demote dependents via the origin-aware demotion helper, and stop further graph-moving work. Schema evolution is now fully cluster-driven (the defer -> manual schema apply -> refresh loop is gone), and out-of-band schema drift is converged back by apply as an ordinary soft migration (axiom 8: drift correction is gated like any change; the recoverable tier needs no approval) — both pinned by reworked e2es. The multi-graph mixed e2e's deferred row is now delete-shaped, pre-staging the 4C surface. Actor: cluster apply accepts the CLI's global --as via the new ApplyOptions / apply_config_dir_with_options (apply_config_dir delegates unchanged); the actor is echoed in ApplyOutput and recorded in sidecars and audit entries, and threads to apply_schema_as so Cedar fires wherever a checker is installed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:12:15 +03:00
aaltshuler	0571c05ebb	feat(cluster): schema-apply recovery sidecar kind and sweep RecoverySidecarKind::SchemaApply with digest-based sweep classification (robust to unrelated manifest movement; version pins stay forensic): ledger-consistent -> sidecar retired (RFC-004 rows 1+2); live digest matches the intended schema, state stale -> roll forward with composite recompute and a recovery_records audit entry (row 3); unverifiable or unexpected digests -> pending, kept, graph-moving work blocked (rows 1-unopenable/6). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:05:42 +03:00
aaltshuler	ca63a9340b	feat(cluster): embed schema migration previews in cluster plan RFC-004 §D7's data-aware preview: for every schema update, plan opens the live graph read-only and embeds the engine's migration plan (supported flag + typed steps) in the change record; the human renderer prints the steps. Preview failures (unreachable graph, planner error) degrade to the digest diff with a schema_preview_unavailable warning — planning never blocks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:04:19 +03:00
aaltshuler	b313075476	refactor(cluster): make plan_config_dir async Mechanical conversion ahead of Stage 4B (plan will preview schema migrations against live graphs): signature, CLI dispatch, and test callers. Zero behavior change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:02:12 +03:00
Andrew Altshuler	e6921157cc	Merge pull request #170 from ModernRelay/feat/cluster-graph-create-4a feat(cluster): Stage 4A — graph create in cluster apply	2026-06-10 05:19:17 +03:00
aaltshuler	cb6c67f196	docs(cluster): document Stage 4A graph create Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 05:00:42 +03:00
aaltshuler	83d77bcb16	test(cluster): failpoint coverage for graph-create crash windows - Crash before the init (row 1): sidecar survives, nothing moved, no ack; the next run's sweep removes the intent and the same run creates and converges. - Crash after the init, before the state CAS (row 4): the graph exists with the post-init manifest pin in the sidecar, state.json byte-identical; the next run's sweep rolls the ledger forward with a recovery_records audit entry and the run converges. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:59:48 +03:00
aaltshuler	c3007369cd	feat(cluster): execute graph creates in cluster apply Stage 4A (RFC-004 §D1/§D5): graph.<id> Create — and its paired schema Create, which the init carries — classify Applied and execute first in the run, sequentially and sidecar-fenced: sidecar written before Omnigraph::init at the derived root, rewritten with the post-init manifest pin, deleted only after the final state CAS lands. Dependent queries and policies no longer block on a graph create in the same plan — creates run first, so they apply in the same run; a create failure demotes them to blocked (dependency_not_applied) and stops further graph-moving work (loud partials), with the sidecar left for the sweep to classify. Graphs with a kept recovery sidecar (rows 5/6) classify Blocked/cluster_recovery_pending, and the sweep's Drifted/Error statuses are never clobbered by a generic Blocked. Schema source is re-read and digest-verified under the lock before the init (the write_resource_payload TOCTOU posture). Plan previews the same dispositions. e2e fallout updated: a fresh multi-graph config now converges in one apply; a destroyed root is re-created as an EMPTY graph by the next apply (declarative convergence — visible in plan, called out in docs); the new cluster_e2e_declared_graph_created_by_apply pins the no-manual-init flow. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:58:56 +03:00
aaltshuler	bf8cc7a753	feat(cluster): graph-create recovery sidecars and sweep RFC-004 §D2/§D3 for the graph_create kind. RecoverySidecar records intent under __cluster/recoveries/{ulid}.json; the roll-forward-only sweep runs at the start of apply/refresh/import under the state lock and classifies each survivor by observation: root absent -> intent removed (row 1); outcome already recorded -> retired (row 2); create completed but state stale -> ledger rolled forward with a recovery_records audit entry (row 4); partial root -> Error/graph_create_incomplete, kept, never auto-deleted (row 5); unexpected schema -> Drifted/actual_applied_state_pending, kept (row 6). Sweep mutations ride the command's existing CAS write; completed sidecars are deleted only after that write lands. Read-only status/plan warn (cluster_recovery_pending) without acting. The apply payload gate now counts only payload-phase errors so kept-sidecar diagnostics don't abort the run before their statuses persist. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:50:42 +03:00
aaltshuler	6fbf09d5c9	refactor(cluster): make apply_config_dir async Mechanical conversion ahead of Stage 4A graph create (which calls the async Omnigraph::init from inside apply): the fn signature, the CLI dispatch arm, and every test caller (#[test] -> #[tokio::test]). Zero behavior change; all 60 lib tests and 3 failpoint tests green before and after. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:43:38 +03:00
Andrew Altshuler	26b26999fd	ci(codeowners): aaltshuler owns all paths; remove ragnorc (#169 ) Engineering and docs roles both resolve to @aaltshuler; every path (catch-all, crates/, docs/, repo-level docs) now requires their review. CODEOWNERS and the doc tables regenerated from codeowners-roles.yml via render-codeowners.py. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:34:17 +03:00
Andrew Altshuler	58c66a54a2	docs(cluster): RFC-004 — graph & schema apply design (Phase 4) (#168 ) * docs(cluster): RFC-004 — graph & schema apply design (Phase 4) The design the implementation spec's exit criteria require before graph-moving cluster apply ships. Core positions: - Cluster recovery is roll-forward-only: the engine's own sidecars make every graph-level operation atomic within the graph, so the cluster never rolls a graph back — its sidecars (__cluster/recoveries/{ulid}.json) classify and record, converging the ledger to observable reality (axiom 5) or surfacing a loud pending-repair condition. Eight-row decision matrix, every row testable with the Stage 3B failpoint harness. - Irreversible operations (graph delete, allow_data_loss schema apply) consume digest-bound approval artifacts written by a new cluster approve command and retired into state.approval_records (axiom 11). A stale approval can never authorize a different change. - cluster apply gains an actor, threaded to apply_schema_as so engine Cedar enforcement and commit attribution work unchanged; the cluster adds no policy engine of its own. - Deterministic ordering (creates -> schema applies -> catalog -> deletes), per-resource apply groups, cross-graph atomicity explicitly not promised. - Staged 4A graph create / 4B schema apply / 4C graph delete, each gated on per-matrix-row failpoint tests. Answers exit criteria 2 and 4 fully, 1/5/6 partially; 3/7/8/9 deferred to their phases (coverage table in the RFC). Linked from the dev index and the implementation spec's Phase 4 section. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * docs(cluster): RFC-004 review fixes — graph_delete sweep rows, state_cas_base contract Two greptile findings: (1) D3 row 2 could not be evaluated for graph_delete (no manifest to version-check after prefix removal) and 'root absent, state already tombstoned' fell into the stale row — split into rows 7 (delete's analog of row 2) and 7b (the roll-forward), with expected_manifest_version documented as always null for the delete kind. (2) state_cas_base is now explicitly audit/diagnostics-only — the sweep never consults it; independent state mutations are handled by the ordinary CAS like any concurrent write. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:34:14 +03:00
Andrew Altshuler	effb9cc068	Merge pull request #167 from ModernRelay/feat/cluster-stage3b feat(cluster): Stage 3B — catalog payload verification + failpoint coverage	2026-06-10 03:17:11 +03:00
aaltshuler	16759b28b9	fix(cluster): RAII-guard the callback failpoint ScopedFailPoint::with_callback gives cfg_callback the same Drop-based cleanup as cfg actions; a panic while the point is active no longer leaks the callback into the process-global registry where it would fire under later tests (greptile review, PR #167). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 02:36:24 +03:00
aaltshuler	08ea659c9b	build: commit Cargo.lock for omnigraph-cluster's optional fail dependency The failpoints feature added fail = { workspace = true, optional = true } to the crate manifest; the lockfile edge belongs with it (--locked CI gate). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 02:21:10 +03:00

1 2 3 4 5 ...

429 commits