omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-12 01:45:14 +02:00

Author	SHA1	Message	Date
aaltshuler	8d7aed065f	test(cluster,server): gated object-storage cluster e2e + CI wiring + docs s3_cluster.rs runs the full control-plane lifecycle against a real bucket (CI: containerized RustFS; locally the RustFS binary): import → lock released (pins the drop-time release regression caught on the first live smoke) → apply (graph roots + catalog on the bucket, nothing local) → serving snapshots from both the config dir and the bare URI → schema evolution → approved delete (prefix removal) → empty-cluster refusal. The server suite gains the config-free boot test: --cluster s3://… with zero local files serves a stored query over HTTP. CI: the rustfs job runs both suites; the classify filter covers the cluster store/serve modules and the new test files. The server smoke drops its name filter — every test in the s3 target is bucket-gated, and a filter matching nothing passes vacuously (which silently ran zero tests for a while). Docs: deployment.md gains the Bucket-no-volume shape as the preferred cloud deployment; cluster.md/server.md document --cluster <uri>; testing.md maps the new suite. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 15:56:40 +03:00
aaltshuler	58855c0a7c	feat(cluster,server): inline policy content + config-free --cluster URI boot Two serving changes that complete RFC-006's read side: ServingPolicy carries the policy bundle CONTENT (digest-verified at snapshot read) instead of a blob path — the catalog may live on object storage, and the server must not re-read mutable state after the snapshot. The server grows a PolicySource enum: File for omnigraph.yaml deployments (unchanged), Inline for cluster boots, wired through PolicyEngine::load_{graph,server}_from_source. read_serving_snapshot_from_storage(uri) reads the applied revision straight from a storage root, and --cluster accepts a scheme-qualified URI (s3://bucket/prefix): config-free serving — a serving box needs only the URI and credentials; the ledger and catalog on the bucket ARE the deployment artifact. Bare paths keep the config-directory behavior. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 15:56:22 +03:00
Andrew Altshuler	7af3697397	Merge pull request #193 from ModernRelay/refactor/cli-modularize refactor(cli): modularize main.rs and the test monolith — pure code movement	2026-06-11 15:37:28 +03:00
aaltshuler	d5e75df272	refactor(cli): split the test monolith into command-area suites tests/cli.rs (4,548 lines, 112 tests) becomes five area files — cli_cluster (24), cli_cluster_e2e (10, the spawned-binary lifecycle compositions), cli_data (49), cli_schema_config (16), cli_queries (13) — with the file-local helpers joining the existing tests/support harness. Verbatim moves + visibility bumps; 161 crate tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 15:16:51 +03:00
aaltshuler	916015c416	refactor(cli): split main.rs into cli/helpers/output modules Verbatim moves: the clap surface (every command/subcommand/arg struct) to cli.rs, resolution helpers (config/actor/graph/branch/query, remote HTTP, env/token, scaffolding) to helpers.rs, human/JSON formatting to output.rs, the in-source test mod to main_tests.rs via #[path]. main.rs (1,184 lines) keeps main() and the dispatch match. Visibility bumps only; 22 binary tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 15:14:27 +03:00
aaltshuler	127440d873	refactor(server): split lib.rs into handlers and settings modules Verbatim moves: route handlers + bearer-auth middleware + per-request authorization + the cluster-prefix OpenAPI rewrite go to handlers.rs; settings resolution (omnigraph.yaml/CLI/env, mode inference, bearer-token sources, runtime-state classification) and its in-source test mod go to settings.rs. lib.rs (1,158 lines) keeps the public types, app/router assembly, and serve(). The ApiDoc derive references handlers::-qualified paths; the one multi-line utoipa attribute the cut orphaned was relocated with its handler. 289 crate tests green, OpenAPI drift check included. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 15:08:25 +03:00
aaltshuler	b036073ec6	refactor(server): split the test monolith into area suites tests/server.rs (6,517 lines, 110 tests) becomes seven area files — auth_policy, data_routes, schema_routes, stored_queries, multi_graph, boot_settings, s3 — with shared helpers in tests/support/mod.rs. Verbatim moves + visibility bumps (pub on helpers, pub(super)->pub inside the matrix harness); cargo fix stripped the per-file unused imports. All 110 tests pass in their new homes (289 across the crate including lib and openapi). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 15:03:51 +03:00
aaltshuler	f6ae3e4fa3	fix(cluster): lock release must complete before a CLI process exits Caught by the first live s3 smoke: StateLockGuard's spawned async delete dies with the runtime when a short-lived CLI process exits right after the command — import's lock survived into the next command as state_lock_held. On the multi-thread runtime (the CLI, and the gated s3 tests) block_in_place waits for the delete to complete; current-thread runtimes keep the spawn fallback with force-unlock as the documented recovery, same as a crash. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 14:33:26 +03:00
aaltshuler	8dc2f15255	feat(cluster): the storage: root — state, catalog, and graph roots relocatable cluster.yaml gains an optional storage: URI deciding where everything the cluster STORES lives: the state ledger, lock, content-addressed catalog, recovery sidecars, approval artifacts, and the derived graph roots (<storage>/graphs/<id>.omni). Absent, it defaults to the config directory itself — the original layout, byte-compatible, so pre-existing clusters and the whole test suite are untouched. Declared configuration always stays in the working tree (Terraform's config-local/state-remote split); credentials are env-only, never in cluster.yaml. Every command resolves its store from the declared root (a bad root is a loud invalid_storage_root). Graph-root derivation, the delete executor (prefix delete via the adapter), the sweep's existence probes, the catalog payload write/verify/read paths, and the serving snapshot all flow through ClusterStore — the last raw-fs holdouts for stored state are gone, and the deny-list gains the rule that keeps it that way. Tests: default-layout byte-compat, a file:// root relocating the entire cluster (ledger+catalog+graphs under the new root, nothing under the config dir, serving snapshot follows), invalid-root validation. 98 in-crate + 9 failpoints + full workspace gate green. The s3:// flavor lands with PR 3's gated RustFS e2e. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 14:28:04 +03:00
aaltshuler	fd002abaa5	feat(cluster): port the storage backend to the engine StorageAdapter LocalStateBackend becomes ClusterStore: every stored byte — state ledger, lock, recovery sidecars, approval artifacts — now flows through the engine's StorageAdapter, making file:// and s3:// one code path. Behavior on the file backend is byte-compatible (layout, CAS semantics, diagnostics, lock release timing) and the entire pre-existing suite passes unchanged. Mechanics: the ledger CAS keeps its public sha256 vocabulary while the physical swap is token-conditioned (ETag If-Match on S3 via PR #186's primitives; content-token + temp/rename locally — the pre-port semantics); the lock is a create-only put (genuinely cross-machine on object stores) with deterministic drop-release locally and best-effort spawned release on S3; sidecars/approvals address by URI (SweepOutcome and the executors carry strings); sweep row-1 retirement joins the uniform deferred post-CAS cleanup. ClusterStore also gains the catalog-payload and graph-root methods that commit 2 wires in. Async ripple: status/force-unlock/serving-snapshot and the server's settings loader chain go async (CLI dispatch and ~20 test hosts follow, mechanically). tokio joins the cluster crate's runtime deps for the lock guard's handle. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 14:11:14 +03:00
aaltshuler	db6fe03be1	refactor(cluster): move type definitions to types.rs Verbatim move of the public output/diagnostic types and the internal state/sidecar/approval models; previously-private types and their fields get pub(crate) (they were crate-visible by position before). lib.rs is now the command pipeline + public API. 95 tests green; full workspace gate green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:42:02 +03:00
aaltshuler	dc0a1fc5a5	refactor(cluster): move declared-config loading to config.rs Verbatim move of cluster.yaml parsing, query discovery, source digesting, header/id validation, path resolution, and live-graph observation. Two helpers that the cut swept along were relocated to their right homes (state-status helpers back to lib.rs, lock-file helpers to store.rs). 95 tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:37:20 +03:00
aaltshuler	dd17c0c50f	refactor(cluster): move diffing and classification to diff.rs Verbatim move of diff_resources, binding-change diffing, blast radius, approval gating, ResourceKind, classify_changes, and demotion. 95 tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:33:13 +03:00
aaltshuler	9c3e09e838	refactor(cluster): move the recovery sweep to sweep.rs Verbatim move of the sidecar classification (all RFC-004 D3 rows), tombstoning, and approval-consumption helpers. 95 tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:30:55 +03:00
aaltshuler	00fc5cf537	refactor(cluster): move the serving snapshot to serve.rs Verbatim move of the Serving* types, read_serving_snapshot, and read_verified_payload; public re-exports preserved (the server's imports are unchanged). 95 tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:29:44 +03:00
aaltshuler	5a8047e5d0	refactor(cluster): move the storage backend to store.rs Verbatim move of LocalStateBackend, StateSnapshot, StateLockGuard and their impls — the single home for stored-state I/O (state ledger, lock, recovery sidecars, approval artifacts), where the RFC-006 object-storage port lands next as a focused diff. Visibility bumps (pub(crate)) only; 95 tests green before and after. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:28:04 +03:00
aaltshuler	fbb86dee0e	refactor(cluster): move the in-source test suite to tests.rs Verbatim move (indentation preserved — embedded raw-string fixtures are content). lib.rs drops from 7,857 to ~4,750 lines; `use super::*` resolves to the crate root through the #[path] module declaration unchanged. 95 tests green before and after. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:25:53 +03:00
aaltshuler	d702fd106a	feat(policy): from-source twins for the policy loaders PolicyConfig::from_source + PolicyEngine::load_graph_from_source / load_server_from_source — the path-based loaders delegate to them. Needed by callers whose policy bundles don't live on the local filesystem (the cluster catalog on object storage); kind-alignment validation stays loud through the new path. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:09:45 +03:00
aaltshuler	f48e69b999	feat(storage): versioned CAS, conditional replace, and prefix delete on StorageAdapter Three primitives the cluster's object-storage port (RFC-006) needs, on the engine's existing adapter rather than a parallel store: - read_text_versioned: content + an opaque backend version token (S3: the ETag from GET; local: content sha256 — ETags don't exist on a filesystem). - write_text_if_match: replace only when the token still matches. S3 maps to a conditional put (PutMode::Update / If-Match) — verified against RustFS beta.8 through the real object_store 0.12.5 path, no extra builder config needed; local compares content then swaps via temp+rename, the same single-machine semantics callers had before this trait (safe under their own lock protocol, not a cross-process barrier by itself). CAS-lost is Ok(None), never silent. - delete_prefix: recursive + idempotent (local remove_dir_all; S3 list + delete, with the non-atomicity documented for crash-retry callers). Gated S3 coverage: s3_adapter_conditional_writes_contract pins the conditional-write behavior the cluster ledger will depend on (red if a backend bump regresses it), and s3_schema_apply_migrates_live_graph closes the previously-untested schema-apply-on-S3 path before the cluster's schema executor leans on it. Engine gains the sha2 workspace dep. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 05:09:45 +03:00
aaltshuler	fa6af775c1	feat(cli)!: unified load command; deprecate ingest as an alias omnigraph load is now the single data-write command: - works against remote graphs (POSTs the server's /ingest endpoint with the same bearer/actor resolution as other remote commands) — previously load was the only data command forced to open Lance storage directly - --from <base> opts into fork-if-missing for --branch (the former ingest semantics); without --from a missing branch is an error, never a fork - --mode is now required: overwrite is destructive, so there is no implicit default (the old silent default was overwrite) - output gains base_branch/branch_created (and table sums on remote loads) omnigraph ingest stays as a deprecated alias (defaults preserved: --from main --mode merge) that prints a one-line warning to stderr, matching the read/change deprecation convention; removal in a later release. Docs updated in the same change: cli.md, cli-reference.md, policy.md, audit.md, execution.md (unified load section), AGENTS.md quick-flow, README.md. BREAKING CHANGE: scripts running omnigraph load without --mode must now pass it explicitly (previously defaulted to the destructive overwrite). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 04:18:00 +03:00
aaltshuler	90676ef52f	feat(server)!: POST /ingest forks only when 'from' is present Branch creation becomes opt-in by presence of the request's 'from' field. Previously the handler defaulted from to 'main' and always auto-created a missing branch — a typo'd branch name silently forked main and landed the data there, with the client none the wiser. Now a request without 'from' against a missing branch returns 404 branch-not-found and creates nothing; with 'from' set, fork-if-missing behaves as before. The BranchCreate authority is only consulted when a fork will actually happen. The handler calls the unified load_as directly (the deprecated ingest_as shim is no longer used in the server). IngestOutput.base_branch becomes nullable: it echoes the request's 'from' and is null when absent. OpenAPI regenerated; the CLI's local ingest arm moves to load_file_as + the new converter shape. BREAKING CHANGE: clients that relied on implicit fork-from-main with 'from' omitted must now pass from='main' explicitly. IngestOutput.base_branch is now nullable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 04:05:29 +03:00
aaltshuler	c236a4c2df	refactor(loader): load_jsonl helpers take &Omnigraph and document their role The free helpers needlessly demanded &mut Omnigraph (every load API takes &self) and read as leftovers. Rather than rewriting their ~200 call sites across the test suites — which would have to re-derive the active-branch resolution at each site — keep the one convenience and make it honest: borrow immutably (&mut callers coerce, no churn) and document it as the active-branch shorthand over Omnigraph::load. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 03:57:41 +03:00
aaltshuler	e676c151bb	feat(engine): unify load/ingest — load_as gains an optional fork base load_as/load_file_as gain a base: Option<&str> parameter: with Some(base) a missing target branch is forked from base first (the former ingest semantics); with None the target branch must exist — staging fails on an unknown branch, so a typo'd name can never create one. LoadResult gains branch/base_branch/branch_created metadata (additive). The ingest family (ingest, ingest_as, ingest_file, ingest_file_as) becomes #[deprecated] shims over load_as that preserve the historical contract exactly (from: None still means fork from main; base recorded even when no fork happened). IngestResult and to_ingest_tables stay for the shims and the server until the removal release. The layered policy check is unchanged: Change on the target branch always, BranchCreate additionally when a fork actually happens (enforced inside branch_create_from_as with the actor threaded through). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 03:53:22 +03:00
aaltshuler	4558454bc7	fix(cluster): address review — discovery reads each file exactly once resolve_query_decls hands its file contents to the caller; the per-query digest/typecheck pass reuses them instead of re-reading (a file with N queries was read N+1 times), which also closes the window where a file changing between enumeration and validation produced a confusing query_key_mismatch for a just-discovered name. Explicit-map declarations read as before. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 01:35:47 +03:00
aaltshuler	677320ceec	feat(cluster): Terraform-shaped query declaration — discover from files cluster.yaml's graphs.<id>.queries previously accepted only an explicit name->file map, forcing configs to re-enumerate every `query <name>` that the .gq files already declare (the SPIKE cookbook needed 66 entries for 6 files). The files ARE the declaration now: `queries: queries/` discovers every declaration in a directory's top-level *.gq (sorted), a list form takes explicit files, and the map stays for fine-grained control. Discovery is loud — unreadable/unparseable files and duplicate query names fail validation (query_parse_error, duplicate_query_name). Downstream is untouched: each discovered query is still an individually addressed resource with the containing file's digest. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-11 00:46:21 +03:00
aaltshuler	3b2bf755ae	fix(cli): address review — honor the one-thing contract, restore docs, untangle test phases - resolve_cluster_actor uses load_config directly: load_cli_config also loads auth.env_file into the process env — a second thing, violating the documented 'exactly one thing' omnigraph.yaml contract for cluster ops. - resolve_cli_actor gets its doc comment back (the inserted helper had absorbed the contiguous /// block). - The actor-default test imports once as setup and asserts on apply alone, idempotently, instead of re-importing inside the assertion helper. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:54:05 +03:00
aaltshuler	fbe9726ac7	test(cli): stop the S3 e2e scaffolding omnigraph.yaml into the crate dir local_cli_s3_end_to_end_init_load_read_flow ran `omnigraph init` without a current_dir, so init's project scaffold landed in crates/omnigraph-cli/ — poisoning any later test that resolves a graph target from the cwd config (query_lint_requires_schema_or_resolvable_graph_target fails determinis- tically once the file exists). Only manifests when OMNIGRAPH_S3_TEST_BUCKET is set, which is why local FS runs and CI's scoped rustfs job never caught it. The init and load calls now run inside the test's tempdir. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:34:54 +03:00
aaltshuler	f7368b58a0	test(cli): pin --cluster boot isolation from cwd omnigraph.yaml A --cluster server process whose cwd contains a MALFORMED omnigraph.yaml boots and serves — proving mode-inference rule 0 returns before any config search can run. New spawn_server_with_cluster_in support helper sets the spawned server's cwd explicitly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:29:49 +03:00
aaltshuler	f3374ac6dc	feat(cli): resolve cluster actor via the per-operator config cascade Cluster FACTS stay unlayered (cluster.yaml only), but the operator's identity is a per-operator fact — exactly the per-operator omnigraph.yaml's permanent job, and the cascade every data-plane write already uses. cluster apply/approve now resolve: --as flag wins and skips any config read entirely (containers and CI stay config-free); without it, the standard cwd search supplies cli.actor, with a malformed config failing loudly and actionably ('pass --as to skip this lookup') rather than silently dropping attribution. approve's no-actor error now names both sources. Tests pin the contract from both sides: cli.actor is the no-flag default for apply (echoed actor) and approve (approved_by), the flag overrides it, a malformed omnigraph.yaml in cwd breaks nothing except the no-flag actor lookup, and a conflicting well-formed one leaks nothing into cluster outputs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 22:29:49 +03:00
aaltshuler	d8354ac213	test(cli): address review — assert schema-show success, document exit-code stance, add e2e opt-out - The drift-heal verification now asserts `schema show` succeeded and produced a schema before checking the rogue field's absence (a failed command previously made the negative assertion vacuously pass). - cluster_cli documents why it deliberately does not assert exit codes (blocked applies exit non-zero by contract while emitting the structured output callers assert on). - The comprehensive lifecycle e2es honor OMNIGRAPH_SKIP_SYSTEM_E2E=1 (graceful skip-with-message, the S3-gate pattern) for constrained sandboxes; requirements + suppression documented in testing.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 19:05:12 +03:00
aaltshuler	7d70811df1	test(cli): comprehensive full-cycle cluster e2e with a live server Two system tests composing the whole Phase 1-5 surface with real binaries: - local_cluster_full_lifecycle_declare_serve_evolve_delete: declare two graphs -> one apply creates and converges them -> the --cluster server serves both stored queries -> schema+query evolve in one apply (migration previewed in plan) -> restart serves the new shape -> out-of-band schema drift observed by refresh and converged back by apply (rogue field soft-dropped) -> approved graph delete -> restart serves the survivor and 404s the tombstoned graph -> final plan empty. Catches composition regressions where each stage passes its own tests but the lifecycle breaks (the composite_flow.rs principle at the control-plane level). - local_cluster_serving_enforces_applied_policy_bindings: applied policy bundles gate serving per their bindings over HTTP with bearer-resolved actors — the cluster-bound bundle owns graph_list (admin 200, reader 403, anonymous 401), the graph-bound bundle owns invoke_query (reader gets rows; denied invocation is the documented anti-probing 404). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 18:07:29 +03:00
aaltshuler	711865e6f1	docs(cluster,server): the Phase 5 mode switch; retire applied-not-serving caveats The standing caveat ('applied means recorded in the cluster catalog — nothing more; the server still boots from omnigraph.yaml') retires: cluster docs gain the 'Serving from the cluster' section (exclusivity, applied- revision serving, fail-fast readiness, restart-to-pick-up, expose-all bridge), server.md gains mode-inference rule 0 and the cluster-booted multi mode, deployment.md the boot-source choice, and the CLI's apply note plus the cli-reference cluster row (stale back to Stage 3A) now describe the full convergence surface. RFC-005 flips to Landed with four implementation deviations recorded. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:56:54 +03:00
aaltshuler	f3eb60fa4e	test(cli): applied-means-serving system e2e The Phase-5 contract end to end with real binaries: cluster import + apply via the CLI, seed a row through the graph plane, boot omnigraph-server with --cluster (no omnigraph.yaml anywhere), and the applied stored query serves the row over HTTP through the multi-graph routes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:51:40 +03:00
aaltshuler	948a54daa7	feat(server): boot from cluster state via --cluster RFC-005 §D1/§D2: omnigraph-server --cluster <dir> is rule 0 of the mode inference — an exclusive boot source (hard error when combined with a graph URI, --target, or --config) that never opens omnigraph.yaml, not even the implicit current-directory search. The cluster branch reads the applied revision through omnigraph-cluster's serving-snapshot API and feeds the EXISTING multi-graph pipeline: GraphStartupConfig per recorded graph at its derived root, stored queries built via QueryRegistry::from_specs from verified blob content (expose-all — the §D5 bridge until Phase 6 policy-owned exposure), cluster-bound policy bundles as the server-level Cedar engine and graph-bound bundles per graph, straight from the content-addressed blob paths. Multiple bundles binding one scope refuse boot (one-bundle-per-scope is the serving pipeline's shape; stacking is a later slice). Everything downstream — parallel opens, query type-checking, registry, routing, auth, OpenAPI — is reused unchanged; cluster mode is a new source, not a new pipeline. First server->cluster crate dependency: read-only types + one fn; omnigraph-cluster stays HTTP-free. open_multi_graph_state goes pub for integration tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:48:10 +03:00
aaltshuler	f5b43164b8	feat(cluster): pub read-only serving-snapshot API RFC-005 §D2/§D4: read_serving_snapshot reads the applied revision as everything a server needs to boot — graphs at derived roots, stored-query sources read from the content-addressed catalog and re-hashed against the recorded digests, policy blob paths with their applied applies_to bindings. All-or-nothing: missing state, pending recovery sidecars, missing/tampered blobs, pre-5A entries without bindings, and an empty graph set each refuse the snapshot with a remedy; no partial serving. Lock-free by design — the state file is replaced atomically, so the read is a consistent point-in-time ledger. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 17:39:26 +03:00
aaltshuler	0b84b1adc3	feat(cluster): record policy applies_to bindings in the applied revision Slice 5A of RFC-005: the state ledger becomes serving-sufficient for the Phase-5 server boot. StateResource gains an optional applies_to (normalized typed refs: cluster \| graph.<id>), written by apply for every applied policy create/update from the desired config's validated bindings. The hole this closes: applies_to is not part of the policy file digest, so a binding-only edit previously produced NO plan change at all (a 4C e2e even asserted that — the gap, not a contract). Binding changes are now first-class: a post-diff pass emits an Update with equal before/after digests and a binding_change marker (visible in plan/apply JSON and human output as [bindings]), classification/execution treat it as an ordinary catalog-tier applied change (payload skips naturally — the blob is unchanged), and convergence requires zero binding divergence, so stale bindings can never report converged. Pre-5A ledger entries (no bindings recorded) surface as the same backfill Update; one apply heals them, exactly the remedy RFC-005's boot-error path names. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 15:30:33 +03:00
aaltshuler	87691fe9c7	test(cluster): failpoint coverage for delete crash windows - Crash before the removal: root intact, approval file unconsumed, sidecar survives, no ack; the next run retires the stale intent (row 8) and the still-approved delete completes in the same run. - Crash after the removal, before the state CAS: root gone, ledger byte-identical, the sidecar carries the approval id; the next run's sweep rolls the tombstone forward, consumes the approval, audits the recovery, and converges (row 7b). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:34:54 +03:00
aaltshuler	d1d04217ab	feat(cluster): execute approved graph deletes in cluster apply Stage 4C execution half (RFC-004 §D5/§D6 + sweep rows 7/7b/8): an approved graph.<id> delete — and its riding schema/query deletes — classifies Applied and executes LAST in the run, sidecar-fenced: pre-op manifest pin (best effort; partial roots still delete), approval_id carried in the sidecar, recursive root removal (NotFound tolerated), subtree tombstoned out of the ledger with a tombstone observation, the approval consumed in the same state CAS (ledger summary) and its artifact file rewritten with consumed_at only after the CAS lands — a failed run consumes nothing and the approval stays valid for the retry. Sweep rows: already-tombstoned intents retire (7); a completed delete with a stale ledger rolls forward — tombstone + approval consumption + audit entry (7b, idempotent); a still-present root retires the stale intent with a graph_delete_incomplete warning and the still-approved delete re-executes in the same run (8) — prefix removal is idempotent, so retry IS the repair. The multi-graph mixed e2e gets its conclusion: blocked without approval, cluster approve graph.engineering --as andrew, converge, tombstone visible in status. Phase 4's disposition matrix is now fully executable. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:34:02 +03:00
aaltshuler	f4e9105272	feat(cluster): cluster approve — digest-bound approval artifacts RFC-004 §D4, gate half: graph deletes (and their subtree) now classify Blocked/approval_required instead of Deferred; the new cluster approve command (requires the global --as actor) writes __cluster/approvals/{ulid}.json bound to the desired config digest and the change's before/after digests, so config or state drift invalidates the artifact automatically (approval_stale warning, never authorizes). One gate per subtree: compute_approvals lists only the graph-level delete, and ApprovalRequirement gains a satisfied flag surfaced by plan. Consumption and the delete executor land next — until then approved deletes stay blocked so a gate-only build can never strip state without removing the root. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 14:30:05 +03:00
aaltshuler	80cae4e8e1	test(cluster): failpoint coverage for schema-apply crash windows - Crash before the engine call: sidecar (carrying the --as actor) survives, live schema and ledger untouched, no ack; the next run's sweep retires the stale intent and the same run applies and converges. - Crash after the engine call, before the state CAS: the manifest moved with the post-op pin in the sidecar, state.json byte-identical; the next run's sweep rolls the ledger forward with a schema_apply audit entry and the run converges. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:13:15 +03:00
aaltshuler	a1ba4dc413	feat(cluster): execute schema applies in cluster apply Stage 4B (RFC-004 §D1/§D5): schema.<id> Update changes classify Applied and execute after graph creates, sequentially and sidecar-fenced — read-write open (the engine's own recovery runs first), pre-op manifest pin recorded, apply_schema_as with allow_data_loss: false (soft drops only; hard drops wait for 4C's approval artifacts), post-op pin rewritten into the sidecar, sidecar retired only after the final state CAS. Queries gated on a same-plan schema update unblock (the migration lands first in the same run); failures — unsupported migrations, lock contention, user branches — surface as schema_apply_failed with the engine's message, demote dependents via the origin-aware demotion helper, and stop further graph-moving work. Schema evolution is now fully cluster-driven (the defer -> manual schema apply -> refresh loop is gone), and out-of-band schema drift is converged back by apply as an ordinary soft migration (axiom 8: drift correction is gated like any change; the recoverable tier needs no approval) — both pinned by reworked e2es. The multi-graph mixed e2e's deferred row is now delete-shaped, pre-staging the 4C surface. Actor: cluster apply accepts the CLI's global --as via the new ApplyOptions / apply_config_dir_with_options (apply_config_dir delegates unchanged); the actor is echoed in ApplyOutput and recorded in sidecars and audit entries, and threads to apply_schema_as so Cedar fires wherever a checker is installed. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:12:15 +03:00
aaltshuler	0571c05ebb	feat(cluster): schema-apply recovery sidecar kind and sweep RecoverySidecarKind::SchemaApply with digest-based sweep classification (robust to unrelated manifest movement; version pins stay forensic): ledger-consistent -> sidecar retired (RFC-004 rows 1+2); live digest matches the intended schema, state stale -> roll forward with composite recompute and a recovery_records audit entry (row 3); unverifiable or unexpected digests -> pending, kept, graph-moving work blocked (rows 1-unopenable/6). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:05:42 +03:00
aaltshuler	ca63a9340b	feat(cluster): embed schema migration previews in cluster plan RFC-004 §D7's data-aware preview: for every schema update, plan opens the live graph read-only and embeds the engine's migration plan (supported flag + typed steps) in the change record; the human renderer prints the steps. Preview failures (unreachable graph, planner error) degrade to the digest diff with a schema_preview_unavailable warning — planning never blocks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:04:19 +03:00
aaltshuler	b313075476	refactor(cluster): make plan_config_dir async Mechanical conversion ahead of Stage 4B (plan will preview schema migrations against live graphs): signature, CLI dispatch, and test callers. Zero behavior change. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 13:02:12 +03:00
aaltshuler	83d77bcb16	test(cluster): failpoint coverage for graph-create crash windows - Crash before the init (row 1): sidecar survives, nothing moved, no ack; the next run's sweep removes the intent and the same run creates and converges. - Crash after the init, before the state CAS (row 4): the graph exists with the post-init manifest pin in the sidecar, state.json byte-identical; the next run's sweep rolls the ledger forward with a recovery_records audit entry and the run converges. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:59:48 +03:00
aaltshuler	c3007369cd	feat(cluster): execute graph creates in cluster apply Stage 4A (RFC-004 §D1/§D5): graph.<id> Create — and its paired schema Create, which the init carries — classify Applied and execute first in the run, sequentially and sidecar-fenced: sidecar written before Omnigraph::init at the derived root, rewritten with the post-init manifest pin, deleted only after the final state CAS lands. Dependent queries and policies no longer block on a graph create in the same plan — creates run first, so they apply in the same run; a create failure demotes them to blocked (dependency_not_applied) and stops further graph-moving work (loud partials), with the sidecar left for the sweep to classify. Graphs with a kept recovery sidecar (rows 5/6) classify Blocked/cluster_recovery_pending, and the sweep's Drifted/Error statuses are never clobbered by a generic Blocked. Schema source is re-read and digest-verified under the lock before the init (the write_resource_payload TOCTOU posture). Plan previews the same dispositions. e2e fallout updated: a fresh multi-graph config now converges in one apply; a destroyed root is re-created as an EMPTY graph by the next apply (declarative convergence — visible in plan, called out in docs); the new cluster_e2e_declared_graph_created_by_apply pins the no-manual-init flow. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:58:56 +03:00
aaltshuler	bf8cc7a753	feat(cluster): graph-create recovery sidecars and sweep RFC-004 §D2/§D3 for the graph_create kind. RecoverySidecar records intent under __cluster/recoveries/{ulid}.json; the roll-forward-only sweep runs at the start of apply/refresh/import under the state lock and classifies each survivor by observation: root absent -> intent removed (row 1); outcome already recorded -> retired (row 2); create completed but state stale -> ledger rolled forward with a recovery_records audit entry (row 4); partial root -> Error/graph_create_incomplete, kept, never auto-deleted (row 5); unexpected schema -> Drifted/actual_applied_state_pending, kept (row 6). Sweep mutations ride the command's existing CAS write; completed sidecars are deleted only after that write lands. Read-only status/plan warn (cluster_recovery_pending) without acting. The apply payload gate now counts only payload-phase errors so kept-sidecar diagnostics don't abort the run before their statuses persist. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:50:42 +03:00
aaltshuler	6fbf09d5c9	refactor(cluster): make apply_config_dir async Mechanical conversion ahead of Stage 4A graph create (which calls the async Omnigraph::init from inside apply): the fn signature, the CLI dispatch arm, and every test caller (#[test] -> #[tokio::test]). Zero behavior change; all 60 lib tests and 3 failpoint tests green before and after. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 04:43:38 +03:00
aaltshuler	16759b28b9	fix(cluster): RAII-guard the callback failpoint ScopedFailPoint::with_callback gives cfg_callback the same Drop-based cleanup as cfg actions; a panic while the point is active no longer leaks the callback into the process-global registry where it would fire under later tests (greptile review, PR #167). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 02:36:24 +03:00
aaltshuler	211b37e6de	test(cluster): failpoint tests for crash-mid-apply and state CAS race The apply-side coverage the implementation spec's hard gate requires before Phase 4 graph-moving apply: - crash after the payload phase: state.json byte-identical, blobs inert on disk, lock released, no phantom statuses, nothing acknowledged; a plain re-run repairs via skip-if-exists blob reuse. - CAS race: a cfg_callback rewrites state.json at the exact read->write window (the state.lock:false concurrent-writer scenario); apply surfaces state_cas_mismatch, acknowledges nothing, reports the persisted status snapshot, leaves the concurrent writer's state on disk; a re-run converges. CI's failpoints step now runs both the engine and cluster suites. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-06-10 02:14:06 +03:00

1 2 3 4 5 ...

263 commits