The ECS day-2 apply gains its required --config flag (the image ships no
omnigraph.yaml, so the CLI cannot locate the cluster dir without it), and
the docker-exec example uses the <you> placeholder convention instead of a
real-looking actor name.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- resolve_cluster_actor uses load_config directly: load_cli_config also
loads auth.env_file into the process env — a second thing, violating the
documented 'exactly one thing' omnigraph.yaml contract for cluster ops.
- resolve_cli_actor gets its doc comment back (the inserted helper had
absorbed the contiguous /// block).
- The actor-default test imports once as setup and asserts on apply alone,
idempotently, instead of re-importing inside the assertion helper.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The container contract (OMNIGRAPH_CLUSTER + mounted volume + token env),
ECS/Fargate+EFS and Railway-volume walkthroughs, the in-container day-2
loop, and the honest constraints list (volume mandatory, no hot reload,
single-writer apply, shared-volume replicas unvalidated). Operator guide
links the recipes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
OMNIGRAPH_CLUSTER boots the container from a mounted cluster directory's
applied revision — checked first and exclusive (exit 64 when combined with
OMNIGRAPH_TARGET_URI/CONFIG/TARGET), the entrypoint-level mirror of the
server's mode-inference rule 0. The omnigraph CLI joins the image so the
day-2 loop (cluster apply/approve/status, data loads by explicit URI) runs
in-container via docker/ECS exec or railway shell — no omnigraph.yaml
required, which the cluster-local-config PR pins. entrypoint_test gains the
cluster case plus all three exclusivity refusals.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
local_cli_s3_end_to_end_init_load_read_flow ran `omnigraph init` without a
current_dir, so init's project scaffold landed in crates/omnigraph-cli/ —
poisoning any later test that resolves a graph target from the cwd config
(query_lint_requires_schema_or_resolvable_graph_target fails determinis-
tically once the file exists). Only manifests when OMNIGRAPH_S3_TEST_BUCKET
is set, which is why local FS runs and CI's scoped rustfs job never caught
it. The init and load calls now run inside the test's tempdir.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The 'Relationship to omnigraph.yaml' section becomes the exact rule set:
cluster commands read the per-operator config for exactly one thing (the
cli.actor default when --as is omitted), a --cluster server reads it for
nothing, and pointing data-plane targets at derived roots is ergonomics,
not coupling. Operator guide and CLI reference updated to match.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A --cluster server process whose cwd contains a MALFORMED omnigraph.yaml
boots and serves — proving mode-inference rule 0 returns before any config
search can run. New spawn_server_with_cluster_in support helper sets the
spawned server's cwd explicitly.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Cluster FACTS stay unlayered (cluster.yaml only), but the operator's
identity is a per-operator fact — exactly the per-operator omnigraph.yaml's
permanent job, and the cascade every data-plane write already uses. cluster
apply/approve now resolve: --as flag wins and skips any config read
entirely (containers and CI stay config-free); without it, the standard cwd
search supplies cli.actor, with a malformed config failing loudly and
actionably ('pass --as to skip this lookup') rather than silently dropping
attribution. approve's no-actor error now names both sources.
Tests pin the contract from both sides: cli.actor is the no-flag default
for apply (echoed actor) and approve (approved_by), the flag overrides it,
a malformed omnigraph.yaml in cwd breaks nothing except the no-flag actor
lookup, and a conflicting well-formed one leaks nothing into cluster
outputs.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
New docs/user/cluster.md — the practical companion to cluster-config.md's
reference: zero-to-served walkthrough (validate/import/plan/apply, derived
roots, data loading, --cluster serving), the day-2 edit->plan->apply->restart
loop with a per-change-kind table, drift observation and convergence, the
approval gate for destructive changes, crash/lock/lost-ledger recovery, the
boot-refusal table with remedies, deployment patterns (replicas, backup
unit, CI gating), and the explicit not-yet list (hot reload, S3-hosted
cluster dirs, per-query exposure, pipelines). Linked from the user index,
the agent guide's topic map, and cross-linked from the reference.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- The drift-heal verification now asserts `schema show` succeeded and
produced a schema before checking the rogue field's absence (a failed
command previously made the negative assertion vacuously pass).
- cluster_cli documents why it deliberately does not assert exit codes
(blocked applies exit non-zero by contract while emitting the structured
output callers assert on).
- The comprehensive lifecycle e2es honor OMNIGRAPH_SKIP_SYSTEM_E2E=1
(graceful skip-with-message, the S3-gate pattern) for constrained
sandboxes; requirements + suppression documented in testing.md.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
beta.4+ refuses the rustfsadmin/rustfsadmin test credentials unless
RUSTFS_ALLOW_INSECURE_DEFAULT_CREDENTIALS=true is set — acceptable for the
ephemeral CI container and the local bootstrap script (which already passed
it). The three S3 suites were validated against the beta.8 binary locally
before this bump. The pin stays explicit, never `latest`, so future
upgrades remain deliberate.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Two system tests composing the whole Phase 1-5 surface with real binaries:
- local_cluster_full_lifecycle_declare_serve_evolve_delete: declare two
graphs -> one apply creates and converges them -> the --cluster server
serves both stored queries -> schema+query evolve in one apply (migration
previewed in plan) -> restart serves the new shape -> out-of-band schema
drift observed by refresh and converged back by apply (rogue field
soft-dropped) -> approved graph delete -> restart serves the survivor and
404s the tombstoned graph -> final plan empty. Catches composition
regressions where each stage passes its own tests but the lifecycle
breaks (the composite_flow.rs principle at the control-plane level).
- local_cluster_serving_enforces_applied_policy_bindings: applied policy
bundles gate serving per their bindings over HTTP with bearer-resolved
actors — the cluster-bound bundle owns graph_list (admin 200, reader 403,
anonymous 401), the graph-bound bundle owns invoke_query (reader gets
rows; denied invocation is the documented anti-probing 404).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The standing caveat ('applied means recorded in the cluster catalog —
nothing more; the server still boots from omnigraph.yaml') retires: cluster
docs gain the 'Serving from the cluster' section (exclusivity, applied-
revision serving, fail-fast readiness, restart-to-pick-up, expose-all
bridge), server.md gains mode-inference rule 0 and the cluster-booted multi
mode, deployment.md the boot-source choice, and the CLI's apply note plus
the cli-reference cluster row (stale back to Stage 3A) now describe the full
convergence surface. RFC-005 flips to Landed with four implementation
deviations recorded.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The Phase-5 contract end to end with real binaries: cluster import + apply
via the CLI, seed a row through the graph plane, boot omnigraph-server with
--cluster (no omnigraph.yaml anywhere), and the applied stored query serves
the row over HTTP through the multi-graph routes.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RFC-005 §D1/§D2: omnigraph-server --cluster <dir> is rule 0 of the mode
inference — an exclusive boot source (hard error when combined with a graph
URI, --target, or --config) that never opens omnigraph.yaml, not even the
implicit current-directory search. The cluster branch reads the applied
revision through omnigraph-cluster's serving-snapshot API and feeds the
EXISTING multi-graph pipeline: GraphStartupConfig per recorded graph at its
derived root, stored queries built via QueryRegistry::from_specs from
verified blob content (expose-all — the §D5 bridge until Phase 6
policy-owned exposure), cluster-bound policy bundles as the server-level
Cedar engine and graph-bound bundles per graph, straight from the
content-addressed blob paths. Multiple bundles binding one scope refuse boot
(one-bundle-per-scope is the serving pipeline's shape; stacking is a later
slice). Everything downstream — parallel opens, query type-checking,
registry, routing, auth, OpenAPI — is reused unchanged; cluster mode is a
new source, not a new pipeline.
First server->cluster crate dependency: read-only types + one fn;
omnigraph-cluster stays HTTP-free. open_multi_graph_state goes pub for
integration tests.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RFC-005 §D2/§D4: read_serving_snapshot reads the applied revision as
everything a server needs to boot — graphs at derived roots, stored-query
sources read from the content-addressed catalog and re-hashed against the
recorded digests, policy blob paths with their applied applies_to bindings.
All-or-nothing: missing state, pending recovery sidecars, missing/tampered
blobs, pre-5A entries without bindings, and an empty graph set each refuse
the snapshot with a remedy; no partial serving. Lock-free by design — the
state file is replaced atomically, so the read is a consistent
point-in-time ledger.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Slice 5A of RFC-005: the state ledger becomes serving-sufficient for the
Phase-5 server boot. StateResource gains an optional applies_to (normalized
typed refs: cluster | graph.<id>), written by apply for every applied policy
create/update from the desired config's validated bindings.
The hole this closes: applies_to is not part of the policy file digest, so a
binding-only edit previously produced NO plan change at all (a 4C e2e even
asserted that — the gap, not a contract). Binding changes are now
first-class: a post-diff pass emits an Update with equal before/after
digests and a binding_change marker (visible in plan/apply JSON and human
output as [bindings]), classification/execution treat it as an ordinary
catalog-tier applied change (payload skips naturally — the blob is
unchanged), and convergence requires zero binding divergence, so stale
bindings can never report converged. Pre-5A ledger entries (no bindings
recorded) surface as the same backfill Update; one apply heals them, exactly
the remedy RFC-005's boot-error path names.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The axiom-15 mode switch: omnigraph-server --cluster <dir> (mutually
exclusive with uri/--target/--config, zero omnigraph.yaml reads) serves the
APPLIED revision — graph set from state, query/policy content from the
content-addressed catalog at applied digests, cluster-scoped policy bundles
as the server-level Cedar engine. The load-bearing finding: state is not yet
serving-sufficient (policy applies_to bindings live only in cluster.yaml), so
slice 5A records binding metadata into the applied revision at apply time —
without it, boot-from-state silently becomes the merged read axiom 15
forbids. Fail-fast readiness table (missing state, pending sidecars, missing
blobs, unbound policies all refuse boot with remedies), the expose-all
mcp.expose bridge with its Phase 6 sunset, the operator migration path (exit
criterion 7), and 5A/5B/5C sequencing. The existing boot pipeline
(GraphStartupConfig -> registry -> routing/auth) is reused as-is — a new
source, not a new pipeline.
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Pipelines (scheduler, connectors, mapping, idempotency, run ledger) leave the
cluster control-plane rollout and become their own project with their own
RFC. This rollout guarantees only the socket, all of which already exists and
is enforced: the pipelines: config field is reserved (typed
future_phase_field rejection, test-covered), the pipeline.<name> typed
address and Pipeline resource kind are reserved in the resource model, and
axiom 13 fixes the contract any future implementation must satisfy
(definition reconciled, execution data-plane, fan-out statusful). The ETL
section in the high-level spec stands as the requirements record for that
project; exit criterion 9 defers to its RFC.
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Approvals + gated graph deletion in the user docs, the approve command in the
CLI reference, RFC-004 flipped to Landed with its three implementation
deviations recorded (row-8 retire-and-repropose, --as instead of --actor/--by,
consumed artifacts rewritten in place rather than moved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Crash before the removal: root intact, approval file unconsumed, sidecar
survives, no ack; the next run retires the stale intent (row 8) and the
still-approved delete completes in the same run.
- Crash after the removal, before the state CAS: root gone, ledger
byte-identical, the sidecar carries the approval id; the next run's sweep
rolls the tombstone forward, consumes the approval, audits the recovery,
and converges (row 7b).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Stage 4C execution half (RFC-004 §D5/§D6 + sweep rows 7/7b/8): an approved
graph.<id> delete — and its riding schema/query deletes — classifies Applied
and executes LAST in the run, sidecar-fenced: pre-op manifest pin (best
effort; partial roots still delete), approval_id carried in the sidecar,
recursive root removal (NotFound tolerated), subtree tombstoned out of the
ledger with a tombstone observation, the approval consumed in the same state
CAS (ledger summary) and its artifact file rewritten with consumed_at only
after the CAS lands — a failed run consumes nothing and the approval stays
valid for the retry.
Sweep rows: already-tombstoned intents retire (7); a completed delete with a
stale ledger rolls forward — tombstone + approval consumption + audit entry
(7b, idempotent); a still-present root retires the stale intent with a
graph_delete_incomplete warning and the still-approved delete re-executes in
the same run (8) — prefix removal is idempotent, so retry IS the repair.
The multi-graph mixed e2e gets its conclusion: blocked without approval,
cluster approve graph.engineering --as andrew, converge, tombstone visible
in status. Phase 4's disposition matrix is now fully executable.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RFC-004 §D4, gate half: graph deletes (and their subtree) now classify
Blocked/approval_required instead of Deferred; the new cluster approve
command (requires the global --as actor) writes
__cluster/approvals/{ulid}.json bound to the desired config digest and the
change's before/after digests, so config or state drift invalidates the
artifact automatically (approval_stale warning, never authorizes). One gate
per subtree: compute_approvals lists only the graph-level delete, and
ApprovalRequirement gains a satisfied flag surfaced by plan. Consumption and
the delete executor land next — until then approved deletes stay blocked so
a gate-only build can never strip state without removing the root.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Crash before the engine call: sidecar (carrying the --as actor) survives,
live schema and ledger untouched, no ack; the next run's sweep retires the
stale intent and the same run applies and converges.
- Crash after the engine call, before the state CAS: the manifest moved with
the post-op pin in the sidecar, state.json byte-identical; the next run's
sweep rolls the ledger forward with a schema_apply audit entry and the run
converges.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Stage 4B (RFC-004 §D1/§D5): schema.<id> Update changes classify Applied and
execute after graph creates, sequentially and sidecar-fenced — read-write
open (the engine's own recovery runs first), pre-op manifest pin recorded,
apply_schema_as with allow_data_loss: false (soft drops only; hard drops wait
for 4C's approval artifacts), post-op pin rewritten into the sidecar, sidecar
retired only after the final state CAS. Queries gated on a same-plan schema
update unblock (the migration lands first in the same run); failures —
unsupported migrations, lock contention, user branches — surface as
schema_apply_failed with the engine's message, demote dependents via the
origin-aware demotion helper, and stop further graph-moving work.
Schema evolution is now fully cluster-driven (the defer -> manual schema
apply -> refresh loop is gone), and out-of-band schema drift is converged
back by apply as an ordinary soft migration (axiom 8: drift correction is
gated like any change; the recoverable tier needs no approval) — both pinned
by reworked e2es. The multi-graph mixed e2e's deferred row is now
delete-shaped, pre-staging the 4C surface.
Actor: cluster apply accepts the CLI's global --as via the new ApplyOptions /
apply_config_dir_with_options (apply_config_dir delegates unchanged); the
actor is echoed in ApplyOutput and recorded in sidecars and audit entries,
and threads to apply_schema_as so Cedar fires wherever a checker is
installed.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RecoverySidecarKind::SchemaApply with digest-based sweep classification
(robust to unrelated manifest movement; version pins stay forensic):
ledger-consistent -> sidecar retired (RFC-004 rows 1+2); live digest matches
the intended schema, state stale -> roll forward with composite recompute and
a recovery_records audit entry (row 3); unverifiable or unexpected digests ->
pending, kept, graph-moving work blocked (rows 1-unopenable/6).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RFC-004 §D7's data-aware preview: for every schema update, plan opens the
live graph read-only and embeds the engine's migration plan (supported flag
+ typed steps) in the change record; the human renderer prints the steps.
Preview failures (unreachable graph, planner error) degrade to the digest
diff with a schema_preview_unavailable warning — planning never blocks.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mechanical conversion ahead of Stage 4B (plan will preview schema migrations
against live graphs): signature, CLI dispatch, and test callers. Zero
behavior change.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Crash before the init (row 1): sidecar survives, nothing moved, no ack;
the next run's sweep removes the intent and the same run creates and
converges.
- Crash after the init, before the state CAS (row 4): the graph exists with
the post-init manifest pin in the sidecar, state.json byte-identical; the
next run's sweep rolls the ledger forward with a recovery_records audit
entry and the run converges.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Stage 4A (RFC-004 §D1/§D5): graph.<id> Create — and its paired schema Create,
which the init carries — classify Applied and execute first in the run,
sequentially and sidecar-fenced: sidecar written before Omnigraph::init at
the derived root, rewritten with the post-init manifest pin, deleted only
after the final state CAS lands. Dependent queries and policies no longer
block on a graph create in the same plan — creates run first, so they apply
in the same run; a create failure demotes them to blocked
(dependency_not_applied) and stops further graph-moving work (loud partials),
with the sidecar left for the sweep to classify. Graphs with a kept recovery
sidecar (rows 5/6) classify Blocked/cluster_recovery_pending, and the sweep's
Drifted/Error statuses are never clobbered by a generic Blocked.
Schema source is re-read and digest-verified under the lock before the init
(the write_resource_payload TOCTOU posture). Plan previews the same
dispositions. e2e fallout updated: a fresh multi-graph config now converges
in one apply; a destroyed root is re-created as an EMPTY graph by the next
apply (declarative convergence — visible in plan, called out in docs); the
new cluster_e2e_declared_graph_created_by_apply pins the no-manual-init flow.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
RFC-004 §D2/§D3 for the graph_create kind. RecoverySidecar records intent
under __cluster/recoveries/{ulid}.json; the roll-forward-only sweep runs at
the start of apply/refresh/import under the state lock and classifies each
survivor by observation: root absent -> intent removed (row 1); outcome
already recorded -> retired (row 2); create completed but state stale ->
ledger rolled forward with a recovery_records audit entry (row 4); partial
root -> Error/graph_create_incomplete, kept, never auto-deleted (row 5);
unexpected schema -> Drifted/actual_applied_state_pending, kept (row 6).
Sweep mutations ride the command's existing CAS write; completed sidecars
are deleted only after that write lands. Read-only status/plan warn
(cluster_recovery_pending) without acting. The apply payload gate now counts
only payload-phase errors so kept-sidecar diagnostics don't abort the run
before their statuses persist.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Mechanical conversion ahead of Stage 4A graph create (which calls the async
Omnigraph::init from inside apply): the fn signature, the CLI dispatch arm,
and every test caller (#[test] -> #[tokio::test]). Zero behavior change; all
60 lib tests and 3 failpoint tests green before and after.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Engineering and docs roles both resolve to @aaltshuler; every path
(catch-all, crates/**, docs/**, repo-level docs) now requires their review.
CODEOWNERS and the doc tables regenerated from codeowners-roles.yml via
render-codeowners.py.
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* docs(cluster): RFC-004 — graph & schema apply design (Phase 4)
The design the implementation spec's exit criteria require before
graph-moving cluster apply ships. Core positions:
- Cluster recovery is roll-forward-only: the engine's own sidecars make every
graph-level operation atomic within the graph, so the cluster never rolls a
graph back — its sidecars (__cluster/recoveries/{ulid}.json) classify and
record, converging the ledger to observable reality (axiom 5) or surfacing
a loud pending-repair condition. Eight-row decision matrix, every row
testable with the Stage 3B failpoint harness.
- Irreversible operations (graph delete, allow_data_loss schema apply)
consume digest-bound approval artifacts written by a new cluster approve
command and retired into state.approval_records (axiom 11). A stale
approval can never authorize a different change.
- cluster apply gains an actor, threaded to apply_schema_as so engine Cedar
enforcement and commit attribution work unchanged; the cluster adds no
policy engine of its own.
- Deterministic ordering (creates -> schema applies -> catalog -> deletes),
per-resource apply groups, cross-graph atomicity explicitly not promised.
- Staged 4A graph create / 4B schema apply / 4C graph delete, each gated on
per-matrix-row failpoint tests.
Answers exit criteria 2 and 4 fully, 1/5/6 partially; 3/7/8/9 deferred to
their phases (coverage table in the RFC). Linked from the dev index and the
implementation spec's Phase 4 section.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
* docs(cluster): RFC-004 review fixes — graph_delete sweep rows, state_cas_base contract
Two greptile findings: (1) D3 row 2 could not be evaluated for graph_delete
(no manifest to version-check after prefix removal) and 'root absent, state
already tombstoned' fell into the stale row — split into rows 7 (delete's
analog of row 2) and 7b (the roll-forward), with expected_manifest_version
documented as always null for the delete kind. (2) state_cas_base is now
explicitly audit/diagnostics-only — the sweep never consults it; independent
state mutations are handled by the ordinary CAS like any concurrent write.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
ScopedFailPoint::with_callback gives cfg_callback the same Drop-based cleanup
as cfg actions; a panic while the point is active no longer leaks the callback
into the process-global registry where it would fire under later tests
(greptile review, PR #167).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The failpoints feature added fail = { workspace = true, optional = true } to
the crate manifest; the lockfile edge belongs with it (--locked CI gate).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>