Commit graph

253 commits

Author SHA1 Message Date
aaltshuler
db6fe03be1 refactor(cluster): move type definitions to types.rs
Verbatim move of the public output/diagnostic types and the internal
state/sidecar/approval models; previously-private types and their fields
get pub(crate) (they were crate-visible by position before). lib.rs is now
the command pipeline + public API. 95 tests green; full workspace gate
green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 05:42:02 +03:00
aaltshuler
dc0a1fc5a5 refactor(cluster): move declared-config loading to config.rs
Verbatim move of cluster.yaml parsing, query discovery, source digesting,
header/id validation, path resolution, and live-graph observation. Two
helpers that the cut swept along were relocated to their right homes
(state-status helpers back to lib.rs, lock-file helpers to store.rs). 95
tests green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 05:37:20 +03:00
aaltshuler
dd17c0c50f refactor(cluster): move diffing and classification to diff.rs
Verbatim move of diff_resources, binding-change diffing, blast radius,
approval gating, ResourceKind, classify_changes, and demotion. 95 tests
green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 05:33:13 +03:00
aaltshuler
9c3e09e838 refactor(cluster): move the recovery sweep to sweep.rs
Verbatim move of the sidecar classification (all RFC-004 D3 rows),
tombstoning, and approval-consumption helpers. 95 tests green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 05:30:55 +03:00
aaltshuler
00fc5cf537 refactor(cluster): move the serving snapshot to serve.rs
Verbatim move of the Serving* types, read_serving_snapshot, and
read_verified_payload; public re-exports preserved (the server's imports
are unchanged). 95 tests green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 05:29:44 +03:00
aaltshuler
5a8047e5d0 refactor(cluster): move the storage backend to store.rs
Verbatim move of LocalStateBackend, StateSnapshot, StateLockGuard and their
impls — the single home for stored-state I/O (state ledger, lock, recovery
sidecars, approval artifacts), where the RFC-006 object-storage port lands
next as a focused diff. Visibility bumps (pub(crate)) only; 95 tests green
before and after.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 05:28:04 +03:00
aaltshuler
fbb86dee0e refactor(cluster): move the in-source test suite to tests.rs
Verbatim move (indentation preserved — embedded raw-string fixtures are
content). lib.rs drops from 7,857 to ~4,750 lines; `use super::*` resolves
to the crate root through the #[path] module declaration unchanged. 95
tests green before and after.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 05:25:53 +03:00
aaltshuler
d702fd106a feat(policy): from-source twins for the policy loaders
PolicyConfig::from_source + PolicyEngine::load_graph_from_source /
load_server_from_source — the path-based loaders delegate to them. Needed by
callers whose policy bundles don't live on the local filesystem (the cluster
catalog on object storage); kind-alignment validation stays loud through the
new path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 05:09:45 +03:00
aaltshuler
f48e69b999 feat(storage): versioned CAS, conditional replace, and prefix delete on StorageAdapter
Three primitives the cluster's object-storage port (RFC-006) needs, on the
engine's existing adapter rather than a parallel store:

- read_text_versioned: content + an opaque backend version token (S3: the
  ETag from GET; local: content sha256 — ETags don't exist on a filesystem).
- write_text_if_match: replace only when the token still matches. S3 maps to
  a conditional put (PutMode::Update / If-Match) — verified against RustFS
  beta.8 through the real object_store 0.12.5 path, no extra builder config
  needed; local compares content then swaps via temp+rename, the same
  single-machine semantics callers had before this trait (safe under their
  own lock protocol, not a cross-process barrier by itself). CAS-lost is
  Ok(None), never silent.
- delete_prefix: recursive + idempotent (local remove_dir_all; S3 list +
  delete, with the non-atomicity documented for crash-retry callers).

Gated S3 coverage: s3_adapter_conditional_writes_contract pins the
conditional-write behavior the cluster ledger will depend on (red if a
backend bump regresses it), and s3_schema_apply_migrates_live_graph closes
the previously-untested schema-apply-on-S3 path before the cluster's schema
executor leans on it. Engine gains the sha2 workspace dep.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 05:09:45 +03:00
aaltshuler
fa6af775c1 feat(cli)!: unified load command; deprecate ingest as an alias
omnigraph load is now the single data-write command:
- works against remote graphs (POSTs the server's /ingest endpoint with the
  same bearer/actor resolution as other remote commands) — previously load
  was the only data command forced to open Lance storage directly
- --from <base> opts into fork-if-missing for --branch (the former ingest
  semantics); without --from a missing branch is an error, never a fork
- --mode is now required: overwrite is destructive, so there is no implicit
  default (the old silent default was overwrite)
- output gains base_branch/branch_created (and table sums on remote loads)

omnigraph ingest stays as a deprecated alias (defaults preserved: --from
main --mode merge) that prints a one-line warning to stderr, matching the
read/change deprecation convention; removal in a later release.

Docs updated in the same change: cli.md, cli-reference.md, policy.md,
audit.md, execution.md (unified load section), AGENTS.md quick-flow,
README.md.

BREAKING CHANGE: scripts running omnigraph load without --mode must now
pass it explicitly (previously defaulted to the destructive overwrite).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 04:18:00 +03:00
aaltshuler
90676ef52f feat(server)!: POST /ingest forks only when 'from' is present
Branch creation becomes opt-in by presence of the request's 'from' field.
Previously the handler defaulted from to 'main' and always auto-created a
missing branch — a typo'd branch name silently forked main and landed the
data there, with the client none the wiser. Now a request without 'from'
against a missing branch returns 404 branch-not-found and creates nothing;
with 'from' set, fork-if-missing behaves as before. The BranchCreate
authority is only consulted when a fork will actually happen.

The handler calls the unified load_as directly (the deprecated ingest_as
shim is no longer used in the server). IngestOutput.base_branch becomes
nullable: it echoes the request's 'from' and is null when absent. OpenAPI
regenerated; the CLI's local ingest arm moves to load_file_as + the new
converter shape.

BREAKING CHANGE: clients that relied on implicit fork-from-main with 'from'
omitted must now pass from='main' explicitly. IngestOutput.base_branch is
now nullable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 04:05:29 +03:00
aaltshuler
c236a4c2df refactor(loader): load_jsonl helpers take &Omnigraph and document their role
The free helpers needlessly demanded &mut Omnigraph (every load API takes
&self) and read as leftovers. Rather than rewriting their ~200 call sites
across the test suites — which would have to re-derive the active-branch
resolution at each site — keep the one convenience and make it honest:
borrow immutably (&mut callers coerce, no churn) and document it as the
active-branch shorthand over Omnigraph::load.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 03:57:41 +03:00
aaltshuler
e676c151bb feat(engine): unify load/ingest — load_as gains an optional fork base
load_as/load_file_as gain a base: Option<&str> parameter: with Some(base) a
missing target branch is forked from base first (the former ingest
semantics); with None the target branch must exist — staging fails on an
unknown branch, so a typo'd name can never create one. LoadResult gains
branch/base_branch/branch_created metadata (additive).

The ingest family (ingest, ingest_as, ingest_file, ingest_file_as) becomes
#[deprecated] shims over load_as that preserve the historical contract
exactly (from: None still means fork from main; base recorded even when no
fork happened). IngestResult and to_ingest_tables stay for the shims and
the server until the removal release.

The layered policy check is unchanged: Change on the target branch always,
BranchCreate additionally when a fork actually happens (enforced inside
branch_create_from_as with the actor threaded through).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 03:53:22 +03:00
aaltshuler
4558454bc7 fix(cluster): address review — discovery reads each file exactly once
resolve_query_decls hands its file contents to the caller; the per-query
digest/typecheck pass reuses them instead of re-reading (a file with N
queries was read N+1 times), which also closes the window where a file
changing between enumeration and validation produced a confusing
query_key_mismatch for a just-discovered name. Explicit-map declarations
read as before.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 01:35:47 +03:00
aaltshuler
677320ceec feat(cluster): Terraform-shaped query declaration — discover from files
cluster.yaml's graphs.<id>.queries previously accepted only an explicit
name->file map, forcing configs to re-enumerate every `query <name>` that
the .gq files already declare (the SPIKE cookbook needed 66 entries for 6
files). The files ARE the declaration now: `queries: queries/` discovers
every declaration in a directory's top-level *.gq (sorted), a list form
takes explicit files, and the map stays for fine-grained control.
Discovery is loud — unreadable/unparseable files and duplicate query names
fail validation (query_parse_error, duplicate_query_name). Downstream is
untouched: each discovered query is still an individually addressed
resource with the containing file's digest.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-11 00:46:21 +03:00
aaltshuler
3b2bf755ae fix(cli): address review — honor the one-thing contract, restore docs, untangle test phases
- resolve_cluster_actor uses load_config directly: load_cli_config also
  loads auth.env_file into the process env — a second thing, violating the
  documented 'exactly one thing' omnigraph.yaml contract for cluster ops.
- resolve_cli_actor gets its doc comment back (the inserted helper had
  absorbed the contiguous /// block).
- The actor-default test imports once as setup and asserts on apply alone,
  idempotently, instead of re-importing inside the assertion helper.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:54:05 +03:00
aaltshuler
fbe9726ac7 test(cli): stop the S3 e2e scaffolding omnigraph.yaml into the crate dir
local_cli_s3_end_to_end_init_load_read_flow ran `omnigraph init` without a
current_dir, so init's project scaffold landed in crates/omnigraph-cli/ —
poisoning any later test that resolves a graph target from the cwd config
(query_lint_requires_schema_or_resolvable_graph_target fails determinis-
tically once the file exists). Only manifests when OMNIGRAPH_S3_TEST_BUCKET
is set, which is why local FS runs and CI's scoped rustfs job never caught
it. The init and load calls now run inside the test's tempdir.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:34:54 +03:00
aaltshuler
f7368b58a0 test(cli): pin --cluster boot isolation from cwd omnigraph.yaml
A --cluster server process whose cwd contains a MALFORMED omnigraph.yaml
boots and serves — proving mode-inference rule 0 returns before any config
search can run. New spawn_server_with_cluster_in support helper sets the
spawned server's cwd explicitly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:29:49 +03:00
aaltshuler
f3374ac6dc feat(cli): resolve cluster actor via the per-operator config cascade
Cluster FACTS stay unlayered (cluster.yaml only), but the operator's
identity is a per-operator fact — exactly the per-operator omnigraph.yaml's
permanent job, and the cascade every data-plane write already uses. cluster
apply/approve now resolve: --as flag wins and skips any config read
entirely (containers and CI stay config-free); without it, the standard cwd
search supplies cli.actor, with a malformed config failing loudly and
actionably ('pass --as to skip this lookup') rather than silently dropping
attribution. approve's no-actor error now names both sources.

Tests pin the contract from both sides: cli.actor is the no-flag default
for apply (echoed actor) and approve (approved_by), the flag overrides it,
a malformed omnigraph.yaml in cwd breaks nothing except the no-flag actor
lookup, and a conflicting well-formed one leaks nothing into cluster
outputs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 22:29:49 +03:00
aaltshuler
d8354ac213 test(cli): address review — assert schema-show success, document exit-code stance, add e2e opt-out
- The drift-heal verification now asserts `schema show` succeeded and
  produced a schema before checking the rogue field's absence (a failed
  command previously made the negative assertion vacuously pass).
- cluster_cli documents why it deliberately does not assert exit codes
  (blocked applies exit non-zero by contract while emitting the structured
  output callers assert on).
- The comprehensive lifecycle e2es honor OMNIGRAPH_SKIP_SYSTEM_E2E=1
  (graceful skip-with-message, the S3-gate pattern) for constrained
  sandboxes; requirements + suppression documented in testing.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 19:05:12 +03:00
aaltshuler
7d70811df1 test(cli): comprehensive full-cycle cluster e2e with a live server
Two system tests composing the whole Phase 1-5 surface with real binaries:

- local_cluster_full_lifecycle_declare_serve_evolve_delete: declare two
  graphs -> one apply creates and converges them -> the --cluster server
  serves both stored queries -> schema+query evolve in one apply (migration
  previewed in plan) -> restart serves the new shape -> out-of-band schema
  drift observed by refresh and converged back by apply (rogue field
  soft-dropped) -> approved graph delete -> restart serves the survivor and
  404s the tombstoned graph -> final plan empty. Catches composition
  regressions where each stage passes its own tests but the lifecycle
  breaks (the composite_flow.rs principle at the control-plane level).

- local_cluster_serving_enforces_applied_policy_bindings: applied policy
  bundles gate serving per their bindings over HTTP with bearer-resolved
  actors — the cluster-bound bundle owns graph_list (admin 200, reader 403,
  anonymous 401), the graph-bound bundle owns invoke_query (reader gets
  rows; denied invocation is the documented anti-probing 404).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 18:07:29 +03:00
aaltshuler
711865e6f1 docs(cluster,server): the Phase 5 mode switch; retire applied-not-serving caveats
The standing caveat ('applied means recorded in the cluster catalog —
nothing more; the server still boots from omnigraph.yaml') retires: cluster
docs gain the 'Serving from the cluster' section (exclusivity, applied-
revision serving, fail-fast readiness, restart-to-pick-up, expose-all
bridge), server.md gains mode-inference rule 0 and the cluster-booted multi
mode, deployment.md the boot-source choice, and the CLI's apply note plus
the cli-reference cluster row (stale back to Stage 3A) now describe the full
convergence surface. RFC-005 flips to Landed with four implementation
deviations recorded.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:56:54 +03:00
aaltshuler
f3eb60fa4e test(cli): applied-means-serving system e2e
The Phase-5 contract end to end with real binaries: cluster import + apply
via the CLI, seed a row through the graph plane, boot omnigraph-server with
--cluster (no omnigraph.yaml anywhere), and the applied stored query serves
the row over HTTP through the multi-graph routes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:51:40 +03:00
aaltshuler
948a54daa7 feat(server): boot from cluster state via --cluster
RFC-005 §D1/§D2: omnigraph-server --cluster <dir> is rule 0 of the mode
inference — an exclusive boot source (hard error when combined with a graph
URI, --target, or --config) that never opens omnigraph.yaml, not even the
implicit current-directory search. The cluster branch reads the applied
revision through omnigraph-cluster's serving-snapshot API and feeds the
EXISTING multi-graph pipeline: GraphStartupConfig per recorded graph at its
derived root, stored queries built via QueryRegistry::from_specs from
verified blob content (expose-all — the §D5 bridge until Phase 6
policy-owned exposure), cluster-bound policy bundles as the server-level
Cedar engine and graph-bound bundles per graph, straight from the
content-addressed blob paths. Multiple bundles binding one scope refuse boot
(one-bundle-per-scope is the serving pipeline's shape; stacking is a later
slice). Everything downstream — parallel opens, query type-checking,
registry, routing, auth, OpenAPI — is reused unchanged; cluster mode is a
new source, not a new pipeline.

First server->cluster crate dependency: read-only types + one fn;
omnigraph-cluster stays HTTP-free. open_multi_graph_state goes pub for
integration tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:48:10 +03:00
aaltshuler
f5b43164b8 feat(cluster): pub read-only serving-snapshot API
RFC-005 §D2/§D4: read_serving_snapshot reads the applied revision as
everything a server needs to boot — graphs at derived roots, stored-query
sources read from the content-addressed catalog and re-hashed against the
recorded digests, policy blob paths with their applied applies_to bindings.
All-or-nothing: missing state, pending recovery sidecars, missing/tampered
blobs, pre-5A entries without bindings, and an empty graph set each refuse
the snapshot with a remedy; no partial serving. Lock-free by design — the
state file is replaced atomically, so the read is a consistent
point-in-time ledger.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 17:39:26 +03:00
aaltshuler
0b84b1adc3 feat(cluster): record policy applies_to bindings in the applied revision
Slice 5A of RFC-005: the state ledger becomes serving-sufficient for the
Phase-5 server boot. StateResource gains an optional applies_to (normalized
typed refs: cluster | graph.<id>), written by apply for every applied policy
create/update from the desired config's validated bindings.

The hole this closes: applies_to is not part of the policy file digest, so a
binding-only edit previously produced NO plan change at all (a 4C e2e even
asserted that — the gap, not a contract). Binding changes are now
first-class: a post-diff pass emits an Update with equal before/after
digests and a binding_change marker (visible in plan/apply JSON and human
output as [bindings]), classification/execution treat it as an ordinary
catalog-tier applied change (payload skips naturally — the blob is
unchanged), and convergence requires zero binding divergence, so stale
bindings can never report converged. Pre-5A ledger entries (no bindings
recorded) surface as the same backfill Update; one apply heals them, exactly
the remedy RFC-005's boot-error path names.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 15:30:33 +03:00
aaltshuler
87691fe9c7 test(cluster): failpoint coverage for delete crash windows
- Crash before the removal: root intact, approval file unconsumed, sidecar
  survives, no ack; the next run retires the stale intent (row 8) and the
  still-approved delete completes in the same run.
- Crash after the removal, before the state CAS: root gone, ledger
  byte-identical, the sidecar carries the approval id; the next run's sweep
  rolls the tombstone forward, consumes the approval, audits the recovery,
  and converges (row 7b).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:34:54 +03:00
aaltshuler
d1d04217ab feat(cluster): execute approved graph deletes in cluster apply
Stage 4C execution half (RFC-004 §D5/§D6 + sweep rows 7/7b/8): an approved
graph.<id> delete — and its riding schema/query deletes — classifies Applied
and executes LAST in the run, sidecar-fenced: pre-op manifest pin (best
effort; partial roots still delete), approval_id carried in the sidecar,
recursive root removal (NotFound tolerated), subtree tombstoned out of the
ledger with a tombstone observation, the approval consumed in the same state
CAS (ledger summary) and its artifact file rewritten with consumed_at only
after the CAS lands — a failed run consumes nothing and the approval stays
valid for the retry.

Sweep rows: already-tombstoned intents retire (7); a completed delete with a
stale ledger rolls forward — tombstone + approval consumption + audit entry
(7b, idempotent); a still-present root retires the stale intent with a
graph_delete_incomplete warning and the still-approved delete re-executes in
the same run (8) — prefix removal is idempotent, so retry IS the repair.

The multi-graph mixed e2e gets its conclusion: blocked without approval,
cluster approve graph.engineering --as andrew, converge, tombstone visible
in status. Phase 4's disposition matrix is now fully executable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:34:02 +03:00
aaltshuler
f4e9105272 feat(cluster): cluster approve — digest-bound approval artifacts
RFC-004 §D4, gate half: graph deletes (and their subtree) now classify
Blocked/approval_required instead of Deferred; the new cluster approve
command (requires the global --as actor) writes
__cluster/approvals/{ulid}.json bound to the desired config digest and the
change's before/after digests, so config or state drift invalidates the
artifact automatically (approval_stale warning, never authorizes). One gate
per subtree: compute_approvals lists only the graph-level delete, and
ApprovalRequirement gains a satisfied flag surfaced by plan. Consumption and
the delete executor land next — until then approved deletes stay blocked so
a gate-only build can never strip state without removing the root.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 14:30:05 +03:00
aaltshuler
80cae4e8e1 test(cluster): failpoint coverage for schema-apply crash windows
- Crash before the engine call: sidecar (carrying the --as actor) survives,
  live schema and ledger untouched, no ack; the next run's sweep retires the
  stale intent and the same run applies and converges.
- Crash after the engine call, before the state CAS: the manifest moved with
  the post-op pin in the sidecar, state.json byte-identical; the next run's
  sweep rolls the ledger forward with a schema_apply audit entry and the run
  converges.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:13:15 +03:00
aaltshuler
a1ba4dc413 feat(cluster): execute schema applies in cluster apply
Stage 4B (RFC-004 §D1/§D5): schema.<id> Update changes classify Applied and
execute after graph creates, sequentially and sidecar-fenced — read-write
open (the engine's own recovery runs first), pre-op manifest pin recorded,
apply_schema_as with allow_data_loss: false (soft drops only; hard drops wait
for 4C's approval artifacts), post-op pin rewritten into the sidecar, sidecar
retired only after the final state CAS. Queries gated on a same-plan schema
update unblock (the migration lands first in the same run); failures —
unsupported migrations, lock contention, user branches — surface as
schema_apply_failed with the engine's message, demote dependents via the
origin-aware demotion helper, and stop further graph-moving work.

Schema evolution is now fully cluster-driven (the defer -> manual schema
apply -> refresh loop is gone), and out-of-band schema drift is converged
back by apply as an ordinary soft migration (axiom 8: drift correction is
gated like any change; the recoverable tier needs no approval) — both pinned
by reworked e2es. The multi-graph mixed e2e's deferred row is now
delete-shaped, pre-staging the 4C surface.

Actor: cluster apply accepts the CLI's global --as via the new ApplyOptions /
apply_config_dir_with_options (apply_config_dir delegates unchanged); the
actor is echoed in ApplyOutput and recorded in sidecars and audit entries,
and threads to apply_schema_as so Cedar fires wherever a checker is
installed.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:12:15 +03:00
aaltshuler
0571c05ebb feat(cluster): schema-apply recovery sidecar kind and sweep
RecoverySidecarKind::SchemaApply with digest-based sweep classification
(robust to unrelated manifest movement; version pins stay forensic):
ledger-consistent -> sidecar retired (RFC-004 rows 1+2); live digest matches
the intended schema, state stale -> roll forward with composite recompute and
a recovery_records audit entry (row 3); unverifiable or unexpected digests ->
pending, kept, graph-moving work blocked (rows 1-unopenable/6).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:05:42 +03:00
aaltshuler
ca63a9340b feat(cluster): embed schema migration previews in cluster plan
RFC-004 §D7's data-aware preview: for every schema update, plan opens the
live graph read-only and embeds the engine's migration plan (supported flag
+ typed steps) in the change record; the human renderer prints the steps.
Preview failures (unreachable graph, planner error) degrade to the digest
diff with a schema_preview_unavailable warning — planning never blocks.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:04:19 +03:00
aaltshuler
b313075476 refactor(cluster): make plan_config_dir async
Mechanical conversion ahead of Stage 4B (plan will preview schema migrations
against live graphs): signature, CLI dispatch, and test callers. Zero
behavior change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 13:02:12 +03:00
aaltshuler
83d77bcb16 test(cluster): failpoint coverage for graph-create crash windows
- Crash before the init (row 1): sidecar survives, nothing moved, no ack;
  the next run's sweep removes the intent and the same run creates and
  converges.
- Crash after the init, before the state CAS (row 4): the graph exists with
  the post-init manifest pin in the sidecar, state.json byte-identical; the
  next run's sweep rolls the ledger forward with a recovery_records audit
  entry and the run converges.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:59:48 +03:00
aaltshuler
c3007369cd feat(cluster): execute graph creates in cluster apply
Stage 4A (RFC-004 §D1/§D5): graph.<id> Create — and its paired schema Create,
which the init carries — classify Applied and execute first in the run,
sequentially and sidecar-fenced: sidecar written before Omnigraph::init at
the derived root, rewritten with the post-init manifest pin, deleted only
after the final state CAS lands. Dependent queries and policies no longer
block on a graph create in the same plan — creates run first, so they apply
in the same run; a create failure demotes them to blocked
(dependency_not_applied) and stops further graph-moving work (loud partials),
with the sidecar left for the sweep to classify. Graphs with a kept recovery
sidecar (rows 5/6) classify Blocked/cluster_recovery_pending, and the sweep's
Drifted/Error statuses are never clobbered by a generic Blocked.

Schema source is re-read and digest-verified under the lock before the init
(the write_resource_payload TOCTOU posture). Plan previews the same
dispositions. e2e fallout updated: a fresh multi-graph config now converges
in one apply; a destroyed root is re-created as an EMPTY graph by the next
apply (declarative convergence — visible in plan, called out in docs); the
new cluster_e2e_declared_graph_created_by_apply pins the no-manual-init flow.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:58:56 +03:00
aaltshuler
bf8cc7a753 feat(cluster): graph-create recovery sidecars and sweep
RFC-004 §D2/§D3 for the graph_create kind. RecoverySidecar records intent
under __cluster/recoveries/{ulid}.json; the roll-forward-only sweep runs at
the start of apply/refresh/import under the state lock and classifies each
survivor by observation: root absent -> intent removed (row 1); outcome
already recorded -> retired (row 2); create completed but state stale ->
ledger rolled forward with a recovery_records audit entry (row 4); partial
root -> Error/graph_create_incomplete, kept, never auto-deleted (row 5);
unexpected schema -> Drifted/actual_applied_state_pending, kept (row 6).
Sweep mutations ride the command's existing CAS write; completed sidecars
are deleted only after that write lands. Read-only status/plan warn
(cluster_recovery_pending) without acting. The apply payload gate now counts
only payload-phase errors so kept-sidecar diagnostics don't abort the run
before their statuses persist.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:50:42 +03:00
aaltshuler
6fbf09d5c9 refactor(cluster): make apply_config_dir async
Mechanical conversion ahead of Stage 4A graph create (which calls the async
Omnigraph::init from inside apply): the fn signature, the CLI dispatch arm,
and every test caller (#[test] -> #[tokio::test]). Zero behavior change; all
60 lib tests and 3 failpoint tests green before and after.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 04:43:38 +03:00
aaltshuler
16759b28b9 fix(cluster): RAII-guard the callback failpoint
ScopedFailPoint::with_callback gives cfg_callback the same Drop-based cleanup
as cfg actions; a panic while the point is active no longer leaks the callback
into the process-global registry where it would fire under later tests
(greptile review, PR #167).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:36:24 +03:00
aaltshuler
211b37e6de test(cluster): failpoint tests for crash-mid-apply and state CAS race
The apply-side coverage the implementation spec's hard gate requires before
Phase 4 graph-moving apply:

- crash after the payload phase: state.json byte-identical, blobs inert on
  disk, lock released, no phantom statuses, nothing acknowledged; a plain
  re-run repairs via skip-if-exists blob reuse.
- CAS race: a cfg_callback rewrites state.json at the exact read->write
  window (the state.lock:false concurrent-writer scenario); apply surfaces
  state_cas_mismatch, acknowledges nothing, reports the persisted status
  snapshot, leaves the concurrent writer's state on disk; a re-run converges.

CI's failpoints step now runs both the engine and cluster suites.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:14:06 +03:00
aaltshuler
21b531605f feat(cluster): failpoint infrastructure mirroring the engine
Optional failpoints feature (dep:fail + fail/failpoints, deliberately NOT
enabling omnigraph/failpoints), a maybe_fail/ScopedFailPoint module returning
Diagnostic-typed injected errors, and two call sites in apply_config_dir:
cluster_apply.after_payload_phase (the crash point: blobs on disk, state
untouched) and cluster_apply.before_state_write (routes through the
persisted-statuses revert contract; a cfg_callback here can mutate state.json
to make the CAS check fail organically). Feature off compiles to Ok(()) —
zero behavior change. Tests live in a separate integration binary because the
fail registry is process-global. Also refresh the crate description (stale
'read-only' since Stage 3A).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:12:59 +03:00
aaltshuler
acb3f1cc14 test(cli): e2e for catalog payload drift self-heal loop
status warns read-only -> refresh persists drift and drops the digest ->
apply republishes the blob -> status clean.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:08:14 +03:00
aaltshuler
15868972ff feat(cluster): verify catalog payload blobs in status and refresh
Closes the Stage 3A product gap where a deleted or corrupted blob under
__cluster/resources/ went unnoticed forever (status reported converged and
apply could not repair it because the digests matched).

verify_catalog_payloads checks every query/policy digest in state against its
content-addressed blob (existence + full sha256 re-hash; graph/schema/unknown
addresses have no payloads and are skipped). status reports findings read-only
(warnings catalog_payload_missing/_mismatch; error catalog_payload_read_error
— an unverifiable catalog must not report healthy). refresh closes the
self-heal loop: missing/mismatched blobs mark the resource drifted and remove
its digest from state so the next plan proposes a create and the next apply
republishes; unreadable blobs keep the digest (no spurious republish), mark
error, and exit non-zero. Verification runs before graph observation so the
recomputed graph composite already excludes removed query digests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 02:07:08 +03:00
Andrew Altshuler
b6d228ff54
test(cli): cluster e2e hardening — lost-state recovery, out-of-band drift, root destruction, multi-graph convergence (#166)
Four lifecycle compositions over the spawned binary that pin spec claims no
single-command test proves:

- Lost ledger: delete state.json -> re-import from the live graph -> re-apply
  converges onto the same content-addressed blobs (axiom 5's reconstructable-
  state resilience edge, end to end).
- Out-of-band schema apply (the Sarah/Bob violation): refresh marks
  graph/schema Drifted with schema_mismatch, status and plan surface it, and
  cluster apply refuses to silently correct it — state keeps the LIVE schema
  digest (drift correction is gated, axiom 8).
- Destroyed graph root: refresh records graph_missing drift and drops
  graph/schema digests while preserving query/policy; plan proposes deferred
  creates only; apply moves nothing and the catalog stays intact.
- Two graphs (one live, one not yet created) + a graph-spanning policy + a
  cluster-scoped policy: a single apply yields all four dispositions at once
  (applied/derived/deferred/blocked, deterministically ordered), then the
  second graph appears, refresh observes it, and apply converges.

Helpers: init_named_cluster_graph generalizes init_cluster_derived_graph;
write_multi_graph_cluster_fixture builds the two-graph config.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 00:59:20 +03:00
aaltshuler
69b63c33ac Merge remote-tracking branch 'origin/main' into feat/cluster-apply-stage3a 2026-06-10 00:45:03 +03:00
aaltshuler
5e1dede08f fix(cluster,cli): apply failure output — persisted statuses only, changes list printed
Two review findings (greptile, PR #165):

- ApplyOutput.resource_statuses on a failed state write now carries the
  pre-apply on-disk snapshot instead of the in-memory mutations that were
  never persisted, so automation reading the field independently of `ok`
  cannot see phantom applied/blocked statuses. Regression test forces the
  state write to fail via a read-only __cluster dir (unix-only, skips when
  permissions are not enforced).
- Human-mode `cluster apply` prints the classified changes list on failure
  too, so an operator debugging a partial apply without --json sees what was
  attempted.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-10 00:35:03 +03:00
devin-ai-integration[bot]
2c578a60b2
(feat) convert engine call sites to &dyn TableStorage; demote legacy TableStore methods to pub(crate) (#86)
* MR-854: convert engine call sites to &dyn TableStorage; demote legacy methods

Phase 1b: every db.table_store.X(...) call site converts to
db.storage().X(...), reaching the storage layer through the sealed
TableStorage trait (returns &dyn TableStorage). Opaque SnapshotHandle
and StagedHandle replace bare lance::Dataset and Transaction in the
threaded values.

Phase 9: the inherent inline-commit methods on TableStore
(append_batch, merge_insert_batch{,es}, overwrite_batch,
create_btree_index, create_inverted_index) demote from pub to
pub(crate). Their only remaining direct users are table_store.rs
itself and the bulk loader's LoadMode::{Append, Overwrite, Merge}
concurrent fast-paths in loader::write_batch_to_dataset (no
two-phase shape in Lance 4.0.0 — closes after lance#6658 and #6666).

Docs:
- invariants.md \u00a7VI.23: drop "at the writer-trait surface"
  qualifier; staged primitives are now the only engine surface.
- runs.md: residual matrix shrinks to delete_where and
  create_vector_index (the two upstream-blocked residuals).
- forbidden_apis.rs: replace transitional language with the
  current allow-list shape (table_store.rs + loader concurrent
  fast-path only).

Files touched:
- changes/mod.rs, db/omnigraph.rs (+export/optimize/schema_apply/
  table_ops.rs), exec/{merge,mod,mutation,staging}.rs,
  loader/mod.rs, storage_layer.rs, table_store.rs,
  tests/forbidden_apis.rs, docs/{invariants,runs}.md.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* MR-854: replace test-only inline-commit append callers with local Lance helpers

After demoting TableStore::append_batch from pub to pub(crate), the
integration tests in tests/recovery.rs and tests/staged_writes.rs
that previously called store.append_batch(...) directly to simulate
HEAD-ahead-of-manifest drift can no longer access the inherent
method. Replace those calls with small in-test helpers that do a raw
Dataset::append (the same body the inherent method runs).

- tests/helpers/mod.rs gains lance_append_inline (shared helper).
- tests/staged_writes.rs gets a file-local lance_append_inline_local
  (staged_writes.rs does not import helpers::).
- tests/recovery.rs drops the unused TableStore import in the one
  function whose store binding became unused after the conversion.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* MR-854: retrigger CI for flaky Test Workspace job

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* MR-854: convert remaining table_store call sites in export.rs / read_blob

Two leftover `self.table_store.X` / `db.table_store.X` call sites were
missed in the initial sweep — flagged by Devin Review on PR #86. Both
now go through the trait surface:

- `entity_from_snapshot` (db/omnigraph/export.rs): switch from
  `db.table_store.open_snapshot_table` + `db.table_store.scan` to
  `db.storage().open_snapshot_at_table` + `db.storage().scan`.

- `read_blob` (db/omnigraph.rs): replace
  `snapshot.open(table_key)` + `self.table_store.first_row_id_for_filter`
  with `self.storage().open_snapshot_at_table` +
  `self.storage().first_row_id_for_filter`. The follow-up
  `take_blobs` call still needs an `Arc<Dataset>` (it's a Lance blob
  accessor not surfaced through the trait), so we hand off via
  `SnapshotHandle::into_arc()` with a comment.

After this commit, no engine code outside `table_store.rs` reaches the
inherent `TableStore` API — the docs/runs.md and docs/invariants.md
claim is now uniformly true.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* MR-854: post-rebase doc fixes (Lance 6.0.1, MR-A framing, into_dataset note)

Reviewer feedback on the rebased PR:

* docs/dev/writes.md residuals matrix: drop demoted methods from the trait-surface table (now `pub(crate)`); keep only the two genuine trait-surface residuals (`delete_where`, `create_vector_index`); reframe under MR-A (Lance v7.x bump) per docs/dev/lance.md.

* tests/forbidden_apis.rs: update transitional allow-list header to (a) drop the truncate_table mislabel (truncate_table is a Lance Dataset method, not a TableStore method — overwrite_batch's internal call), (b) reframe trait-surface residuals under MR-A / Lance #6666.

* crates/omnigraph/src/storage_layer.rs::SnapshotHandle::{into_arc, into_dataset}: add single-ref invariant doc — both consume Arc via try_unwrap-or-clone; sibling SnapshotHandle clones across an await point force a deep Dataset clone.

* Replace lance-4.0.0 version refs with lance-6.0.1 in active source/test/dev-doc comments (storage_layer.rs, table_store.rs, table_ops.rs, schema_apply.rs, merge.rs, recovery.rs, staged_writes.rs, consistency.rs, docs/dev/execution.md, docs/user/query-language.md). Historical refs in docs/releases/v0.4.1.md and the canonical "Lance 4.0.0 → 6.0.1 migration" line in docs/dev/lance.md left intact.

No engine code changes.

* MR-854: update docs/dev/invariants.md Storage trait row + gap entry

Reviewer feedback: the docs reorg landed; the invariant row now lives in
docs/dev/invariants.md with stable headings (no more numbered §VI.23).

Update two pieces to reflect MR-854 completion:

* Status table 'Storage trait' row: was 'full call-site migration ... incomplete';
  now 'engine call sites all route through db.storage() (MR-854); inline-commit
  inherent methods are pub(crate)-demoted; capability/stat surfaces are roadmap'.

* 'Known Gaps' 'Storage abstraction' entry: was 'older inherent TableStore call
  sites and inline residuals remain'; now names the closed scope (MR-854 — call
  sites migrated, methods demoted, loader fast-paths) and the remaining
  trait-surface residuals under MR-A (Lance v7.x bump) and Lance #6666.

Cross-links to docs/dev/lance.md and docs/dev/writes.md so the framing stays
co-located with the canonical Lance surface tracking.

* MR-854: remove dead inline-commit methods from the storage surface

The loader concurrent fast-path (write_batch_to_dataset) is only reached
for LoadMode::Overwrite — Append/Merge route through MutationStaging — so
its Append/Merge arms were unreachable. Collapse it to overwrite-only and
drop the now-unused mode params, which removes the only callers of:

- TableStorage::append_batch + TableStorage::merge_insert_batches (trait)
- TableStore::merge_insert_batch + merge_insert_batches (inherent)

create_btree_index / create_inverted_index had zero callers anywhere
(scalar index builds use the stage_* primitives). Remove both from the
trait and the inherent impl.

Inherent append_batch stays pub(crate): overwrite_batch and recovery
tests use it. Migrate the one trait-append_batch test caller
(seed_person_row) to stage_append + commit_staged. The merge_insert
FirstSeen-workaround rationale moves from the deleted merge_insert_batch
into stage_merge_insert (now the sole merge path). No behavior change.

Also corrects the inaccurate loader residual comment (the prior text
blamed Lance #6658/#6666, which are the delete and vector-index issues,
for keeping overwrite inline; a stage_overwrite primitive already exists
and schema_apply uses it).

* MR-854: seal db.storage() to staged-only; move residuals to InlineCommitResidual

Split the three remaining inline-commit writes (overwrite_batch,
delete_where, create_vector_index) off the TableStorage trait onto a new
sealed InlineCommitResidual trait, reachable only via the explicit
Omnigraph::storage_inline_residual() accessor. db.storage() now exposes
only staged primitives + reads, so engine code cannot couple a write
with a Lance HEAD advance through the default surface — MR-793 acceptance
§1 ("no public method commits as a side effect of writing") now holds by
construction, not by review + naming.

Call sites moved to storage_inline_residual(): loader overwrite
fast-path, the three mutation delete_where paths, the branch-merge
delete, and the vector-index build. Impl bodies are unchanged (same
delegation to the pub(crate) inherent methods); this is a pure surface
reshape with no behavior change.

The residual trait holds two genuinely upstream-blocked methods
(delete_where -> Lance #6658/v7.x, create_vector_index -> Lance #6666)
plus overwrite_batch, kept for the loader's cross-table bulk-overwrite
concurrency until its staged migration lands (tracked follow-up).

* MR-854 docs: describe the staged-only seal; fix stale Lance index URLs

- writes.md / invariants.md / AGENTS.md: the inline-commit residuals now
  live on InlineCommitResidual behind db.storage_inline_residual(), so
  acceptance §1 holds by construction rather than 'option (b)' per-method
  enumeration. Drop the inaccurate 'until Lance exposes
  Operation::Overwrite { fragments }' claim (that op exists; stage_overwrite
  already builds it) and reframe overwrite_batch as a removable legacy
  residual gated on the loader's bulk-overwrite concurrency.
- forbidden_apis.rs: rewrite the allow-list doc for the split surface.
- lance.md: the index spec pages moved from /format/table/index/ to
  /format/index/ in Lance 6.x (the old paths 404). Fix all 13 URLs.

* MR-854: fix stale lance-4.0.0 comment refs flagged in review

Addresses greptile (exec/merge.rs) and aaltshuler's stale-version blocker:
update lance-4.0.0 -> 6.0.1 in the comment/doc refs within this PR's
footprint (exec/merge.rs, exec/mutation.rs, docs/dev/writes.md). Also
corrects exec/merge.rs to cite lance#6666 (not #6658) for
build_index_metadata_from_segments — that is the vector-index segment-commit
API; #6658 is the two-phase delete. (Pre-existing 4.0.0 refs in untouched
files like architecture.md/storage.md are main's incomplete migration
cleanup, left out of scope.)

* fix(storage): stage loader overwrites

* fix(storage): stage empty schema rewrites

---------

Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com>
Co-authored-by: Ragnor Comerford <hello@ragnor.co>
2026-06-09 23:03:08 +02:00
aaltshuler
d870eaaf3f test(cli): cluster lifecycle e2e — real-graph import/apply/refresh, schema-change loop, force-unlock retry
Three composition tests over the spawned binary against a real derived graph:

- import -> plan (dispositions) -> apply -> status -> refresh -> plan-empty,
  then a query edit round-trip. Pins that refresh and apply recompute the
  graph composite digest identically — divergence would silently re-open
  the plan forever and no single-command test would catch it.
- The Stage 3A operator workflow across the control/data-plane boundary:
  cluster apply defers a schema change, omnigraph schema apply executes it,
  cluster refresh observes it, the next cluster apply re-converges.
- Held lock refuses apply, force-unlock clears it, retried apply converges.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-09 23:44:49 +03:00
aaltshuler
bcef8444dd feat(cli): omnigraph cluster apply
Terraform-style: apply executes directly (cluster plan is the preview, now
annotated with apply dispositions). Human output prints per-change
dispositions, convergence, and the catalog-only caveat; --json emits the full
ApplyOutput. Exit is non-zero only on errors — deferred/blocked changes are
warnings with converged: false as the automation signal.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-09 23:34:48 +03:00
aaltshuler
1f8e5945cf feat(cluster): config-only apply with content-addressed catalog publish
apply_config_dir executes the query/policy subset of the plan: payloads are
written content-addressed under __cluster/resources/{query,policy}/... before
the state CAS (state is the publish point; orphaned blobs from a failed CAS
are inert and re-apply is the repair), then state.json is CAS-updated with
applied digests, Applied/Blocked statuses, and a revision bump. Graph/schema
changes are never executed here: schema content and graph lifecycle defer to
a later phase with loud warnings, while graph.<id> composite-digest updates
whose schema component is unchanged converge automatically via recomputation
from state's own components (without which apply could never converge).
Idempotent re-apply leaves state bytes and revision untouched.

PlanChange gains optional disposition/reason fields, populated by the same
classifier in cluster plan, so plan is an honest preview of what apply will
execute, derive, defer, or block.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-09 23:32:13 +03:00