The standing caveat ('applied means recorded in the cluster catalog —
nothing more; the server still boots from omnigraph.yaml') retires: cluster
docs gain the 'Serving from the cluster' section (exclusivity, applied-
revision serving, fail-fast readiness, restart-to-pick-up, expose-all
bridge), server.md gains mode-inference rule 0 and the cluster-booted multi
mode, deployment.md the boot-source choice, and the CLI's apply note plus
the cli-reference cluster row (stale back to Stage 3A) now describe the full
convergence surface. RFC-005 flips to Landed with four implementation
deviations recorded.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Approvals + gated graph deletion in the user docs, the approve command in the
CLI reference, RFC-004 flipped to Landed with its three implementation
deviations recorded (row-8 retire-and-repropose, --as instead of --actor/--by,
consumed artifacts rewritten in place rather than moved).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Four lifecycle compositions over the spawned binary that pin spec claims no
single-command test proves:
- Lost ledger: delete state.json -> re-import from the live graph -> re-apply
converges onto the same content-addressed blobs (axiom 5's reconstructable-
state resilience edge, end to end).
- Out-of-band schema apply (the Sarah/Bob violation): refresh marks
graph/schema Drifted with schema_mismatch, status and plan surface it, and
cluster apply refuses to silently correct it — state keeps the LIVE schema
digest (drift correction is gated, axiom 8).
- Destroyed graph root: refresh records graph_missing drift and drops
graph/schema digests while preserving query/policy; plan proposes deferred
creates only; apply moves nothing and the catalog stays intact.
- Two graphs (one live, one not yet created) + a graph-spanning policy + a
cluster-scoped policy: a single apply yields all four dispositions at once
(applied/derived/deferred/blocked, deterministically ordered), then the
second graph appears, refresh observes it, and apply converges.
Helpers: init_named_cluster_graph generalizes init_cluster_derived_graph;
write_multi_graph_cluster_fixture builds the two-graph config.
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
* docs(invariants): note the non-atomic manifest->commit-graph publish gap
Every graph publish commits __manifest then appends _graph_commits as two
separate writes; a crash between them leaves the manifest ahead of the commit
DAG. Live reads + durability are unaffected (reads resolve via the manifest) and
recovery does not repair it; impact is bounded to commit history / time-travel
by commit id / merge-base completeness. Pre-existing across all publishes, not
the optimize reconcile specifically. Documented as a Known Gap; the fix is a
commit-graph reconcilable from the manifest, not a recovery sidecar.
* fix(maintenance): route uncovered drift through repair
* fix(maintenance): harden repair review feedback
The maintenance.rs row referenced `optimize_reconciles_preexisting_manifest_head_drift`,
which never existed (leftover from the reconcile-drift heuristic removed in #141).
The actual second test is `optimize_defers_when_recovery_sidecar_is_pending`.
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* test(optimize): cover manifest publish + HEAD-drift reconcile
Red against the pre-fix optimize, which ran compact_files without
publishing the compacted version to __manifest:
- maintenance: optimize must publish so the manifest table_version
tracks the compacted Lance HEAD and a later schema apply succeeds;
and must reconcile a pre-existing manifest-behind-HEAD drift (forged
via raw Lance compaction) so strict writes commit again.
- end_to_end + composite_flow: post-optimize query / strict update /
reopen in the full lifecycle (the canonical flow previously omitted
post-optimize writes as a documented "known limitation").
- failpoints: a crash between compaction and the manifest publish rolls
forward on next open.
* fix(optimize): publish compaction to manifest and reconcile HEAD drift
optimize ran Lance compact_files without publishing the new version to
__manifest, so the manifest table_version lagged the Lance HEAD: reads
stayed pinned to the pre-compaction version, and the next schema apply or
strict update/delete failed its HEAD-vs-manifest precondition with
"stale view ... refresh and retry" (open-time recovery rollback inflated
the gap on retry).
optimize now publishes each compacted table's version under the
per-(table, main) write queue, guarded by a manifest CAS and a
SidecarKind::Optimize recovery sidecar (loose-match; roll-forward is safe
because compaction is content-preserving). When a table has nothing left
to compact but its Lance HEAD is already ahead of the manifest pin
(pre-fix drift, or a recovery restore commit), optimize reconciles the
manifest forward to HEAD (metadata-only, no sidecar). Caches and the
CSR/CSC graph index are invalidated after a publish.
Docs updated (maintenance, storage, branches-commits, writes, testing).
* test(recovery): rollback convergence + optimize-defer regressions
Red against the current code, landed before the fix:
- recovery: after the open-time sweep rolls a sidecar back, the manifest
must track Lance HEAD (no residual drift) so a follow-up schema apply
succeeds — the original "+1 per retry" loop. Today roll-back restores
without publishing, so the manifest lags HEAD and the apply fails its
HEAD-vs-manifest precondition.
- maintenance: optimize must refuse while a recovery sidecar is pending —
operating on an unrecovered graph could publish a partial write the
sweep would roll back.
Also removes optimize_reconciles_preexisting_manifest_head_drift: the
ad-hoc drift reconcile it covered is replaced by recovery-side convergence.
* fix(recovery): converge manifest on roll-back; optimize defers on pending recovery
Root of PR #141's review findings and the original "+1 per retry" loop:
a Lance HEAD ahead of the manifest was ambiguous (benign content-preserving
drift vs. a partial write a sidecar will roll back), and optimize's reconcile
guessed it benign. Close the class instead of guessing:
- Recovery roll-back now PUBLISHES the restored version (via a
push_table_update_at_head helper shared with roll-forward), so the manifest
tracks the Lance HEAD after recovery — symmetric with roll-forward. This
fixes the +1 loop (after one roll-back the retry's HEAD-vs-manifest
precondition passes) and removes the only remaining source of orphaned
drift. The audit still records the logical rolled-back-to version; the
manifest is published at the restore commit (identical content).
- optimize drops the ad-hoc drift reconcile and instead REFUSES when a
__recovery sidecar is pending, so it only ever operates on a recovered
graph (manifest == HEAD); its compaction publish can no longer commit a
partial write. With the reconcile gone, the blob-skip-vs-reconcile gap is
moot.
Updates the rollback recovery-test helper (manifest == HEAD after roll-back),
the failpoints assertions, and the user/dev docs.
* test(recovery): fix rollback assertion for manifest convergence
The roll-back-publishes change makes the manifest version advance after a
SchemaApply roll-back (to the old-schema content), so the
schema_apply_without_schema_staging_rolls_back_on_next_open assertion must
be `version > pre`, not `version == pre`. This update was dropped during
the commit churn and surfaced as a CI Test Workspace failure; the
old-schema-preserved intent stays covered by count_rows + _schema.pg + the
RolledBack convergence invariant.
The Run state machine was removed in MR-771 (v0.4.0); `docs/dev/runs.md`
and `crates/omnigraph/tests/runs.rs` have since documented and tested the
direct-publish write path, so the "runs" name was misleading.
- git mv docs/dev/runs.md → docs/dev/writes.md (reframe H1 + intro;
keep MR-771 history note)
- git mv crates/omnigraph/tests/runs.rs → tests/writes.rs (reframe header)
- repoint every runs.md / runs.rs reference across docs, AGENTS.md, and
source comments
- fix four pre-existing broken `docs/runs.md` links (the file never lived
at that path) to `docs/dev/writes.md`
- fix the stale v0.4.0 anchor to the live section
No behavior change: every source edit is a comment. Engine builds and the
renamed test passes 25/25; scripts/check-agents-md.sh passes.
The run-removal cleanup itself (run_registry.rs guard, __run__ prefix) is
deferred to MR-770.