diff --git a/AGENTS.md b/AGENTS.md index 3f5b711..812629a 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -237,7 +237,7 @@ omnigraph policy explain --actor act-alice --action change --branch main | Per-dataset versioning + time travel | ✅ | `snapshot_at_version`, `entity_at`, snapshot-pinned reads across many tables | | Per-dataset branches | ✅ | **Graph-level** branches (atomic across all sub-tables), lazy fork, system branch filtering | | Atomic single-dataset commits | ✅ | **Multi-table publish via three layers**, NOT a single Lance primitive: (1) per-table Lance `commit_staged` for the data write, (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for cross-table ordering, (3) the open-time recovery sweep for the residual gap between (1) and (2). All three layers ship; the five migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`, `optimize_all_tables`) write a `__recovery/{ulid}.json` sidecar before Phase B and delete it after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the sweep in `db/manifest/recovery.rs`: classify, decide all-or-nothing per sidecar, roll forward via single `ManifestBatchPublisher::publish` or roll back via `Dataset::restore` followed by a manifest publish of the restored version (so both directions converge to `manifest == HEAD` — no residual drift), and record an audit row in `_graph_commit_recoveries.lance` (queryable via `omnigraph commit list --filter actor=omnigraph:recovery`). Continuous in-process recovery (no restart needed between Phase B failure and recovery) is the goal of a future background reconciler. Engine writes route through a sealed `TableStorage` trait exposing `stage_*` + `commit_staged` as the canonical staged-write surface; documented inline-commit residuals (`delete_where`, `create_vector_index`, plus legacy `append_batch` / `merge_insert_batches` / `overwrite_batch` / `create_*_index`) remain on the trait until upstream Lance ships a public two-phase API ([#6658](https://github.com/lance-format/lance/issues/6658), [#6666](https://github.com/lance-format/lance/issues/6666)) and the migration of every call site completes. | -| Compaction (`compact_files`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency; **publishes each compacted table's new version to `__manifest`** (so the manifest tracks the Lance HEAD — required for reads to observe compaction and for schema apply / strict writes to pass their HEAD-vs-manifest precondition), under the per-`(table, main)` write queue with `SidecarKind::Optimize` recovery coverage; **refuses on an unrecovered graph** (errors if a `__recovery` sidecar is pending — recovery may roll back a partial write, so optimize requires `manifest == HEAD` going in); **skips blob-bearing tables** (reported via `TableOptimizeStats.skipped`, not silent), gated on `LANCE_SUPPORTS_BLOB_COMPACTION` until the upstream blob-v2 compaction-decode bug is fixed (see [docs/dev/invariants.md](docs/dev/invariants.md) Known Gaps) | +| Compaction (`compact_files`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency; **publishes each compacted table's new version to `__manifest`** (so the manifest tracks the Lance HEAD — required for reads to observe compaction, and so system-produced drift stays bounded; strict writes and schema apply separately *tolerate* benign uncovered `HEAD > manifest` drift via the pre-stage precondition rather than 409ing on it — see [docs/dev/writes.md](docs/dev/writes.md)), under the per-`(table, main)` write queue with `SidecarKind::Optimize` recovery coverage; **refuses on an unrecovered graph** (errors if a `__recovery` sidecar is pending — recovery may roll back a partial write, so optimize requires `manifest == HEAD` going in); **skips blob-bearing tables** (reported via `TableOptimizeStats.skipped`, not silent), gated on `LANCE_SUPPORTS_BLOB_COMPACTION` until the upstream blob-v2 compaction-decode bug is fixed (see [docs/dev/invariants.md](docs/dev/invariants.md) Known Gaps) | | Cleanup (`cleanup_old_versions`) | ✅ | `omnigraph cleanup` with `--keep` / `--older-than` policy | | BTREE / inverted (FTS) / vector indexes | ✅ | `ensure_indices` builds them on every relevant column; idempotent; lazy across branches | | `merge_insert` upsert | ✅ | `LoadMode::Merge`, mutation `update`/`insert`/`delete` lowering | diff --git a/docs/dev/invariants.md b/docs/dev/invariants.md index 5ee4f17..592b83a 100644 --- a/docs/dev/invariants.md +++ b/docs/dev/invariants.md @@ -99,6 +99,7 @@ Use it this way: | Multi-table commit | Manifest CAS plus recovery sidecars; not a single Lance primitive | [writes.md](writes.md), [architecture.md](architecture.md) | | Constructive mutations | In-memory `MutationStaging`, one end-of-query table commit per touched table, then one manifest publish | [writes.md](writes.md), [execution.md](execution.md) | | Deletes | Inline-commit residual; delete-only queries allowed, mixed insert/update/delete rejected by D2 | [query-language.md](../user/query-language.md), [writes.md](writes.md) | +| Pre-stage write precondition | Strict writers tolerate benign `HEAD > manifest` drift with no sidecar (proceed), defer when a recovery sidecar pins the table, and reject a stale caller pin (`caller pin != fresh manifest pin`) with `ExpectedVersionMismatch`. The OCC fence is the fresh manifest pin, never the caller's snapshot pin. Correct *only* because uncovered drift is always content-preserving | [writes.md](writes.md) | | Branch delete | Manifest is the single authority, flipped atomically first; per-table forks + commit-graph branch are derived state, reclaimed best-effort (`force_delete_branch`) with the `cleanup` reconciler as the guaranteed backstop. Reusing a name whose reclaim failed before `cleanup` surfaces an actionable error | [branches-commits.md](../user/branches-commits.md), [maintenance.md](../user/maintenance.md) | | Schema validation | Type checks, required fields, defaults, edge endpoint checks, and edge cardinality are enforced on write paths | [schema-language.md](../user/schema-language.md), [execution.md](execution.md) | | Unique constraints | Intra-batch and write-path checks exist; full cross-version uniqueness is still a gap | [schema-language.md](../user/schema-language.md) | @@ -165,6 +166,11 @@ case is exceptional. snapshot. - New write paths that can advance Lance HEAD before manifest publish without a recovery sidecar. +- Advancing Lance HEAD with *non-content-preserving* data and no recovery + sidecar. The pre-stage write precondition tolerates uncovered `HEAD > manifest` + drift (proceeds without deferring) on the assumption it is content-preserving; + a path that breaks that assumption without registering a sidecar turns the + tolerance into a silent-data-loss vector. - Cross-query `BEGIN`/`COMMIT` transactions in the OSS engine. Use branches and merges for multi-query workflows. - Acknowledging writes before durable Lance and manifest persistence. diff --git a/docs/dev/testing.md b/docs/dev/testing.md index e737dea..e38fb73 100644 --- a/docs/dev/testing.md +++ b/docs/dev/testing.md @@ -20,13 +20,13 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav | `end_to_end.rs` | Full init → load → query/mutate flow | | `branching.rs` | Branch create / list / delete, lazy fork | | `merge_truth_table.rs` | Merge-pair truth table (MR-786): all 9×9 `(left_op, right_op)` cells from `{noop, addNode, removeNode, addEdge, removeEdge, setProperty, dropProperty, addLabel, removeLabel}`. Adding a new op to `OpVariant` forces a compile error in `build_case` until the new row + column are dispositioned. 36 executable cells run through real `branch_merge` with a structured oracle (`MergeOutcome` / `MergeConflictKind` + graph-state assert); 45 cells involving `dropProperty`/`addLabel`/`removeLabel` are recorded as `Unsupported` until the mutation grammar grows. | -| `writes.rs` | Direct-publish writes: cancellation, concurrent-writer CAS, multi-statement atomicity, MR-794 staged-write rewire (D₂ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery) | +| `writes.rs` | Direct-publish writes: cancellation, concurrent-writer CAS, multi-statement atomicity, MR-794 staged-write rewire (D₂ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery); pre-stage drift tolerance (`strict_update_proceeds_on_benign_drift_without_sidecar`, `delete_proceeds_on_benign_drift_without_sidecar`, `strict_update_defers_when_sidecar_pins_table`) | | `staged_writes.rs` | TableStore staged-write primitives (`stage_append`, `stage_merge_insert`, `commit_staged`, `scan_with_staged`, `count_rows_with_staged`) — primitive-level only; engine code uses the in-memory `MutationStaging` accumulator instead | | `lifecycle.rs` | Graph lifecycle, schema state | | `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) | | `changes.rs` | `diff_between` / `diff_commits` | | `consistency.rs` | Cross-table snapshot isolation, atomic publish | -| `schema_apply.rs` | Migration plan + apply, schema-apply lock | +| `schema_apply.rs` | Migration plan + apply, schema-apply lock; pre-stage drift tolerance (`additive_apply_proceeds_on_benign_drift_without_sidecar`, `additive_apply_defers_when_sidecar_pins_table`) | | `search.rs` | FTS / vector / hybrid (`bm25`, `nearest`, `rrf`) | | `traversal.rs` | `Expand`, variable-length hops, anti-join | | `aggregation.rs` | `count`, `sum`, `avg`, `min`, `max` | diff --git a/docs/dev/writes.md b/docs/dev/writes.md index d2c7c7e..aca6dfc 100644 --- a/docs/dev/writes.md +++ b/docs/dev/writes.md @@ -239,6 +239,63 @@ publisher commit produces exactly one winner. The residual above is about *our* abandoned commits in the failure path, not about concurrency races. +### Pre-stage write precondition: tolerate benign drift, defer sidecar-covered + +Strict writers (Update / Delete / SchemaRewrite, and the schema-apply +index rebuild) run a pre-stage precondition — `Omnigraph::ensure_writable_or_defer` +— before staging. Insert/Merge skip it (Lance's auto-rebase + the queue + +the publisher CAS handle their drift). + +A table's **Lance HEAD can legitimately sit ahead of its manifest pin** +between an in-place HEAD advance and the next manifest publish. Sources: +`optimize` compaction *before its publish*, a recovery `Dataset::restore`, +an old-binary optimize that never published, an *external* `compact_files`, +or a finalize→publisher residual. All of these are **content-preserving** +and carry **no recovery sidecar**. The only `HEAD > pin` state that is *not* +safe to write over is a real in-flight partial write, which the writer +protocol always covers with a `__recovery/{ulid}.json` sidecar (Phase A). + +So the precondition disambiguates `HEAD > pin` by sidecar presence rather +than rejecting it wholesale. The OCC fence is the **current** manifest pin, +re-read fresh on the conflict path — *not* the caller's snapshot pin, which +may be stale: + +- `HEAD == caller pin` → fresh, no drift → proceed (fast path, no extra read). +- `caller pin != current pin` → the caller's pre-write view is stale relative + to the live manifest: a normal OCC conflict. Fail with + `ExpectedVersionMismatch` **here**, before any staged commit or sidecar, so + the client refreshes and retries with no residue left behind. +- `caller pin == current pin`, `HEAD > pin`, **no sidecar** pins the table → + benign content-preserving drift → **proceed**; the writer's own + `commit_staged` + the publisher CAS reconcile the manifest at the commit + boundary. +- `caller pin == current pin`, `HEAD > pin`, **a sidecar** pins the table → + defer with an actionable "reopen the graph to run the recovery sweep" + error; never write onto state the open-time sweep may roll back. +- `HEAD < current pin` → the manifest cannot lead durable Lance state under + the commit protocol → loud invariant violation. + +This is the **consumer-side** complement to the producer-side convergence +above (recovery roll-back and `optimize` both publish so `manifest == HEAD`). +Convergence keeps *system-produced* drift bounded; the precondition is the +net for drift no sidecar covers — legacy old-binary optimize, external Lance +compaction — which heals at the point of use on the next strict write. + +> **Load-bearing invariant.** Tolerating uncovered drift is correct *only* +> because such drift is always content-preserving: a strict write (and schema +> apply, which reads source at the pinned version and rewrites onto HEAD) +> overwrites the drifted HEAD assuming its rows equal the pinned version's +> rows. A future code path that advances Lance HEAD with *different content* +> and no sidecar would turn this tolerance into a silent-data-loss vector — +> such a path must register a recovery sidecar. See +> [docs/dev/invariants.md](invariants.md). + +> **Observable change (Hyrum's Law).** A strict write or schema apply on a +> benign-drifted table now **succeeds** where it previously returned 409 +> "stale view … refresh and retry". Clients that depended on the 409 to detect +> compaction/recovery drift must not — that 409 is reserved for genuine OCC +> conflicts (stale handle / concurrent publisher). + ## Conflict shape Concurrent writers to the same `(table, branch)` produce exactly one diff --git a/docs/user/maintenance.md b/docs/user/maintenance.md index a835799..015322d 100644 --- a/docs/user/maintenance.md +++ b/docs/user/maintenance.md @@ -4,7 +4,7 @@ ## `optimize_all_tables(db)` — non-destructive -- Lance `compact_files()` on every node + edge table on `main`, then **publishes the compacted version to the `__manifest`** so the manifest's `table_version` tracks the compacted Lance HEAD. Reads pin the manifest version, so without this publish compaction would be invisible to readers *and* would break the HEAD-vs-manifest precondition of the next schema apply / strict update/delete ("stale view … refresh and retry"). The publish advances the graph version (a system-attributed commit) only for tables that actually compacted. +- Lance `compact_files()` on every node + edge table on `main`, then **publishes the compacted version to the `__manifest`** so the manifest's `table_version` tracks the compacted Lance HEAD. Reads pin the manifest version, so without this publish compaction would be invisible to readers until the next write. The publish advances the graph version (a system-attributed commit) only for tables that actually compacted. (Even if a graph is left with uncompacted `HEAD > manifest` drift — e.g. an old-binary optimize that never published, or an external `compact_files` — strict writes and schema apply now tolerate that benign drift and reconcile it on the next write; see the pre-stage write precondition in [docs/dev/writes.md](../dev/writes.md). The publish is still done so reads observe compaction immediately and system-produced drift stays bounded.) - Rewrites small fragments into fewer large ones; old fragments remain reachable via older manifests until `cleanup` runs. - Each table's compact→publish runs under its per-`(table, main)` write queue (serializing with concurrent mutations — compaction is a Lance `Rewrite` op that retryable-conflicts with a concurrent merge/update/delete on overlapping fragments). The Lance-HEAD-before-manifest-publish gap is covered by a `SidecarKind::Optimize` recovery sidecar (loose-match): a crash in that window rolls the compacted version forward on the next `Omnigraph::open` (compaction is content-preserving, so roll-forward is always safe). - **Requires a recovered graph.** `optimize` refuses (errors) when an unresolved recovery sidecar is present under `__recovery` — operating on an unrecovered graph could publish a partial write the open-time recovery sweep would roll back. Reopen the graph to run the recovery sweep, then re-run `optimize`. (Recovery roll-back now publishes its restored version, so a recovered graph always satisfies `manifest == Lance HEAD` going in; there is no leftover drift for `optimize` to interpret.)