mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-30 02:49:39 +02:00
* docs(dev): write-latency roadmap (validated cost model + layered fix)
Records the validated 6-LIST warm-write cost model, the two root causes
(un-GC'd _versions/; re-resolving latest by listing), and the layered fix
(GC + capture-once reuse), plus how commit-graph-table retirement feeds in.
Linked from docs/dev/index.md next to the RFC-013 docs.
* feat(engine)!: strand storage versioning — one internal-schema version, no in-place migration
Set MIN_SUPPORTED == CURRENT == 4: this binary reads exactly one `__manifest`
internal-schema version and refuses any older graph on open with a
rebuild-via-export/import message, instead of migrating it in place. Storage
format changes become a deliberate cutover, not a permanently-carried in-place
migration — the pre-release "complexity must be earned" contract.
Delete the entire in-place migration apparatus and everything that existed only
to support it: the `migrate_vN` arms + dispatcher + stamp-bump helpers + the
schema-version-floor tripwire; `migrate_on_open` (both open modes now refuse);
the legacy `_graph_commits.lance` readers + the v3 test fixtures + migration
tests + `migration.v3_to_v4.*` failpoints + the two surface guards that pinned
Lance variants only the migration matched on; and `state::merge_lineage_rows`.
Keep `read_stamp` / `stamp_current_version` / `set_stamp` /
`refuse_if_stamp_unsupported` — the seam a future one-shot converter plugs into.
`load_commit_cache_for_branch` now reads the `__manifest` projection
unconditionally (sub-v4 graphs are refused at open). Adds
`sub_current_graph_is_refused_on_open_with_rebuild_hint`.
The commit-graph TABLES are still created/used as branch-ref ledgers — their
retirement (CommitGraph -> pure `__manifest` projection) is the next commit.
BREAKING CHANGE: a graph created by omnigraph <= 0.7.2 (internal schema v3) is
refused on open. Rebuild it: `omnigraph export` with the old release, then
`omnigraph init` + `omnigraph load` with this one. Data, vectors, and blobs are
preserved; commit history and branches are not.
* feat(engine)!: retire `_graph_commits.lance` / `_graph_commit_actors.lance` — CommitGraph is a pure `__manifest` projection
Since RFC-013 Phase 7, graph lineage lives in `__manifest` (`graph_commit` /
`graph_head` rows) and branch authority is `__manifest` (branch create forks it
first). The two commit-graph datasets were vestigial: `_graph_commit_actors.lance`
was never written or read; `_graph_commits.lance` carried zero commit rows and
only mirrored the manifest's branch refs (a deny-list "parallel copy"). Retire
both.
- `CommitGraph` collapses to a pure projection: drops its Lance dataset handles
(`dataset`/`actor_dataset`) and all branch methods; `open`/`open_at_branch`/
`refresh`/`init` open NO dataset, building the cache from
`ManifestCoordinator::read_graph_lineage_at`. Removes ~1.4s of cold-open
dataset opens.
- `graph_coordinator`: `commit_graph` is now non-`Option` (always a valid
projection). `branch_create`/`branch_delete` go through `ManifestCoordinator`
only — a single atomic op, replacing the two-step manifest-fork +
commit-graph-fork + rollback. Deleted `create_commit_graph_branch`,
`reclaim_commit_graph_branch`, `ensure_commit_graph_initialized`, and every
`storage.exists(_graph_commits.lance)` gate.
- `optimize`: dropped `reconcile_commit_graph_orphans` and the two tables from
the internal-table compaction set (now `__manifest` only).
- `instrumentation`: `INTERNAL_TABLE_DIRS` no longer lists the two tables.
- Fresh graphs create neither table; `lineage_projection.rs` now asserts both
`.lance` dirs are absent. Deleted the obsolete commit-graph-branch-race
failpoint tests + their failpoint names, and updated the `maintenance`
optimize tests (one internal table, not three).
Review-pass fixes folded in:
- Removed two stale `omnigraph.rs` in-source tests the prior run missed (a
disk-full link failure masked them): one asserting `open` probes
`_graph_commits.lance` (the exists-gate this commit removes) — it was masked
earlier by a disk-full link failure.
- Corrected src comments referencing deleted code (`migrate_v3_to_v4`,
`append_commit`/`append_merge_commit`, the three-internal-table list,
the `_graph_commits` reconcile owner) in publisher/recovery/optimize/recovery_audit.
- Narrowed `set_stamp_for_test` to `cfg(test)` (its only caller is the refusal
test) — removes a dead-code warning in the failpoints build.
Branch create/delete atomicity improves (single atomic `__manifest` op). No
behavior change for reads or branches.
Follow-up (separate commit): the now-always-0 `IoCounts::commit_graph_reads` test
counter + its `IOTracker`, threaded through ~11 cost-test files.
* feat: surface the internal-schema (storage-format) version to operators
After stranding storage versioning (a sub-v4 graph is refused on open), operators
could only discover the storage-format version by hitting a refusal. Surface it:
- `omnigraph version` prints an `internal-schema <N>` line (the binary's CURRENT
storage-format version).
- `omnigraph snapshot` includes `internal_schema_version` — the GRAPH's per-branch
on-disk stamp, read via the new `Omnigraph::internal_schema_version_of`.
- `GET /healthz` includes `internal_schema_version` (server-scoped: the binary's
CURRENT, alongside `version`/`source_version`).
Wire: re-expose `INTERNAL_MANIFEST_SCHEMA_VERSION` as `pub` on `db::manifest`;
add `internal_schema_version: u32` to `SnapshotOutput` + `HealthOutput`;
`snapshot_payload` takes the per-graph version (the `Snapshot` does not carry it),
threaded through the embedded CLI + server snapshot callers. `openapi.json`
regenerated (two added int32 properties). Extends the existing healthz / snapshot /
version tests.
* docs(engine): gate internal-schema version at the graph level; record the per-branch read gap
PR reviewers flagged that the open path validates only main's internal-schema stamp, so a branch read could decode a branch stamped outside this binary's range. The stamp is a graph-wide storage-format property (the upgrade path is a whole-graph export/import), so with one binary version every branch is always CURRENT; divergence needs concurrent multi-version writers, an unsupported topology already in one-winner-CAS territory. Gating per-branch would add a second __manifest open per non-main branch read to defend a state we do not support, unearned complexity that regresses the warm-read budget.
Keep the graph-level gate, document it at the code site (refuse_if_internal_schema_unsupported), and record the read-only residual hole as a known gap in invariants.md to close only when multi-version write topologies become supported. Also clarify the sub-floor rebuild message to say "export with the older omnigraph binary that created it."
No behavior change: HEAD already gated at the graph level.
* test(cost): remove the dead commit_graph_reads IO counter
Phase B retired _graph_commits.lance / _graph_commit_actors.lance, so no commit-graph dataset is opened and the commit_graph IOTracker term is structurally always 0. Remove IoCounts::commit_graph_reads, its total_reads() term, the commit_graph IOTracker in OpProbes, and the now-dead commit_graph_wrapper field on QueryIoProbes (it had no accessor — nothing ever attached it). Drop the 7 trivially-true assert_eq!(commit_graph_reads, 0) checks in warm_read_cost.rs and the debug-print refs in write_cost{,_s3}.rs.
Lineage and actor rows now live in __manifest (RFC-013 Phase 7), so the internal_table_scans_are_flat_in_history gate folds into the single manifest_reads flat-assertion — the manifest scan already covers them. Harness-only; no production runtime impact.
* docs: align with the commit-graph retirement + strand storage versioning
Update the always-loaded and user-facing docs to match the landed state: graph lineage lives in __manifest, the _graph_commits.lance / _graph_commit_actors.lance tables are retired, and storage is strict-single-version (no in-place migration — a sub-CURRENT graph is refused with an export/import rebuild).
Fixed stale claims in invariants.md (the migration/atomicity known-gap entry, the Truth Matrix branch-delete row, the read-path/optimize internal-table scope), lance.md (the migrate_v1_to_v2 PK bullet now reflects init-time set; removed the two deleted v3->v4 migration surface guards), testing.md (dropped the deleted migration failpoint tests; manifest-only internal-table term), writes.md (rewrote the Migration-code section to the strand model), storage.md / maintenance.md / constants.md (retired tables out of the layout, internal-table compaction scope, and the constants cheat-sheet), and AGENTS.md. Marked the retirement DONE in the RFC-013 handoff/roadmap and banner-noted the historical RFC analysis.
Added docs/user/operations/upgrade.md (the export/import rebuild recipe) and docs/dev/versioning.md (the four-axis compatibility policy: release lockstep / wire additive / storage strict-single-version / Lance pinned), cross-linked from the audience indexes and the AGENTS.md topic map, and rewrote the in-progress v0.8.0 release note for the strand model + version surfacing. check-agents-md.sh passes (65 links, 62 docs).
* test(manifest): cover the v3-refusal→export/import rebuild cycle and branch stamp inheritance
Two coverage additions from PR review (P1):
(a) sub_current_graph_is_refused_then_rebuilt_via_export_import — the full operator narrative in one flow: load → export → a sub-CURRENT graph (stamp rewound below CURRENT) is refused with the export nudge → fresh init + load(export) → data present and the rebuilt graph opens. The refusal is stamp-only (read before any data), so a stamp-rewound graph is a faithful stand-in for a real older-release graph without a second binary; vector/blob fidelity stays covered by tests/export.rs.
(b) branch_inherits_main_internal_schema_stamp — proves a branch cannot diverge from main's stamp under single-binary operation (create_branch forks main's __manifest, the publisher does not re-stamp), which is why the graph-level (main-only) stamp gate is sufficient for supported inputs. A divergent branch stamp needs concurrent multi-version writers, the unsupported topology recorded as a known gap.
649 lines
44 KiB
Markdown
649 lines
44 KiB
Markdown
# Handoff: finishing RFC-013 (write-path latency + correctness)
|
||
|
||
> **Status update (strand-and-retire work):** the `_graph_commits.lance` /
|
||
> `_graph_commit_actors.lance` datasets are **retired** (Phase B — lineage lives in
|
||
> `__manifest`; the `CommitGraph` is a pure projection). References to those tables
|
||
> below are historical: `optimize` now compacts **`__manifest` only** among the
|
||
> internal tables, and the per-write `_graph_commits` scan term is gone. See
|
||
> [versioning.md](versioning.md) and [invariants.md](invariants.md).
|
||
|
||
**Status:** living handoff. **Source of truth is [`rfc-013-write-path-latency.md`](rfc-013-write-path-latency.md)** —
|
||
this doc is the *current-state map + the decisions/validation from the latest work cycle
|
||
+ the concrete next actions*. When they disagree, the RFC wins (and fix this doc).
|
||
|
||
**Audience:** the engineer/agent who picks up RFC-013 next.
|
||
|
||
> **Two threads, one RFC.** RFC-013 has been worked in two overlapping lines by different
|
||
> cycles, with different sub-numbering. Don't let the numbering confuse you:
|
||
> - **Thread A — per-write call-count / RTT collapse + concurrency correctness** (steps 3b /
|
||
> 4 / 5, Design A / `PublishPlan`, #297 / #254 / #296 / #298). This is **§1–§10 below** (the
|
||
> original body of this doc). Read it for the concurrency model and the convergent fix.
|
||
> - **Thread B — `__manifest` scan amplification + the unbounded `_versions/` chain** (the
|
||
> investigation framed as **PR1 / PR2 / PR3 / PR4**). This is **§A below (read it first)** —
|
||
> it is the most recent cycle (2026-06-26) and where the live branch sits.
|
||
>
|
||
> They meet at the internal-table maintenance area (Thread A's "step 2a/2b" == Thread B's
|
||
> "compaction is done / PR1 = bound the chain"). §A maps the two framings.
|
||
|
||
---
|
||
|
||
## §A. Current state — the `__manifest` scan + version-chain thread (2026-06-26) — READ FIRST
|
||
|
||
The latest cycle attacked **write latency that grows with graph age on object storage** and a
|
||
**hard open failure on a large `_versions/` chain**. The working plan (survives context resets,
|
||
**not in the repo**) is `~/.claude/plans/in-the-mean-time-humble-reef.md` — pull it for the full
|
||
PR1/PR2/PR3/PR4 decomposition, the assumptions-validated-against-code list, and the critical-files
|
||
map. This section is the durable summary.
|
||
|
||
### A.1 The problem (root cause)
|
||
|
||
Two interacting terms, both centered on the internal `__manifest` Lance dataset (the cross-table
|
||
catalog; one dataset, 217 tables on a real graph):
|
||
|
||
- **Term 1 — repeated full `__manifest` scans per write.** Each read of `__manifest` was a bare
|
||
`dataset.scan()` (no filter/projection/index), cost **O(fragments F)**. A publish did **4** such
|
||
scans (OCC re-capture + `load_publish_state` + the `use_index(false)` merge-insert join +
|
||
post-publish read-back). `F` grows **+1 per write**, so per-write cost climbs with history.
|
||
- **Term 2 — unbounded `_versions/` chain.** `cleanup` version-GC **excludes** the internal tables
|
||
(`all_table_keys` is node/edge-only, `db/omnigraph/optimize.rs`), so `__manifest/_versions/`
|
||
grows without bound (768 objects measured). Lance **lists** that prefix on every open; once large,
|
||
a store like RustFS times out serving the list page → branch ops take minutes or **fail outright**.
|
||
Unfixable via the public CLI today (`cleanup --keep` can't target `__manifest`).
|
||
|
||
Validated three ways: a multi-agent workflow (readers + adversarial refutation), Lance 7.0.0 source,
|
||
and live branch-op probes on a RustFS mirror. **Key Lance fact (closed the open-path investigation):**
|
||
on standard S3-class stores (R2/RustFS, not S3 Express) Lance sets `list_is_lexically_ordered = true`,
|
||
so it **never** uses `latest_version_hint.json` — it always lists `_versions/`. So the version hint is
|
||
**not** a lever; only **bounding the chain (PR1)** fixes the open path. (Corollary: on production R2 the
|
||
open is one cheap list page regardless of chain length, so the chain barely affects R2 open latency —
|
||
the RustFS *failure* is a RustFS list-page limit; the R2 ~16s write latency is Term-1 fragment
|
||
amplification = PR2's target.)
|
||
|
||
### A.2 What is LANDED on `main`
|
||
|
||
- **Step 2a** — `optimize` compacts the internal tables too (`__manifest` / `_graph_commits` /
|
||
`_graph_commit_actors`), so a *periodically-compacted* graph keeps Term-1 flat. (Cleanup/version-GC
|
||
of them is the still-open PR1.)
|
||
- **Phase 7 / #299** (`1c5cb874`) — graph lineage lives in `__manifest` (`graph_commit` +
|
||
`graph_head:<branch>` rows in the same publish merge-insert; `_graph_commits` is now a projection;
|
||
v3→v4 internal-schema migration; schema-version floor). This removed the per-write commit-graph
|
||
scan and closed the manifest→commit-graph atomicity + commit-graph-parent-under-concurrency gaps.
|
||
**This is the base everything below builds on.**
|
||
|
||
### A.3 What is IN FLIGHT — branch `ragnorc/read-lance-table-docs` = **PR #307** (OPEN, base `main`)
|
||
|
||
Ten commits ahead of `origin/main`. PR #307 includes the scan-halving work, the fold
|
||
correctness fix, and PR2.1's ground-truth cost harness.
|
||
|
||
1. **PR2 — halve per-write `__manifest` scans** (`4ac3cde4` + tripwire `52a7e0cd`). Three moves:
|
||
- **#1a probe-gate the OCC re-capture** (`occ_snapshot_for_branch`, mirrors the read path's
|
||
`resolve_target_inner`): replace the cold re-scan with a cheap incarnation **probe**; reuse the
|
||
warm coordinator on match, cold-scan only on mismatch.
|
||
- **#1b in-memory post-publish fold** (`fold_inputs` in `publisher.rs`; `PublishOutcome.known_state`):
|
||
build the new `ManifestState` from `existing_versions ∪ pending rows` instead of re-scanning.
|
||
- **#1c projection** on `read_manifest_scan` (drop `base_objects` always, `object_id` off the
|
||
table-state path).
|
||
- Net: per-write `__manifest` scans **4 → 2**; the two inherent publisher scans
|
||
(`load_publish_state` + the `use_index(false)` merge-join) remain O(F).
|
||
2. **Fold correctness fix** (`5537cd95` test, `245cb26d` fix) — a reviewer (Cursor Bugbot High +
|
||
Codex P2) caught that `#1b`'s fold dropped a **same-version owner-branch handoff**: a `table_version`
|
||
row UPDATEd in place at the same Lance version with a new `table_branch` (merge-insert `UpdateAll`
|
||
on the deterministic `version_object_id`) was appended after `existing_versions`, and
|
||
`assemble_manifest_state` kept the stale first entry, so the warm coordinator held the wrong fork
|
||
until refresh. Fix: key the fold's version entries by `(table_key, table_version)` so a pending row
|
||
**replaces** the existing one (mirroring `UpdateAll`). Test-first repro in
|
||
`db/manifest/tests.rs::test_post_publish_fold_reflects_owner_branch_handoff` (red→green).
|
||
3. **PR2.1 — ground-truth cost harness** (`fd73f01b`, `59d9ff39`, `3cd2b2c1`, `383022e8`, `9f1e5b6e`).
|
||
Rebuilt `tests/helpers/cost.rs` on lance's IO-counted idiom (`incremental_stats()` deltas; one
|
||
`IOTracker` per class). Added `cost_harness(body)` / `GraphIoMeter`: it installs one `__manifest`
|
||
tracker **before the coordinator opens**, so the tracker rides every handle (init + each
|
||
post-publish reassignment at `db/manifest.rs:590`). `manifest_reads` is now **ground truth**
|
||
(handle-age-irrelevant), closing the blind spot where a per-op tracker installed at measure time
|
||
could not see reads on the long-lived warm handle. `last_manifest_reads()` dumps the read log for
|
||
`assert_io_eq!`-style failure diagnostics. Outside `cost_harness`, `measure` falls back to
|
||
fresh-per-op, so `write_cost_s3.rs` is untouched. (Kept the bespoke `PrefixCounter` for the
|
||
opener/scan split — lance does the same with `throttle_store`/`failing_store`, and the
|
||
request-log alternative would couple to unstable debug method-strings.)
|
||
|
||
### A.4 The accurate measurement (PR2.1's payoff — what it told us)
|
||
|
||
The old (fresh-only) harness **undercounted writes**: `#1a`'s probe rides the warm handle, and its
|
||
reads escaped the per-op tracker (they showed only as `version_probes=1`). Ground truth counts them
|
||
and reveals **a write's freshness probe does ~3 `__manifest` object-store RPCs** (a *read*'s probe is
|
||
a 0-IO cache hit). So, apples-to-apples (both ground truth), per-write `__manifest` ops:
|
||
|
||
| | depth 10 | depth 100 | slope |
|
||
|---|---|---|---|
|
||
| Pre-PR2 (4 cold scans) | 50 | 410 | +4/write |
|
||
| Post-PR2 (ground truth) | 28 | 208 | +2/write |
|
||
|
||
- PR2 roughly **halved** the per-write manifest work **and its growth slope** (+4 → +2/write).
|
||
- The **compacted/maintained floor is ~5 RPCs/write, flat in history** — the 3-RPC probe now dominates
|
||
it (it is O(1), not O(F)). So `#1a` made the OCC re-capture O(1), it did not make it free.
|
||
- For latency: a periodically-compacted graph has bounded, history-independent per-write manifest
|
||
cost; an unmaintained graph still grows at half the rate (PR1 flattens the residual). The probe and
|
||
RFC #7264 are the levers for the compacted floor. (The harness measures op *count*, the latency proxy
|
||
on object stores; the ~16s R2 figure is the open-path chain = PR1, separate.)
|
||
- Regression guard: `write_cost.rs::manifest_reads_capture_warm_probe` (fresh=11 vs ground-truth=14)
|
||
goes red if the ground-truth wiring reverts.
|
||
|
||
### A.5 What is LEFT (priority order) — Thread B
|
||
|
||
1. **PR1a — manual `__manifest`-only cleanup** *(available now, no new invariant; HIGHEST priority —
|
||
it is the only thing that fixes the hard open **failures**)*. Add `all_table_keys_internal()` +
|
||
`cleanup_internal_tables()` reusing the generic `cleanup_all_tables` loop (`optimize.rs`); refuse on
|
||
a pending recovery sidecar. Safe **only on a quiesced graph** (no concurrent writers → no Q8
|
||
resurrection race). Shrinks `_versions/` (768 → keep-N). This is RFC **step 2b's available half**.
|
||
Pair with a **V2-naming surface guard** (protects the one-page open fast-path).
|
||
2. **PR3 (the available half) — branch-op de-amplification.** Branch **merge** candidate-scoping
|
||
(avoid 3 full cross-branch snapshots + union-all-keys upfront, `exec/merge.rs`); **parallelize** the
|
||
branch-delete loop (`ensure_branch_delete_safe` snapshots every other branch — O(branches)). Each
|
||
per-branch scan is already cheaper post-PR2 (#1c projection).
|
||
3. **Design-gated / deferred:**
|
||
- **PR1b — the Q8 durable boundary watermark** for SAFE automated/cadenced GC under live writers
|
||
(Lance version create is a bare `PutMode::Create` with no monotonic guard → a stalled writer can
|
||
resurrect a GC'd version = silent lost write on R2/S3). Invariant-level, partially MTT-redundant.
|
||
**This is the same design point as Thread A's "step 2b / Q8 watermark" in §8.** Design deliberately
|
||
or wait for RFC #7264.
|
||
- **PR3 branch-delete O(1)** — needs a cross-branch dependency index (the `table_branch` dependency
|
||
is genuinely cross-branch with no index today).
|
||
- **PR4 / RFC #7264** — Lance native branch-aware `BatchCreateTableVersions`; manifest read → O(1),
|
||
per-write fragment append gone; retires most of PR1/PR2. Upstream-blocked.
|
||
4. **Low-leverage:** ~~retire the vestigial `_graph_commits`/`_graph_commit_actors` datasets~~ **DONE**
|
||
(Phase B of the strand-and-retire work — the tables are gone, branch authority is `__manifest`, the
|
||
`CommitGraph` is a pure `__manifest` projection). Still open: a bitmap index on `__manifest` (no builder
|
||
exists; `use_index(false)` means it can't serve the CAS join anyway — a `graph_head:<branch>` point-lookup
|
||
is the better variant).
|
||
|
||
### A.6 Critical files (Thread B)
|
||
|
||
- `db/manifest/state.rs` — `read_manifest_state` / `read_manifest_scan` / `assemble_manifest_state` (the
|
||
shared reduction both the fold and the scan feed).
|
||
- `db/manifest/publisher.rs` — `fold_inputs` / `PublishOutcome` / `is_owner_branch_handoff` (publisher.rs:267,
|
||
the same-version handoff the fold must honor) / the merge-insert CAS.
|
||
- `db/manifest.rs` — `commit_changes_with_lineage` (adopts the fold; `self.dataset = dataset` reassignment
|
||
at :590, the reason the cost tracker must be installed before open) + the probes.
|
||
- `db/omnigraph.rs` — `occ_snapshot_for_branch` (#1a), `resolved_branch_target`, `ensure_branch_delete_safe` (PR3).
|
||
- `exec/staging.rs` `commit_all`; `exec/merge.rs` (PR3); `db/omnigraph/optimize.rs` (`all_table_keys`,
|
||
`cleanup_all_tables` — PR1).
|
||
- `tests/helpers/cost.rs` (the harness), `tests/write_cost.rs` / `warm_read_cost.rs` / `write_cost_s3.rs`,
|
||
`tests/writes.rs` / `consistency.rs` / `composite_flow.rs` (must stay green).
|
||
|
||
### A.7 Gotchas (Thread B, learned this cycle)
|
||
|
||
- **A per-op object-store wrapper cannot see a long-lived handle's reads.** That was the measurement
|
||
blind spot. The fix is to install the tracker before the handle opens (`cost_harness`), not at measure
|
||
time. A write's warm-handle probe is **3 RPCs** that hid behind `version_probes=1`.
|
||
- **`cost_harness` must wrap the WHOLE test body** (the graph must open inside it), and the body future
|
||
must be **`Box::pin`-ed** — wrapping a whole test body in another async layer overflows the test
|
||
thread's stack (these cost tests already raise `recursion_limit`).
|
||
- **The fold must mirror merge-insert identity.** `version_object_id(table_key, version)` is
|
||
deterministic, so a same-version handoff is an in-place `UpdateAll`; the in-memory fold must key by
|
||
`(table_key, version)` and replace, or the warm coordinator desyncs from a fresh re-scan. The
|
||
byte-identity guard is `writes.rs::post_publish_fold_matches_fresh_reopen`.
|
||
- **`lance-io` `test-util`** is enabled in dev-deps (gives `IoStats.requests` + `assert_io_eq!`,
|
||
diagnostics only); production builds exclude dev-deps so they never see it.
|
||
|
||
### A.8 Immediate next action
|
||
|
||
The natural next PR is **PR1a** (no design gate, fixes the RustFS open failures). Run and confirm
|
||
the relevant test gate before starting or stacking that follow-up.
|
||
|
||
---
|
||
|
||
## 0. TL;DR — where we are and what's next
|
||
|
||
RFC-013 makes the write path fast **and** correct on object storage (217 Lance tables
|
||
under one `__manifest` catalog, on R2/S3). It is sequenced as steps; read §9 of the RFC
|
||
for the canonical list. Current reality:
|
||
|
||
**Landed on `main`:**
|
||
- **Step 1** — Tier-1 cost gate + the shared `helpers::cost` harness (#288).
|
||
- **Step 3a** — opener bypass: write opens go direct (`Dataset::open` by URI + version)
|
||
instead of the Lance-namespace builder (#288). **This already banked the dominant
|
||
depth win** — see §2 below; it reframes everything.
|
||
- **Step 2a** — internal-table compaction: `optimize` now compacts `__manifest` /
|
||
`_graph_commits` / `_graph_commit_actors` (#291). Plus the RFC latency-model
|
||
correction (#292).
|
||
- **Step 4 / Phase 7** — graph lineage moved into `__manifest` (#299 `1c5cb874`):
|
||
`graph_commit` + mutable `graph_head:<branch>` in the publish merge-insert,
|
||
`_graph_commits` now a projection. **The base for the live branch (§A).**
|
||
- **Optimize-vs-write race** — optimize survives a cross-process write race on the
|
||
same table (#297, **LANDED** — origin/main `6d4606a8`; see §6 for why it's not
|
||
redundant with Design A). Step 3b stacks on top of this.
|
||
|
||
**Open PRs (land these; relationships in §7):**
|
||
- **#296** `correctness-by-design-fix` — recovery roll-forward converges on a concurrent
|
||
manifest advance (the fix for the flaky `iss-schema-apply-reopen-recovery-race`).
|
||
**MERGED to main and integrated into this branch** — the converge helper now threads
|
||
Phase-7's manifest-CAS recovery `graph_commit_id` (see `converge_or_defer_roll_forward`).
|
||
- **#295** `docs/rfc-013-step-3b` — the step-3b RFC doc.
|
||
- **#254** `ragnorc/bug-4-schema-apply-occ` — schema-apply vs optimize false-fail
|
||
(same op-class family as #297, logical side).
|
||
|
||
**Step 3b is DONE** (capture-once `WriteTxn`, schema-once + open-collapse; see §4) on
|
||
`rfc-013-step-3b-writetxn-v2`. **Phase 7 (step 4) has since LANDED on `main` (#299 `1c5cb874`)**
|
||
— lineage now lives in `__manifest` (see §A.2). **Next for Thread A: the big one — Design A /
|
||
`PublishPlan` unification (step 5)** — see §5, the convergent fix for the bug *class* this area
|
||
keeps generating, which also absorbs 3b's deferred session-aware write opens. **Next for Thread B
|
||
(the live branch, §A): PR1a** (manual `__manifest` cleanup — fixes the RustFS open failures).
|
||
|
||
---
|
||
|
||
## 1. The corrected mental model (read this before touching anything)
|
||
|
||
Three reframes from the latest cycle that the older RFC prose may not fully reflect:
|
||
|
||
### 1a. 3a already won the depth fight → the residual is constant-factor + RTT
|
||
Before 3a, the write re-opened each table through the lance-namespace builder ~13×, and
|
||
that path was **O(depth)** (it re-opened `__manifest` + `list_table_versions` per open —
|
||
**not** a Lance back-walk; the root cause was OmniGraph's own namespace round-trips, not
|
||
Lance — validated against Lance source). 3a swapped it for the direct opener, which is
|
||
**O(1)** (`from_uri(loc).with_version(N)` = arithmetic path + one HEAD). So:
|
||
|
||
- The dominant **O(depth) data-table** term is **gone**.
|
||
- Step 2a flattened the secondary **internal-table** scan term.
|
||
- What remains is the **~110-hop serial backbone × RTT + compute** — a constant in
|
||
depth. The latency model is **`wall = (serial_hops + ops/effective_concurrency)·RTT
|
||
+ compute`**; on a capped store (R2) the op-count term re-enters wall-clock, on an
|
||
unlimited store it parallelizes away. Measured: prod one-row write 27→15.76s after
|
||
2a; the remaining 15.76s is the serial backbone — **step 3b's target**, not step 2's.
|
||
- Step 3b's win is therefore the **call-count/RTT collapse** (redundant opens, the
|
||
flat-46 schema reads), NOT a depth slope. Don't expect a depth-slope improvement from
|
||
3b; gate it on the constant-factor (S3 round-trips), not a curve.
|
||
|
||
### 1b. Two op classes, two commit models (the §6.6 principle)
|
||
Every concurrency bug in this area is **one op class using the other's commit model**:
|
||
|
||
| class | examples | commutes? | correct commit model |
|
||
|---|---|---|---|
|
||
| **maintenance** | compaction (`Rewrite`), `optimize_indices` | yes (content-preserving) | Lance native rebase + app reopen/replan on real overlap + **monotonic manifest fast-forward** — no epoch, no read-set |
|
||
| **logical mutation** | load / mutate / merge / delete | no (lost-update, write-skew) | strict cross-process OCC: read-set + write-set CAS under the `writer_epoch` fence |
|
||
|
||
Applying strict OCC + equality-CAS uniformly is the mistake: too strong for maintenance
|
||
(false conflicts — #297's bug), too weak for logical cross-process (§6.5 corruption).
|
||
|
||
### 1c. The root liability (what keeps generating these bugs)
|
||
Lance gives **per-table atomic commits** but **no cross-table/cross-step atomicity**, so
|
||
every multi-commit op advances per-table Lance HEAD **before** the manifest references it
|
||
(the "A-before-B window"). The resulting `HEAD vs manifest` delta is **ambiguous**
|
||
(external drift? my own in-flight work? a crashed writer?), and **many uncoordinated code
|
||
paths each re-interpret it** (4 writers + the maintenance path + recovery + the write-path
|
||
drift guard). Each interpreter is a fresh chance to misclassify. That is the bug class:
|
||
- §6.5 cross-process logical corruption,
|
||
- #297's own-HEAD-drift misclassification,
|
||
- the flaky write-path "HEAD ahead of manifest, run repair" guard,
|
||
- the recovery classifier edges.
|
||
|
||
**The convergent fix is Design A (one publish authority — step 5); Lance MTT eventually
|
||
retires the window entirely.** See §5.
|
||
|
||
### 1d. The second facet: the write base is a stale pin (no probe)
|
||
The READ path resolves its base behind a freshness probe (`resolve_target_inner`
|
||
omnigraph.rs:~1072 → `probe_latest_incarnation` → `refresh_manifest_only`); the WRITE path
|
||
does NOT (`resolved_branch_target` omnigraph.rs:~778 returns the warm `coord.snapshot()` for
|
||
the bound branch, no probe). So a long-lived server's write base lags the live manifest. That
|
||
single staleness feeds **two distinct failure modes**, both surfaced this cycle:
|
||
|
||
1. **Stale validation *reads* → integrity under-enforced.** Write-path RI checks read
|
||
committed state off the stale base. 3b's collapse #1 made it worse for edge `@card`:
|
||
`edge_cardinality_read_handle` (mutation.rs:~614) scans the pinned `txn.base` instead of
|
||
live HEAD (was live HEAD pre-3b), so a concurrent edge committed after `txn` capture is
|
||
uncounted → a `@card` max can be exceeded (cursor **High** / codex **P1** on #298,
|
||
**VALID**). **#298 fix: restore the live-HEAD read for that scan** (un-regress; gate-safe —
|
||
the `data_open_count` gate is a node insert) + a deterministic regression test (commit A's
|
||
edge, then B validates → must see A) + correct the wrong "pinned base == live HEAD" doc
|
||
comment (mutation.rs:~605-613, which assumes a single writer). The *structural* liability
|
||
underneath: there is **no unified write-validation read-set** — endpoint
|
||
(`ensure_node_id_exists`, warm `snapshot_for_branch`), cardinality (mutation: pinned
|
||
`txn.base`; loader: warm `snapshot_for_branch` — the SAME check forks per write path),
|
||
commit drift guard (live `fresh_snapshot_for_branch`), and uniqueness
|
||
(`enforce_unique_constraints_intra_batch`, intra-batch only — cross-version uniqueness is a
|
||
documented gap). Three freshness levels chosen ad hoc, none re-validated at commit → the
|
||
§7.1 TOCTOU class, and each new constraint forks the pattern again.
|
||
|
||
2. **Stale OCC *pin* → false-fail on a maintenance advance.** A served strict update/delete
|
||
pins the stale base version, then false-fails `ExpectedVersionMismatch` after an external
|
||
`optimize` advanced `__manifest` — even though the advance was content-preserving
|
||
compaction the logical write should fast-forward past (invariant 7). It's the **write-side
|
||
mirror of #297/§6.6** (#297 made optimize fast-forward past a logical write; this is a
|
||
logical write that must fast-forward past optimize). A served read clears it (the read
|
||
probes the shared coordinator). Validated repro on prod (omnigraph.ragnor.co) +
|
||
`writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes`
|
||
(`#[ignore]` on branch `fix/write-path-stale-view-probe`). **The naive "just probe" fix is
|
||
proven wrong** — a blanket probe silently refreshes past *logical* advances too, breaking
|
||
`consistency::stale_handle_public_mutation_must_refresh_then_retry` (the deliberate
|
||
cross-process lost-update OCC primitive). The fix must **discriminate by op class**.
|
||
|
||
**Both fold into Design A (step 5), same as §1c.** `open_txn`'s one warm probe makes the base
|
||
fresh (absorbs maintenance advances cheaply); the **op-class-aware strict precondition** —
|
||
derive from Lance's per-version transaction metadata (all `Rewrite`/`ReserveFragments` =
|
||
maintenance → fast-forward the pin; any `Append`/`Update`/`Delete`/`Merge` = logical → fail
|
||
loudly; NO parallel marker, invariant 1/15) — is the correctness fence for anything that lands
|
||
after. And the §7.1 read-set-in-CAS unifies the validation read-set + re-validates it under the
|
||
`graph_head` contention. So **the stale-view false-fail, the cardinality/validation-read-set
|
||
liability, and #297's mirror are one bug** (the write base is a stale, un-probed, un-classified
|
||
pin) with **one home: the single PublishPlan delta-interpreter** (§1c + §5). Strong corroboration
|
||
of Design A — three symptoms, one fix.
|
||
|
||
---
|
||
|
||
## 2. Validated facts — do NOT re-derive these
|
||
|
||
Established this cycle against **Lance 7.0.0 source**
|
||
(`~/.cargo/registry/src/index.crates.io-*/lance-7.0.0`) and current engine code. Cited so
|
||
you can trust them without re-investigating.
|
||
|
||
**Lance (upstream):**
|
||
- `from_uri(loc).with_version(N).load()` and `checkout_version(N)` are **O(1)** (computed
|
||
V2 path `_versions/{u64::MAX-N:020}.manifest` + one HEAD; no listing/back-walk).
|
||
(`lance-table/src/io/commit.rs` `default_resolve_version`.)
|
||
- A shared `Arc<Session>` (`DatasetBuilder::with_session`) warms metadata/index caches
|
||
keyed by `(URI, version, e_tag)`. Caveat: the *first* manifest read on open is uncached
|
||
— the Session warms the *scan/index* metadata, not the first open. **`WriteParams` *does*
|
||
carry a `session` field** (`lance/src/dataset/write.rs`), but it only matters on the
|
||
`WriteDestination::Uri` arm; OmniGraph's staged path always drives off an **already-open
|
||
`Dataset`**, and Lance takes the store/session from that handle. So to attach the shared
|
||
Session to a write base, open read-style (`open_table_dataset` → `from_uri().with_version()
|
||
.with_session()`) and drive the staged write off that handle.
|
||
- A held `Arc<Dataset>` at a pinned version is `Send + Sync`, immutable, safe to reuse for
|
||
many scans/count/staged-write base in one txn (OmniGraph's `TableHandleCache` already
|
||
relies on this).
|
||
- **No compaction `RetryExecutor`** (only Delete/MergeInsert/Update have one).
|
||
`commit_compaction` commits a fixed `Rewrite` via `apply_commit` direct. In
|
||
`commit_transaction`, a semantic `RetryableCommitConflict` **escapes the retry loop**
|
||
via `?` at `io/commit.rs:979`; the loop only retries the OCC `CommitConflict`
|
||
(`:1096`), and even that re-rebases the *same* transaction (never re-plans). ⇒
|
||
**compaction needs app-level reopen+REPLAN; you cannot "set conflict_retries" and let
|
||
Lance own it.**
|
||
- `check_rewrite_txn`: a `Rewrite` rebases **cleanly** past a concurrent `Append`/disjoint
|
||
`Update`/`Delete` (preserving both); only a same-fragment overlap yields a retryable
|
||
conflict. ⇒ the common concurrent insert/update/delete is rebased for free; the app
|
||
retry fires only on real overlap.
|
||
|
||
**Engine (internal):**
|
||
- Read path (post-#268) already has the capture-once machinery: `Snapshot` (`db/manifest.rs`),
|
||
warm `GraphCoordinator` behind a `latest_version_id`/incarnation probe, a held
|
||
`TableHandleCache` keyed `(table,branch,version,e_tag)`, **one shared `Session` per
|
||
graph** (`read_caches.session`). **Writes bypass all of it by construction**
|
||
(`resolved_branch_target` returns `read_caches: None`; the 3a write opener attaches no
|
||
session and opens by latest, not pinned version).
|
||
- A single write opens each table **3–4×** (accumulation → staging reopen → commit
|
||
drift-guard → publish prepare), each a fresh cold open. `validate_schema_contract`
|
||
(`db/schema_state.rs`, via `ensure_schema_state_valid`) runs uncached (~3 `read_text`
|
||
+ 2 `exists`) at every resolve point (~the flat-46). Both are constant-factor, flat in
|
||
depth — 3b's targets.
|
||
- Strict-op guards are the lost-update floor (3 layers: pre-stage `ensure_expected_version`
|
||
`table_store.rs`; commit-time strict drift `exec/staging.rs`; publisher CAS
|
||
`publisher.rs`). Capture-once **supplies** the pinned operand — never remove a guard.
|
||
- Fork-on-first-write authority reads (`classify_fork_ref` → `fresh_snapshot_for_branch`)
|
||
must stay **fresh** (not served from a pinned base).
|
||
- Cost harness: `helpers::cost` (`measure`/`measure_with_staged`/`IoCounts`/`assert_flat`/
|
||
`local_graph`/`s3_graph`). The schema-once assert can reuse `CountingStorageAdapter`
|
||
(`warm_read_cost.rs::warm_query_validates_schema_contract_once`) with **zero** prod
|
||
change; an open-count assert wants a small `open_count` AtomicU64 in `QueryIoProbes`
|
||
(copy the `probe_count`/`record_probe` pattern). The forbidden-API guard
|
||
(`tests/forbidden_apis.rs`) makes an instrumentation-level counter complete.
|
||
|
||
---
|
||
|
||
## 3. The #297 cycle (this branch) — what it is, and the lesson
|
||
|
||
`fix-optimize-concurrency-race` (5 commits): a CLI `optimize` racing a served write on the
|
||
same table failed (Lance Rewrite lost, or the equality-CAS publish lost). Fix: unify both
|
||
compaction paths on the internal path's **reopen+replan** shape, with a **two-level retry**
|
||
— outer loop reopens+replans on a real Lance overlap; inner Phase-C loop makes the manifest
|
||
publish a **monotonic fast-forward** (advance to compacted version `N`, or no-op when the
|
||
manifest already moved to `≥ N`), never the strict equality CAS. Sidecar written once;
|
||
in-process queue kept as a contention reducer (not the cross-process guard); no `writer_epoch`.
|
||
|
||
**Two review rounds surfaced two follow-on bugs I introduced with the retry loop** — both
|
||
fixed, both regression-tested (own-HEAD-drift via negative control):
|
||
1. **Own-HEAD-drift misclassification** (`56d004e0`): the drift guard re-ran every
|
||
iteration and, after a partial Phase-B commit (auto_cleanup strip or compact, then a
|
||
later op conflicts), saw `HEAD > manifest` from *our own* covered work and deleted the
|
||
sidecar + returned `skipped_for_drift` (stranding uncovered drift). Fix: track
|
||
`head_advanced`; the drift guard fires only when `!head_advanced`.
|
||
2. **Publish exhaustion spurious error** (`e9d16a2c`): the publish loop returned `Err` on
|
||
its final retry even if the conflict meant a concurrent writer already published `≥ N`
|
||
(postcondition met). Fix: re-check `current >= state.version` on exhaustion.
|
||
|
||
**The lesson (write it on the wall):** *wrapping a sequence of side-effecting commits in a
|
||
retry silently converts every "checked once, before any side effect" precondition into
|
||
"re-checked after partial side effects."* That's a distinct bug class; it needs
|
||
fault-injection tests **at each commit boundary**, not just end-to-end concurrency tests.
|
||
(The `optimize.before_compact` / `optimize.inject_reindex_conflict` failpoints exist for
|
||
exactly this.)
|
||
|
||
**Temporary mechanism flag:** `head_advanced` is an in-memory proxy for "is this HEAD
|
||
movement mine." Under Design A the authority answers that from the plan/sidecar **identity**
|
||
— so `head_advanced` is the part that gets *replaced*, while the monotonic-publish +
|
||
reopen/replan **semantics** are permanent. (Noted in RFC §6.6.)
|
||
|
||
---
|
||
|
||
## 4. DONE: Step 3b — capture-once `WriteTxn` (shipped on `rfc-013-step-3b-writetxn-v2`)
|
||
|
||
**Delivered:** on the **table-touch hot path**, a single `mutate`/`load` validates the schema
|
||
contract **once** and opens each touched data table **at most once** — a constant-factor/RTT
|
||
win (not a depth-slope win; 1a). Two cost gates in `write_cost.rs` lock it (both on a node
|
||
insert): `write_validates_schema_contract_once` (3 `read_text` / 2 `exists`, was 12/9) and
|
||
`keyed_insert_opens_table_at_most_once` (`data_open_count <= 1`, was 4). The carrier is the
|
||
minimal `WriteTxn { branch, base }`, threaded as `Option<&WriteTxn>` (`Some` on the hot
|
||
mutate/load path, `None` byte-identical everywhere else); it **converges into** step 5's
|
||
`PublishPlan`.
|
||
|
||
**Not "once" everywhere (scope, not regression):** edge endpoint / cardinality RI validation
|
||
(`ensure_node_id_exists`, the loader's RI + cardinality) still resolves through
|
||
`snapshot_for_branch` and re-validates the schema — and reads **warm**, not live. Threading
|
||
`txn.base` there to make it "once" would re-introduce the stale-read class the #298 cardinality
|
||
fix removed (it now reads live HEAD). Doing schema-once *and* fresh reads for those validations
|
||
needs the unified, re-checked read-set — **step 4 §7.1** (§1d). So #298 **un-regresses
|
||
cardinality only; it does not close write-validation freshness.** No edge-insert/load schema-once
|
||
gate yet (only the node gates above).
|
||
|
||
Commits (off merged-#297 main):
|
||
- **Stage 0** — scope `open_count` → `data_open_count`/`internal_open_count` by URI class
|
||
(the review fix: `open_dataset_tracked` also opens `__manifest`/`_graph_commits`, so the
|
||
raw counter conflated them and the gate was unreachable). Re-baselined RED 4.
|
||
- **Commit A (schema-once)** — capture `txn` once at entry (the single validation); the 4
|
||
validation sites collapse: S1 (entry `ensure_schema_state_valid`) removed; S3a
|
||
(`open_for_mutation_on_branch`) + S3b (`prepare_updates_for_commit`) source `txn.base`;
|
||
S4 (`commit_all`) uses new `fresh_snapshot_for_branch_unchecked` (the OCC manifest re-read
|
||
minus the schema re-validation). `fresh_snapshot_for_branch{,_unchecked}` now read the
|
||
manifest directly via `ManifestCoordinator` (drops a spurious commit-graph `exists` probe;
|
||
same `Snapshot`).
|
||
- **Commit B (open collapse 4→1)** — #1 accumulation open ELIMINATED (the node path discarded
|
||
the handle; read `txn.base.entry().table_version`); #2 staging open KEPT (the one open);
|
||
#3 commit drift-guard reads live HEAD via `entry.dataset.dataset().latest_version_id()` (a
|
||
cheap manifest-pointer probe off the staged handle, not a fresh open); #4 index build reuses
|
||
the `commit_staged` handle threaded through `CommittedMutation`/`prepare_updates_for_commit`.
|
||
- **Commit B.1 + cleanup** — named the two positional returns (`OpenedForMutation`,
|
||
`CommittedMutation`) + a `debug_assert` pinning the open-skip contract; **removed the
|
||
unearned `WriteTxn.session` field** (the collapse uses skip/probe/reuse, not a session).
|
||
|
||
**RFC §4.1 corrections — how they resolved:**
|
||
1. *Thread the evolving handle, not a version-keyed cache* → realized as collapse #4 (carry
|
||
the `commit_staged` handle forward into the index build).
|
||
2. *Don't forbid re-resolution* → honored: the commit-time OCC re-read
|
||
(`fresh_snapshot_for_branch_unchecked` — fresh manifest, only schema-revalidation dropped)
|
||
and the fork-authority reads stay fresh.
|
||
3. *Minimal carrier* → `WriteTxn { branch, base }` (even the `session` from the original
|
||
sketch was dropped as unearned).
|
||
|
||
**Deferred to step 5 (NOT in this PR):** session-aware write base opens. The one remaining
|
||
open (#2) stays a HEAD open; warming the shared `Session` across writes is an object-store
|
||
(S3) phenomenon invisible on local FS, so it earns its own `write_cost_s3.rs` gate in step 5,
|
||
where `txn` becomes the non-optional publish carrier. No new concurrency test was needed here:
|
||
#2 stays a HEAD open (no pinned+session base introduced), so the publisher CAS + #3 live-HEAD
|
||
probe fences are unchanged (covered by the green `writes.rs`/`consistency.rs`).
|
||
|
||
**Guardrails (don't regress):** schema validation is deliberately uncached for drift
|
||
detection — collapse to 1 *per write*, never cache across writes on a long-lived handle
|
||
(`lifecycle::long_lived_handle_rejects_schema_*`). The commit-time fresh read is OCC
|
||
machinery, not redundancy. Keep all 3 strict-op guards. Keep fork-authority reads fresh.
|
||
Pin the *correct* branch (server-bound-to-main writing a feature branch falls to a fresh
|
||
open). A branch `rfc-013-step-3b-writetxn` exists off an earlier main; rebase onto the
|
||
post-#297 main before starting.
|
||
|
||
---
|
||
|
||
## 5. Design A — the `PublishPlan` unification (step 5) = the convergent fix
|
||
|
||
**This is the real fix for the bug class in §1c.** Collapse the four hand-rolled writers +
|
||
the maintenance path into **one `publish(txn, plan)` authority** where the CAS + bounded
|
||
retry is **unconditional and unbypassable** (no caller can "hold the queue → skip the CAS").
|
||
Properties:
|
||
- **One interpreter of the `HEAD vs manifest` delta** — and "is this my work?" is answered
|
||
by the plan/sidecar **identity**, not a re-derived comparison. The own-HEAD-drift bug, the
|
||
§6.5 writers, the write-path guard — all close *by construction*.
|
||
- **Recovery = the same `PublishPlan` re-applied** — the crash-recovery interpreter and the
|
||
live interpreter become the same code (`iss-merge-recovery-partial-rollforward` gone).
|
||
- Each `TableAction` commits by its **class** (§1b): `Rewrite` = maintenance (Lance rebase
|
||
+ reopen/replan + monotonic fast-forward, **no epoch**); load/mutate = logical (strict OCC
|
||
+ `writer_epoch`).
|
||
|
||
**Why it composes with Lance MTT (don't over-build):**
|
||
- The **unification itself is convergent** — when MTT lands, it slots *underneath* the same
|
||
authority; nothing wasted. Build this.
|
||
- The **`writer_epoch`** is the one MTT-redundant piece (MTT's commit-handler lease subsumes
|
||
a cross-process fence). Build it *last and minimally*, gated on actually deploying
|
||
multi-writer topologies. Per the deny-list, don't reimplement what the substrate will own.
|
||
|
||
**Sequencing judgment (this cycle's strongest signal):** the bug density here (this PR alone
|
||
= 3 review rounds, all "a writer re-interprets the delta") means the current N-writers interim
|
||
is high integrated-over-time liability. **Consider pulling the *convergent half* of step 5
|
||
(the single authority + recovery-as-plan) forward — possibly ahead of 3b** — because it stops
|
||
the bug class rather than patching instances. #297 + #254 are the *de-risking inputs*: they
|
||
validate the maintenance-class and logical-class commit models in isolation first, so Design
|
||
A implements a known spec rather than designing under refactor pressure. Do NOT build more
|
||
substrate-shaped scaffolding (custom WAL / job queue / second coordination table) to paper
|
||
over the window — strictly higher liability than either Design A or waiting for MTT.
|
||
|
||
**Deeper-than-A (post-MTT or as Lance exposes uncommitted variants):** all-uncommitted-fragments
|
||
+ one manifest commit would shrink the A-before-B window itself, blocked today by Lance not
|
||
exposing uncommitted variants for `compact_files` / `optimize_indices` / vector index (#6666
|
||
open; delete #6658 shipped). Track, don't build yet.
|
||
|
||
### 5.1 Step-5 design constraints inherited from the #295 spec review
|
||
3b shipped a **minimal** `WriteTxn { branch, base }` (schema-once + open-collapse via
|
||
eliminate/probe/thread) and **deferred** the full §4.1 opener-unification — the pinned-base
|
||
opener, the shared-`Session` open, the write-local **handle cache**, and the strict-op
|
||
conflict-timing move — to step 5. So the greptile-bot comments on the #295 *spec* were **moot
|
||
for #298** (which built none of those constructs) but are **load-bearing constraints for step
|
||
5** when it builds them. Bank them:
|
||
1. **Handle cache must be `Send + Sync`** (`Mutex<HashMap<…, Dataset>>`, not `RefCell`) if
|
||
`WriteTxn::open(&self)` is shared across concurrent stage futures — a `RefCell` compiles
|
||
but panics when two stages poll. Or make it `&mut self` (no parallel-stage sharing). This
|
||
is the deny-list "in-process-only `Dataset` impls — `Send + Sync`" item.
|
||
2. **The strict-op timing move needs an explicit retry contract.** If step 5 moves
|
||
strict-op conflict detection from open-time `ensure_expected_version` to commit-time CAS
|
||
(the §4.1 pinned-base design), it MUST specify: the txn is **discarded after any commit**
|
||
(success or conflict — the handle cache is commit-invalidated), and the retry **re-opens a
|
||
fresh `WriteTxn` at the new HEAD** (never re-stages against the stale pinned base — that
|
||
reproduces the lost-update). **This is the same retry/refresh contract as the stale-view
|
||
false-fail (§1d.2)** — the op-class-aware precondition + "fresh base on retry" are one
|
||
design point. Today (#298) strict ops keep open-at-HEAD + `ensure_expected_version`, so the
|
||
contract is unchanged; step 5 owns it the moment it pins strict reads to the base.
|
||
3. **The opener-equivalence test must be non-trivial.** A differential test that only passes
|
||
when `HEAD == base` proves nothing about pinning. To actually prove "`WriteTxn::open`
|
||
returns the pinned base, not HEAD," the test must **advance the branch HEAD externally
|
||
(direct Lance write), then assert the txn open still reads the base version** — and that a
|
||
strict write then fails `ExpectedVersionMismatch` at commit (verifying the timing move).
|
||
|
||
---
|
||
|
||
## 6. Why #297 is still needed even if you do Design A
|
||
- Design A **relocates** #297's maintenance-class commit logic into the authority's
|
||
`TableAction::Rewrite` path; it does not eliminate it. #297 is the *validated spec + tests*.
|
||
- The two regression tests + §6.6 are the **contract** Design A must keep green.
|
||
- The prod bug is **live**; Design A is the largest write-path change in the RFC. Don't hold a
|
||
correctness fix hostage to a big refactor, and don't do a big refactor under bug-fix urgency.
|
||
- Genuinely throwaway under Design A: only the loop's *location* + the `head_advanced` proxy
|
||
(~a dozen lines). Everything else relocates or persists. **#297 LANDED.**
|
||
|
||
---
|
||
|
||
## 7. Open PRs and their relationships
|
||
- **#297** — maintenance-class fix (optimize vs write). **LANDED** (origin/main `6d4606a8`);
|
||
step 3b stacks on it.
|
||
- **#254** — logical-class fix (schema-apply vs optimize false-fail). Same op-class family;
|
||
both are de-risking inputs for Design A's per-class commit models.
|
||
- **#296** — recovery roll-forward converges on concurrent manifest advance. The fix
|
||
for the flaky `iss-schema-apply-reopen-recovery-race`. It touches `recovery.rs` and is
|
||
*aligned* with #297's "postcondition is the state, not winning the CAS" principle. **#296
|
||
landed on main first and is merged into this branch:** the converge helper
|
||
(`converge_or_defer_roll_forward`) was reconciled with Phase-7's manifest-CAS roll-forward —
|
||
on convergence the audit references the winner's folded `graph_commit_id` (the current
|
||
`graph_head`), not a freshly minted one.
|
||
- **#295** — the step-3b RFC doc (apply §4's three corrections to it).
|
||
|
||
---
|
||
|
||
## 8. Remaining RFC steps after 3b (RFC §9 is canonical)
|
||
- **#298 follow-up (do on the 3b PR, before merge): the edge-`@card` stale-read regression**
|
||
(§1d.1). Restore the live-HEAD cardinality scan, add the deterministic regression test, fix
|
||
the wrong doc comment. Small, gate-safe, un-regresses an integrity check (invariant 9). The
|
||
residual concurrent TOCTOU is the §7.1 gap (step 4) — un-widen here, don't over-reach.
|
||
- **Step 4 / Phase 7** (`iss-991`): **LANDED on `main` as #299 (`1c5cb874`).** Lineage now lives
|
||
in `__manifest` (`graph_commit` + mutable `graph_head:<branch>` in the same merge-insert;
|
||
`_graph_commits` is a projection). Removed the per-write `commit_graph.refresh`; closed the
|
||
manifest→commit-graph atomicity + commit-graph-parent-under-concurrency gaps. *(Historical note,
|
||
kept for the §7.1 framing it carried:)* it
|
||
carries the §7.1 *concurrent* write-skew fix (needs the `graph_head` contention row) —
|
||
**frame §7.1 as "unify the entire write-validation read-set" (endpoint + cardinality +
|
||
cross-version uniqueness), not merely "add `graph_head`"** (§1d.1): the bespoke
|
||
`edge_cardinality_read_handle` and the mutation-vs-loader freshness fork dissolve into one
|
||
pinned read-set re-validated under the `graph_head` contention, or the liability survives as
|
||
a second special-case.
|
||
- **Step 5 / Design A** — §5 above. **Acceptance item: the served-strict-write stale-view
|
||
false-fail** (§1d.2) — the op-class-aware precondition + `open_txn` probe. The contract is
|
||
two tests passing *together*: un-ignore
|
||
`writes.rs::served_strict_delete_after_external_optimize_advance_auto_refreshes` (goes green)
|
||
*while* `consistency::stale_handle_public_mutation_must_refresh_then_retry` stays green
|
||
(maintenance fast-forwards; logical fails loudly). Self-contained enough to ship standalone
|
||
like #297 if prod pain is acute; otherwise fold into the single PublishPlan delta-interpreter.
|
||
- **Step 2b** — internal-table cleanup + the Q8 monotonic watermark (a Lance boundary tag).
|
||
Deferred: only the secondary version-count/space term, touches the read/open path, and is
|
||
MTT-redundant. Land when version-count cost bites.
|
||
- **§7.1 sequential write-skew** (`iss-overwrite-orphans-committed-edges`) — inbound-RI
|
||
validation on node removal; independent, ships anytime.
|
||
- **#20** — the prod per-write `storage.ops` span metric (RFC §5.3), still owed.
|
||
- Branch ops: Lance `Clone` for create (`iss-691`).
|
||
|
||
---
|
||
|
||
## 9. Gotchas / traps (learned the hard way)
|
||
- **In-process queue ≠ cross-process lock.** Any "I hold the queue → skip the retry/CAS"
|
||
reasoning is a bug across processes. This is the recurring trap.
|
||
- **Monotonic publish must be `≥`-conditional, never "no assertion."** The `__manifest`
|
||
merge-insert is unconditional `UpdateAll` keyed on `object_id` (`publisher.rs:379`), so
|
||
the equality (or monotonic) pre-check is the *only* guard — dropping it lets `UpdateAll`
|
||
regress a newer version = lost write.
|
||
- **The drift guard interprets an ambiguous delta.** Re-evaluating it in a retry over
|
||
self-mutated state is how #297's follow-on bug happened. Gate any HEAD-vs-manifest
|
||
interpretation on "have *we* committed yet."
|
||
- **`compact_files` fires Lance's auto_cleanup GC hook** (commits with
|
||
`skip_auto_cleanup=false`, no override) — optimize strips stale `lance.auto_cleanup.*`
|
||
config before compacting to stay non-destructive on upgraded graphs. The strip is a
|
||
separate commit (relevant to the partial-commit retry trap).
|
||
- **Lance rebases the common concurrent case for free** — so the data-table conflict usually
|
||
surfaces as the manifest fast-forward, not a Lance error. The Lance-Rewrite-overlap path is
|
||
rare and needs failpoint injection to test.
|
||
|
||
---
|
||
|
||
## 10. Verification (the gate)
|
||
- `cargo test --workspace --locked` — the canonical gate (matches CI).
|
||
- `cargo test -p omnigraph-engine --features failpoints --test failpoints optimize` —
|
||
the optimize concurrency/recovery tests.
|
||
- `cargo test -p omnigraph-engine --test write_cost` / `write_cost_s3` (bucket-gated) —
|
||
cost gates (3b adds the schema-once + open-count asserts here).
|
||
- `cargo test -p omnigraph-engine --test maintenance` — optimize/repair/cleanup.
|
||
- Re-read [`invariants.md`](invariants.md), [`lance.md`](lance.md), [`testing.md`](testing.md)
|
||
before each change (always-on requirement).
|
||
|
||
Lance source for re-validation:
|
||
`/Users/ragnor/.cargo/registry/src/index.crates.io-*/lance-7.0.0` (key files: `io/commit.rs`,
|
||
`io/commit/conflict_resolver.rs`, `dataset/optimize.rs`, `dataset/write/retry.rs`,
|
||
`dataset/builder.rs`).
|