mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-24 02:38:06 +02:00
* feat(engine): compact the internal __manifest/_graph_commits tables in optimize
`optimize` iterated node/edge catalog tables only, so the two internal system
tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and
were never compacted -- making every write's metadata scan O(fragments), which
grows forever on a long-lived graph (RFC-013 step 2).
`optimize_all_tables` now also compacts both internal tables via a new
`compact_internal_table`. They are not catalog-tracked (readers open them at
their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`:
compact in place, no manifest publish (nothing to publish to), no recovery
sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no
optimize_indices (they carry no Lance index, only object_id's unenforced-PK
metadata). No application lock: Lance's compact_files auto-retries its Rewrite
against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append
is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a
coordinator refresh afterwards makes the warm handle observe the compacted HEAD.
Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits
into __manifest -- a one-call throwaway for the full interim win; __manifest
compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup
(version GC) of the internal tables is deliberately NOT included here: it needs
the Q8 cleanup-resurrection watermark first (deferred).
maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds
optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar,
graph coherent for reads + strict writes after).
* test(engine): un-ignore the internal-table scan LOCK (step 2 acceptance)
`internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance
gate staged in PR #288. With internal-table compaction landed, a write's
__manifest/_graph_commits scan is flat in commit-history depth on a compacted
graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the
pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before
measuring and runs green every-PR.
* docs: RFC-013 step 2 internal-table compaction landed
- invariants.md: close the compaction half of the read-path-rederivation known
gap (optimize now compacts the internal tables; cleanup half still deferred).
- maintenance.md: optimize covers __manifest/_graph_commits (no publish, no
sidecar); not yet in cleanup.
- rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8
watermark, deferred — debated; MTT-overlap + hot-path liability).
- testing.md: the internal-table LOCK is now green every-PR.
* fix(engine): guard absent _graph_commits + always compact internal tables
Addresses PR #291 review findings:
- Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction,
but a graph can validly have none (the coordinator opens it as `Option`, gated on
`storage.exists`, for graphs predating the commit graph). `Dataset::open` on the
absent table errored and failed the whole optimize. Guard the `_graph_commits`
compaction with the same `storage_adapter().exists()` check the coordinator uses;
`__manifest` always exists so it stays unguarded. Regression test
`optimize_tolerates_absent_graph_commits_table` (empty graph so no publish
recreates the table before the guard).
- Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table
compaction for a schema with no node/edge types. Removed it so the internal
tables are compacted regardless of the data-table set.
- Codex (auto-cleanup, P1): documented — `compact_files` commits with a default
`CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override,
so on a graph storing an *on* auto_cleanup config the commit would fire version
GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are
safe; the only exposure is pre-fix upgraded graphs, identical to the existing
data-table optimize path, with step 2b's watermark as the comprehensive guard.
Added a comment in `compact_internal_table` recording this.
* fix(engine): retry publish on RetryableCommitConflict (compaction vs publish)
Step 2 compacts `__manifest` with no app-level lock (Lance OCC arbitrates,
validated against LanceDB + the lance-7.0.0 conflict resolver). compact_files'
`Operation::Rewrite` auto-retries 20x (CommitConfig default num_retries=20), so a
live publish usually wins the race and the compaction rebases. But the publish
runs its merge-insert with conflict_retries(0) = one rebase attempt; if the
compaction commits first AND the merge touched a fragment the Rewrite rewrote,
Lance preempts the publish with `Error::RetryableCommitConflict` — a DIFFERENT
variant from the row-level `TooMuchWriteContention` the publisher already retries.
Left unhandled, that surfaces a transient error to the caller, i.e. a maintenance
compaction (physical op) failing a live write (logical op) — invariant 7.
Map `LanceError::RetryableCommitConflict` to a new
`ManifestConflictDetails::RetryableCommitConflict` and treat it as retryable in the
publisher's outer loop (reload fresh state + re-merge), alongside
RowLevelCasContention. `ExpectedVersionMismatch` still propagates (a genuine
expectation break must not be blindly retried). This also hardens multi-process
concurrent writers generally, not just compaction.
Normal publishes are insert-only (new object_ids -> new fragments, disjoint from
rewritten old ones), so the conflict is rare; the guard covers the
same-fragment-update edge and multi-process writers. Unit tests in publisher.rs
pin the mapping + the retry-predicate contract.
* revert: publisher RetryableCommitConflict handling (it was the wrong side)
Reverts d138902e. Validated against lance-7.0.0: the publisher's merge-insert runs
with conflict_retries(0), and execute_with_retry converts an exhausted retryable
commit conflict to TooMuchWriteContention before the caller sees it
(write/retry.rs ~95-130). So map_lance_publish_error NEVER receives
RetryableCommitConflict from merge_rows — it receives TooMuchWriteContention, which
the publisher already maps to RowLevelCasContention and retries. The reverted
mapping was therefore dead on the real path and its unit test was synthetic.
The actual exposure is the *compaction* side: compact_files -> commit_compaction ->
apply_commit directly (no execute_with_retry), so a Rewrite-vs-Merge check_txn
conflict propagates raw and optimize can fail on a live graph. That is fixed
app-side in compact_internal_table in the following commit.
* fix(engine): make internal-table compaction correct by construction
Address three findings from review of the step-2 internal-table compaction:
- Non-destructive by construction: before compacting an internal table,
strip any stored `lance.auto_cleanup.*` config off it. `compact_files`
commits with a default `CommitConfig` (skip_auto_cleanup=false) and
`CompactionOptions` exposes no override, so on a graph created by an older
binary (on-by-default GC hook) the compaction commit would fire Lance's
auto-cleanup and silently prune `__manifest`-pinned versions. Current
binaries store no such config; the strip is the upgrade-path safety net so
`optimize` can never GC versions.
- App-level compaction retry: `compact_files` does NOT auto-retry a semantic
conflict against a concurrent live writer (Rewrite vs Update/Merge/Delete
propagates raw from apply_commit; Lance prescribes app-rerun). Wrap the
internal-table compaction in a bounded retry loop that reopens fresh and
replans on a retryable Lance conflict, so a maintenance compaction (a
physical op) never fails a live write (a logical op) — invariant 7.
- Compact all three internal tables, not two: `_graph_commit_actors` grows
one fragment per commit on the authenticated write path, the same O(depth)
scan as `__manifest`/`_graph_commits`. Drive the sweep from one
source-of-truth list with per-table existence guards (the two commit-graph
tables are optional). Make `graph_commit_actors_uri` pub(crate).
Tests: the `internal_table_scans_are_flat_in_history` LOCK now runs the
authenticated (actorful) write path so it covers `_graph_commit_actors` via
the shared commit-graph IO wrapper (new `commit_many_as`/`measure_insert_as`
helpers); `optimize_clears_stale_auto_cleanup_and_preserves_versions` pins
the non-destructive guarantee (config cleared + no version GC); a unit test
pins the retryable-conflict classifier; the empty-graph stats count is 7
(the actor table is created at init).
* docs: internal-table compaction covers all 3 tables, non-destructive, retried
Sync the RFC-013 step-2a section and the maintenance guide with the
correctness-by-design refinements:
- optimize compacts `__manifest`, `_graph_commits`, AND `_graph_commit_actors`
(the actor table grows on the authenticated write path).
- optimize is non-destructive by construction — it never GCs versions, and
strips stale `lance.auto_cleanup.*` config so an upgraded graph's commit-time
GC hook cannot fire during compaction.
- internal-table compaction rebases and retries against concurrent live
writers rather than failing the operator's optimize or the live write.
- the cost LOCK is the authenticated-path acceptance test.
* fix(engine): refresh coordinator after a config-strip with no compaction work
`compact_internal_table` returns early when `plan_compaction` finds no work,
but `clear_stale_auto_cleanup_config` may have already committed a config-strip
that advanced Lance HEAD. The early return skipped the coordinator refresh that
the successful-compaction path performs, leaving warm `__manifest`/commit-graph
handles pinned to the pre-strip version until the next read's version probe
healed them. No correctness bug (the probe self-heals, and a stale-handle write
would retry via publisher CAS), but the refresh makes coherence deterministic
rather than probe-dependent. Refresh iff the config-strip actually committed.
* docs(engine): correct compact_internal_table doc — compact_files does not auto-retry
The function doc claimed "Lance's compact_files auto-retries its Operation::Rewrite
against any concurrent writer" — wrong, and contradicting the is_retryable_lance_conflict
doc just below it and the explicit retry loop that exists precisely because compact_files
does NOT auto-retry semantic conflicts (Rewrite vs Update/Merge/Delete propagates raw
through apply_commit). Also move the orphaned description from above the retry-budget
const onto the function, and include the third internal table.
* test(engine): optimize must clear stale auto_cleanup on DATA tables too (red)
Regression test for a destructive bug on the data-table optimize path: on an
upgraded graph whose node/edge table still carries pre-v7 lance.auto_cleanup.*
config, `optimize`'s compact_files/optimize_indices commits fire Lance's version
GC and prune __manifest-pinned data-table versions. Mirrors the internal-table
auto_cleanup test on a Person table (force-repair realigns the config-induced
drift so optimize doesn't skip the table). Red against the current code: the
data-table path does not strip the config. The fix lands in the next commit.
* fix(engine): clear stale auto_cleanup on the data-table optimize path too
The auto_cleanup scrub previously only protected the internal tables; the
data-table path (optimize_one_table) ran compact_files/optimize_indices with a
default CommitConfig (skip_auto_cleanup=false) and no override, so on an upgraded
graph those commits could fire Lance's version-GC hook and prune __manifest-pinned
node/edge versions — making the "non-destructive" contract false for data tables.
Strip the config before the HEAD-advancing commits, capturing version_before first
so the strip's own commit still triggers the Phase-C manifest publish (no uncovered
drift). No retry loop needed: the data-table path holds the per-table write queue.
Covered by the existing Optimize recovery sidecar. Turns the prior commit's test green.
Also: switch clear_stale_auto_cleanup_config off the deprecated delete_config_keys
to update_config(None values), and correct two now-inaccurate doc comments —
compaction is "one or more content-preserving commits" (compact_files can emit a
ReserveFragments before the Rewrite), not "a single atomic commit"; the sidecar-free
property rests on content-preservation + read-at-HEAD, not single-commit atomicity.
* docs: optimize is non-destructive on all tables; correct atomicity/retry claims
- non-destructive guarantee now spans data + internal tables (the auto_cleanup
strip runs on both paths), not just the internal ones.
- "single atomic Lance commit" was inaccurate: compaction can emit a
ReserveFragments commit before the Rewrite; the no-sidecar property rests on
content-preservation + read-at-HEAD, not single-commit atomicity.
- "retries rather than failing" softened to the truth: a *bounded* retry on the
internal path; sustained contention surfaces a loud conflict error (bounded +
observable, not an infinite loop). The data path holds the per-table queue and
never contends.
1240 lines
81 KiB
Markdown
1240 lines
81 KiB
Markdown
# RFC-013: Write-path latency — capture-once `WriteTxn`, manifest-authoritative publish, bounded history, and a measured cost contract
|
||
|
||
**Status:** Proposed
|
||
**Author(s):** write-path latency investigation (handoff + multi-agent validation)
|
||
**Date:** 2026-06-19
|
||
**Audience:** engine / storage maintainers
|
||
**Builds on:**
|
||
[rfc-009-unify-access-paths.md](rfc-009-unify-access-paths.md) (`GraphClient` — embedded ≡ remote),
|
||
the query-latency work (PR #268, read-path warm-up — the read-side twin of this change),
|
||
the iss-991 handoff (manifest-authoritative graph lineage / Phase 7),
|
||
[writes.md](writes.md), [execution.md](execution.md), [invariants.md](invariants.md).
|
||
**Tracking (dev graph `modernrelay`):** primary `iss-write-s3-roundtrip-amplification`; depth term `iss-991`; substrate seam `iss-863`/`iss-864`; branch-create `iss-691`; recovery `iss-856`/`iss-recovery-sweep-live-writer-rollback`/`iss-merge-recovery-partial-rollforward`; MemWAL `iss-681`; read twin `gap-read-path-rederivation`.
|
||
|
||
> Status maintained by maintainers: `Proposed` while open, `Accepted` on merge.
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
On object-store-backed clusters a single trivial write (one edge, one branch op)
|
||
issues **hundreds of mostly-sequential object-store round-trips**, and that count
|
||
**grows without bound with the graph's commit history**, so a long-lived graph
|
||
degrades to minutes per edge. The cost is invisible on a local filesystem
|
||
(µs/call) and to correctness tests (results are right, just slow), and it was
|
||
never measured because nothing in the suite counts *object-store round-trips per
|
||
logical operation*.
|
||
|
||
This RFC specifies the optimal write path from first principles — **a write is a
|
||
pure function of one version-pinned snapshot, published in a single
|
||
manifest-atomic CAS** — and the **cost contract that makes its O(1)-in-history
|
||
guarantee provable and non-regressable** (deterministic IO-counted tests on every
|
||
PR). It collapses four hand-rolled writers into one `GraphPublishAuthority`,
|
||
moves graph lineage into the manifest (so the per-write `_graph_commits` scan
|
||
disappears), brings the internal metadata tables into compaction (so the
|
||
per-write `__manifest` scan stops growing), takes recovery off the hot path, and
|
||
adds an epoch fence for multi-writer safety. None of it is a substrate rewrite —
|
||
the manifest-CAS model is already correct and is exactly what Lance native
|
||
multi-table transactions (lance#7264) will later formalize; this RFC builds the
|
||
seam to that future and pays down the write path onto it.
|
||
|
||
**The dominant fix is demonstrated, not proposed:** a one-line opener-bypass
|
||
prototype (open writes direct-by-URI instead of through the lance-namespace builder)
|
||
flattens the depth-dominant term `31 + 12·depth → flat 4` and cuts a depth-80 edge
|
||
**2.7×** (1618 → 593 ops), measured end-to-end and functionally correct on
|
||
main/branch/node paths (§2.4). It is shippable as a standalone PR first (§9 step
|
||
3a); the rest of the RFC is the constant-factor + correctness + internal-residual
|
||
work layered on the same seam.
|
||
|
||
---
|
||
|
||
## 0. Validation ledger (read this first)
|
||
|
||
Every claim is tagged: **[M]** measured by me this cycle, **[S]** verified in
|
||
v0.7.0 source (`file:line` given), **[U]** verified against upstream
|
||
Lance/LanceDB/SlateDB source or docs, **[G]** tracked in the dev graph (slug
|
||
given), **[I]** inferred/reasoned.
|
||
|
||
A correction from the originating handoff: it hypothesized that **Cloudflare R2
|
||
walks the full manifest listing on every open** (a prod-only amplifier absent
|
||
from AWS). **This is false for the pinned Lance 7.0.0 [U].** R2 is treated as
|
||
lexically ordered (`list_is_lexically_ordered = !is_s3_express`,
|
||
`lance-io/.../providers/aws.rs:183`), so R2 gets the O(1) head-only manifest fast
|
||
path, same as AWS; only S3-Express buckets are excluded, and even those are O(1)
|
||
via the v7 `latest_version_hint.json`. There is no R2-list config fix because
|
||
there is no R2-list problem.
|
||
|
||
**The depth term — corrected attribution.** Two measurements, one
|
||
instrumentation-blind, one complete:
|
||
|
||
*(a) IOTracker probe [M] — internal tables only.* A throwaway probe (the
|
||
`warm_read_cost` harness applied to a single insert to `main`, swept across
|
||
commit depth) counted the two internal tables: `__manifest` ≈ 14 + 2·depth,
|
||
`_graph_commits` ≈ 9 + 2·depth → ≈ 23 + 4·depth, `write_iops = 1`. **But this
|
||
probe is structurally blind to the write path's per-table *data* opens** — they
|
||
bypass the instrumented opener (`table_wrapper`), so it reports `probes=0` for the
|
||
data tables. It measured the *minority* of the cost.
|
||
|
||
*(b) Network-proxy measurement [M] — all RPCs, fresh graph.* A counting proxy in
|
||
front of `rustfs` (sees every object-store RPC, under `--mode merge` — the
|
||
production path), on a brand-new graph (400 seed nodes, one committing merge per
|
||
checkpoint), classified by S3 key:
|
||
|
||
| commit depth | data `_versions` | `__manifest` | `_graph_commits` | node (RI) | schema | TOTAL | `write_iops` |
|
||
|---:|---:|---:|---:|---:|---:|---:|---:|
|
||
| 0 | 31 | 29 | 13 | 6 | 46 | 156 | 1 |
|
||
| 5 | 121 | 44 | 23 | 6 | 46 | 268 | 1 |
|
||
| 10 | 181 | 59 | 33 | 6 | 46 | 358 | 1 |
|
||
| 20 | 301 | 89 | 53 | 6 | 46 | 538 | 1 |
|
||
| 40 | 541 | 149 | 93 | 6 | 46 | 898 | 1 |
|
||
| 80 | 1021 | 269 | 173 | 6 | 46 | 1618 | 1 |
|
||
|
||
Slopes: **data table +12/depth (~67%)**, `__manifest` +3/depth, `_graph_commits`
|
||
+2/depth → **TOTAL ≈ 156 + 18·depth**, `write_iops` flat at 1. The IOTracker probe
|
||
(a) saw only the +4/depth internal subset — blind to the data-table opens, the
|
||
dominant ~67%.
|
||
|
||
**Constant-factor finding [M]: the schema contract is a flat 46 reads/write** — not
|
||
depth-scaling, but **29% of the depth-0 cost (46/156)**, from
|
||
`validate_schema_contract` re-running uncached on every resolve (`omnigraph.rs:561`).
|
||
A depth-slope gate will *not* catch it; WriteTxn's resolve/validate-once kills it,
|
||
and the §5.1 fitness assert (`validate_schema_contract_calls == 1`) is what pins it
|
||
(constant-factor delta, §6).
|
||
|
||
The dominant term is **the written table's open routed through the lance-namespace
|
||
builder ~13× per write** — now source-traced. The **write** path opens via
|
||
`DatasetBuilder::from_namespace` (`namespace.rs:174`, from `open_table_head_for_write`
|
||
`table_store.rs:181` / `namespace.rs:544`). Lance's builder calls the namespace's
|
||
`describe_table` once and uses only `response.location` (`lance` `builder.rs:130-178`)
|
||
— but omnigraph's `describe_table` **opens the whole dataset** just to produce that
|
||
location (`open_head` → `Dataset::open`, `namespace.rs:362`/`:112`), and `.load()`
|
||
then **resolves the latest version again** — a **double latest-resolution per
|
||
open**, ~13× per write, nothing cached. Crucially, latest-resolution is **not
|
||
inherently O(depth)**: the namespace path is O(depth) because it **misses the V2
|
||
lexical / `latest_version_hint.json` fast path** that the direct opener engages
|
||
(most likely because `load_table_from_namespace` attaches no shared `Session`/store
|
||
params, `namespace.rs:174` — inferred, not traced). The **read** path skips all of
|
||
it — `from_uri(location).with_version(N)`, one HEAD, O(1) — which is why reads are
|
||
flat (+12/depth on the data table, §0(b)). **Proven on omnigraph's own table [M]:**
|
||
a direct `Dataset::open` of the *same physical* 85-version edge table = **2 ops
|
||
(O(1))**, the `from_namespace` open of that identical table = the O(depth) sweep —
|
||
same bytes, two open paths. `checkout_version` is also O(1) — **exonerated**, not a
|
||
back-walk. So `from_uri().with_version(N)` is the O(1) primitive and step 3 makes
|
||
each open O(1) *intrinsically* (cleanup then becomes hygiene/interim, not
|
||
load-bearing for read cost — §2.3). **Mode-independent [U]:** `append` ≡ `merge` ≡ +12/depth, so §0(a)
|
||
measuring a single insert was *not* the defect — the defect is the namespace open
|
||
path, not the verb. **Using `from_namespace` per-open is a misuse of Lance's
|
||
design** (the namespace is a catalog/discovery layer — resolve once, then open the
|
||
dataset directly, `lance-namespace` `operations/index.md` **[U]**); the read path
|
||
already bypasses it (PR #268 Fix 2 — see §2.4).
|
||
|
||
**Corrected conclusion.** The depth blow-up is in omnigraph's DB layer and is
|
||
**data-table-dominated**: the redundant per-table opens (fixed by §9 step 3 —
|
||
WriteTxn open-once-by-pinned-version — plus scheduled *version cleanup* of the
|
||
node/edge tables) are ~70% of it; the uncompacted internal tables (§9 step 2) are
|
||
the secondary ~30%. Both the originating R2 hypothesis and the earlier "entirely
|
||
the internal tables" framing are wrong. The exact Lance call doing the data-table
|
||
chain re-read (`checkout_version` back-walk vs merge-insert conflict replay) is the
|
||
one unpinned item — see §12. Reads, by contrast, are flat in depth
|
||
(`warm_read_cost.rs`, PR #268). This is the O(history)-per-write →
|
||
O(N²)-cumulative behavior the production incident hit.
|
||
|
||
---
|
||
|
||
## 1. Problem & measurements
|
||
|
||
On object storage every call is a 10–100 ms RPC, there is no cheap stat, and
|
||
sequential RPCs serialize. A long-lived production graph on R2, originating handoff
|
||
**[M]**:
|
||
|
||
| operation | prod (R2) | local `file://` |
|
||
|---|---|---|
|
||
| one-edge `load --mode merge` → main | ~3 min (90 s workflow timeout) | <1 s |
|
||
| `branch create --from main` | 120 s | <1 s |
|
||
| one-row `load` → a branch | 204 s | <1 s |
|
||
| `branch delete` | 216 s | <1 s |
|
||
| warm read / `/healthz` | fast (0.2–2 s) | fast |
|
||
|
||
`iss-write-s3-roundtrip-amplification` **[G]** independently records the same:
|
||
cross-region single insert ~46 s, 5-node mutation ~110 s, vs ~390 ms for a
|
||
no-storage `/healthz`. Its acceptance criteria are this RFC's goal: *"a single
|
||
insert issues O(1)-to-few S3 round-trips, not O(number of tables); bulk mutations
|
||
amortize the manifest commit."*
|
||
|
||
The cost decomposes into terms; the dominant one scales with history (§0):
|
||
|
||
1. **Per-table opens through the O(depth) lance-namespace builder (DOMINANT,
|
||
O(tables × depth)).** Each stage opens via `DatasetBuilder::from_namespace`
|
||
(`namespace.rs:174`); its `describe_table` opens the whole dataset just to return
|
||
a location (`open_head` → `Dataset::open`, `namespace.rs:362`/`:112`) and
|
||
`.load()` resolves latest **again** — a double latest-resolution per open,
|
||
O(depth) on the repro store, ~13× per write with nothing caching it **[S]**
|
||
(§2.2). The read path's direct `from_uri().with_version(N)` is O(1). →
|
||
**+12 reads/depth, ~70% of the slope [M]**. Fixed by opening once, by pinned
|
||
version via the direct opener (§9 step 3); node/edge version *cleanup* bounds it
|
||
further.
|
||
2. **Per-write `__manifest` scan (O(history), secondary).** Every publish
|
||
full-scans the uncompacted `__manifest` (`load_publish_state` →
|
||
`read_manifest_scan`, `state.rs:133-141`) **[S]**; the internal tables are
|
||
never compacted/cleaned (`optimize` iterates node/edge only,
|
||
`optimize.rs:895-904`) **[S]**. +3.1 reads/depth **[M]**.
|
||
3. **Per-write `_graph_commits` refresh (O(history), secondary).**
|
||
`record_graph_commit` reloads the entire commit cache before each append
|
||
(`commit_graph.rs:136-164`) **[S]**; never compacted/cleaned. +2.1 reads/depth
|
||
**[M]**. The "read-path anti-pattern, now on writes" (`iss-991` handoff **[G]**).
|
||
|
||
Terms 2+3 are the secondary ~30%; term 1 dominates. Plus per-write fixed taxes: a `list_dir("__recovery/")` (`loader/mod.rs:197`,
|
||
`exec/mutation.rs:725`, `exec/merge.rs:1090`) **[S]**, and the publisher CAS
|
||
retry budget (`PUBLISHER_RETRY_BUDGET = 5`, `publisher.rs:51`) **[S]**.
|
||
|
||
Branch ops compound it: `branch create` is a per-table sequential fork loop
|
||
(`fork_branch_from_state`, `table_store.rs:282`); `branch delete` opens a
|
||
snapshot per *other* branch (`ensure_branch_delete_safe`, `omnigraph.rs:1317`)
|
||
and force-deletes per forked table sequentially (`cleanup_deleted_branch_tables`,
|
||
`omnigraph.rs:1359`) **[S]**.
|
||
|
||
---
|
||
|
||
## 2. Root cause (validated)
|
||
|
||
### 2.1 The write re-derives its world from storage every stage
|
||
|
||
`loader/mod.rs:400` captures a `snapshot` once, but downstream stages **ignore
|
||
it** and re-resolve **[S]**:
|
||
|
||
- `open_for_mutation_on_branch` (`table_ops.rs:505`) re-calls
|
||
`resolved_branch_target` **per table** (`:512`), which runs
|
||
`ensure_schema_state_valid` (a full schema-contract storage read with no cache,
|
||
`omnigraph.rs:561-568`) and then opens **by head** via
|
||
`open_dataset_head_for_write` (`:522`/`:559`), asserting head == pinned only
|
||
*after* the open.
|
||
- `fresh_snapshot_for_branch` (`omnigraph.rs:771`) always does fresh I/O; the
|
||
fork authority path re-reads the live manifest (`table_ops.rs:574`).
|
||
- The captured snapshot is used only for membership/fork checks, never for the
|
||
actual opens.
|
||
|
||
The drift guards, CAS retries, and recovery scans are **compensating machinery**
|
||
for the staleness this self-inflicts. The `Snapshot`/coordinator primitive
|
||
already exists; it is treated as cheap-to-reacquire rather than as the
|
||
operation's authoritative identity.
|
||
|
||
### 2.2 The depth terms — data-table re-reads dominate, internal tables secondary
|
||
|
||
Confirmed in code and measurement (§0). The **dominant** term is §2.1's per-table
|
||
opens: ~13 opens per write through the lance-namespace builder
|
||
(`DatasetBuilder::from_namespace`, `namespace.rs:174`). The builder calls the
|
||
namespace's `describe_table` (`lance` `builder.rs:130-178`), and omnigraph's
|
||
`describe_table` opens the whole dataset just to return a location (`open_head` →
|
||
`Dataset::open`, `namespace.rs:362`/`:112`); `.load()` then resolves latest again —
|
||
a **double latest-resolution per open**, O(depth) on the repro store — so cost
|
||
grows with the table's version count (+12 reads/depth, ~70%). The **read** path
|
||
opens direct `from_uri().with_version(N)` (`namespace.rs:112` / `SubTableEntry::open`)
|
||
— O(1) — and native pylance is flat 6 ops at any depth **[U]**, so this is
|
||
omnigraph's *namespace-open* pattern, not Lance; `checkout_version` is O(1) and not
|
||
implicated. (The heavier `list_table_versions` — `versions()` + a checkout per
|
||
version, `namespace.rs:395-427` — is **not** on this path; it is test-only today, a
|
||
separate latent O(depth): §10 follow-up.) The **secondary** terms are the two
|
||
internal tables: `load_publish_state` and
|
||
`commit_graph.refresh` each full-scan a table that gains a fragment per write and
|
||
is never compacted (+5 reads/depth, ~30%). This is the `gap-read-path-rederivation`
|
||
**[G]** failure mode — "cost grows with fragment count" — on the *write* path,
|
||
where PR #268 never reached. `invariants.md` documents the internal-table half:
|
||
*"the internal metadata tables (`__manifest`, `_graph_commits`) are still not
|
||
compacted, so the probe and refresh cost still grows with fragment count."*
|
||
|
||
### 2.3 The `skip_auto_cleanup` interaction — and compaction ≠ cleanup
|
||
|
||
v0.7.0 sets `skip_auto_cleanup: true` deliberately (`table_store.rs` 10 sites +
|
||
`publisher.rs:392`) **[S]** — load-bearing, because Lance 7's on-by-default
|
||
`auto_cleanup` would GC `__manifest`-pinned snapshot versions (`lance.md` audit)
|
||
**[U]**. Two distinct levers were turned off and must be replaced *separately*:
|
||
**compaction** (`compact_files`) rewrites small fragments into fewer larger ones
|
||
but does **not** prune old versions; **cleanup** (`cleanup_old_versions`) prunes
|
||
old versions. Measured on a ~85-version graph **[M]**: `optimize`/compaction
|
||
*added a version* (data-table reads 1035 → 1083, frags 81→1 — **no help** against
|
||
the depth term); `cleanup --keep 3` dropped it 1035 → 63 (89 versions pruned across
|
||
7 tables, **16×**). So only *cleanup* bounds the version-chain length. Note today's
|
||
`cleanup`/`optimize` cover **node/edge tables only** (the "7 tables"; internal
|
||
`__manifest`/`_graph_commits` are excluded, `optimize.rs:895-904` **[M]**) — so
|
||
bounding the internal +5/depth residual needs them **added** to the key set (§9 step
|
||
2's code change). Operationally: `cleanup` aborts on a remote store without `--yes`
|
||
(the
|
||
scheduled job must pass it). Relation to step 3: while the namespace open is still
|
||
on the write path, cleanup **caps** the dominant term — a real interim mitigation;
|
||
once step 3 opens direct-by-version (O(1) regardless of version count, §2.4),
|
||
cleanup is **storage hygiene + internal-table sprawl**, not load-bearing for read
|
||
cost. The correct replacement is *scheduled* compaction **and** version cleanup
|
||
(§9 step 2), **not** re-enabling `auto_cleanup`. Without it, version history (and
|
||
per-write cost) grows forever.
|
||
|
||
### 2.4 Lance namespace: proper use (why the fix is bypass, not patch)
|
||
|
||
The upstream Lance Namespace is a **catalog / discovery layer** — "table
|
||
discovery, resolving table locations, and coordinating commits" — whose intended
|
||
division of labor is *"the namespace provides basic information about the table,
|
||
[then] the Lance SDK … fulfill[s] the other operations"* (`lance-namespace`
|
||
`namespace/index.md`, `operations/index.md`) **[U]**. It is meant to be consulted
|
||
to *resolve a table once*, after which you operate on the `Dataset` directly — **not
|
||
consulted on every per-table open on a hot path.** `DatasetBuilder::from_namespace`
|
||
itself reflects this: it calls `describe_table` only to extract `location`, then
|
||
reduces to a `from_uri` builder (`lance` `builder.rs:130-178`). For a system that
|
||
*already holds* each table's location + version (omnigraph's `__manifest` does, via
|
||
`SubTableEntry`), routing per-open resolution back through the namespace is the
|
||
anti-pattern — and it aligns with this project's invariant 1 ("resolve latest state
|
||
through the substrate's cheap primitive instead of re-scanning") and the deny-list
|
||
"cold re-derivation on the hot path."
|
||
|
||
So the fix is **bypass, not patch**: open writes by URI + pinned version
|
||
(`from_uri(location).with_version(N)`) — exactly what the **read** path already does
|
||
(PR #268 Fix 2; the read path's own comment notes the namespace open "would
|
||
full-scan `__manifest` twice per open (`describe_table` + `describe_table_version`)"),
|
||
so this completes #268's open-by-location migration on the write side (§9 step 3).
|
||
The **custom namespace impl stays** — it is still the right home for legitimate
|
||
*catalog* operations (`describe_table` / `table_exists` / `list_table_versions` /
|
||
`create_table_version` / managed-versioning commit coordination); only the
|
||
per-open *resolution* leaves it. Two Lance facts make this safe and final: opening
|
||
by explicit version is `default_resolve_version` = a single HEAD, O(1) (`lance`
|
||
`commit.rs:939-981`), and Lance's own latest-resolution cost work (version-hint, PR
|
||
#6752) confirms the latest path is the expensive one to avoid. **Proven on
|
||
omnigraph's own table [M]:** a direct `Dataset::open` of the *same physical*
|
||
85-version edge table is 2 ops (O(1)), while the `from_namespace` open of that
|
||
identical table is the O(depth) sweep — so latest-resolution is not inherently
|
||
O(depth); the namespace path is O(depth) only because it misses the fast path the
|
||
direct opener engages (likely the un-threaded `Session`). Step 3 therefore makes
|
||
each write open O(1) on its own — so node/edge `cleanup` (§2.3) is an **interim
|
||
mitigation + storage hygiene**, not load-bearing for read cost once step 3 ships.
|
||
|
||
**End-to-end proof [M] — the one-line opener bypass, measured.** A prototype
|
||
patched `open_dataset_head_for_write` (`table_store.rs:174`) to open directly by URI
|
||
(bypassing `from_namespace` — exactly step 3 / Alternative B), rebuilt v0.7.0, and
|
||
re-ran the depth sweep on a fresh graph:
|
||
|
||
| depth | data `edgeVER` baseline | data patched | TOTAL baseline | TOTAL patched |
|
||
|---:|---:|---:|---:|---:|
|
||
| 0 | 31 | **4** | 156 | 121 |
|
||
| 10 | 181 | **4** | 358 | 173 |
|
||
| 20 | 301 | **4** | 538 | 233 |
|
||
| 40 | 541 | **4** | 898 | 353 |
|
||
| 80 | 1021 | **4** | 1618 | **593** |
|
||
|
||
The dominant data-table term collapses `31 + 12·depth → flat 4` (O(1) in history),
|
||
the total slope drops `+18/depth → +5/depth` (the residual +5 is exactly the two
|
||
internal tables — step 2's scope), and at depth 80 a single edge drops **1618 → 593
|
||
ops (2.7×)** from this one change alone, before step 2 / Phase 7. Functional
|
||
correctness verified on the hot paths: main edge merge, branch create + write +
|
||
read-back, node merge (managed-versioning still correct) — the direct opener already
|
||
handles `checkout_branch` for non-main, so the namespace layer was not load-bearing
|
||
for write correctness on these paths. **Caveat:** the prototype did **not** exercise
|
||
schema-apply, branch merge, fork-on-first-write to a new table on a branch, overwrite
|
||
mode, or concurrent writers — a production step 3 must pass the full
|
||
`merge_truth_table`/recovery/failpoint suite (the namespace may do
|
||
managed-versioning work that matters there). It proves the thesis + hot-path
|
||
correctness, not drop-in completeness.
|
||
|
||
**Step 2 also proven [M].** On the step-3-patched binary at depth ~87, compacting
|
||
the internal tables to 1 fragment each (content-preserving) collapsed their scans:
|
||
`__manifest` 285 → 32 (8.9×), `_graph_commits` 177 → 11 (16×); the step-3 data term
|
||
stayed flat at 4. So **both depth terms are now empirically eliminated** — a depth-87
|
||
single edge drops **~1720 → 198 ops (~8.7×; ≈258 s → ≈30 s at 150 ms/RTT)** with
|
||
both fixes. The internal term is **fragment-scan growth** (`read_manifest_scan` /
|
||
`commit_graph.refresh` read all fragments of the *latest* version), so the fix is
|
||
**compaction** (merge fragments) — distinct from the data table's version-chain term
|
||
that step 3 / version-cleanup handle. `optimize`'s `all_table_keys`
|
||
(`optimize.rs:895-904`) excludes the internal tables, so step 2 is a real code
|
||
change, not just scheduling.
|
||
|
||
---
|
||
|
||
## 3. First principles
|
||
|
||
On object storage the only objective function is **minimize the number of
|
||
*sequential* round-trips per logical operation, and make that number invariant to
|
||
graph age, history depth, and table count** — under the hard floor of SI,
|
||
durability, atomicity, and loud integrity. Three generating principles fall out,
|
||
each mapped to a validated failure:
|
||
|
||
1. **Pin once, derive the rest (MVCC / invariant 15).** A write is a pure
|
||
function of one immutable, fully-pinned snapshot
|
||
`{branch, manifest_version, per-table (location, version, e_tag), schema_hash,
|
||
writer_epoch}`, resolved exactly once; every stage reads only from it
|
||
(open-by-pinned-version, O(1), cacheable); the only contact with "current" is
|
||
the final CAS. → fixes §2.1.
|
||
2. **One source of truth, one commit (invariant 2).** Visibility + lineage +
|
||
version bumps are **one atomic manifest write**; the commit graph, indexes,
|
||
and topology are *projections* of the manifest, never second authorities to
|
||
keep in sync. → fixes the §2.2 `_graph_commits` term (iss-991 Phase 7).
|
||
3. **The plan is the contract (correct-by-construction recovery).** The writer
|
||
serializes its *complete* publish intent **before any HEAD moves**; the live
|
||
commit and crash-recovery execute the *identical* plan, so they cannot
|
||
diverge. → fixes the partial-publish bug class structurally
|
||
(`iss-merge-recovery-partial-rollforward`, PR #277).
|
||
|
||
The optimal single-edge write under these: **~2–3 sequential hops, O(1) in size**
|
||
— 1 warm probe (0 if the coordinator is unchanged), 1 parallel stage of fragment
|
||
writes, 1 manifest CAS — regardless of 5 tables or 500, 10 commits or 10M.
|
||
Lance's own `test_commit_iops` (read 1 / write 2 / stages 3) **[U]** proves the
|
||
per-table primitive already hits this; the job is to make the *graph* write
|
||
inherit it.
|
||
|
||
This is not speculative: it is exactly what the two reference object-storage
|
||
databases do. **LanceDB** threads a pinned `Arc<Dataset>` + shared `Session` and
|
||
commits with one CAS off a captured `read_version`, never re-resolving "latest"
|
||
under default consistency **[U]**. **SlateDB** captures a snapshot, treats a
|
||
monotonic-ID manifest (no pointer file) as the *sole* authority, commits with one
|
||
conditional-PUT, recovers on open (never per-write), fences with a monotonic
|
||
`writer_epoch`, and compacts on a schedule **[U]**.
|
||
|
||
---
|
||
|
||
## 4. Reference-level design
|
||
|
||
### 4.1 The interface — one publish authority, one declarative plan
|
||
|
||
The deepest structural flaw is **four hand-rolled writers** (`load_as`,
|
||
`mutate_as`, `apply_schema_as`, `branch_merge_as`), each re-implementing open →
|
||
stage → commit → sidecar → lineage. There is **one publish machine**; the verbs
|
||
are different declarative plans fed to it.
|
||
|
||
```rust
|
||
// The pinned, immutable operation identity — resolved ONCE.
|
||
struct WriteTxn {
|
||
branch: BranchRef,
|
||
base: PinnedSnapshot, // {manifest_version, per-table (loc,version,e_tag), schema_hash, writer_epoch}
|
||
session: Arc<Session>, // shared per-graph; warms metadata/index caches across opens
|
||
handles: HandleCache, // open-by-version; each table opened once, reused across stages
|
||
}
|
||
|
||
// A typed, declarative publish plan — the COMPLETE "what", built before any HEAD moves.
|
||
enum TableAction {
|
||
Append(Stream), Upsert(Batch), Overwrite(Image), Delete(Pred),
|
||
Fork { from_version: u64 }, Register(Schema), Tombstone,
|
||
}
|
||
struct PublishPlan {
|
||
base: PinnedSnapshot,
|
||
actions: Map<TableKey, Vec<TableAction>>,
|
||
lineage: GraphCommitIntent, // parent = base.head; rides the SAME manifest CAS (Phase 7)
|
||
expected: Expectations, // per-table versions + graph_head + writer_epoch
|
||
}
|
||
|
||
impl GraphPublishAuthority {
|
||
async fn open_txn(&self, branch: BranchRef) -> WriteTxn; // 1 warm probe
|
||
async fn publish(&self, txn: &WriteTxn, plan: PublishPlan) -> PublishedSnapshot; // stage∥ → 1 CAS
|
||
}
|
||
```
|
||
|
||
Properties that make it optimal:
|
||
|
||
- **Stages take `&WriteTxn`/`&PublishPlan`, never storage** — re-resolution and
|
||
open-latest are *unrepresentable*. Invariants 2/3/15 hold by construction.
|
||
- **The recovery sidecar *is* the serialized `PublishPlan`.** Phase C and
|
||
recovery both call `plan.apply()` — a merge that bumps tables A+B can never
|
||
roll A forward and silently drop B. The
|
||
`iss-merge-recovery-partial-rollforward` bug class is gone by design.
|
||
- **One CAS.** `publish` issues exactly one conditional `__manifest`
|
||
merge-insert carrying every touched-table version + the `graph_commit` /
|
||
`graph_head` lineage rows + the `writer_epoch` check.
|
||
- **Verbs are thin lowerings.** `load`/`mutate`/`schema apply`/`branch merge`
|
||
each build a `PublishPlan` and call `publish`. Four copies → one machine; the
|
||
public `load_as`/`mutate_as` API is unchanged (it lowers internally).
|
||
|
||
The cost contract becomes part of `publish`'s documented API:
|
||
|
||
> `publish(txn, plan)` costs `opens ≤ |plan.touched_tables|` (0 warm),
|
||
> `stages ≤ 3`, `manifest_ops = O(1)` — **invariant to history depth and table
|
||
> count.**
|
||
|
||
### 4.2 Supporting mechanics (each validated this cycle)
|
||
|
||
| Mechanic | Design | Validation |
|
||
|---|---|---|
|
||
| Open by pinned version | `from_uri(location).with_version(N)` + shared `Session` + warm handle cache — the O(1) opener *reads* already use (`instrumentation::open_table_dataset:112`, `SubTableEntry::open` `db/manifest.rs:200`). **NOT** the write path's `from_namespace` builder (`namespace.rs:174`), whose `describe_table` + `.load()` do an O(depth) double latest-resolution (§2.2 — the dominant cost), and **NOT** `open_dataset_at_state` (opens head then checks out, `table_store.rs:232`, not O(1)). | #0 **[S]** |
|
||
| Strict-op SI | Update/Delete/SchemaRewrite open by pinned version (consistent read base) and the publish CAS rejects a *same-table* advance. Insert/Merge rely on Lance's natural rebase. **Do not remove the open guards wholesale** — that is a silent lost-update. | #5 **[S]** |
|
||
| Fork × pinned-version | Fork already opens source at the pinned version and creates the target from it; the live-manifest authority re-read before fork stays (not defeated by the pin). | #6 **[S]** |
|
||
| Open-once via the direct opener (**THE dominant depth fix**) | Reuse is **intra-transaction** (open each table once, by pinned version, thread it — kills the ~13 namespace-builder opens, the O(depth) double latest-resolution / +12/depth term, §0/§2.2). A commit invalidates its own entry, so no cross-write warm cache. Thread the shared per-graph `Session` through write opens (it is *not* today — `load_table_from_namespace` attaches no session, `namespace.rs:174`). | #9 **[S]** |
|
||
| Lineage in the manifest (Phase 7) | Publish `graph_commit` + mutable `graph_head:<branch>` rows in the same `__manifest` merge-insert with a branch-head CAS; `_graph_commits` becomes a projection. Removes the per-write `commit_graph.refresh` and closes the "manifest→commit-graph atomicity" + "commit-graph parent under concurrency" gaps. | `iss-991` **[G]**, **[S]** |
|
||
| Bounded history (compaction **and** cleanup) | Bring the internal table(s) into the `optimize` loop AND schedule version *cleanup* of node/edge tables — compaction rewrites fragments, only cleanup prunes the version chain that §2.2's dominant term re-reads (§2.3). No blob/PK/CAS blocker (`__manifest` has no blob column, `state.rs:44-72`; the unenforced PK is orthogonal to a content-preserving Rewrite). Post-Phase-7 there is only **one** internal table to compact. | #8 **[S]** |
|
||
| Recovery off the hot path | Move the per-write `list_dir("__recovery/")` to coordinator-open + the CAS-conflict path, guarded by a sidecar-age grace window (the sidecar carries `created_at` micros + a ULID, `recovery.rs:762`/`:1522`). | #4, `iss-856`/`iss-recovery-sweep-live-writer-rollback` **[G][S]** |
|
||
| Epoch fence | Monotonic `writer_epoch` in `__manifest`, CAS-claimed at writer init, checked on every publish. Fences a whole zombie *writer* deterministically (no TTL); closes the multi-process exposure and the Lance-MTT TTL-lease gap. | SlateDB `FenceableTransactionalObject` **[U]** |
|
||
| Branch create | Lance `Clone` instead of the per-table fork loop (O(tables)→O(1) sequential). | `iss-691` **[G]** |
|
||
| Branch delete | Run the per-other-branch safety check and the per-table reclaim loops concurrently (`buffer_unordered`); read branch sets from in-memory coordinator state. | **[S]** |
|
||
|
||
---
|
||
|
||
## 5. The cost contract — measurement & enforcement
|
||
|
||
The bug class is invisible to correctness tests, to local-FS tests, and to
|
||
wall-clock benches. You can only prevent a regression in a quantity you **define
|
||
precisely, measure deterministically, and bound on every PR.** The quantity is
|
||
*sequential object-store round-trips per logical operation, as a function of
|
||
history depth and table count.* OmniGraph already has the correct pattern for
|
||
**reads** (`warm_read_cost.rs`, `IOTracker`, swept to depth 20); this RFC extends
|
||
it across the write/branch/open surface. This is exactly how Lance and SlateDB
|
||
enforce it **[U]**.
|
||
|
||
### 5.1 Tier 1 — deterministic IO-counted gate (every PR)
|
||
|
||
Ordinary `cargo test`, hermetic (in-memory / tempdir + `IOTracker`), no S3, no
|
||
wall-clock. Two shapes:
|
||
|
||
```rust
|
||
// (A) cost-invariant-to-HISTORY — the load-bearing gate. Gate the MERGE verb (the prod path).
|
||
for depth in [10, 100, 1000] { // REAL commit history, not row count
|
||
build_history(depth);
|
||
reset_counters();
|
||
let s = measured_merge(); // --mode merge, the read-modify production path
|
||
// PRIMARY — the dominant term (§0): the written table's data opens/reads, flat in depth.
|
||
assert!(s.data_table_opens <= touched_tables); // open each table ONCE, by pinned version
|
||
assert!(s.data_table_reads_per_open <= K_OPEN); // each open O(1) in the table's version count
|
||
// SECONDARY — internal-table scans flat in depth (compaction + cleanup).
|
||
assert!(s.manifest_ops <= K_MANIFEST); // small CONSTANT, NOT a function of depth
|
||
assert!(s.lineage_ops <= K_LINEAGE);
|
||
assert!(s.stages <= 3); // bounded sequential hops
|
||
}
|
||
assert_flat_across_depths(); // ALL terms — esp. data-table opens — flat in N
|
||
|
||
// (B) fitness functions — architectural invariants AS tests
|
||
assert_eq!(validate_schema_contract_calls(write), 1); // resolve-once
|
||
assert_eq!(coordinator_resolutions(write), 1); // O(1) resolution
|
||
assert_eq!(recovery_listdir_calls(steady_state_write), 0);
|
||
```
|
||
|
||
**Prerequisite, not a follow-up: route ALL opens (read + write) through the one
|
||
instrumented opener BEFORE the gate is meaningful.** Today the write path's data
|
||
opens bypass `table_wrapper` (the §0(a) blind spot), so a gate that asserts only
|
||
`manifest_ops`/`lineage_ops` would **pass a still-broken build** — one that
|
||
compacts the internal tables (§9 step 2) but keeps the dominant ~13× namespace-open
|
||
sweep (§2.2). The gate MUST count data-table opens/reads (the dominant term), which
|
||
requires the routing change first. The data term is **mode-independent** (append ≡
|
||
merge ≡ +12/depth **[U]**), so either verb exercises it; gate the **merge** verb
|
||
as the production path. **Fixture caveat [U]:** use *valid* edge endpoints — a
|
||
write to a non-existent endpoint fails RI validation and rolls back at ~192 ops
|
||
with **zero chain reads**, so a bad-endpoint fixture silently measures the rollback
|
||
path and would pass falsely.
|
||
|
||
The load-bearing rule both Lance and SlateDB mostly miss: **assert the constant is
|
||
flat across N, not just small at one N.** A shallow fixture cannot catch an
|
||
O(history) cost (the §0(b) table is the red baseline). Add a `num_stages`
|
||
(sequential-hop) assertion via a `ThrottledStore` wrapper (Lance's
|
||
`test_commit_iops` setup) so an O(N) listing also blows a wall-time budget.
|
||
|
||
### 5.2 Tier 2 — wall-clock trend (post-merge / nightly, never a PR gate)
|
||
|
||
A `ThrottledStore` criterion bench injecting cross-region RTT (50/150 ms/op — the
|
||
incident's regime) for single-insert and branch-op latency, with a threshold
|
||
alert (Bencher.dev `--err` / github-action-benchmark `fail-on-alert`). Both
|
||
reference DBs keep wall-clock out of the PR gate (too noisy on shared runners)
|
||
and use it only as a trend.
|
||
|
||
### 5.3 Close the loop — production metric
|
||
|
||
Emit `storage.ops` and `storage.stages` per logical operation as a span/counter
|
||
(cheap always-on atomics; the heavy per-table attributing wrapper stays
|
||
test-only behind a `test-util`-style feature, zero release cost). The number
|
||
asserted in CI is the number observed in prod — `iss-write-s3-roundtrip-amplification`'s
|
||
cross-region signal becomes a direct readout.
|
||
|
||
### 5.4 Process discipline — test-first for performance
|
||
|
||
Write the depth-sweep cost-budget test **first**: it goes **red today** (§0), the
|
||
WriteTxn + Phase-7 + compaction work turns it **green** (flat in N), and the
|
||
red→green is the proof. This is CLAUDE.md rule 12 applied to cost, and the
|
||
originating handoff's sequencing (§8/§9: land the tests before the fix so the win
|
||
is measured and locked). Add the policy (extend invariant 15 + testing.md "Cost
|
||
budget tests"): *any change touching the read/write/branch/open path MUST add or
|
||
extend a cost-budget test asserting the metric is flat at history depth.*
|
||
|
||
### 5.5 The correctness contract — concurrency tests (the safety twin of the cost gate)
|
||
|
||
The cost gate proves *fast*; these prove *safe*. §6.5's multi-writer cliff slipped
|
||
the suite for the same structural reason the latency bug did — **nothing runs the
|
||
schedule that triggers it**: the suite is single-process with the in-process queue
|
||
(the bug is cross-process), uses local/in-memory stores (no object-store
|
||
cross-process CAS), and its recovery tests cover restart-time sweep, not
|
||
live-writer rollback. **These four must land before `PublishPlan`/epoch merge
|
||
(steps 5):**
|
||
|
||
1. **Cross-process multi-writer on a real/emulated object store** (the *corruption*
|
||
case) — N independent engine **processes** writing the same `(table, branch)`;
|
||
assert all commit-or-cleanly-retry (no lost updates, no stuck "needs recovery,"
|
||
no HEAD-ahead-of-manifest). **A single-process failpoint test cannot reproduce
|
||
the corruption** (in-process degrades to clean OCC, §6.5) — this genuinely needs
|
||
a multi-process harness (empirically 1/12 today). State that so nobody writes a
|
||
single-process test expecting it to fail.
|
||
2. **Deterministic in-process interleaving (failpoint) — WRITTEN, passes [M].** Two→
|
||
eight handles, sleep failpoint at the `commit_staged`→publish window
|
||
(`loader/mod.rs:605`); resume losers and assert they retry cleanly. This
|
||
demonstrates the **benign** path (N=8 → 2 commit, 6 clean OCC retries) — it is the
|
||
regression guard for "in-process stays clean," *not* a reproduction of the
|
||
cross-process cliff.
|
||
3. **Live-writer recovery** (`iss-recovery-sweep-live-writer-rollback`) — a
|
||
concurrent open must not roll back a live in-flight publish (the grace window).
|
||
4. **Formal model** — a Quint/TLA+ model of `{two writers, interleave commit_staged
|
||
and manifest-CAS}` (`iss-934`); it finds the §6.5 cliff immediately.
|
||
5. **Cross-table write-skew — WRITTEN, red, and driven red→green in-process [M].**
|
||
Failpoint `loader.post_ri_pre_stage` (between RI-validation and staging): writer B
|
||
validates "Bob exists" and parks; writer A `overwrite`s `node:Person` dropping Bob
|
||
(non-cascading); B commits `Knows(Bob→Alice)` → committed orphan. The red test for
|
||
the §7.1 fix. **Acceptance is a single-process gate** — unlike the §6.5 HEAD-ahead
|
||
corruption (which genuinely needs the multi-process harness), this skew reproduces
|
||
*deterministically in one process*: the parked edge writer's snapshot really does
|
||
pin `edge:Knows:1` before the overwrite commits, so the overlap is real with two
|
||
in-process handles. The fix went red→green in-process behind a shared head row
|
||
(§7.1). Only #1–#4 (HEAD-ahead/epoch corruption) need cross-process scheduling.
|
||
|
||
Plus one **disambiguating run** owed (§6.5 confound): separate-handles in-process
|
||
on S3 — to confirm the corruption is the process boundary, not the store.
|
||
|
||
This mirrors the cost gate's discipline (assert across the dimension the suite
|
||
otherwise never exercises) — there, history depth; here, concurrent cross-process
|
||
schedules.
|
||
|
||
---
|
||
|
||
## 6. What is already right vs. the deltas
|
||
|
||
**Already correct — do not rewrite.** The in-memory `MutationStaging` accumulator,
|
||
the recovery sidecar mechanism, the per-(table,branch) write queue, D2, the sealed
|
||
`TableStorage` trait, and the read-path warm-up (PR #268) all stay. This is **not**
|
||
a substrate rewrite.
|
||
|
||
**One claim to soften — manifest-CAS is atomic *per publish*, not unconditionally
|
||
cross-table-serializing [M].** The manifest CAS (the reference impl of the
|
||
lance#7264 "Alternative A") makes each publish atomic and serializes any two writers
|
||
whose write-sets **share a `__manifest` row** — overlapping or same-table, which is
|
||
exactly why §6.5's same-table cases and the cascading-delete case retry cleanly. But
|
||
two writers touching **disjoint** tables write disjoint per-`object_id` rows, so Lance
|
||
sees no conflict and **both commit** (proven [M], §7.1). The genuinely-atomic
|
||
cross-table commit §13 contrasts with Delta is the **target** (§4.1's single
|
||
merge-insert over a shared head row), **not current state**. So "do not rewrite the
|
||
CAS" holds for the *commit primitive*, but the cross-table-serialization §7.1 needs
|
||
is a real addition (the shared `graph_head` row), not something the current CAS
|
||
already provides.
|
||
|
||
**The deltas (each a validated, localized gap):**
|
||
|
||
| # | Delta | Mechanism | Tracking |
|
||
|---|---|---|---|
|
||
| 1 | Snapshot re-derived per stage | capture-once `WriteTxn`, thread by ref | `iss-write-s3-roundtrip-amplification` |
|
||
| 2 | Write opens via `from_namespace` re-resolve the data-table ~13×/write, missing the fast path (**DOMINANT, +12/depth**) | open each table **once, direct `from_uri().with_version(N)`** (bypass namespace, §2.4) + shared Session | `iss-write-s3-roundtrip-amplification`, #0 |
|
||
| 3 | Lineage = 2nd authority, O(history) refresh (secondary) | Phase 7: lineage into `__manifest` | `iss-991` |
|
||
| 4 | `__manifest`/`_graph_commits` excluded from optimize/cleanup (`optimize.rs:895-904`; prototype pruned "7 tables" = node/edge only **[M]**) — the +5/depth residual after step 3 | **add them to `all_table_keys`** (a code change) + scheduled compaction/cleanup | `gap-read-path-rederivation` (write twin) |
|
||
| 5 | `list_dir("__recovery/")` per write | move to open + conflict, grace window | `iss-856`, `iss-recovery-sweep-live-writer-rollback` |
|
||
| 6 | 4 hand-rolled writers, commit↔recovery drift | one `PublishPlan` executed by both | `iss-merge-recovery-partial-rollforward` (PR #277) |
|
||
| 7 | No writer epoch (multi-process exposure) | `writer_epoch` in `__manifest` | — (new) |
|
||
| 8 | branch create = O(tables) fork loop | Lance `Clone` | `iss-691` |
|
||
| 9 | branch delete = sequential loops | concurrent `buffer_unordered` | — (new) |
|
||
| 10 | No write/branch cost gate (must count **data-table** opens; route all opens through the instrumented opener first) | Tier-1 IO-counted tests, merge verb | — (new) |
|
||
| 11 | Schema contract re-validated uncached per resolve (**flat 46 reads/write — 29% of depth-0 cost; constant, not depth**) | resolve/validate-once in `WriteTxn`; §5.1 `validate_schema_contract_calls==1` (the depth gate misses it) | `iss-write-s3-roundtrip-amplification` |
|
||
|
||
---
|
||
|
||
## 6.5 Concurrency correctness — the multi-writer cliff (proven [M])
|
||
|
||
The latency fixes are about *speed*; a separate, proven finding is about *safety*.
|
||
A multi-writer experiment **[M]** shows concurrent same-branch writers behave very
|
||
differently by topology:
|
||
|
||
| topology | concurrency | outcome |
|
||
|---|---|---|
|
||
| single server (shared in-proc queue, `loader/mod.rs:426`) | 12 | **12 / 12 commit** (clean) |
|
||
| in-process, separate handles, interleave failpoint at `commit_staged`→publish (`loader/mod.rs:605`) | 8 | **2 / 8 commit; the other 6 are clean retryable OCC** |
|
||
| multi-process (separate CLIs / S3, no shared queue) | 2 / 3 / 5 / 12 | **1 / N commit; the rest CORRUPT** |
|
||
|
||
**Two distinct failure modes — and the corruption is strictly cross-process:**
|
||
|
||
- **In-process → benign.** Even with *separate handles, no shared queue, high
|
||
contention*, losers fail with `stale view of 'edge:Knows': expected manifest table
|
||
version 5 but current is 7 — refresh and retry` — a **clean, retryable OCC
|
||
conflict; graph state stays consistent.** The publisher CAS is doing its job.
|
||
- **Cross-process → corruption.** `Lance HEAD version N+1 ahead of manifest version
|
||
N; a pending recovery sidecar requires rollback`. **Mechanism:** a losing writer
|
||
advances the table's Lance HEAD (`commit_staged`) *before* the manifest CAS; when
|
||
the CAS loses, HEAD is ahead of the manifest — a partial commit the per-write heal
|
||
**defers** (`recovery.rs:978-988`; only the open-time sweep rolls back), so a
|
||
*live* writer hitting it **fails instead of healing**. Self-heals on the next
|
||
read-write reopen (not permanently bricked), but during a burst throughput
|
||
collapses to one survivor. Reachable at **concurrency = 2** cross-process.
|
||
|
||
So in-process safety **already comes from the publisher CAS** (clean OCC); the
|
||
corruption needs the process boundary. *(Confound, stated honestly: the in-process
|
||
interleave ran on local-FS and the cross-process on S3-via-proxy — but
|
||
single-server-on-S3 was also clean (12/12), giving two independent "in-process
|
||
clean" points vs one "cross-process corrupt," triangulating on the process
|
||
boundary, not the store. One disambiguating run — separate-handles in-process on S3
|
||
— would move this from triangulated to proven; §5.5.)*
|
||
|
||
**Scoping (matters for urgency):** **single-server prod is serialized-correct, just
|
||
slow** — the in-process `(table,branch)` queue serializes same-branch writes (all 12
|
||
commit, no lost updates); the production incident was the *latency* (serialized
|
||
O(depth) writes → 90 s timeout), **not** corruption. The corruption hazard is
|
||
**latent**: it appears the moment a second writer exists (server replica,
|
||
CLI-alongside-server, multi-writer scale-out). **So: single-server today =
|
||
serialized-correct (slow; fixed by steps 2/3); multi-writer = UNSAFE until
|
||
`writer_epoch` lands.**
|
||
|
||
**The fix is the existing RFC, no new design.** The `A`-before-`B` window
|
||
(Lance HEAD moves before the manifest references it) is inherent to Lance's
|
||
per-table-lineage model — you cannot eliminate it, only fence and recover it: the
|
||
**`writer_epoch`** (delta #7) is a leader-lease via cross-process CAS so two writers
|
||
are never in the `commit_staged`→manifest-CAS window across processes (it removes
|
||
the concurrent-race dimension); the **`PublishPlan`=sidecar** (delta #6) makes a
|
||
single crashed writer roll forward/back deterministically (the crash dimension); and
|
||
**recovery off the hot path + grace window** (delta #5, Q2) is the exact reason the
|
||
live writers failed rather than self-healed (`iss-recovery-sweep-live-writer-rollback`).
|
||
This is the standard WAL-replay + leader-lease shape (confirmed against SlateDB's
|
||
`FenceableTransactionalObject` and Kleppmann's fencing-token canon, §10). **This
|
||
finding promotes #6/#7 from "nice correctness work" to the load-bearing guard that
|
||
gates multi-writer topologies — and it is the motivating case for them.**
|
||
|
||
---
|
||
|
||
## 7. Invariants & deny-list check
|
||
|
||
Touches and *strengthens* (does not weaken) invariants in
|
||
[invariants.md](invariants.md):
|
||
|
||
- **§2 (manifest-atomic visibility):** preserved; lineage now rides the same CAS
|
||
(strengthens — closes the "manifest→commit-graph atomicity" gap).
|
||
- **§3 (one snapshot per op):** enforced *by construction* via `&WriteTxn`.
|
||
- **§4 (publish at one boundary):** unchanged — still one manifest publish.
|
||
- **§5 (recovery part of the commit protocol):** preserved; the sidecar *is* the
|
||
`PublishPlan` (strengthens — commit and recovery cannot diverge). The grace
|
||
window addresses the documented "recovery serialized against live writers
|
||
in-process only" gap.
|
||
- **§7 (indexes derived) / §15 (one source of truth, cheaply derived):** this RFC
|
||
is the write-side application of §15 — bound cost to the working set, not
|
||
history. The commit graph becomes derived (strengthens).
|
||
- **§5 strict-op SI:** preserved (#5 validation — open guards kept for
|
||
read-modify-write).
|
||
|
||
**Deny-list:** does *not* hit "cold re-derivation on the hot path" (it removes
|
||
two instances), "state that drifts" (lineage stops being a second authority), or
|
||
"acks before durable persistence." The `writer_epoch` is the closing move on the
|
||
"local `write_text_if_match` is not a cross-process CAS" / multi-process gaps —
|
||
add it before admitting multi-process write topologies.
|
||
|
||
No invariant is weakened. Two Known Gaps **close** (manifest→commit-graph
|
||
atomicity; commit-graph parent under concurrency, via Phase 7); one
|
||
(read-path-rederivation) gets its **write twin** filed and addressed.
|
||
|
||
### 7.1 Scope of the correctness claims (literature review, §13)
|
||
|
||
The "correct by construction" framing (§3, §4.1) is **precise but bounded** — the
|
||
DB-canon review flags three places not to over-claim:
|
||
|
||
- **Per-table serializability, not graph-wide — but the gap is narrow and now
|
||
measured [M].** Three deterministic cases (failpoint `loader.post_ri_pre_stage`,
|
||
placed between RI-validation and staging; red test in `tests/failpoints.rs`):
|
||
- **Cross-table *disjoint* → genuine skew, VIOLATED.** A **non-cascading endpoint
|
||
removal** — `node:Person` *overwrite* dropping Bob, touching only the node table
|
||
— concurrent with an edge insert `Knows(Bob→Alice)`: both commit (write-set-only
|
||
CAS, RI validated once pre-commit and never re-checked at publish) → **committed
|
||
orphan**. (= `iss-ri-write-skew-dangling-edges` + the concurrent face of
|
||
`iss-overwrite-orphans-committed-edges`.)
|
||
- **Cross-table *overlapping* → incidentally protected.** `delete`-based removal
|
||
**cascades** into `edge:Knows`, so the write-sets overlap, the per-table CAS
|
||
engages, and the loser fails **cleanly** (stale-view OCC retry); invariant held.
|
||
- **Same-table → NOT a separate skew.** Cardinality / `@unique(src)` have
|
||
overlapping write-sets, so the per-table CAS holds the constraint; the loser's
|
||
failure is the **HEAD-ahead corruption already scoped to #6/#7** (epoch +
|
||
PublishPlan), not a consistency hole. *(This corrects an earlier
|
||
over-generalization: cardinality/uniqueness do not share the read-set gap.)*
|
||
|
||
So the skew is **reachable only for the non-cascading-overwrite × disjoint-edge-insert
|
||
shape** — operation-specific, not constraint-specific.
|
||
|
||
**The scoped fix alone is a no-op — proven [M], and the reason is mechanical.**
|
||
Feeding the endpoint node-table versions into the edge's publish *expected* set
|
||
(`check_expected_table_versions`, `publisher.rs:353`) was prototyped exactly; debug
|
||
confirmed the pins reach the check, **and both writers still committed — the orphan
|
||
persisted.** Every publish writes a *unique per-`object_id` row* into `__manifest`
|
||
(merge key `object_id = version_object_id(table, version)`). Two disjoint-table
|
||
writers (`node:Person` vs `edge:Knows`) touch **no common row**, so Lance's
|
||
row-level merge-insert CAS commits both with **no conflict**, the publisher's retry
|
||
loop **never fires**, and `check_expected_table_versions` — a **non-atomic
|
||
pre-check, not part of the CAS** — is evaluated exactly once against the stale
|
||
pre-both manifest and passes for both. The read-set pin only bites if the loser is
|
||
**forced to retry and re-evaluate against fresh state**, which requires a *shared
|
||
contention row* every publish touches. Adding a stand-in global head row
|
||
(`UpdateAll`-touched by every publish) makes the disjoint writers overlap → Lance
|
||
conflict → publisher retry → the reloaded pin (`edge:Knows:1` vs current `5`)
|
||
rejects the stale writer → no orphan (red→green, failpoint suite 52/52). **That
|
||
shared row is exactly Phase-7's `graph_head:<branch>`.**
|
||
|
||
**Consequence — §7.1 is NOT a standalone single-server PR** (correct earlier text
|
||
that called it "single-server-live, not deferrable" — it *is* urgent and
|
||
epoch-independent, but it cannot ship against today's per-`object_id` manifest
|
||
without a contention point). Land it one of three ways: **(a)** with Phase 7
|
||
(step 4), reusing `graph_head:<branch>` as the contention row; **(b)** behind a
|
||
minimal per-branch head row ahead of Phase 7 (~15 lines, as prototyped); or
|
||
**(c)** as commit-time re-validation — still must win a serialization point first.
|
||
**Recommended: (c) behind a per-branch head row.** The CAS-map approach carries the
|
||
two costs §11 anticipated — *table-granularity false conflicts* (any `Person`
|
||
overwrite conflicts with any concurrent `edge:Knows` insert, even different rows —
|
||
needs a row-granularity read-set) and *scope* (a global head serializes the whole
|
||
graph; per-branch `graph_head` is the right granularity). Commit-time re-validation
|
||
is precise (no false positives) **and** reuses the same serialization point, so once
|
||
the head row exists it strictly dominates the CAS-map. Either way the head row
|
||
imposes an inherent trade — same-branch writers serialize cross-process (throughput
|
||
ceiling 1/branch, bounded by `PUBLISHER_RETRY_BUDGET`) — **now a correctness
|
||
requirement, not just a Phase-7 side effect** (§11).
|
||
|
||
**Two faces, two fixes — do not bundle them.** The above addresses only the
|
||
*concurrent* face (overlapping snapshots, `iss-ri-write-skew-dangling-edges`). The
|
||
*sequential* face (`iss-overwrite-orphans-committed-edges`) — an overwrite drops a
|
||
node that **already has a committed inbound edge**, with *zero* concurrency —
|
||
**cannot** be caught by read-set-in-CAS: the later writer's snapshot legitimately
|
||
post-dates the edge, so its pin matches and it commits. That is a pure
|
||
**inbound-RI-validation** gap: when an overwrite/delete removes node endpoints,
|
||
re-check that no live edge references them. A validation concern, not a CAS one;
|
||
it needs no contention row and ships independently.
|
||
*(Note: `iss-984` is a different bug — remote branch-merge idempotency — not this.)*
|
||
- **Recovery: roll-forward is by-construction; roll-back is not.** "Commit and
|
||
recovery replay the identical plan" holds for the **redo** direction (shared
|
||
`plan.apply()`). The undo classifier (NoMovement / UnexpectedAtP1 /
|
||
UnexpectedMultistep / IncompletePhaseB) lives *outside* the shared executor, only
|
||
at open-time — that's where ARIES-style divergence risk concentrates and where the
|
||
§5.5 failpoint coverage is owed.
|
||
- **The fence and the cross-file atomicity rest on a linearizable conditional-put.**
|
||
Kleppmann's fencing-token guarantee, the manifest CAS, and the epoch all require a
|
||
linearizable register — true on S3/R2 (If-Match) but **not** on the local-FS path
|
||
(`write_text_if_match` is content-token compare-then-replace, ABA-prone —
|
||
`invariants.md` Known Gap). **Precondition to state up front: every "deterministic
|
||
fence" / "atomic CAS" claim holds *on a store with linearizable conditional-put*;
|
||
the epoch must not use the local-FS path.** Delta Lake §3.2.2 treats the
|
||
object-store consistency model (read-after-write + put-if-absent) as a first-class
|
||
design parameter; so should this RFC.
|
||
|
||
---
|
||
|
||
## 8. Relationship to Lance MTT (the seam, not a dependency)
|
||
|
||
`GraphPublishAuthority.publish(txn, plan)` is exactly the adapter to a future
|
||
Lance `catalog.transaction()`. lance#7264 ("Multi-Table Transactions via
|
||
Branching") is real and OmniGraph is its reference "Alternative A"
|
||
(fast-forward-main + WAL + roll-forward recovery) **[U]**, but it is a 5-day-old
|
||
discussion with two unbuilt dependencies (lance#7263 branch merge/rebase,
|
||
lance#7185 UUID branch paths), an unresolved central choice (it *favors*
|
||
pointer-swap — the opposite identity model from OmniGraph), and an open soundness
|
||
question (TTL lease needs an epoch). **Build the seam now on its own merits; do
|
||
not schedule around MTT landing.** When it ships, `publish`'s *body* swaps
|
||
(stage→CAS→sidecar → `catalog.transaction()`) while `WriteTxn`/`PublishPlan` and
|
||
every verb lowering stay. `iss-863`/`iss-864` **[G]** already scope this spike.
|
||
|
||
The MemWAL/LSM ingest tier (`iss-681` **[G]**, `dec-adopt-lance-v7-memwal`) is
|
||
**complementary, not competing, and not in flight** (the `memwal-benefit-analysis`
|
||
branch is an empty placeholder; the real analysis is commit `c9a81266`). MemWAL
|
||
sits *below* the manifest publisher (per-table durability, opt-in, intra-table);
|
||
`WriteTxn` owns the cross-table CAS. Build `WriteTxn` first.
|
||
|
||
---
|
||
|
||
## 9. Sequencing
|
||
|
||
Ordered by leverage and dependency. **The dominant depth term is the redundant
|
||
data-table opens (step 3), not the internal tables (step 2)** — §0; both must land
|
||
to flatten the curve.
|
||
|
||
1. **Measure first (Tier-1 gate). ✅ LANDED (gate + harness).** *Prerequisite (1a):*
|
||
the write opener (`open_dataset_head`) is routed through the instrumented
|
||
`open_dataset_tracked` so the gate can count data-table opens (§5.1). The
|
||
write cost-budget tests live in `crates/omnigraph/tests/write_cost.rs` on a
|
||
**shared, store-agnostic harness** (`tests/helpers/cost.rs`: `measure`/`IoCounts`/
|
||
`assert_flat`/`local_graph`/`s3_graph`) that `warm_read_cost.rs` and
|
||
`write_cost_s3.rs` also consume — one vocabulary, no duplicated `IOTracker`
|
||
plumbing. The local gate ships green every-PR guards + the RED `#[ignore]`'d
|
||
internal-table LOCK (step 2's red→green acceptance). *Still owed:* the prod
|
||
`storage.ops` span metric (§5.3) and the bucket-gated `write_cost_s3.rs` opener
|
||
LOCK (step 3a's red→green, S3-only per the §9-3a measurement note).
|
||
2. **Bound history — bring the INTERNAL tables into optimize/cleanup.** Split into
|
||
a compaction half (the latency win, safe) and a cleanup half (version GC, needs
|
||
the Q8 watermark). Validated (Lance docs + source): compaction *preserves*
|
||
versions and is the only term needed to flatten the per-write metadata scan;
|
||
cleanup is the separate version-deleting op that opens the Q8 hole.
|
||
- **2a. Internal-table compaction. ✅ LANDED.** `optimize` now compacts all
|
||
three internal tables — `__manifest`, `_graph_commits`, **and
|
||
`_graph_commit_actors`** (the actor table grows one fragment per commit on the
|
||
authenticated write path, so it carries the same O(depth) scan as the other
|
||
two and is compacted from one source-of-truth list with per-table existence
|
||
guards). `compact_internal_table` is a separate simpler path than
|
||
`optimize_one_table`: no manifest publish, no recovery sidecar. The sidecar-free
|
||
property does **not** rest on single-commit atomicity (`compact_files` can emit a
|
||
`ReserveFragments` commit before the `Rewrite`, and the auto-cleanup strip is a
|
||
further commit) — it holds because each of those commits is content-preserving
|
||
and the table is read at HEAD, so a crash leaves it readable and content-identical
|
||
and the next `optimize` re-plans. **Non-destructive by construction:** compaction
|
||
preserves versions, and before compacting it strips any stale `lance.auto_cleanup.*`
|
||
config off the table, so a graph created by an older binary (on-by-default GC
|
||
hook) cannot have the commit-time hook silently prune `__manifest`-pinned
|
||
versions during an `optimize` (current binaries store no such config; the
|
||
strip is the upgrade-path safety net). **The same strip now also runs on the
|
||
data-table path** (`optimize_one_table`), inside the Optimize sidecar window —
|
||
so `optimize` is non-destructive on node/edge tables too, not just the internal
|
||
ones (the data-table path was a pre-existing gap, since `compact_files`/
|
||
`optimize_indices` there also commit with the auto-cleanup hook enabled). **Concurrency:**
|
||
no app lock on the internal path — and `compact_files` does *not* auto-retry a
|
||
semantic conflict against a concurrent live writer (Lance prescribes app-rerun for
|
||
`Rewrite` vs `Update`/`Merge`), so `compact_internal_table` runs a *bounded*
|
||
retry loop that reopens fresh and replans on a retryable Lance conflict (the
|
||
canonical Lance-consumer pattern); transient contention does not fail the live
|
||
publisher or the operator's `optimize`, but sustained contention past the budget
|
||
surfaces a loud conflict error (bounded + observable, not an infinite loop). The
|
||
data-table path instead holds the per-table write queue, so it never contends. A
|
||
coordinator `refresh` after the compaction restores cache coherence. The
|
||
`internal_table_scans_are_flat_in_history` LOCK is now green on the
|
||
**authenticated** write path: on a compacted graph a write's
|
||
`__manifest`/`_graph_commits`+`_graph_commit_actors` scan is flat in history
|
||
(measured `__manifest` 4→2, commit-graph+actors 10→2 across depth 10→100).
|
||
Compacts all three tables even though Phase 7 (`iss-991`) will later fold
|
||
`_graph_commits` into `__manifest` (one-call throwaway; full interim win until
|
||
then). **2a is also the hard prerequisite for Phase 7** (its `graph_head` CAS
|
||
contention is only acceptable once `__manifest` compaction bounds the
|
||
publisher's `load_publish_state` scan).
|
||
- **2b. Internal-table cleanup + Q8 watermark — DEFERRED** (debated; not bundled
|
||
with 2a). Cleanup is the version-deleting op that hits cleanup-resurrection
|
||
(§12 Q8: Lance's version CAS has no monotonic guard), so it must land **with**
|
||
a durable monotonic watermark (a Lance boundary tag — durable across cleanup,
|
||
`cleanup.rs` `is_tagged`). Deferred because it touches the read/open path
|
||
(a tag-floor clamp on every coordinator open), is the MTT-redundant part (MTT
|
||
may replace `__manifest`), and only buys the secondary version-count/space term
|
||
— whereas 2a delivers the dominant per-write scan win with zero resurrection
|
||
risk. Land it when the version-count cost bites or the Lance MTT timeline
|
||
clarifies. (`gap-read-path-rederivation` write twin.)
|
||
3. **The opener fix — a shippable lead + the structural follow-on.**
|
||
- **3a. Opener bypass (standalone PR, THE dominant fix — [M] proven). ✅ LANDED.**
|
||
`TableStore::open_dataset_head_for_write` now delegates to the direct
|
||
`open_dataset_head` opener (`Dataset::open` by URI + `checkout_branch`, routed
|
||
through `instrumentation::open_dataset_tracked` so the cost gate can count it;
|
||
no-op in prod) instead of the `from_namespace` builder. Measured end-to-end on
|
||
the prototype: data term `31 + 12·depth → flat 4`, total `+18 → +5/depth`,
|
||
depth-80 **2.7×** (§2.4), functionally correct on main/branch/node.
|
||
**Acceptance:** the full `cargo test --workspace --locked` suite passes under the
|
||
bypass (the `tests/` integration + `merge_truth_table` + recovery/failpoint
|
||
suites the prototype's `--lib` run didn't cover — schema-apply, branch merge,
|
||
fork-on-first-write, overwrite). **Namespace retired to test-only:** with both
|
||
reads (Fix 2) and now writes bypassing it, *nothing in production routes through
|
||
the Lance namespace* — confirming §2.4's premise. The dead per-table open chain
|
||
(`load_table_from_namespace`, `open_table_head_for_write`) was deleted and the
|
||
`StagedTableNamespace` contract apparatus gated `#[cfg(test)]`, mirroring the
|
||
already-`#[cfg(test)]` read namespace (`BranchManifestNamespace`). **Measurement
|
||
note (corrected):** the opener win is **S3-only** — local FS resolves latest with
|
||
one cheap `read_dir` regardless of opener, so the namespace-vs-direct difference
|
||
is invisible there (the local data-table read count *does* grow with depth, but
|
||
that is the merge-insert/RI scan over O(depth) *fragments*, a compaction term,
|
||
not the opener; depth-100 = 92 ops identically before and after the bypass). The
|
||
opener LOCK therefore lives in the bucket-gated `write_cost_s3.rs`, not the local
|
||
`write_cost.rs`.
|
||
- **3b. Full `WriteTxn` (capture-once + intra-txn handle reuse + shared Session).**
|
||
Formalize 3a's open-once into the pinned, threaded `WriteTxn` (re-resolution
|
||
*unrepresentable*, invariant 3) and kill the flat-46 schema-read constant
|
||
(resolve/validate-once, §0/§6). (`iss-write-s3-roundtrip-amplification`.)
|
||
4. **Phase 7 — lineage into the manifest.** Removes the per-write
|
||
`commit_graph.refresh`; commit graph becomes a projection. (`iss-991`.)
|
||
**Hard dependency: step 2 must land first (Q1, §12)** — each publisher retry
|
||
re-runs the O(history) `load_publish_state` scan, so the `graph_head` CAS
|
||
contention Phase 7 introduces is acceptable only once compaction bounds that
|
||
scan. Acceptance includes the Q1 concurrent-same-branch-writer gate.
|
||
**Carries the §7.1 concurrent write-skew fix.** The `graph_head:<branch>` row is
|
||
the shared contention point the cross-table read-set-in-CAS needs — proven [M]
|
||
that the read-set fix is a no-op without it (§7.1). So the concurrent face of the
|
||
write-skew lands *with* this step (or, if §7.1 must ship earlier, behind a minimal
|
||
per-branch head row — ~15 lines — or as commit-time re-validation). The
|
||
*sequential* face (`iss-overwrite-orphans-committed-edges`) is independent:
|
||
inbound-RI validation on node removal, no head row, ships anytime.
|
||
5. **`PublishPlan` unification + recovery off the hot path + epoch fence — the
|
||
multi-writer safety guard.** Collapse the four writers; move the `__recovery` list
|
||
to open/conflict; add the `writer_epoch` leader-lease. **Motivated by the proven
|
||
§6.5 cliff** (multi-process same-branch writers corrupt at concurrency = 2) — this
|
||
is the guard that makes multi-writer topologies safe, not optional polish.
|
||
**Gated by the §5.5 correctness contract** (the four concurrency tests must land
|
||
with it). `writer_epoch` must be a true cross-process conditional CAS — **not**
|
||
the local-FS `write_text_if_match` path (§7.1). (`iss-856`,
|
||
`iss-merge-recovery-partial-rollforward`, `iss-recovery-sweep-live-writer-rollback`,
|
||
`iss-934`.)
|
||
6. **Branch ops.** Lance `Clone` for create (`iss-691`); concurrent delete loops.
|
||
7. **Freeze** investment in publisher/sidecar/fork internals; pursue the MTT
|
||
seam (`iss-863`/`iss-864`) as the strategic exit.
|
||
|
||
**Land PR #277 first** — it closes `iss-merge-recovery-partial-rollforward` and is
|
||
the producer-side half of the `PublishPlan` discipline; the heal-relocation in
|
||
step 5 must preserve its merge pre-snapshot heal (`exec/merge.rs:1084-1090`) and
|
||
its open-time `IncompletePhaseB → RollBack` (which the per-write heal never
|
||
performed anyway).
|
||
|
||
---
|
||
|
||
## 10. Cross-reference map (the ties)
|
||
|
||
**Dev-graph items (modernrelay) — what this RFC ties together:**
|
||
|
||
- Primary: `iss-write-s3-roundtrip-amplification` (the bug).
|
||
- Depth term / Phase 7 (commit graph → manifest-derived projection): `iss-991`
|
||
(related: `iss-707` structured commit-graph lineage; `iss-934` Quint
|
||
multi-table-publish verification). Read twin: `gap-read-path-rederivation`.
|
||
- Substrate seam: `iss-863`, `iss-864`. Decision: `dec-adopt-lance-v7-memwal`
|
||
(`iss-681`).
|
||
- Recovery: `iss-856`, `iss-recovery-sweep-live-writer-rollback`,
|
||
`iss-merge-recovery-partial-rollforward`, `iss-903`, `iss-load-not-crash-safe`.
|
||
- Residual migration: `iss-950` (MR-A staged delete, retires D2), `iss-848`
|
||
(index-coverage reconciler, owns `create_vector_index`).
|
||
- Branch/load: `iss-691`, `iss-677`, `iss-895`, `iss-topology-cross-branch-cache`,
|
||
`iss-841`, `iss-982`, `iss-423`, `iss-989`.
|
||
- Concurrency correctness (survives MTT) — **two faces, two different fixes [M]**
|
||
(§7.1): `iss-ri-write-skew-dangling-edges` (the *concurrent* face; fix =
|
||
read-set-in-CAS **+ a shared `graph_head` contention row**, so it's coupled to
|
||
step 4 / a minimal head row / commit-time re-validation — NOT a standalone PR) and
|
||
`iss-overwrite-orphans-committed-edges` (the *sequential* face; fix =
|
||
**inbound-RI validation on node removal**, ships independently, no contention row).
|
||
*(`iss-984` — remote branch-merge idempotency — is unrelated; not a write-skew.)*
|
||
- Blockers: `blk-lance-6658` (shipped 7.0.0), `blk-lance-6666` (open, vector
|
||
index two-phase), `blk-lance-blob-compaction`.
|
||
- Epics: `epc-bulk-data-plane`, `epc-lance-v7-migration`, `epc-783` (reliability
|
||
harness), `epc-929` (Quint verification).
|
||
|
||
**Proposed new dev-graph wiring (not yet written):**
|
||
|
||
- New **Epic** `epc-write-path-latency` — owns the cluster of orphaned issues
|
||
above (none currently has an epic).
|
||
- New **Gap** `gap-write-path-rederivation` — the write twin of
|
||
`gap-read-path-rederivation` (current: write re-derives snapshot + scans
|
||
uncompacted internal tables per write; target: capture-once + bounded history).
|
||
- New **Issues**: write-side cost-budget gate + prod metric (step 1; prereq 1a
|
||
routes all opens through the instrumented opener); **opener bypass — open writes
|
||
direct-by-URI, standalone (step 3a, [M] the dominant fix, completes PR #268 Fix 2
|
||
on the write path, §2.4)**; full `WriteTxn` capture-once (step 3b); **add
|
||
`__manifest`/`_graph_commits` to `all_table_keys`** for compaction+cleanup (step 2
|
||
— a code change, `optimize.rs:895-904`); `PublishPlan` unification + epoch
|
||
(step 5); branch-delete concurrency (step 6).
|
||
- **Per-table namespace retired to test-only (step 3a landed).** With reads (Fix 2)
|
||
and now writes (step 3a) both opening direct-by-URI, *nothing in production routes
|
||
through the per-table `StagedTableNamespace`*. The dead open chain
|
||
(`load_table_from_namespace`, `open_table_head_for_write`) was deleted; the
|
||
`StagedTableNamespace` struct/impl/factory are now `#[cfg(test)]`, mirroring the
|
||
already-`#[cfg(test)]` read namespace (`BranchManifestNamespace`). Both are retained
|
||
only to validate the `LanceNamespace` contract in unit tests. *Production catalog /
|
||
managed-versioning commit coordination for `__manifest` itself goes through a
|
||
**separate** namespace (`GraphNamespacePublisher`), unaffected by this change.* The
|
||
former follow-up to harden `StagedTableNamespace::list_table_versions`
|
||
(`checkout_version` per version, O(depth)) is now purely a test-hygiene note — no
|
||
prod caller can hit it; if any future version-list / time-travel feature needs
|
||
per-table version enumeration, build `TableVersion`s from `versions()` metadata
|
||
directly rather than resurrecting the namespace open path.
|
||
- New **Decision** `dec-writetxn-manifest-authoritative-publish` — records this
|
||
RFC's design choice and the MTT-seam stance.
|
||
|
||
**Key source locations (v0.7.0):**
|
||
`omnigraph.rs:561-568,739-779,1317-1389`; `table_ops.rs:505-609`;
|
||
`table_store.rs:157-280,282-341,797`; `loader/mod.rs:197,400,485,557`;
|
||
`exec/mutation.rs:725`; `exec/merge.rs:1084-1090`;
|
||
`db/manifest/publisher.rs:51,93-124,356-371,385,432-440,448-490`;
|
||
`exec/mutation.rs:640-673` (D2 rule); `db/manifest/state.rs:44-72,133-141`; `db/manifest/layout.rs:22-26`;
|
||
`db/manifest/namespace.rs:111-112` (read open, O(1)),`:357-385`/`:362` (`describe_table` → redundant `Dataset::open` — the write-path double-open),`:158-186,544-550` (write open via `from_namespace`),`:395-427` (`list_table_versions` per-version checkout — test-only O(depth), the §10 follow-up);
|
||
`db/manifest/recovery.rs:762,978-988,1522`; `db/commit_graph.rs:136-164,213-272`;
|
||
`db/omnigraph/optimize.rs:240,517,895-904`; `instrumentation.rs:37,112-131`;
|
||
`runtime_cache.rs:202-283`; `tests/warm_read_cost.rs` (the read-side gate to mirror).
|
||
|
||
**Upstream:** lance#7264/#7263/#7185 (MTT); Lance `with_version` O(1) open
|
||
(`from_namespace` → `describe_table`, `builder.rs:130-178`; `default_resolve_version`
|
||
= one HEAD, `commit.rs:939-981`; version-hint PR #6752),
|
||
`list_is_lexically_ordered = !is_s3_express` (`aws.rs:183`),
|
||
`IOTracker`/`assert_io_*`/`num_stages`, `test_commit_iops`,
|
||
`test_commit_uses_version_hint_on_non_lexical_store`; **lance-namespace** design
|
||
(`namespace/index.md`, `operations/index.md` — catalog/discovery layer, resolve
|
||
once); LanceDB `io_tracking.rs`, `test_reload_resets_consistency_timer`; SlateDB
|
||
`FenceableTransactionalObject` (epoch fence), `InstrumentedObjectStore`,
|
||
monotonic-ID manifest.
|
||
|
||
**Reproduce the §0(b) network measurement:** `rustfs` (S3-compat) on `:9000`
|
||
behind a ~90-LoC Go counting proxy on `:9100` (adds `LATENCY_MS`, preserves the
|
||
SigV4 `Host` header, `/__ctl/reset` + `/__ctl/stat`); an omnigraph cluster on
|
||
`s3://…/cluster` through the proxy. Single-write breakdown: reset the proxy log,
|
||
`load --mode merge` one edge, classify by S3 key. Depth slope: write N× to main,
|
||
diff the per-write log at depth D vs D+20 by table. Native baseline: pylance 7.0.0
|
||
`write_dataset(mode="append")` in a loop → flat 6 ops/append at any depth.
|
||
|
||
---
|
||
|
||
## 11. Drawbacks, alternatives, reversibility
|
||
|
||
**Drawbacks.** Phase 7 makes disjoint-table same-branch writers contend on the
|
||
`graph_head:<branch>` row (they don't today) — bounded by the Lance retry budget,
|
||
inherent to a linear per-branch DAG, gated on a measured concurrency test and on
|
||
step 2 landing first (§12 Q1, resolved). **Reframe [M]: this contention is
|
||
load-bearing for correctness, not merely a throughput tax.** The §7.1 write-skew is
|
||
*unreachable only because* the shared head row forces disjoint cross-table writers to
|
||
overlap, conflict, retry, and re-evaluate their read-set pins against fresh state
|
||
(proven — without it the scoped CAS fix is a no-op). So §7.1 and the head row are
|
||
**coupled**: the "drawback" is exactly what buys the cross-table invariant, and the
|
||
throughput ceiling (1 writer/branch, bounded by `PUBLISHER_RETRY_BUDGET`) is a
|
||
**correctness requirement** the moment §7.1 ships, not an optional Phase-7 side
|
||
effect. `PublishPlan` is a non-trivial refactor of four writers; it must land behind
|
||
the cost gate and the `merge_truth_table`/recovery/failpoint suites.
|
||
|
||
**Alternatives.** (A) *Caching band-aid only* — memoize schema validation, cache
|
||
opens within a request: ~30–50% fewer round-trips but leaves open-by-latest and
|
||
the O(history) terms. Mitigation, not a fix. (B) *Opener bypass only* (open
|
||
direct-by-URI+version, no full txn) — **kills the dominant depth term, now measured
|
||
[M]**: a one-line patch flattened the data term `31+12·depth → flat 4` and cut a
|
||
depth-80 edge **2.7×** (§2.4), leaving only the secondary internal-table term and
|
||
the writer unification. (C) *Full design (this RFC)* — correctness by construction.
|
||
(D) *Wait for Lance MTT* — future exit, not a current dependency (§8).
|
||
**Recommend: ship B as a standalone PR first (behind the step-1 gate), then C for
|
||
the constant-factor + correctness, then step 2 for the internal residual; D as the
|
||
strategic end-state.** B is the demonstrated dominant fix, not a partial one.
|
||
|
||
**Reversibility.** The interface (`WriteTxn`/`PublishPlan`) is internal and
|
||
reversible. Phase 7's new `__manifest` object types (`graph_commit`,
|
||
`graph_head`) are an **on-disk format addition** — additive (old binaries skip
|
||
unknown `object_type`s) but near-permanent; it earns its own validation pass
|
||
(forward/back-compat, the validation checklist in the `iss-991` handoff). The
|
||
`writer_epoch` is likewise a durable manifest field. Everything else (compaction
|
||
scheduling, recovery relocation, branch concurrency, the cost gate) is cheap to
|
||
undo.
|
||
|
||
---
|
||
|
||
## 12. Resolved questions (was: unresolved)
|
||
|
||
All five original open questions were investigated read-only against post-#277/#284
|
||
`origin/main`, upstream Lance 7.0.0, and the dev graph; each is resolved below. One
|
||
new item (Q6), surfaced by peer review, remains genuinely open.
|
||
|
||
1. **`graph_head` CAS contention → RESOLVED, gated on step 2 + a concurrency test.**
|
||
Retry is publisher-owned; Lance's internal rebase-retry is disabled
|
||
(`conflict_retries(0)`, `publisher.rs:385`) → no double-retry. Row-CAS is true
|
||
one-winner (`TooMuchWriteContention` → retryable, `publisher.rs:432-440`),
|
||
bounded by `PUBLISHER_RETRY_BUDGET = 5`. **But each retry re-runs the O(history)
|
||
`load_publish_state` scan (`publisher.rs:455`)**, so `graph_head` contention
|
||
multiplies the manifest term — **step 2 (compaction) is a hard prerequisite for
|
||
step 4 (Phase 7)**. Same-branch is the real workload (the incident is concurrent
|
||
`main` writes). Residual: a measured gate before Phase 7 — N≈100 concurrent
|
||
same-branch writers, assert bounded retry + O(working-set) re-scan + P99 within
|
||
SLA. Fallback: batched-lineage, or Alternative B (defer lineage-in-manifest).
|
||
2. **Recovery grace-window → RESOLVED.** PR #284 is **unrelated** (cluster-apply
|
||
trap; zero `recovery.rs` changes). The dangerous rollback classifications
|
||
(NoMovement / UnexpectedAtP1 / UnexpectedMultistep / #277's IncompletePhaseB)
|
||
fire only at the open-time Full sweep; the per-write heal defers all rollback
|
||
(`recovery.rs:978-988`), so moving the heal off the hot path doesn't break #277.
|
||
A sidecar-age grace window (defer sidecars younger than T_grace, loud typed
|
||
skip, `repair` override) on the existing `created_at`/ULID
|
||
(`recovery.rs:762`/`:1522`) is the sound interim; the permanent fix is the
|
||
in-process background reconciler `iss-856`. Lands step 5 with a failpoint test.
|
||
3. **Epoch fence × publisher CAS → RESOLVED (by construction).** With Lance retry
|
||
off (Q1), the publisher loop is the only retry layer. Model `writer_epoch` as a
|
||
**pre-publish hard-fail gate** beside `check_expected_table_versions`
|
||
(`publisher.rs:462`) but non-retryable (a stale epoch is a protocol violation,
|
||
not a race). No double-retry; the epoch gate and the row-CAS loop are
|
||
sequential. SlateDB `FenceableTransactionalObject` is the precedent.
|
||
4. **Compaction cadence → RESOLVED.** Not `auto_cleanup` (GCs pinned versions).
|
||
Not foreground every-N-writes (deny-list job-queue + invariant 7 + cost cliff).
|
||
Minimum (step 2): extend `optimize`/`cleanup` to the internal tables AND node/edge
|
||
version cleanup — no special-casing (`SidecarKind::Optimize` already covers a
|
||
mid-compaction crash). Follow-up: an `iss-856`-shaped background reconciler
|
||
triggered by a cheap fragment-count probe (work off the hot path; a reconciler,
|
||
not a job queue — deny-list-clean; SlateDB's epoch-coordinated compactor is the
|
||
precedent). CLI `omnigraph optimize` stays the operator override.
|
||
5. **`PublishPlan` residuals → RESOLVED.** Both `delete_where` and
|
||
`create_vector_index` are representable as `TableAction` variants with existing
|
||
sidecar coverage (`SidecarKind::Mutation`/`EnsureIndices`) and are
|
||
content-preserving (roll-forward safe). `TableAction::Delete` migrates to staged
|
||
two-phase via MR-A / `iss-950` (now unblocked — `blk-lance-6658` shipped); **D2
|
||
retires then** (`enforce_no_mixed_destructive_constructive`,
|
||
`exec/mutation.rs:640-673`). `TableAction::CreateVectorIndex` stays inline until
|
||
`blk-lance-6666` ships (`iss-848` reconciler path).
|
||
|
||
**Resolved post-review:**
|
||
|
||
6. **The exact mechanism of the data-table chain re-read → RESOLVED (§0, §2.4).**
|
||
Pinned by Lance-source trace + proxy + pylance isolation **[U]**: it is **not**
|
||
`checkout_version` (O(1)) and **not** merge-insert conflict replay. The write
|
||
open goes through `DatasetBuilder::from_namespace` (`namespace.rs:174`), whose
|
||
`describe_table` opens the whole dataset just to return a location
|
||
(`namespace.rs:362`/`:112`) and whose `.load()` resolves latest **again** — a
|
||
double latest-resolution per open, ~13× per write, nothing cached. The open
|
||
resolves latest **without the V2 lexical / version-hint fast path** the direct
|
||
opener uses (likely the un-threaded `Session`/store params,
|
||
`load_table_from_namespace` `namespace.rs:174` — inferred, not traced), so it is
|
||
O(depth) where a direct `from_uri().with_version(N)` is O(1). **The mechanism
|
||
question is now academic for the fix:** bypassing `from_namespace` makes the open
|
||
flat regardless of the precise sub-mechanism (un-threaded `Session` / double
|
||
resolve / missed hint) — the bypass is the answer. (`list_table_versions` is
|
||
**not** on this path — test-only; §10 follow-up.) `checkout_version` stays
|
||
exonerated.
|
||
|
||
**Resolved end-to-end [M]:**
|
||
|
||
7. **End-to-end prototype of step 3 → DONE, measured (§2.4 before/after).** A
|
||
prototype patched the opener (`open_dataset_head_for_write`, `table_store.rs:174`)
|
||
to bypass `from_namespace` and open direct-by-URI, rebuilt v0.7.0, and re-ran the
|
||
sweep: the data term collapsed `31 + 12·depth → flat 4`, total `+18/depth →
|
||
+5/depth` (residual = the two internal tables, step 2), depth-80 **1618 → 593 ops
|
||
(2.7×)** — functionally correct on main edge merge, branch create+write+read, and
|
||
node merge. So step 3's "closes the dominant term outright" is **measured, not
|
||
inferred**, and the opener bypass is **shippable standalone** (§9 step 3a).
|
||
**Remaining (not blockers for step 3a's thesis):** the prototype did not cover
|
||
schema-apply / branch merge / fork-on-first-write / overwrite / concurrent — a
|
||
production opener change must pass the full `merge_truth_table`/recovery/failpoint
|
||
suite; and the internal-table cleanup demo (step 2) + the concurrency
|
||
fault-injection harness (steps 4/5) are still owed.
|
||
|
||
**Newly surfaced (open):**
|
||
|
||
8. **CAS-resurrection after cleanup → CONFIRMED VULNERABLE [S]; boundary watermark
|
||
is a HARD PREREQUISITE for step 2.** SlateDB found this race (RFC-0026 / issue
|
||
#352): a writer that stalls between computing manifest id `N+1` and creating it
|
||
can, *after GC deletes `N+1`*, re-create it and observe **false success**.
|
||
Lance 7.0.0 was traced directly and is **not immune**: version creation is a plain
|
||
`put_opts(naming_scheme.manifest_path(base, version), PutMode::Create)` /
|
||
`rename_if_not_exists` (`lance-table commit.rs:1421-1437`, `:1358`) on a
|
||
version-numbered, **pruneable** path, with **no monotonic/boundary/watermark
|
||
guard** anywhere in the manifest/commit/dataset path; `cleanup_old_versions` is
|
||
**timestamp-based** (`cleanup.rs:1086`), so it deletes the very file the only
|
||
guard (AlreadyExists→rebase) relies on. A stalled publisher whose target version
|
||
was pruned by step-2 cleanup gets a `PutMode::Create` **success on a non-existent
|
||
version → false success.** Severity by store: **R2/S3 (lexical, prod) = a silent
|
||
lost write** (the resurrected version doesn't win V2 latest-resolution, so data
|
||
lands on a dead branch while the publisher believes it committed); non-lexical =
|
||
the version hint (`commit.rs:1439`) is overwritten to the stale version and
|
||
trusted (worse, but not the prod path). **Action:** step 2 ships **only with** a
|
||
durable **boundary/floor watermark** (GC advances it before deleting; every writer
|
||
rejects `id <= boundary` after a "successful" create — SlateDB's fix), which also
|
||
bounds any list-then-read-latest fallback. This was "lowest-risk earliest item";
|
||
it is now gated (§9 step 2).
|
||
|
||
---
|
||
|
||
## 13. External validation (subagents + literature)
|
||
|
||
Validated read-only against OSS prior art and the DB/distributed-systems canon:
|
||
|
||
- **SlateDB** (canonical object-store LSM) — tenet-by-tenet ✅ on capture-once
|
||
snapshot, monotonic-ID manifest (no pointer file — *explicitly rejected* in their
|
||
RFC-0001), the **epoch fence** (exact match: `FenceableTransactionalObject`,
|
||
hard-fail, TTL-lease *explicitly rejected* — adopt as specified), background
|
||
epoch-coordinated compaction/GC, and recovery-on-open. **Adopt-items OmniGraph is
|
||
missing / under-specifies:** (1) the **boundary-file** CAS-resurrection guard (Q8);
|
||
(2) **group-commit batching** — coalesce pending `PublishPlan`s into one manifest
|
||
CAS, directly mitigating the Q1 / §6.5 contention; (3) SlateDB *peels* compaction
|
||
state *out* of the manifest (RFC-0013) — the **opposite** of Phase 7's fold-*in*;
|
||
§11 should defend "fold-in (lineage must be atomic with visibility) beats peel-out
|
||
for us"; (4) **write back-pressure** when cleanup lags (`l0_max_ssts`). **Citation
|
||
correction:** SlateDB has the per-RPC counter (`InstrumentedObjectStore`) but
|
||
**not** the flatness-across-history gate — the depth-swept Tier-1 gate (§5.1) is
|
||
OmniGraph-novel; cite it that way.
|
||
- **Literature** — OCC/MVCC (Kung-Robinson 1981; DDIA ch.7), ARIES redo/undo, the
|
||
fencing-token canon (Kleppmann — whose motivating example *is* OmniGraph's
|
||
S3-read-modify-write-paused-past-lease scenario), and the lakehouse genre (Delta
|
||
Lake VLDB 2020, Iceberg spec, Neon). The spine — OCC-over-MVCC + one atomic
|
||
manifest CAS + WAL-of-intent recovery + monotonic-epoch fence — is canon-blessed,
|
||
and OmniGraph **exceeds** Delta/Iceberg on the axis that matters (both are
|
||
explicitly *single-table*-transactional; the manifest CAS delivers the atomic
|
||
*cross-table* commit Delta only speculates about). The three scoping caveats are in
|
||
§7.1.
|
||
- **HelixDB** (embedded LMDB graph DB) — too different a substrate to validate the
|
||
object-store machinery (LMDB's `commit()` subsumes tenets #2–#8 for free), but it
|
||
**corroborates tenet #1** (capture-once, thread-by-reference, re-resolution
|
||
unrepresentable — its `&mut RwTxn`-threaded traversal is the embedded twin of
|
||
`WriteTxn`) and confirms the bug class is **substrate-induced**. Portable idea for
|
||
the roadmapped traversal work: adjacency as a *persisted, sorted,
|
||
label-partitioned projection* keyed by `(node, label)` (vs the cold-rebuilt
|
||
`TypeIndex`).
|