(feat): compact the internal manifest/commit-graph tables in optimize (#291)

* feat(engine): compact the internal __manifest/_graph_commits tables in optimize

`optimize` iterated node/edge catalog tables only, so the two internal system
tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and
were never compacted -- making every write's metadata scan O(fragments), which
grows forever on a long-lived graph (RFC-013 step 2).

`optimize_all_tables` now also compacts both internal tables via a new
`compact_internal_table`. They are not catalog-tracked (readers open them at
their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`:
compact in place, no manifest publish (nothing to publish to), no recovery
sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no
optimize_indices (they carry no Lance index, only object_id's unenforced-PK
metadata). No application lock: Lance's compact_files auto-retries its Rewrite
against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append
is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a
coordinator refresh afterwards makes the warm handle observe the compacted HEAD.

Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits
into __manifest -- a one-call throwaway for the full interim win; __manifest
compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup
(version GC) of the internal tables is deliberately NOT included here: it needs
the Q8 cleanup-resurrection watermark first (deferred).

maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds
optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar,
graph coherent for reads + strict writes after).

* test(engine): un-ignore the internal-table scan LOCK (step 2 acceptance)

`internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance
gate staged in PR #288. With internal-table compaction landed, a write's
__manifest/_graph_commits scan is flat in commit-history depth on a compacted
graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the
pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before
measuring and runs green every-PR.

* docs: RFC-013 step 2 internal-table compaction landed

- invariants.md: close the compaction half of the read-path-rederivation known
  gap (optimize now compacts the internal tables; cleanup half still deferred).
- maintenance.md: optimize covers __manifest/_graph_commits (no publish, no
  sidecar); not yet in cleanup.
- rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8
  watermark, deferred — debated; MTT-overlap + hot-path liability).
- testing.md: the internal-table LOCK is now green every-PR.

* fix(engine): guard absent _graph_commits + always compact internal tables

Addresses PR #291 review findings:

- Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction,
  but a graph can validly have none (the coordinator opens it as `Option`, gated on
  `storage.exists`, for graphs predating the commit graph). `Dataset::open` on the
  absent table errored and failed the whole optimize. Guard the `_graph_commits`
  compaction with the same `storage_adapter().exists()` check the coordinator uses;
  `__manifest` always exists so it stays unguarded. Regression test
  `optimize_tolerates_absent_graph_commits_table` (empty graph so no publish
  recreates the table before the guard).

- Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table
  compaction for a schema with no node/edge types. Removed it so the internal
  tables are compacted regardless of the data-table set.

- Codex (auto-cleanup, P1): documented — `compact_files` commits with a default
  `CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override,
  so on a graph storing an *on* auto_cleanup config the commit would fire version
  GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are
  safe; the only exposure is pre-fix upgraded graphs, identical to the existing
  data-table optimize path, with step 2b's watermark as the comprehensive guard.
  Added a comment in `compact_internal_table` recording this.

* fix(engine): retry publish on RetryableCommitConflict (compaction vs publish)

Step 2 compacts `__manifest` with no app-level lock (Lance OCC arbitrates,
validated against LanceDB + the lance-7.0.0 conflict resolver). compact_files'
`Operation::Rewrite` auto-retries 20x (CommitConfig default num_retries=20), so a
live publish usually wins the race and the compaction rebases. But the publish
runs its merge-insert with conflict_retries(0) = one rebase attempt; if the
compaction commits first AND the merge touched a fragment the Rewrite rewrote,
Lance preempts the publish with `Error::RetryableCommitConflict` — a DIFFERENT
variant from the row-level `TooMuchWriteContention` the publisher already retries.
Left unhandled, that surfaces a transient error to the caller, i.e. a maintenance
compaction (physical op) failing a live write (logical op) — invariant 7.

Map `LanceError::RetryableCommitConflict` to a new
`ManifestConflictDetails::RetryableCommitConflict` and treat it as retryable in the
publisher's outer loop (reload fresh state + re-merge), alongside
RowLevelCasContention. `ExpectedVersionMismatch` still propagates (a genuine
expectation break must not be blindly retried). This also hardens multi-process
concurrent writers generally, not just compaction.

Normal publishes are insert-only (new object_ids -> new fragments, disjoint from
rewritten old ones), so the conflict is rare; the guard covers the
same-fragment-update edge and multi-process writers. Unit tests in publisher.rs
pin the mapping + the retry-predicate contract.

* revert: publisher RetryableCommitConflict handling (it was the wrong side)

Reverts d138902e. Validated against lance-7.0.0: the publisher's merge-insert runs
with conflict_retries(0), and execute_with_retry converts an exhausted retryable
commit conflict to TooMuchWriteContention before the caller sees it
(write/retry.rs ~95-130). So map_lance_publish_error NEVER receives
RetryableCommitConflict from merge_rows — it receives TooMuchWriteContention, which
the publisher already maps to RowLevelCasContention and retries. The reverted
mapping was therefore dead on the real path and its unit test was synthetic.

The actual exposure is the *compaction* side: compact_files -> commit_compaction ->
apply_commit directly (no execute_with_retry), so a Rewrite-vs-Merge check_txn
conflict propagates raw and optimize can fail on a live graph. That is fixed
app-side in compact_internal_table in the following commit.

* fix(engine): make internal-table compaction correct by construction

Address three findings from review of the step-2 internal-table compaction:

- Non-destructive by construction: before compacting an internal table,
  strip any stored `lance.auto_cleanup.*` config off it. `compact_files`
  commits with a default `CommitConfig` (skip_auto_cleanup=false) and
  `CompactionOptions` exposes no override, so on a graph created by an older
  binary (on-by-default GC hook) the compaction commit would fire Lance's
  auto-cleanup and silently prune `__manifest`-pinned versions. Current
  binaries store no such config; the strip is the upgrade-path safety net so
  `optimize` can never GC versions.

- App-level compaction retry: `compact_files` does NOT auto-retry a semantic
  conflict against a concurrent live writer (Rewrite vs Update/Merge/Delete
  propagates raw from apply_commit; Lance prescribes app-rerun). Wrap the
  internal-table compaction in a bounded retry loop that reopens fresh and
  replans on a retryable Lance conflict, so a maintenance compaction (a
  physical op) never fails a live write (a logical op) — invariant 7.

- Compact all three internal tables, not two: `_graph_commit_actors` grows
  one fragment per commit on the authenticated write path, the same O(depth)
  scan as `__manifest`/`_graph_commits`. Drive the sweep from one
  source-of-truth list with per-table existence guards (the two commit-graph
  tables are optional). Make `graph_commit_actors_uri` pub(crate).

Tests: the `internal_table_scans_are_flat_in_history` LOCK now runs the
authenticated (actorful) write path so it covers `_graph_commit_actors` via
the shared commit-graph IO wrapper (new `commit_many_as`/`measure_insert_as`
helpers); `optimize_clears_stale_auto_cleanup_and_preserves_versions` pins
the non-destructive guarantee (config cleared + no version GC); a unit test
pins the retryable-conflict classifier; the empty-graph stats count is 7
(the actor table is created at init).

* docs: internal-table compaction covers all 3 tables, non-destructive, retried

Sync the RFC-013 step-2a section and the maintenance guide with the
correctness-by-design refinements:

- optimize compacts `__manifest`, `_graph_commits`, AND `_graph_commit_actors`
  (the actor table grows on the authenticated write path).
- optimize is non-destructive by construction — it never GCs versions, and
  strips stale `lance.auto_cleanup.*` config so an upgraded graph's commit-time
  GC hook cannot fire during compaction.
- internal-table compaction rebases and retries against concurrent live
  writers rather than failing the operator's optimize or the live write.
- the cost LOCK is the authenticated-path acceptance test.

* fix(engine): refresh coordinator after a config-strip with no compaction work

`compact_internal_table` returns early when `plan_compaction` finds no work,
but `clear_stale_auto_cleanup_config` may have already committed a config-strip
that advanced Lance HEAD. The early return skipped the coordinator refresh that
the successful-compaction path performs, leaving warm `__manifest`/commit-graph
handles pinned to the pre-strip version until the next read's version probe
healed them. No correctness bug (the probe self-heals, and a stale-handle write
would retry via publisher CAS), but the refresh makes coherence deterministic
rather than probe-dependent. Refresh iff the config-strip actually committed.

* docs(engine): correct compact_internal_table doc — compact_files does not auto-retry

The function doc claimed "Lance's compact_files auto-retries its Operation::Rewrite
against any concurrent writer" — wrong, and contradicting the is_retryable_lance_conflict
doc just below it and the explicit retry loop that exists precisely because compact_files
does NOT auto-retry semantic conflicts (Rewrite vs Update/Merge/Delete propagates raw
through apply_commit). Also move the orphaned description from above the retry-budget
const onto the function, and include the third internal table.

* test(engine): optimize must clear stale auto_cleanup on DATA tables too (red)

Regression test for a destructive bug on the data-table optimize path: on an
upgraded graph whose node/edge table still carries pre-v7 lance.auto_cleanup.*
config, `optimize`'s compact_files/optimize_indices commits fire Lance's version
GC and prune __manifest-pinned data-table versions. Mirrors the internal-table
auto_cleanup test on a Person table (force-repair realigns the config-induced
drift so optimize doesn't skip the table). Red against the current code: the
data-table path does not strip the config. The fix lands in the next commit.

* fix(engine): clear stale auto_cleanup on the data-table optimize path too

The auto_cleanup scrub previously only protected the internal tables; the
data-table path (optimize_one_table) ran compact_files/optimize_indices with a
default CommitConfig (skip_auto_cleanup=false) and no override, so on an upgraded
graph those commits could fire Lance's version-GC hook and prune __manifest-pinned
node/edge versions — making the "non-destructive" contract false for data tables.
Strip the config before the HEAD-advancing commits, capturing version_before first
so the strip's own commit still triggers the Phase-C manifest publish (no uncovered
drift). No retry loop needed: the data-table path holds the per-table write queue.
Covered by the existing Optimize recovery sidecar. Turns the prior commit's test green.

Also: switch clear_stale_auto_cleanup_config off the deprecated delete_config_keys
to update_config(None values), and correct two now-inaccurate doc comments —
compaction is "one or more content-preserving commits" (compact_files can emit a
ReserveFragments before the Rewrite), not "a single atomic commit"; the sidecar-free
property rests on content-preservation + read-at-HEAD, not single-commit atomicity.

* docs: optimize is non-destructive on all tables; correct atomicity/retry claims

- non-destructive guarantee now spans data + internal tables (the auto_cleanup
  strip runs on both paths), not just the internal ones.
- "single atomic Lance commit" was inaccurate: compaction can emit a
  ReserveFragments commit before the Rewrite; the no-sidecar property rests on
  content-preservation + read-at-HEAD, not single-commit atomicity.
- "retries rather than failing" softened to the truth: a *bounded* retry on the
  internal path; sustained contention surfaces a loud conflict error (bounded +
  observable, not an infinite loop). The data path holds the per-table queue and
  never contends.
This commit is contained in:
Ragnor Comerford 2026-06-21 16:38:20 +02:00 committed by GitHub
parent fff441196c
commit f2b792e0ae
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
12 changed files with 600 additions and 46 deletions

View file

@ -846,23 +846,60 @@ to flatten the curve.
internal-table LOCK (step 2's red→green acceptance). *Still owed:* the prod
`storage.ops` span metric (§5.3) and the bucket-gated `write_cost_s3.rs` opener
LOCK (step 3a's red→green, S3-only per the §9-3a measurement note).
2. **Bound history — bring the INTERNAL tables into optimize/cleanup (a code
change, not just scheduling).** Today `optimize`/`cleanup` iterate **node/edge
keys only** (`optimize.rs:895-904`) — confirmed: the prototype's `cleanup --keep 3`
pruned "7 tables" = the node/edge data tables; `__manifest`/`_graph_commits` were
untouched **[M]**. So the residual +5/depth internal slope (§0b) is **not** fixed
by today's tooling — step 2 is a real `all_table_keys` change to add the internal
tables, then schedule compaction+cleanup (pass `--yes`; cleanup aborts on remote
otherwise). The pruning mechanism is proven on a data table (1035→63, 16× **[M]**);
the internal tables need the same inclusion. **Proven [M]:** compacting the
internal tables collapsed their scans `__manifest` 285→32, `_graph_commits`
177→11; with step 3 a depth-87 edge drops **~1720 → 198 ops** (§2.4). (Separately,
node/edge cleanup **caps** the dominant data-table term as an interim *before*
step 3 — after step 3 that term is flat regardless.) **HARD PREREQUISITE:** the
Q8 boundary watermark must land **with** this step — Lance's version CAS is
confirmed vulnerable to cleanup-resurrection (§12 Q8, a silent lost write on
R2/S3), so scheduling cleanup without the watermark trades a latency bug for a
correctness bug. (`gap-read-path-rederivation` write twin.)
2. **Bound history — bring the INTERNAL tables into optimize/cleanup.** Split into
a compaction half (the latency win, safe) and a cleanup half (version GC, needs
the Q8 watermark). Validated (Lance docs + source): compaction *preserves*
versions and is the only term needed to flatten the per-write metadata scan;
cleanup is the separate version-deleting op that opens the Q8 hole.
- **2a. Internal-table compaction. ✅ LANDED.** `optimize` now compacts all
three internal tables — `__manifest`, `_graph_commits`, **and
`_graph_commit_actors`** (the actor table grows one fragment per commit on the
authenticated write path, so it carries the same O(depth) scan as the other
two and is compacted from one source-of-truth list with per-table existence
guards). `compact_internal_table` is a separate simpler path than
`optimize_one_table`: no manifest publish, no recovery sidecar. The sidecar-free
property does **not** rest on single-commit atomicity (`compact_files` can emit a
`ReserveFragments` commit before the `Rewrite`, and the auto-cleanup strip is a
further commit) — it holds because each of those commits is content-preserving
and the table is read at HEAD, so a crash leaves it readable and content-identical
and the next `optimize` re-plans. **Non-destructive by construction:** compaction
preserves versions, and before compacting it strips any stale `lance.auto_cleanup.*`
config off the table, so a graph created by an older binary (on-by-default GC
hook) cannot have the commit-time hook silently prune `__manifest`-pinned
versions during an `optimize` (current binaries store no such config; the
strip is the upgrade-path safety net). **The same strip now also runs on the
data-table path** (`optimize_one_table`), inside the Optimize sidecar window —
so `optimize` is non-destructive on node/edge tables too, not just the internal
ones (the data-table path was a pre-existing gap, since `compact_files`/
`optimize_indices` there also commit with the auto-cleanup hook enabled). **Concurrency:**
no app lock on the internal path — and `compact_files` does *not* auto-retry a
semantic conflict against a concurrent live writer (Lance prescribes app-rerun for
`Rewrite` vs `Update`/`Merge`), so `compact_internal_table` runs a *bounded*
retry loop that reopens fresh and replans on a retryable Lance conflict (the
canonical Lance-consumer pattern); transient contention does not fail the live
publisher or the operator's `optimize`, but sustained contention past the budget
surfaces a loud conflict error (bounded + observable, not an infinite loop). The
data-table path instead holds the per-table write queue, so it never contends. A
coordinator `refresh` after the compaction restores cache coherence. The
`internal_table_scans_are_flat_in_history` LOCK is now green on the
**authenticated** write path: on a compacted graph a write's
`__manifest`/`_graph_commits`+`_graph_commit_actors` scan is flat in history
(measured `__manifest` 4→2, commit-graph+actors 10→2 across depth 10→100).
Compacts all three tables even though Phase 7 (`iss-991`) will later fold
`_graph_commits` into `__manifest` (one-call throwaway; full interim win until
then). **2a is also the hard prerequisite for Phase 7** (its `graph_head` CAS
contention is only acceptable once `__manifest` compaction bounds the
publisher's `load_publish_state` scan).
- **2b. Internal-table cleanup + Q8 watermark — DEFERRED** (debated; not bundled
with 2a). Cleanup is the version-deleting op that hits cleanup-resurrection
(§12 Q8: Lance's version CAS has no monotonic guard), so it must land **with**
a durable monotonic watermark (a Lance boundary tag — durable across cleanup,
`cleanup.rs` `is_tagged`). Deferred because it touches the read/open path
(a tag-floor clamp on every coordinator open), is the MTT-redundant part (MTT
may replace `__manifest`), and only buys the secondary version-count/space term
— whereas 2a delivers the dominant per-write scan win with zero resurrection
risk. Land it when the version-count cost bites or the Lance MTT timeline
clarifies. (`gap-read-path-rederivation` write twin.)
3. **The opener fix — a shippable lead + the structural follow-on.**
- **3a. Opener bypass (standalone PR, THE dominant fix — [M] proven). ✅ LANDED.**
`TableStore::open_dataset_head_for_write` now delegates to the direct