docs: RFC-013 step 2 internal-table compaction landed

- invariants.md: close the compaction half of the read-path-rederivation known gap (optimize now compacts the internal tables; cleanup half still deferred). - maintenance.md: optimize covers __manifest/_graph_commits (no publish, no sidecar); not yet in cleanup. - rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8 watermark, deferred — debated; MTT-overlap + hot-path liability). - testing.md: the internal-table LOCK is now green every-PR.
2026-06-24 02:38:06 +02:00 · 2026-06-20 17:29:06 +02:00 · 2026-06-20 17:29:06 +02:00 · 8db8937a6a
commit 8db8937a6a
parent 76b66adda0
4 changed files with 38 additions and 23 deletions
--- a/docs/dev/invariants.md
+++ b/docs/dev/invariants.md
@ -285,11 +285,14 @@ them explicit.
  because Lance branch names can be deleted/recreated at the same version number;
  the manifest e_tag is carried into synthetic snapshot ids when available, and
  a detected same-branch manifest refresh clears read caches as the fallback for
-  e_tag-less table locations/topology. Remaining: the internal metadata tables
-  (`__manifest`, `_graph_commits`) are still not compacted, so the probe and
-  refresh cost still grows with fragment count on a long-lived graph (the
-  `optimize`-covers-internal-tables follow-up); the commit graph is not yet
-  reconcilable from the manifest; and the traversal id-map is still rebuilt.
+  e_tag-less table locations/topology. Remaining: `optimize` now compacts the
+  internal metadata tables (`__manifest`, `_graph_commits`) too (RFC-013 step 2),
+  so a *periodically-optimized* graph keeps the probe/refresh/per-write scan flat
+  in history; but they are not yet brought into `cleanup` (version GC), so the
+  `_versions/` chain still grows until an explicit cleanup (the cleanup half is
+  deferred — it needs the Q8 cleanup-resurrection watermark first). The commit
+  graph is not yet reconcilable from the manifest; and the traversal id-map is
+  still rebuilt.
 - **Commit-graph parent under concurrency:** `record_graph_commit` now refreshes
  the commit-graph head from storage before appending, so a same-branch write
  after an external commit no longer forks the commit DAG by parenting off a
--- a/docs/dev/rfc-013-write-path-latency.md
+++ b/docs/dev/rfc-013-write-path-latency.md
@ -846,23 +846,34 @@ to flatten the curve.
   internal-table LOCK (step 2's red→green acceptance). *Still owed:* the prod
   `storage.ops` span metric (§5.3) and the bucket-gated `write_cost_s3.rs` opener
   LOCK (step 3a's red→green, S3-only per the §9-3a measurement note).
-2. **Bound history — bring the INTERNAL tables into optimize/cleanup (a code
-   change, not just scheduling).** Today `optimize`/`cleanup` iterate **node/edge
-   keys only** (`optimize.rs:895-904`) — confirmed: the prototype's `cleanup --keep 3`
-   pruned "7 tables" = the node/edge data tables; `__manifest`/`_graph_commits` were
-   untouched **[M]**. So the residual +5/depth internal slope (§0b) is **not** fixed
-   by today's tooling — step 2 is a real `all_table_keys` change to add the internal
-   tables, then schedule compaction+cleanup (pass `--yes`; cleanup aborts on remote
-   otherwise). The pruning mechanism is proven on a data table (1035→63, 16× **[M]**);
-   the internal tables need the same inclusion. **Proven [M]:** compacting the
-   internal tables collapsed their scans `__manifest` 285→32, `_graph_commits`
-   177→11; with step 3 a depth-87 edge drops **~1720 → 198 ops** (§2.4). (Separately,
-   node/edge cleanup **caps** the dominant data-table term as an interim *before*
-   step 3 — after step 3 that term is flat regardless.) **HARD PREREQUISITE:** the
-   Q8 boundary watermark must land **with** this step — Lance's version CAS is
-   confirmed vulnerable to cleanup-resurrection (§12 Q8, a silent lost write on
-   R2/S3), so scheduling cleanup without the watermark trades a latency bug for a
-   correctness bug. (`gap-read-path-rederivation` write twin.)
+2. **Bound history — bring the INTERNAL tables into optimize/cleanup.** Split into
+   a compaction half (the latency win, safe) and a cleanup half (version GC, needs
+   the Q8 watermark). Validated (Lance docs + source): compaction *preserves*
+   versions and is the only term needed to flatten the per-write metadata scan;
+   cleanup is the separate version-deleting op that opens the Q8 hole.
+   - **2a. Internal-table compaction. ✅ LANDED.** `optimize` now compacts
+     `__manifest` and `_graph_commits` (`compact_internal_table`, a separate simpler
+     path than `optimize_one_table`: no manifest publish, no recovery sidecar — a
+     single atomic Lance commit; no app lock — Lance OCC auto-retries the Rewrite,
+     the canonical LanceDB pattern; a coordinator `refresh` after for cache
+     coherence). The `internal_table_scans_are_flat_in_history` LOCK is now green:
+     on a compacted graph a write's `__manifest`/`_graph_commits` scan is flat in
+     history (measured `__manifest` 4→2, `_graph_commits` 7→3 across depth 10→100,
+     vs the pre-2a RED 34→214 / 29→207). Compacts both tables even though Phase 7
+     (`iss-991`) will later fold `_graph_commits` into `__manifest` (one-call
+     throwaway; full interim win until then). **2a is also the hard prerequisite
+     for Phase 7** (its `graph_head` CAS contention is only acceptable once
+     `__manifest` compaction bounds the publisher's `load_publish_state` scan).
+   - **2b. Internal-table cleanup + Q8 watermark — DEFERRED** (debated; not bundled
+     with 2a). Cleanup is the version-deleting op that hits cleanup-resurrection
+     (§12 Q8: Lance's version CAS has no monotonic guard), so it must land **with**
+     a durable monotonic watermark (a Lance boundary tag — durable across cleanup,
+     `cleanup.rs` `is_tagged`). Deferred because it touches the read/open path
+     (a tag-floor clamp on every coordinator open), is the MTT-redundant part (MTT
+     may replace `__manifest`), and only buys the secondary version-count/space term
+     — whereas 2a delivers the dominant per-write scan win with zero resurrection
+     risk. Land it when the version-count cost bites or the Lance MTT timeline
+     clarifies. (`gap-read-path-rederivation` write twin.)
 3. **The opener fix — a shippable lead + the structural follow-on.**
   - **3a. Opener bypass (standalone PR, THE dominant fix — [M] proven). ✅ LANDED.**
     `TableStore::open_dataset_head_for_write` now delegates to the direct
--- a/docs/dev/testing.md
+++ b/docs/dev/testing.md
@ -26,7 +26,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
 | `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) — reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
 | `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
 | `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
-| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — the RED `internal_table_scans_are_flat_in_history` LOCK, `#[ignore]`'d until internal-table compaction lands) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
+| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — `internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
 | `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary |
 | `lifecycle.rs` | Graph lifecycle, schema state |
 | `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
--- a/docs/user/operations/maintenance.md
+++ b/docs/user/operations/maintenance.md
@ -6,6 +6,7 @@

 - Compacts every node + edge table on `main`, then reindexes them, then **publishes the resulting version to the `__manifest`** so the manifest's recorded version tracks the compacted-and-reindexed state. Reads pin the manifest version, so without this publish the work would be invisible to readers *and* would break the version precondition of the next schema apply / strict update/delete ("stale view … refresh and retry"). The publish advances the graph version (a system-attributed commit) only for tables that actually changed.
 - Rewrites small fragments into fewer large ones; old fragments remain reachable via older versions until `cleanup` runs.
+- **Also compacts the internal system tables** `__manifest` and `_graph_commits` (RFC-013 step 2), which accumulate one fragment per commit and otherwise make every write's metadata scan grow with history. These take a simpler path than data tables: they are not `__manifest`-tracked (readers open them at their latest version), so compaction just advances their version in place — **no manifest publish and no recovery sidecar** (a single atomic Lance commit). They appear in the returned stats under `table_key` `"__manifest"` / `"_graph_commits"`. They are **not yet covered by `cleanup`**, so their version chain still grows until the cleanup half lands (it requires a cleanup-resurrection safeguard first); run `optimize` on a cadence to keep per-write metadata scans flat.
 - **Reindex (index coverage maintenance).** A scalar/FTS/vector index only covers the fragments it was built over. Rows appended after the index was built (e.g. by `load --mode merge`, whose commit does not rebuild an already-existing index) are scanned unindexed, and compaction itself rewrites fragments out of an index's coverage. `optimize` runs Lance's incremental `optimize_indices` after compaction to fold those fragments back in (a delta merge, not a full retrain), restoring full coverage so equality/range/traversal predicates stay index-accelerated. This is why a table with **no compaction work but stale index coverage still commits** a new version under `optimize`. Run `optimize` on a cadence at least as frequent as your freshness window so recently-loaded rows do not linger in the unindexed flat-scan tail.
 - **Create declared-but-missing indexes (the index reconciler).** `@index`/`@key` declares intent; `schema apply` records it but builds nothing, and `load`/`mutate` defer a column that cannot be built yet (a `Vector` column with no trainable vectors). `optimize` materializes any such declared-but-unbuilt index over the compacted layout — so it is the convergence path for an `@index` added after data exists, or a vector index whose embeddings arrived via a later `embed`. A column still not buildable (no vectors yet) is reported on the table's stat as `pending_indexes` (visible in `--json`), not treated as a failure; the next `optimize` retries. So `optimize` is the single operator-facing index reconciler: it compacts, restores coverage, **and** builds declared-but-missing indexes.
 - Each table's compact→reindex→publish serializes with concurrent mutations on the same table. A crash mid-operation is recovered automatically on the next open (both compaction and reindex are content-preserving, so roll-forward is always safe).