From 8db8937a6a6ceed6ee8a7e2066090f0b2fc23156 Mon Sep 17 00:00:00 2001 From: Ragnor Comerford Date: Sat, 20 Jun 2026 17:29:06 +0200 Subject: [PATCH] docs: RFC-013 step 2 internal-table compaction landed MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - invariants.md: close the compaction half of the read-path-rederivation known gap (optimize now compacts the internal tables; cleanup half still deferred). - maintenance.md: optimize covers __manifest/_graph_commits (no publish, no sidecar); not yet in cleanup. - rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8 watermark, deferred — debated; MTT-overlap + hot-path liability). - testing.md: the internal-table LOCK is now green every-PR. --- docs/dev/invariants.md | 13 +++++--- docs/dev/rfc-013-write-path-latency.md | 45 ++++++++++++++++---------- docs/dev/testing.md | 2 +- docs/user/operations/maintenance.md | 1 + 4 files changed, 38 insertions(+), 23 deletions(-) diff --git a/docs/dev/invariants.md b/docs/dev/invariants.md index eb6821a..3195bd0 100644 --- a/docs/dev/invariants.md +++ b/docs/dev/invariants.md @@ -285,11 +285,14 @@ them explicit. because Lance branch names can be deleted/recreated at the same version number; the manifest e_tag is carried into synthetic snapshot ids when available, and a detected same-branch manifest refresh clears read caches as the fallback for - e_tag-less table locations/topology. Remaining: the internal metadata tables - (`__manifest`, `_graph_commits`) are still not compacted, so the probe and - refresh cost still grows with fragment count on a long-lived graph (the - `optimize`-covers-internal-tables follow-up); the commit graph is not yet - reconcilable from the manifest; and the traversal id-map is still rebuilt. + e_tag-less table locations/topology. Remaining: `optimize` now compacts the + internal metadata tables (`__manifest`, `_graph_commits`) too (RFC-013 step 2), + so a *periodically-optimized* graph keeps the probe/refresh/per-write scan flat + in history; but they are not yet brought into `cleanup` (version GC), so the + `_versions/` chain still grows until an explicit cleanup (the cleanup half is + deferred — it needs the Q8 cleanup-resurrection watermark first). The commit + graph is not yet reconcilable from the manifest; and the traversal id-map is + still rebuilt. - **Commit-graph parent under concurrency:** `record_graph_commit` now refreshes the commit-graph head from storage before appending, so a same-branch write after an external commit no longer forks the commit DAG by parenting off a diff --git a/docs/dev/rfc-013-write-path-latency.md b/docs/dev/rfc-013-write-path-latency.md index fa4abf3..eaece00 100644 --- a/docs/dev/rfc-013-write-path-latency.md +++ b/docs/dev/rfc-013-write-path-latency.md @@ -846,23 +846,34 @@ to flatten the curve. internal-table LOCK (step 2's red→green acceptance). *Still owed:* the prod `storage.ops` span metric (§5.3) and the bucket-gated `write_cost_s3.rs` opener LOCK (step 3a's red→green, S3-only per the §9-3a measurement note). -2. **Bound history — bring the INTERNAL tables into optimize/cleanup (a code - change, not just scheduling).** Today `optimize`/`cleanup` iterate **node/edge - keys only** (`optimize.rs:895-904`) — confirmed: the prototype's `cleanup --keep 3` - pruned "7 tables" = the node/edge data tables; `__manifest`/`_graph_commits` were - untouched **[M]**. So the residual +5/depth internal slope (§0b) is **not** fixed - by today's tooling — step 2 is a real `all_table_keys` change to add the internal - tables, then schedule compaction+cleanup (pass `--yes`; cleanup aborts on remote - otherwise). The pruning mechanism is proven on a data table (1035→63, 16× **[M]**); - the internal tables need the same inclusion. **Proven [M]:** compacting the - internal tables collapsed their scans `__manifest` 285→32, `_graph_commits` - 177→11; with step 3 a depth-87 edge drops **~1720 → 198 ops** (§2.4). (Separately, - node/edge cleanup **caps** the dominant data-table term as an interim *before* - step 3 — after step 3 that term is flat regardless.) **HARD PREREQUISITE:** the - Q8 boundary watermark must land **with** this step — Lance's version CAS is - confirmed vulnerable to cleanup-resurrection (§12 Q8, a silent lost write on - R2/S3), so scheduling cleanup without the watermark trades a latency bug for a - correctness bug. (`gap-read-path-rederivation` write twin.) +2. **Bound history — bring the INTERNAL tables into optimize/cleanup.** Split into + a compaction half (the latency win, safe) and a cleanup half (version GC, needs + the Q8 watermark). Validated (Lance docs + source): compaction *preserves* + versions and is the only term needed to flatten the per-write metadata scan; + cleanup is the separate version-deleting op that opens the Q8 hole. + - **2a. Internal-table compaction. ✅ LANDED.** `optimize` now compacts + `__manifest` and `_graph_commits` (`compact_internal_table`, a separate simpler + path than `optimize_one_table`: no manifest publish, no recovery sidecar — a + single atomic Lance commit; no app lock — Lance OCC auto-retries the Rewrite, + the canonical LanceDB pattern; a coordinator `refresh` after for cache + coherence). The `internal_table_scans_are_flat_in_history` LOCK is now green: + on a compacted graph a write's `__manifest`/`_graph_commits` scan is flat in + history (measured `__manifest` 4→2, `_graph_commits` 7→3 across depth 10→100, + vs the pre-2a RED 34→214 / 29→207). Compacts both tables even though Phase 7 + (`iss-991`) will later fold `_graph_commits` into `__manifest` (one-call + throwaway; full interim win until then). **2a is also the hard prerequisite + for Phase 7** (its `graph_head` CAS contention is only acceptable once + `__manifest` compaction bounds the publisher's `load_publish_state` scan). + - **2b. Internal-table cleanup + Q8 watermark — DEFERRED** (debated; not bundled + with 2a). Cleanup is the version-deleting op that hits cleanup-resurrection + (§12 Q8: Lance's version CAS has no monotonic guard), so it must land **with** + a durable monotonic watermark (a Lance boundary tag — durable across cleanup, + `cleanup.rs` `is_tagged`). Deferred because it touches the read/open path + (a tag-floor clamp on every coordinator open), is the MTT-redundant part (MTT + may replace `__manifest`), and only buys the secondary version-count/space term + — whereas 2a delivers the dominant per-write scan win with zero resurrection + risk. Land it when the version-count cost bites or the Lance MTT timeline + clarifies. (`gap-read-path-rederivation` write twin.) 3. **The opener fix — a shippable lead + the structural follow-on.** - **3a. Opener bypass (standalone PR, THE dominant fix — [M] proven). ✅ LANDED.** `TableStore::open_dataset_head_for_write` now delegates to the direct diff --git a/docs/dev/testing.md b/docs/dev/testing.md index 941cec6..ac5d4f0 100644 --- a/docs/dev/testing.md +++ b/docs/dev/testing.md @@ -26,7 +26,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav | `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) — reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: ` sentinel exempts reviewed lines | | `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands | | `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below | -| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — the RED `internal_table_scans_are_flat_in_history` LOCK, `#[ignore]`'d until internal-table compaction lands) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below | +| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — `internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below | | `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary | | `lifecycle.rs` | Graph lifecycle, schema state | | `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) | diff --git a/docs/user/operations/maintenance.md b/docs/user/operations/maintenance.md index e2a88eb..72dc853 100644 --- a/docs/user/operations/maintenance.md +++ b/docs/user/operations/maintenance.md @@ -6,6 +6,7 @@ - Compacts every node + edge table on `main`, then reindexes them, then **publishes the resulting version to the `__manifest`** so the manifest's recorded version tracks the compacted-and-reindexed state. Reads pin the manifest version, so without this publish the work would be invisible to readers *and* would break the version precondition of the next schema apply / strict update/delete ("stale view … refresh and retry"). The publish advances the graph version (a system-attributed commit) only for tables that actually changed. - Rewrites small fragments into fewer large ones; old fragments remain reachable via older versions until `cleanup` runs. +- **Also compacts the internal system tables** `__manifest` and `_graph_commits` (RFC-013 step 2), which accumulate one fragment per commit and otherwise make every write's metadata scan grow with history. These take a simpler path than data tables: they are not `__manifest`-tracked (readers open them at their latest version), so compaction just advances their version in place — **no manifest publish and no recovery sidecar** (a single atomic Lance commit). They appear in the returned stats under `table_key` `"__manifest"` / `"_graph_commits"`. They are **not yet covered by `cleanup`**, so their version chain still grows until the cleanup half lands (it requires a cleanup-resurrection safeguard first); run `optimize` on a cadence to keep per-write metadata scans flat. - **Reindex (index coverage maintenance).** A scalar/FTS/vector index only covers the fragments it was built over. Rows appended after the index was built (e.g. by `load --mode merge`, whose commit does not rebuild an already-existing index) are scanned unindexed, and compaction itself rewrites fragments out of an index's coverage. `optimize` runs Lance's incremental `optimize_indices` after compaction to fold those fragments back in (a delta merge, not a full retrain), restoring full coverage so equality/range/traversal predicates stay index-accelerated. This is why a table with **no compaction work but stale index coverage still commits** a new version under `optimize`. Run `optimize` on a cadence at least as frequent as your freshness window so recently-loaded rows do not linger in the unindexed flat-scan tail. - **Create declared-but-missing indexes (the index reconciler).** `@index`/`@key` declares intent; `schema apply` records it but builds nothing, and `load`/`mutate` defer a column that cannot be built yet (a `Vector` column with no trainable vectors). `optimize` materializes any such declared-but-unbuilt index over the compacted layout — so it is the convergence path for an `@index` added after data exists, or a vector index whose embeddings arrived via a later `embed`. A column still not buildable (no vectors yet) is reported on the table's stat as `pending_indexes` (visible in `--json`), not treated as a failure; the next `optimize` retries. So `optimize` is the single operator-facing index reconciler: it compacts, restores coverage, **and** builds declared-but-missing indexes. - Each table's compact→reindex→publish serializes with concurrent mutations on the same table. A crash mid-operation is recovered automatically on the next open (both compaction and reindex are content-preserving, so roll-forward is always safe).