mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-24 02:38:06 +02:00
docs: RFC-013 step 2 internal-table compaction landed
- invariants.md: close the compaction half of the read-path-rederivation known gap (optimize now compacts the internal tables; cleanup half still deferred). - maintenance.md: optimize covers __manifest/_graph_commits (no publish, no sidecar); not yet in cleanup. - rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8 watermark, deferred — debated; MTT-overlap + hot-path liability). - testing.md: the internal-table LOCK is now green every-PR.
This commit is contained in:
parent
76b66adda0
commit
8db8937a6a
4 changed files with 38 additions and 23 deletions
|
|
@ -285,11 +285,14 @@ them explicit.
|
|||
because Lance branch names can be deleted/recreated at the same version number;
|
||||
the manifest e_tag is carried into synthetic snapshot ids when available, and
|
||||
a detected same-branch manifest refresh clears read caches as the fallback for
|
||||
e_tag-less table locations/topology. Remaining: the internal metadata tables
|
||||
(`__manifest`, `_graph_commits`) are still not compacted, so the probe and
|
||||
refresh cost still grows with fragment count on a long-lived graph (the
|
||||
`optimize`-covers-internal-tables follow-up); the commit graph is not yet
|
||||
reconcilable from the manifest; and the traversal id-map is still rebuilt.
|
||||
e_tag-less table locations/topology. Remaining: `optimize` now compacts the
|
||||
internal metadata tables (`__manifest`, `_graph_commits`) too (RFC-013 step 2),
|
||||
so a *periodically-optimized* graph keeps the probe/refresh/per-write scan flat
|
||||
in history; but they are not yet brought into `cleanup` (version GC), so the
|
||||
`_versions/` chain still grows until an explicit cleanup (the cleanup half is
|
||||
deferred — it needs the Q8 cleanup-resurrection watermark first). The commit
|
||||
graph is not yet reconcilable from the manifest; and the traversal id-map is
|
||||
still rebuilt.
|
||||
- **Commit-graph parent under concurrency:** `record_graph_commit` now refreshes
|
||||
the commit-graph head from storage before appending, so a same-branch write
|
||||
after an external commit no longer forks the commit DAG by parenting off a
|
||||
|
|
|
|||
|
|
@ -846,23 +846,34 @@ to flatten the curve.
|
|||
internal-table LOCK (step 2's red→green acceptance). *Still owed:* the prod
|
||||
`storage.ops` span metric (§5.3) and the bucket-gated `write_cost_s3.rs` opener
|
||||
LOCK (step 3a's red→green, S3-only per the §9-3a measurement note).
|
||||
2. **Bound history — bring the INTERNAL tables into optimize/cleanup (a code
|
||||
change, not just scheduling).** Today `optimize`/`cleanup` iterate **node/edge
|
||||
keys only** (`optimize.rs:895-904`) — confirmed: the prototype's `cleanup --keep 3`
|
||||
pruned "7 tables" = the node/edge data tables; `__manifest`/`_graph_commits` were
|
||||
untouched **[M]**. So the residual +5/depth internal slope (§0b) is **not** fixed
|
||||
by today's tooling — step 2 is a real `all_table_keys` change to add the internal
|
||||
tables, then schedule compaction+cleanup (pass `--yes`; cleanup aborts on remote
|
||||
otherwise). The pruning mechanism is proven on a data table (1035→63, 16× **[M]**);
|
||||
the internal tables need the same inclusion. **Proven [M]:** compacting the
|
||||
internal tables collapsed their scans `__manifest` 285→32, `_graph_commits`
|
||||
177→11; with step 3 a depth-87 edge drops **~1720 → 198 ops** (§2.4). (Separately,
|
||||
node/edge cleanup **caps** the dominant data-table term as an interim *before*
|
||||
step 3 — after step 3 that term is flat regardless.) **HARD PREREQUISITE:** the
|
||||
Q8 boundary watermark must land **with** this step — Lance's version CAS is
|
||||
confirmed vulnerable to cleanup-resurrection (§12 Q8, a silent lost write on
|
||||
R2/S3), so scheduling cleanup without the watermark trades a latency bug for a
|
||||
correctness bug. (`gap-read-path-rederivation` write twin.)
|
||||
2. **Bound history — bring the INTERNAL tables into optimize/cleanup.** Split into
|
||||
a compaction half (the latency win, safe) and a cleanup half (version GC, needs
|
||||
the Q8 watermark). Validated (Lance docs + source): compaction *preserves*
|
||||
versions and is the only term needed to flatten the per-write metadata scan;
|
||||
cleanup is the separate version-deleting op that opens the Q8 hole.
|
||||
- **2a. Internal-table compaction. ✅ LANDED.** `optimize` now compacts
|
||||
`__manifest` and `_graph_commits` (`compact_internal_table`, a separate simpler
|
||||
path than `optimize_one_table`: no manifest publish, no recovery sidecar — a
|
||||
single atomic Lance commit; no app lock — Lance OCC auto-retries the Rewrite,
|
||||
the canonical LanceDB pattern; a coordinator `refresh` after for cache
|
||||
coherence). The `internal_table_scans_are_flat_in_history` LOCK is now green:
|
||||
on a compacted graph a write's `__manifest`/`_graph_commits` scan is flat in
|
||||
history (measured `__manifest` 4→2, `_graph_commits` 7→3 across depth 10→100,
|
||||
vs the pre-2a RED 34→214 / 29→207). Compacts both tables even though Phase 7
|
||||
(`iss-991`) will later fold `_graph_commits` into `__manifest` (one-call
|
||||
throwaway; full interim win until then). **2a is also the hard prerequisite
|
||||
for Phase 7** (its `graph_head` CAS contention is only acceptable once
|
||||
`__manifest` compaction bounds the publisher's `load_publish_state` scan).
|
||||
- **2b. Internal-table cleanup + Q8 watermark — DEFERRED** (debated; not bundled
|
||||
with 2a). Cleanup is the version-deleting op that hits cleanup-resurrection
|
||||
(§12 Q8: Lance's version CAS has no monotonic guard), so it must land **with**
|
||||
a durable monotonic watermark (a Lance boundary tag — durable across cleanup,
|
||||
`cleanup.rs` `is_tagged`). Deferred because it touches the read/open path
|
||||
(a tag-floor clamp on every coordinator open), is the MTT-redundant part (MTT
|
||||
may replace `__manifest`), and only buys the secondary version-count/space term
|
||||
— whereas 2a delivers the dominant per-write scan win with zero resurrection
|
||||
risk. Land it when the version-count cost bites or the Lance MTT timeline
|
||||
clarifies. (`gap-read-path-rederivation` write twin.)
|
||||
3. **The opener fix — a shippable lead + the structural follow-on.**
|
||||
- **3a. Opener bypass (standalone PR, THE dominant fix — [M] proven). ✅ LANDED.**
|
||||
`TableStore::open_dataset_head_for_write` now delegates to the direct
|
||||
|
|
|
|||
|
|
@ -26,7 +26,7 @@ The engine's `tests/` is the principal coverage surface; most graph-shaped behav
|
|||
| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs, nor open datasets directly (`Dataset::open` / `DatasetBuilder::from_uri`/`from_namespace`) — reads route through `Snapshot::open` and the held-handle cache; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
|
||||
| `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
|
||||
| `warm_read_cost.rs` | Cost-budget tests for the warm read path (query-latency work), measured at the object-store boundary with Lance `IOTracker` (the LanceDB IO-counted pattern): a warm same-branch read does 0 manifest opens, 0 commit-graph opens, 1 version probe, validates the schema once (Fix 1 / finding A / Fix 2 at commit-history depth); stale same-branch reads perform exactly 2 probes and refresh manifest-only; recreated non-main branches with the same Lance version refresh by incarnation; recreated branch-owned table handles are distinguished by table e_tag or refresh-time cache clearing; recreated traversal topology is protected by synthetic snapshot-id incarnation or refresh-time cache clearing; a warm *repeat* read does 0 table opens via the held-handle cache and a write re-opens only the changed table at its new version/e_tag (Fix 3/6A). See "Cost-budget tests" below |
|
||||
| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — the RED `internal_table_scans_are_flat_in_history` LOCK, `#[ignore]`'d until internal-table compaction lands) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
|
||||
| `write_cost.rs` | Cost-budget tests for the WRITE path (RFC-013), the latency twin of `warm_read_cost.rs` on the **shared `helpers::cost` harness** (`measure`/`IoCounts`/`assert_flat`/`local_graph`). Runs on **local FS**; gates the **internal-table** term (`__manifest`/`_graph_commits` scans flat in commit-history depth — `internal_table_scans_are_flat_in_history`, now **green every-PR** since RFC-013 step 2 brought the internal tables into `optimize`; the test compacts at each depth before measuring) plus green every-PR guards (single-insert `data_writes` bounded, a per-write read-op ceiling that fails the moment a round-trip is added, and a `measure_with_staged` fitness assert that a keyed insert routes through `stage_merge_insert` once with no `stage_append`/vector-index build). The **data-table opener** term is S3-only — see `write_cost_s3.rs` and the backend-split note in "Cost-budget tests" below |
|
||||
| `helpers/cost.rs` | The shared cost-budget harness (not a test): `IoCounts`/`StagedCounts` (counts by table class), `measure`/`measure_with_staged` (the one place the `with_query_io_probes` + `MergeWriteProbes` task-local + `IOTracker` wiring lives), `assert_flat(curve, select, slack, what)`, and store-agnostic `local_graph`/`s3_graph` fixtures. `warm_read_cost.rs`, `write_cost.rs`, and `write_cost_s3.rs` all consume it so a cost test body is written once and reads in one vocabulary |
|
||||
| `lifecycle.rs` | Graph lifecycle, schema state |
|
||||
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
|
||||
|
|
|
|||
|
|
@ -6,6 +6,7 @@
|
|||
|
||||
- Compacts every node + edge table on `main`, then reindexes them, then **publishes the resulting version to the `__manifest`** so the manifest's recorded version tracks the compacted-and-reindexed state. Reads pin the manifest version, so without this publish the work would be invisible to readers *and* would break the version precondition of the next schema apply / strict update/delete ("stale view … refresh and retry"). The publish advances the graph version (a system-attributed commit) only for tables that actually changed.
|
||||
- Rewrites small fragments into fewer large ones; old fragments remain reachable via older versions until `cleanup` runs.
|
||||
- **Also compacts the internal system tables** `__manifest` and `_graph_commits` (RFC-013 step 2), which accumulate one fragment per commit and otherwise make every write's metadata scan grow with history. These take a simpler path than data tables: they are not `__manifest`-tracked (readers open them at their latest version), so compaction just advances their version in place — **no manifest publish and no recovery sidecar** (a single atomic Lance commit). They appear in the returned stats under `table_key` `"__manifest"` / `"_graph_commits"`. They are **not yet covered by `cleanup`**, so their version chain still grows until the cleanup half lands (it requires a cleanup-resurrection safeguard first); run `optimize` on a cadence to keep per-write metadata scans flat.
|
||||
- **Reindex (index coverage maintenance).** A scalar/FTS/vector index only covers the fragments it was built over. Rows appended after the index was built (e.g. by `load --mode merge`, whose commit does not rebuild an already-existing index) are scanned unindexed, and compaction itself rewrites fragments out of an index's coverage. `optimize` runs Lance's incremental `optimize_indices` after compaction to fold those fragments back in (a delta merge, not a full retrain), restoring full coverage so equality/range/traversal predicates stay index-accelerated. This is why a table with **no compaction work but stale index coverage still commits** a new version under `optimize`. Run `optimize` on a cadence at least as frequent as your freshness window so recently-loaded rows do not linger in the unindexed flat-scan tail.
|
||||
- **Create declared-but-missing indexes (the index reconciler).** `@index`/`@key` declares intent; `schema apply` records it but builds nothing, and `load`/`mutate` defer a column that cannot be built yet (a `Vector` column with no trainable vectors). `optimize` materializes any such declared-but-unbuilt index over the compacted layout — so it is the convergence path for an `@index` added after data exists, or a vector index whose embeddings arrived via a later `embed`. A column still not buildable (no vectors yet) is reported on the table's stat as `pending_indexes` (visible in `--json`), not treated as a failure; the next `optimize` retries. So `optimize` is the single operator-facing index reconciler: it compacts, restores coverage, **and** builds declared-but-missing indexes.
|
||||
- Each table's compact→reindex→publish serializes with concurrent mutations on the same table. A crash mid-operation is recovered automatically on the next open (both compaction and reindex are content-preserving, so roll-forward is always safe).
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue