mirror of
https://github.com/ModernRelay/omnigraph.git
synced 2026-06-24 02:38:06 +02:00
* feat(engine): compact the internal __manifest/_graph_commits tables in optimize
`optimize` iterated node/edge catalog tables only, so the two internal system
tables (`__manifest`, `_graph_commits`) accumulated one fragment per commit and
were never compacted -- making every write's metadata scan O(fragments), which
grows forever on a long-lived graph (RFC-013 step 2).
`optimize_all_tables` now also compacts both internal tables via a new
`compact_internal_table`. They are not catalog-tracked (readers open them at
their latest Lance HEAD), so it is a much simpler path than `optimize_one_table`:
compact in place, no manifest publish (nothing to publish to), no recovery
sidecar (a single atomic Lance commit -- no HEAD-before-publish gap), and no
optimize_indices (they carry no Lance index, only object_id's unenforced-PK
metadata). No application lock: Lance's compact_files auto-retries its Rewrite
against any concurrent writer (the canonical LanceDB pattern; Rewrite vs Append
is compatible, vs Update a retryable same-fragment conflict Lance rebases), and a
coordinator refresh afterwards makes the warm handle observe the compacted HEAD.
Compacts both tables even though Phase 7 (iss-991) will later fold _graph_commits
into __manifest -- a one-call throwaway for the full interim win; __manifest
compaction is also the prerequisite for Phase 7's graph_head contention. Cleanup
(version GC) of the internal tables is deliberately NOT included here: it needs
the Q8 cleanup-resurrection watermark first (deferred).
maintenance.rs: optimize now returns 6 stats (4 data + 2 internal); adds
optimize_compacts_internal_tables (sheds fragments, leaks no recovery sidecar,
graph coherent for reads + strict writes after).
* test(engine): un-ignore the internal-table scan LOCK (step 2 acceptance)
`internal_table_scans_are_flat_in_history` was the RED, #[ignore]'d acceptance
gate staged in PR #288. With internal-table compaction landed, a write's
__manifest/_graph_commits scan is flat in commit-history depth on a compacted
graph (measured __manifest 4->2, _graph_commits 7->3 across depth 10->100, vs the
pre-step-2 RED 34->214 / 29->207). The test now compacts at each depth before
measuring and runs green every-PR.
* docs: RFC-013 step 2 internal-table compaction landed
- invariants.md: close the compaction half of the read-path-rederivation known
gap (optimize now compacts the internal tables; cleanup half still deferred).
- maintenance.md: optimize covers __manifest/_graph_commits (no publish, no
sidecar); not yet in cleanup.
- rfc-013 §9: split step 2 into 2a (compaction, landed) and 2b (cleanup + Q8
watermark, deferred — debated; MTT-overlap + hot-path liability).
- testing.md: the internal-table LOCK is now green every-PR.
* fix(engine): guard absent _graph_commits + always compact internal tables
Addresses PR #291 review findings:
- Greptile (P1): optimize unconditionally opened `_graph_commits` for compaction,
but a graph can validly have none (the coordinator opens it as `Option`, gated on
`storage.exists`, for graphs predating the commit graph). `Dataset::open` on the
absent table errored and failed the whole optimize. Guard the `_graph_commits`
compaction with the same `storage_adapter().exists()` check the coordinator uses;
`__manifest` always exists so it stays unguarded. Regression test
`optimize_tolerates_absent_graph_commits_table` (empty graph so no publish
recreates the table before the guard).
- Cursor (low): the `table_tasks.is_empty()` early return skipped internal-table
compaction for a schema with no node/edge types. Removed it so the internal
tables are compacted regardless of the data-table set.
- Codex (auto-cleanup, P1): documented — `compact_files` commits with a default
`CommitConfig` (no skip_auto_cleanup) and `CompactionOptions` exposes no override,
so on a graph storing an *on* auto_cleanup config the commit would fire version
GC. Both internal tables are created with `auto_cleanup: None`, so new graphs are
safe; the only exposure is pre-fix upgraded graphs, identical to the existing
data-table optimize path, with step 2b's watermark as the comprehensive guard.
Added a comment in `compact_internal_table` recording this.
* fix(engine): retry publish on RetryableCommitConflict (compaction vs publish)
Step 2 compacts `__manifest` with no app-level lock (Lance OCC arbitrates,
validated against LanceDB + the lance-7.0.0 conflict resolver). compact_files'
`Operation::Rewrite` auto-retries 20x (CommitConfig default num_retries=20), so a
live publish usually wins the race and the compaction rebases. But the publish
runs its merge-insert with conflict_retries(0) = one rebase attempt; if the
compaction commits first AND the merge touched a fragment the Rewrite rewrote,
Lance preempts the publish with `Error::RetryableCommitConflict` — a DIFFERENT
variant from the row-level `TooMuchWriteContention` the publisher already retries.
Left unhandled, that surfaces a transient error to the caller, i.e. a maintenance
compaction (physical op) failing a live write (logical op) — invariant 7.
Map `LanceError::RetryableCommitConflict` to a new
`ManifestConflictDetails::RetryableCommitConflict` and treat it as retryable in the
publisher's outer loop (reload fresh state + re-merge), alongside
RowLevelCasContention. `ExpectedVersionMismatch` still propagates (a genuine
expectation break must not be blindly retried). This also hardens multi-process
concurrent writers generally, not just compaction.
Normal publishes are insert-only (new object_ids -> new fragments, disjoint from
rewritten old ones), so the conflict is rare; the guard covers the
same-fragment-update edge and multi-process writers. Unit tests in publisher.rs
pin the mapping + the retry-predicate contract.
* revert: publisher RetryableCommitConflict handling (it was the wrong side)
Reverts
|
||
|---|---|---|
| .. | ||
| architecture.md | ||
| branch-protection.md | ||
| bug-case-fix.md | ||
| ci.md | ||
| cluster-axioms.md | ||
| cluster-config-implementation-spec.md | ||
| cluster-config-specs.md | ||
| docs-issues.md | ||
| execution.md | ||
| index.md | ||
| invariants.md | ||
| lance.md | ||
| merge.md | ||
| rfc-001-queries-envelope-mcp.md | ||
| rfc-002-config-cli-architecture.md | ||
| rfc-003-mcp-server-surface.md | ||
| rfc-004-cluster-graph-schema-apply.md | ||
| rfc-005-server-cluster-boot.md | ||
| rfc-007-operator-config.md | ||
| rfc-008-deprecate-omnigraph-yaml.md | ||
| rfc-009-unify-access-paths.md | ||
| rfc-010-cli-planes-restructure.md | ||
| rfc-011-cli-refactoring.md | ||
| rfc-012-embedding-provider-config.md | ||
| rfc-013-write-path-latency.md | ||
| schema-lint-v1-plan.md | ||
| testing.md | ||
| writes.md | ||