omnigraph

mirror of https://github.com/ModernRelay/omnigraph.git synced 2026-06-24 02:38:06 +02:00

Ragnor Comerford 6d4606a830 fix(engine): optimize survives a cross-process write race on the same table (#297 ) * test(engine): cross-process optimize-vs-write race — RED Two regression tests for the prod bug: a direct `optimize` process racing a served write on the same table fails, because the in-process write queue does not serialize across processes and the data-table optimize path has no retry. - optimize_survives_concurrent_insert_advancing_manifest: a concurrent insert advances the manifest while optimize is paused between compact and publish; optimize's equality-CAS publish then fails "expected X but current Y". - optimize_survives_concurrent_delete_before_compaction: a concurrent delete commits before optimize compacts; Lance rebases the compaction past it cleanly, so optimize again fails the publish CAS (the genuine Lance Rewrite-vs-Rewrite overlap is rarer and shares the internal path's retry). Both fail today with ExpectedVersionMismatch. Adds the `optimize.before_compact` failpoint seam + a wait_for_sidecar helper; serializes the optimize failpoint tests (shared failpoint name). The fix lands next. * fix(engine): optimize survives a cross-process write race on the same table The data-table optimize path trusted the in-process write queue and skipped a retry, so a CLI `optimize` racing a served write (separate processes = separate queues) failed: either the Lance Rewrite lost ("preempted by concurrent Update") or the manifest publish lost the strict equality CAS ("expected X but current Y"). Unify both compaction paths on the internal path's reopen+replan shape, with a two-level retry that matches the two failure points: - Outer loop (reopen+replan): a genuine Lance Rewrite-vs-Update/Delete same- fragment conflict means our compaction did not commit — reopen at the new HEAD and re-plan. Lance rebases the common disjoint case (a concurrent insert/delete on other fragments) for free, so this fires only on real overlap. - Inner loop (Phase C, monotonic publish): the manifest advanced between our compaction and our publish. The compaction is already committed at Lance HEAD N, so we must NOT reopen (that trips the HEAD>manifest drift guard on our own work). Re-read the current manifest version C: if C >= N the manifest already includes our compaction (versions are linear) — no-op; else fast-forward to N. Monotonic, not the strict equality CAS that manufactured the conflict. The Phase-A sidecar is written once and reused across reopen attempts (every Phase-B commit is content-preserving, so recovery rolls the observed HEAD forward or safely rolls the compaction back). The in-process queue is kept — it is now an in-process contention reducer, not the cross-process correctness guard. Shares the COMPACTION_RETRY_BUDGET constant + is_retryable_lance_conflict with the internal path; adds is_retryable_manifest_conflict for the publish loop. No writer_epoch. Turns the prior commit's two race tests green. * docs(rfc-013): two-op-class principle + the found+fixed optimize-vs-write race §6.6 records the maintenance vs logical op-class distinction (maintenance commutes → Lance rebase + reopen/replan + monotonic manifest fast-forward, no writer_epoch; logical → strict cross-process OCC + epoch) and the prod optimize-vs-served-write race that motivated it, now landed. Adds the matching mechanic row to §4.2. * fix(engine): retry must not misclassify optimize's own HEAD drift Review catch on the cross-process optimize fix: the outer retry loop re-ran the `lance_head > manifest` drift guard every iteration. After a partial Phase-B commit (the auto_cleanup strip or compaction commits, then a later op hits a retryable conflict), the reopened attempt saw HEAD ahead of the manifest — from OUR own sidecar-covered work, not an external writer — and deleted the sidecar + returned `skipped_for_drift`, stranding uncovered drift that then needs `repair`. Track `head_advanced` (did one of our Phase-B ops already commit). The drift guard now fires only when `!head_advanced` (genuine pre-existing external drift); once we have advanced HEAD, a reopened HEAD>manifest is our work that the monotonic publish fast-forwards. The no-op early-return likewise publishes prior committed work instead of dropping it when `head_advanced`. Regression test `optimize_retry_does_not_misclassify_own_head_drift` injects one retryable reindex conflict after the compaction commits (new `optimize.inject_ reindex_conflict` seam); red→green verified by negative control (reverting the gate reproduces `skipped_for_drift: Some(DriftNeedsRepair)`). Also de-flake `optimize_survives_concurrent_insert_advancing_manifest`: pause at `before_compact` (not post-compact) so the concurrent insert lands while HEAD== manifest — otherwise it could race optimize's committed-but-unpublished compaction and hit the write-path "HEAD ahead of manifest" guard. * fix(engine): optimize publish converges on retry-budget exhaustion Review catch (greptile): the monotonic Phase-C publish loop returned an error on its final iteration's retryable manifest conflict, even though that conflict can itself mean a concurrent writer published a version that already includes our (content- preserving) compaction — i.e. the postcondition ("the manifest reflects our compaction") is already met. Recovery covered it (no data loss), but the operator saw a spurious error and had to re-run. Restructure the loop to re-read `current` on every retryable conflict and, on budget exhaustion, do a final `current >= state.version` convergence check before surfacing the error — the §6.6 "postcondition is the state, not winning the CAS" principle. Factor the repeated current-version read into `current_manifest_version`.		2026-06-22 13:05:28 +02:00
..
architecture.md	feat!: delete the legacy OmnigraphConfig + config migrate; finish the omnigraph.yaml docs sweep (#252 )	2026-06-15 22:31:29 +03:00
branch-protection.md	chore: remove CODEOWNERS chassis and the code-owner review gate	2026-06-18 02:55:27 +03:00
bug-case-fix.md	fix(engine): preserve identifier case in filter pushdown (#283 ) (#285 )	2026-06-19 18:42:56 +03:00
ci.md	chore: remove CODEOWNERS chassis and the code-owner review gate	2026-06-18 02:55:27 +03:00
cluster-axioms.md	docs(cluster): axiom 15 — single ownership, mode-switch migration, per-operator layer (#164 )	2026-06-10 00:44:51 +03:00
cluster-config-implementation-spec.md	docs(cluster): RFC-005 — server boots from cluster state (Phase 5 design) (#174 )	2026-06-10 15:22:12 +03:00
cluster-config-specs.md	docs(user): restructure user docs into topic sections (Phase 1) (#223 )	2026-06-14 13:52:14 +03:00
docs-issues.md	docs(dev): update coherence ledger — cookbooks drift resolved, omnigraph-ts mechanism (#294 )	2026-06-21 00:11:48 +03:00
execution.md	fix(embedding): address PR review feedback (RFC-012 Phase 2)	2026-06-15 18:37:34 +02:00
index.md	docs(user): coherence cleanup aligned with 0.7.1 (#293 )	2026-06-21 00:02:34 +03:00
invariants.md	(feat): compact the internal manifest/commit-graph tables in optimize (#291 )	2026-06-21 16:38:20 +02:00
lance.md	test(engine): pin Lance 7 immutable-PK behavior + sharpen native-namespace alignment notes (#240 )	2026-06-15 11:33:25 +02:00
merge.md	docs: split user and developer docs (#93 )	2026-05-15 03:45:22 +03:00
rfc-001-queries-envelope-mcp.md	docs(user): restructure user docs into topic sections (Phase 1) (#223 )	2026-06-14 13:52:14 +03:00
rfc-002-config-cli-architecture.md	docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus	2026-06-12 17:33:11 +03:00
rfc-003-mcp-server-surface.md	Stored-query registry foundation + config/CLI RFC-002 (#128 )	2026-06-01 22:50:31 +02:00
rfc-004-cluster-graph-schema-apply.md	docs(cluster): document Stage 4C — Phase 4 complete	2026-06-10 14:44:12 +03:00
rfc-005-server-cluster-boot.md	fix(cluster): stop cluster-apply crash-loops from the recovery-sidecar trap (#284 )	2026-06-19 03:34:15 +03:00
rfc-007-operator-config.md	docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus	2026-06-12 17:33:11 +03:00
rfc-008-deprecate-omnigraph-yaml.md	docs(rfc): RFC-009 — unify CLI access paths; align the RFC corpus	2026-06-12 17:33:11 +03:00
rfc-009-unify-access-paths.md	feat: canonical POST /load, deprecate /ingest (RFC-009 Phase 5) (#222 )	2026-06-14 03:32:16 +03:00
rfc-010-cli-planes-restructure.md	docs(rfc): RFC-010 — apply verification-comment current-state fixups (#215 )	2026-06-13 22:24:09 +03:00
rfc-011-cli-refactoring.md	feat(cli): add read-only `profile list` / `profile show` (RFC-011 D8) (#255 )	2026-06-15 23:33:01 +03:00
rfc-012-embedding-provider-config.md	Wire cluster embedding providers	2026-06-16 04:02:08 +03:00
rfc-013-write-path-latency.md	fix(engine): optimize survives a cross-process write race on the same table (#297 )	2026-06-22 13:05:28 +02:00
schema-lint-v1-plan.md	schema-lint chassis v1.0: DropProperty Soft + code-tagged diagnostics (MR-694) (#90 )	2026-05-16 16:30:03 +03:00
testing.md	(feat): compact the internal manifest/commit-graph tables in optimize (#291 )	2026-06-21 16:38:20 +02:00
writes.md	fix(engine): stop branch-merge fast-forward OOM on embedding tables (#277 )	2026-06-19 00:15:06 +02:00