Commit graph

182 commits

Author SHA1 Message Date
Ragnor Comerford
8726ffe0a3
release: bump version to 0.4.1 2026-05-02 23:20:50 +02:00
Andrew Altshuler
e041130de3
docs: rename omnigraph-starters to omnigraph-cookbooks (#71)
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 23:37:09 +03:00
Ragnor Comerford
2e20f4d69f
Merge pull request #70 from ModernRelay/ragnorc/mr793-phases-1-6
TableStorage trait + staged-write surface
2026-05-02 20:05:52 +02:00
Ragnor Comerford
151a1798b5
runs: enumerate inline-commit residuals on TableStorage as a residuals matrix
Closes MR-793 acceptance §1 via option (b): every inline-commit method
remaining on the trait surface is named, the upstream blocker or
internal phase that closes it is cited, and the call-site residual
comment is mandated.

Reframes the criterion text in the MR-793 ticket comment from "either
full sealing OR all residuals enumerated" — this commit ships the
"enumerated" path. The "full sealing" path (Phase 1b + Phase 9 + the
two Lance upstream tickets) closes the matrix entirely.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:46:07 +02:00
Ragnor Comerford
c9a81266e4
lance: confirm MemWAL is opt-in, intra-table, no overlap with MR-847
Fetched https://lance.org/format/table/mem_wal/ in full via npx mdrip.
The "Overview / Details / Implementation" sidebar items turned out to
be anchor sections on the same URL, not separate pages.

Key findings (relevant to MR-847's recovery reconciler design):

* MemWAL is opt-in. Requires (1) unenforced primary key in schema,
  (2) explicit shard config, (3) writers using the LSM-tree write
  path. omnigraph does NOT enable it; we use direct write_fragments +
  commit(Operation::Append).

* MemWAL is intra-table — addresses streaming-write throughput for
  one Lance base table via MemTables → flushed MemTables → async
  merge. It does not coordinate across multiple tables.

* MemWAL's recovery is intra-table: WAL replay reconstructs MemTable
  state for one table. It does NOT help with omnigraph's cross-table
  manifest-pinned-vs-Lance-HEAD drift class.

Conclusion: MR-847's recovery reconciler design is unaffected. The
two operate at different abstraction layers.

Borrowable: MemWAL's epoch-based fencing pattern is structurally
similar to a future multi-coordinator sidecar protocol; noted on
MR-847 for if MR-668 (multi-process) ever lands.
2026-05-02 19:44:37 +02:00
Ragnor Comerford
5afde54d69
agents: stop overclaiming atomic multi-table publish — describe the three layers honestly
External reviewer flagged that the capability matrix's "Atomic
multi-dataset publish" cell implied Lance gives us a single primitive
for cross-table atomicity. It doesn't. The real contract is three
layers stacked:

  (1) per-table Lance `commit_staged` for the data write
  (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for
      cross-table ordering
  (3) recovery-on-open reconciler for the residual gap between (1)
      and (2) — NOT YET SHIPPED, tracked in MR-847.

Until MR-847 lands, a failure between per-table `commit_staged` and
the manifest publish leaves drift on the partially-committed tables
(the "Phase B → Phase C residual" documented in `docs/runs.md`).

Also enumerate the legacy inline-commit residuals (`append_batch`,
`merge_insert_batches`, `overwrite_batch`, `create_*_index`) alongside
`delete_where` and `create_vector_index` — they remain on the trait
pending Phase 1b call-site conversion + Phase 9 demotion.

End the row with an explicit DO NOT: future agents reading the
capability matrix should not describe atomicity as "fully upheld"
until MR-847 ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:35:34 +02:00
Ragnor Comerford
bf7d716d1b
forbidden_apis: add .restore( and document why .append( / .delete( are excluded
Cubic flagged that the guard misses `ds.append() / ds.delete() / ds.restore()`.

`.restore(` is added — Lance-specific (no false positives in the
workspace).

`.append(` and `.delete(` stay excluded with a documenting comment:
* `.append(` over-matches `Vec::append`, `String::append`, every
  `arrow_array::xxxArrayBuilder::append` (30+ legit uses across
  `exec/mutation.rs`, `loader/jsonl.rs`, `exec/projection.rs`).
* `.delete(` over-matches `ObjectStore::delete` (used in `storage.rs`,
  `db/schema_state.rs`, `db/omnigraph.rs:1277` for staging-file
  cleanup) and would require many `// forbidden-api-allow:` sentinels
  for legitimate uses.

The remaining bypass route — engine code that imports `lance::Dataset`
and calls `ds.append(reader, params)` — is bounded by:
1. The trait surface itself (sealed, only-callable-via-trait once
   Phase 1b call-site conversion completes).
2. The PR-review process catching new `lance::Dataset` imports in
   non-storage files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:25:52 +02:00
Ragnor Comerford
9b0920b5da
address PR #70 bot review (Cubic + Cursor): 7 inline + failpoint test + invariants notes
Cubic findings:
* `tests/forbidden_apis.rs`: expand `FORBIDDEN_PATTERNS` with `Dataset::write`
  / `Dataset::append` / `Dataset::delete` / `Dataset::merge_insert` /
  `Dataset::add_columns` / `update_columns` / `drop_columns` /
  `truncate_table` / `restore` and the bare `.merge_insert(` /
  `.add_columns(` / `.update_columns(` / `.drop_columns(` /
  `.truncate_table(` method patterns. Deliberately avoid `.append(` /
  `.delete(` / `.write(` (over-match `Vec::append`, `.delete_branch(`,
  arrow-array `.append(`, etc.). Allow-list `commit_graph.rs` and
  `graph_coordinator.rs` — they're manifest-layer infra that legitimately
  uses `Dataset::write` for system tables.
* `schema_apply.rs:253`: pass `entry.table_branch.as_deref()` (not
  `None`) to `open_dataset_head_for_write` for consistency with the
  sibling `indexed_tables` block. Schema apply rejects non-main
  branches at the lock-acquire step today, so behavior is unchanged;
  this is a defensive consistency fix that survives a future relaxation
  of the lock check.
* `storage_layer.rs:131` doc: was `Vec<&StagedWrite>` with lifetime
  claim; actually returns `Vec<StagedWrite>` (cloned). Fixed.
* `AGENTS.md:201` capability matrix row + `storage_layer.rs:1` module
  doc: softened the "stage_* + commit_staged are the only paths" /
  "trait funnels every write" overclaim. Inline-commit residuals
  (`delete_where`, `create_vector_index`) remain on the trait pending
  upstream Lance work (#6658, #6666); legacy `append_batch` etc.
  remain pending Phase 1b / Phase 9. Module doc now describes the
  current transitional state honestly.

Cursor Bugbot findings:
* `storage_layer.rs:360`: trait `delete_where` consumed `SnapshotHandle`
  but returned only `DeleteState`, dropping the post-delete dataset.
  Future callers migrating from the inherent `&mut Dataset` API would
  lose the post-delete dataset state needed for indexing /
  `table_state` queries. Fixed: returns `(SnapshotHandle, DeleteState)`
  matching `append_batch` / `overwrite_batch` shape.
* `storage_layer.rs:824`: removed dead `_scanner_type_marker` fn and
  the unused `Scanner` import (the marker existed only to suppress an
  unused-import warning — fixing the import is the cleaner answer).

Engine-level Phase A failpoint test (closes the partial-criterion
flagged in Cubic's acceptance-criteria checklist):
* `db/omnigraph/table_ops.rs::stage_and_commit_btree`: instrumented
  with `crate::failpoints::maybe_fail("ensure_indices.post_stage_pre_commit_btree")`
  between `stage_create_btree_index` and `commit_staged`.
* `tests/failpoints.rs::ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`:
  triggers the failpoint via a schema-apply that adds a new node type;
  proves that existing tables are unaffected (Person mutation succeeds
  after the failed apply) — i.e. Phase A failure leaves no Lance-HEAD
  drift on tables outside the failed `added_tables` iteration.

`docs/invariants.md` transitional notes:
* §VI.23 (atomicity per query): annotated as upheld at the
  writer-trait surface for inserts / updates / scalar-index builds /
  merge_insert / overwrite after MR-793 PR #70. Per-table
  commit_staged → manifest publish window remains; closing requires
  MR-847's recovery-on-open reconciler. `delete_where` and
  `create_vector_index` remain inline pending lance#6658 / #6666.
* §VII.35 (reconciler pattern): annotated as partial — staged
  primitives are the building blocks; the reconciler task itself is
  MR-848.
* §VIII.45 (reference impl per trait): `TableStorage` has its primary
  impl on `TableStore` with opaque-handle signatures; no test impl
  yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:47:07 +02:00
Ragnor Comerford
b87be5e9f0
agents: read every Lance page even slightly relevant, not just the obvious match
Behavior is interlocked across Lance pages — transactions reference
index lifecycle, index lifecycle references compaction, compaction
references row-id lineage. Skipping a "slightly relevant" page is how
alignment misses happen. The index alone is not a substitute for
reading the pages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:44:41 +02:00
Ragnor Comerford
17bf978d0e
MR-793 follow-up: lance docs alignment audit + mandate full-page fetch via mdrip
* AGENTS.md / docs/lance.md: agents must use `npx mdrip` (not summarizing
  WebFetch) when consulting Lance docs. WebFetch routinely drops
  load-bearing details — `pub(crate)` blockers, sub-specs behind nav hubs,
  default flags. Lesson learned during the MR-793 alignment audit.
* docs/lance.md: add "Last alignment audit: 2026-05-02" stanza
  documenting MemWAL gap, lance#6666 companion ticket, stable-row-ID
  status (experimental, may unblock MR-848), FRI as documented
  compaction-friendly alternative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:41:32 +02:00
Ragnor Comerford
3135ff5d19
MR-793 phases 1-6: TableStorage trait + staged-write surface for engine writers
Hoists Lance's stage+commit two-phase write pattern from "discipline at
each writer" to a sealed trait surface (`TableStorage`). New engine code
that needs to advance Lance HEAD MUST go through `stage_*` + `commit_staged`;
the trait's opaque `SnapshotHandle` / `StagedHandle` types keep
`lance::Dataset` and `lance::Transaction` out of trait signatures.

Phases landed (see .context/mr-793-design.md for the full plan):
* 1a: `crates/omnigraph/src/storage_layer.rs` — `TableStorage` trait,
  sealed (only in-tree types can impl), single impl on `TableStore`
  delegating to existing inherent methods; `Omnigraph::storage()`
  accessor returns `&dyn TableStorage`.
* 2: three new staged primitives — `stage_overwrite`,
  `stage_create_btree_index`, `stage_create_inverted_index` —
  implementing the simple branch of Lance's `CreateIndexBuilder::execute`
  (scalar indices only; vector indices stay inline because
  `build_index_metadata_from_segments` is `pub(crate)` in lance-4.0.0).
  Six new tests in `tests/staged_writes.rs` pin both the new primitives
  and the inline residuals (`delete_where`, `create_vector_index`).
* 3: `tests/forbidden_apis.rs` — defense-in-depth integration test
  walks engine source, fails on direct lance::* inline-commit API use
  outside `table_store.rs` / `db/manifest/`. Skips comment lines and
  honors `// forbidden-api-allow:` sentinels.
* 4: `ensure_indices` migration — scalar index builds now route through
  `stage_create_*_index` + `commit_staged` instead of
  `create_*_index(&mut Dataset)`. Vector indices stay inline (residual,
  named honestly at the call site).
* 5: `branch_merge::publish_rewritten_merge_table` migration — the
  merge_insert phase now uses `stage_merge_insert` + `commit_staged`;
  delete phase stays inline (Lance #6658 residual, named honestly).
* 6: `schema_apply` rewritten_tables migration — non-empty rewrites
  use `stage_overwrite` + `commit_staged`; empty-batch rewrites stay
  inline because `InsertBuilder::execute_uncommitted` rejects empty
  data. The narrow inline window is bounded by `__schema_apply_lock__`.

Verified-green test surface:
* `cargo test -p omnigraph-engine` — 68 lib + ~120 integration tests
  (incl. 6 new staged_writes tests + the new forbidden_apis test).
* `cargo test -p omnigraph-engine --features failpoints --test failpoints`
  — 5 tests, all green.
* `cargo test --workspace` — green.

Deferred to follow-up sessions (see design doc §17 split):
* Phase 1b — convert remaining engine call sites to `&dyn TableStorage`
  (mostly READS that don't touch the staged-write invariant).
* Phase 7 — recovery-on-open reconciler (closes Phase B → Phase C
  residual across process restarts; new subsystem).
* Phase 8 — index-coverage reconciler (full §VII.35 compliance —
  removes synchronous index work from the publish path).
* Phase 9 — demote unused `TableStore` inherent methods to `pub(crate)`
  (depends on Phase 1b).

Lance upstream blockers documented:
* lance-format/lance#6658 — two-phase delete API (open, no PRs).
* Companion: `build_index_metadata_from_segments` should be `pub` so
  vector-index builds can be staged outside the lance crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:03:15 +02:00
Ragnor Comerford
6f60c0cbcf
Merge pull request #68 from ModernRelay/ragnorc/mr794-rewire
MR-794 step 2: in-memory accumulator rewire for mutate_as + load
2026-05-01 23:06:10 +02:00
Ragnor Comerford
044ed46019
chore: scrub Linear ticket numbers and review-bot mentions from code comments
OmniGraph is OSS; internal Linear ticket references and code-review-bot
mentions in source-code comments don't help external readers and leak
internal tooling. Replace ticket numbers (MR-XXX) with descriptive
prose, drop linear.app URLs, and remove inline mentions of
Cursor/Bugbot/Cubic/Codex review threads.

Scope is limited to source-code comments (`crates/`). Docs under
`docs/` keep their MR-XXX references — those are part of the
established change-history narrative for in-repo docs and don't
require a Linear account to find context for.

No behavior changes; no public API changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 22:45:38 +02:00
Ragnor Comerford
ea16c74329
MR-794 step 2: address Cursor Bugbot follow-ups on commits 3223b51 + 052b6e6
Four code/doc fixes from the latest Cursor Bugbot pass:

* **Misplaced doc comment in table_store.rs (Medium):** the doc block
  intended for `scan_pending_batches` was, after my earlier edit,
  attached to `collect_string_column_values` because the new helper
  was inserted between the original docblock and `scan_pending_batches`.
  Move the docblock back onto its function and add a note about the
  shared SQL-dialect contract with the Lance scanner (the predicate
  goes to both, which is fine for `predicate_to_sql`'s plain comparison
  shapes today; future Lance-specific scanner extensions in the filter
  would need translation).

* **Missing null check on committed `id` column (Low):** the
  committed-side loop in `collect_node_ids_with_pending` (and the
  parallel non-pending `collect_node_ids`) read `id_col.value(i)`
  without `is_valid(i)` first. `id` is the @key column on every node
  type and non-nullable by schema, so this is unreachable today, but
  the inconsistency with the pending-side `is_valid` guard is worth
  closing for symmetry / defense.

* **Misleading comment in count_pending_src_with_dedupe (Low):** the
  comment claimed "fall back to naive counting" but the code did
  `continue`. Fix: it's unreachable in practice (the pending-side
  schema always contains the key when the caller passes one), so
  failing loudly with a typed error if it ever does fire is correct
  — silently skipping the batch would let `@card` violations slip
  past validation.

* **PendingTable.schema mismatch surfaces too late (Medium):**
  PendingTable captures the schema from the first batch and never
  updates it. On a blob-bearing table, `insert` produces a full-schema
  batch and `update` (without assigning every blob) produces a
  subset-schema batch. Pre-fix the mismatch surfaced inside
  finalize/MemTable construction — distant from the offending op.
  Post-fix `MutationStaging::append_batch` validates the new batch's
  schema against the existing accumulator's schema and returns a
  typed error directing the caller to split the mutation. Error
  fires at the offending op, not at end-of-query. New helper
  `schemas_compatible` compares field name + data_type pairs;
  nullability and field metadata differences stay tolerated (downstream
  concat already permits those).

Cubic Cursor Bugbot finding #5 (cascade delete edge re-open) self-resolved
in the bot's own analysis ("logic appears sound on re-examination") —
no action.

New test on tests/runs.rs:

* append_batch_rejects_mismatched_schema_in_blob_table_at_offending_op
  — pins the early-error path. Builds a blob-bearing schema, runs an
  `insert + update` query where the update doesn't assign the blob,
  asserts the error fires at the second op with the "Split the
  mutation" message and the manifest is unchanged.

Local: tests/runs.rs 24/24 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:50:13 +02:00
Ragnor Comerford
675568ce85
ci: fold failpoints test into Test Workspace job
The standalone test_failpoints_feature job took 21min on first run
(cold cache; the omnigraph-engine crate has lance + datafusion deps
that make any fresh build expensive). Folding into Test Workspace
shares the warm cache so the failpoints invocation is incremental —
~30s vs 21min on subsequent runs, and within the workspace job's
existing budget.

The failpoints feature is gated behind a Cargo flag and only adds
the small `fail` crate dep + a few feature-gated code paths; it
doesn't change the dep tree of any other crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 21:15:14 +02:00
Ragnor Comerford
052b6e680f
MR-794 step 2: address PR #68 follow-up review (Cubic) — pending dedupe + projection guard + CI
Three new findings from Cubic on commit 3223b51:

* **Pending edge cardinality counted within-input duplicates** (P2):
  count_src_per_edge's pending walk added every row to the count,
  including duplicate rows that finalize will collapse via
  dedupe_merge_batches_by_id. A LoadMode::Merge with the same edge id
  twice would over-count → spurious @card violation. Fix: when
  dedupe_key_column is Some, walk pending in reverse, track seen keys
  via HashSet, count only the kept (last-occurrence) rows. Mirrors
  finalize-time dedupe so cardinality counts what stage_merge_insert
  actually publishes.

* **scan_with_pending silently disabled merge-shadow when projection
  omitted key_column** (P2): if a caller passed Some("id") as
  key_column but their projection didn't include "id", the
  filter_out_rows_where_string_in helper passed batches through
  unchanged — silently degrading to union semantics. Fix: validate
  up front that projection contains key_column when both are Some;
  return a typed Lance error otherwise. Tightened the helper too:
  missing column is now an internal error (was a silent passthrough).

* **Cascade-vs-explicit delete test was too weak** (P2): asserted
  only that edge count decreased after delete. The cascade alone
  could satisfy that even if the explicit second-delete silently
  no-op'd. Strengthened: assert post_knows == 0, which only holds
  when both ops landed (Bob→Diana would survive if op-2 no-op'd).

CI gap: also added test_failpoints_feature job to .github/workflows/ci.yml.
The workspace test runs without --features failpoints (the feature is
behind a Cargo flag), so the failpoints test suite was never exercised
by CI before now. The new job builds + runs
`cargo test -p omnigraph-engine --features failpoints --test failpoints`
on every full CI run, mirroring the test_aws_feature pattern.

New tests on tests/runs.rs:

* load_merge_mode_dedupes_within_pending_for_cardinality_count
  (Cubic P2 #2 — pending-vs-pending dedup, distinct from the
  load_merge_mode_dedupes_edge_for_cardinality_count test which
  covers committed-vs-pending dedup).
* scan_with_pending_rejects_key_column_missing_from_projection
  (Cubic P2 #3 — verifies the up-front validation rejects bad
  callers and that the happy path still works correctly).

Local test results:

* tests/runs.rs: 23/23 passed
* tests/failpoints.rs --features failpoints: 7/7 passed (includes the
  two new finalize→publisher residual tests landed in 3223b51).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 20:47:45 +02:00
Aaron Goh
68c9d1bc91 docs: add community Slack link
Add a Community section to the README with the Omnigraph Slack invite so contributors and users have a clear place to ask questions and share feedback.
2026-05-01 14:39:21 +02:00
Ragnor Comerford
3223b51cf1
MR-794 step 2: address PR #68 review — merge semantics, cardinality, residual
Five fixes from PR #68 review (Cursor Bugbot + Codex + Cubic):

* **scan_with_pending gains merge-shadow semantics** (Codex P1, Cubic P1#1):
  new `key_column: Option<&str>` parameter. When set, committed rows
  whose key value appears in any pending batch are excluded from the
  scan — making `scan_with_pending` correctly merge-semantic for chained
  updates instead of naively unioning. execute_update calls with
  Some("id"). Without this, a chained `update where age > 30` could
  match a row whose pending value already moved out of range.

* **Multi-delete on same table no longer trips ExpectedVersionMismatch**
  (Cursor Bugbot HIGH): open_table_for_mutation routes through
  reopen_for_mutation when staging.inline_committed has the table,
  using the post-inline-commit Lance version captured at record_inline
  time. The legacy open_for_mutation_on_branch fence (Lance HEAD ==
  manifest pinned) is correct cross-writer but wrong intra-query when
  deletes have already advanced HEAD on this table. Branch goes away
  when Lance ships two-phase delete (lance-format/lance#6658).

* **Cardinality validation consolidated** (Cursor LOW + Codex P2 +
  Cubic P1#2 + Cubic P2): new exec/staging::count_src_per_edge +
  enforce_cardinality_bounds shared by mutation and loader paths.
  Restores the missing min-cardinality check on the engine path.
  Loader Merge mode passes Some("id") to dedupe edges being updated
  by id (not double-count committed + pending). Loader Append mode
  and engine path pass None (ULID-generated ids never collide).

* **Dead count_rows_with_pending removed** (Cursor LOW): never called.

* **Misleading concat-helper comment fixed** (Cubic P3): claimed
  schema normalization the helper doesn't implement. Updated to match
  reality.

* **Documentation honesty** (Cubic P1#3): MR-794 narrows but doesn't
  eliminate the "Lance HEAD ahead of __manifest" drift class. Drift is
  unreachable for op-execution failures (the partial_failure test pins
  this), but a residual remains at the finalize→publisher boundary
  because Lance has no multi-dataset commit primitive: per-table
  commit_staged calls run sequentially before manifest commit. Updated
  docs/runs.md, docs/invariants.md §VI.25, docs/releases/v0.4.1.md to
  scope the claim precisely.

* **Failpoint test pinning the residual**: new
  mutation.post_finalize_pre_publisher failpoint + two tests in
  tests/failpoints.rs that confirm the documented residual behavior.
  Catches future regressions that widen the residual.

Test additions on tests/runs.rs:

* chained_updates_with_overlapping_predicate_respects_intermediate_value
* multi_statement_delete_on_same_node_table
* cascade_delete_node_then_explicit_delete_edge_on_same_table
* mutation_insert_edge_enforces_min_cardinality
* load_merge_mode_dedupes_edge_for_cardinality_count

113/113 engine integration tests pass (runs + end_to_end + consistency
+ staged_writes + validators). Failpoints feature build runs in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:47:55 +02:00
Ragnor Comerford
a61e82f47a
MR-794 step 2: docs — runs/invariants/architecture/execution + cleanup
Refresh user-facing and agent-facing docs for the staged-write rewire
and clean up stale Run-state-machine references that survived MR-771.

MR-794-specific updates:
* docs/runs.md — remove "Known limitation: mid-query partial failure"
  section; document the in-memory accumulator + D₂ rule + the
  LoadMode::Overwrite residual.
* docs/invariants.md §VI.25 — flip from aspirational/open to
  upheld for inserts/updates. Within-query read-your-writes is now
  load-bearing for the publisher CAS contract.
* docs/architecture.md — add "Mutation atomicity — in-memory
  accumulator (MR-794)" subsection with per-op flow; refresh the
  engine + state diagrams to drop RunRegistry and add MutationStaging.
* docs/execution.md — rewrite the mutation flow sequence diagram
  for the staged-write path; updated the LoadMode table to call
  out per-mode commit semantics; rewrote load vs ingest.
* docs/query-language.md — document the D₂ parse-time rule.
* docs/errors.md — add the D₂ BadRequest rejection path.
* docs/testing.md — extend the runs.rs row to cover the new MR-794
  contract tests; add the staged_writes.rs row.
* docs/releases/v0.4.1.md (new) — release note covering the rewire,
  test additions, residuals, and files changed.
* AGENTS.md (CLAUDE.md symlink) — update the atomic-per-query
  description and the L2 capability matrix row.

Stale-reference cleanup (MR-771 leftovers):
* docs/storage.md — drop live _graph_runs.lance / _graph_run_actors.lance
  from the layout diagram and prose; mark legacy.
* docs/branches-commits.md — move __run__<id> to a legacy note;
  remove publish_run from the publish-trigger list.
* docs/audit.md — refresh _as API list (drop begin_run_as / publish_run_as);
  legacy RunRecord.actor_id moved to a historical note.
* docs/constants.md — mark run registry / branch-prefix rows as legacy.
* docs/cli.md — replace the legacy omnigraph run * quickstart block
  with omnigraph commit list/show.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:43:19 +02:00
Ragnor Comerford
82350bdc4a
MR-794 step 2: tests — flip partial_failure + add coalesce/D₂/load tests
* Flip partial_failure_observably_rolls_back_but_blocks_next_mutation_on_same_table
  → partial_failure_leaves_target_queryable_and_unblocks_next_mutation.
  Asserts the next mutation succeeds (no drift) instead of expecting
  the legacy ExpectedVersionMismatch.
* New: mutation_rejects_mixed_insert_and_delete_at_parse_time —
  D₂ check fires before any I/O.
* New: mixed_insert_and_update_on_same_person_coalesces_to_one_merge —
  verifies dedupe-by-id last-write-wins in MutationStaging::finalize.
* New: multiple_appends_to_same_edge_coalesce_to_one_append —
  two-statement edge insert publishes exactly once.
* New: multi_statement_inserts_publish_exactly_once — manifest
  version advances by exactly 1 per multi-statement query.
* New: load_with_bad_edge_reference_unblocks_next_load — RI
  violation aborts before publish; next load succeeds.
* New: load_with_cardinality_violation_unblocks_next_load — uses
  a custom WorksAt @card(0..1) schema; cardinality violation aborts
  before publish; next load succeeds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:43:00 +02:00
Ragnor Comerford
bbf610ea9b
MR-794 step 2: rewire loader for Append/Merge — Overwrite stays inline
load_jsonl_reader dispatches on mode: Append/Merge use the
MutationStaging accumulator (per-type batch staging, single stage_* +
commit_staged per touched table at end-of-load, publisher CAS).
Overwrite keeps the legacy concurrent inline-commit path
(truncate-then-append doesn't fit the staged shape; overwrite has no
in-flight read-your-writes requirement).

* New helpers collect_node_ids_with_pending and
  validate_edge_cardinality_with_pending_loader — loader analogs
  of the engine's pending-aware validators.
* Phase 2c (RI) and Phase 3 (cardinality) consult pending batches
  for Append/Merge so a mid-load failure aborts the load before any
  Lance write reaches HEAD.

A failed Append/Merge load no longer advances Lance HEAD on staged
tables — the next load on the same tables proceeds normally with no
ExpectedVersionMismatch. Overwrite mode's drift residual is unchanged
from today's behavior; documented in docs/runs.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:42:47 +02:00
Ragnor Comerford
e6f48ba24d
MR-794 step 2: rewire mutation.rs to in-memory accumulator + D₂
* Replace mutation.rs's MutationStaging.latest with the new
  pending + inline_committed shape from exec::staging. Inserts
  and updates push batches into pending; deletes still inline-commit
  via record_inline.
* Rewrite execute_insert, execute_update, execute_delete*,
  validate_edge_insert_endpoints, ensure_node_id_exists for the
  new shape. Edge cardinality validates against committed scan +
  in-memory pending walk (validate_edge_cardinality_with_pending).
* D₂ parse-time check: a query is either insert/update-only or
  delete-only. Mixed → reject before any I/O.
* Drop CoordinatorRestoreGuard and the swap_coordinator_for_branch /
  restore_coordinator dance from mutate_with_current_actor. Branch
  is threaded explicitly through execute_named_mutation and the
  per-op functions. (merge.rs keeps its own swap pattern.)
* apply_assignments updated to copy unassigned blob columns from
  the scan when present, enabling full-schema update batches; for
  blob-bearing tables we still project away the blob columns at scan
  time (Lance's filter pushdown panics otherwise) and accept the
  narrow-schema output for the v1 path.

A failed mid-query op no longer advances Lance HEAD on staged tables —
the next mutation proceeds normally with no ExpectedVersionMismatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:42:37 +02:00
Ragnor Comerford
cdfbccbfdc
MR-794 step 2: scaffold MutationStaging accumulator + scan_with_pending
Add the scaffolding for the in-memory staged-write rewire — no behavior
change yet:

* New crates/omnigraph/src/exec/staging.rs with MutationStaging,
  PendingTable, PendingMode, StagedTablePath, plus the end-of-query
  finalize() that issues one stage_* + commit_staged per pending
  table (Merge mode dedupes by id, last-write-wins).
* TableStore::scan_with_pending and count_rows_with_pending helpers —
  Lance scan committed + DataFusion MemTable scan pending, concat.
  Sidesteps the Scanner::with_fragments filter-pushdown limitation
  documented on scan_with_staged.
* Add datafusion = "52" to workspace + omnigraph-engine deps for
  MemTable (transitively pulled by Lance already).

Engine code still uses the legacy MutationStaging shape; the rewire
lands in subsequent commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:42:21 +02:00
Ragnor Comerford
facab010e5
Merge pull request #67 from ModernRelay/ragnorc/staged-writes-redo
MR-794 step 1: re-land staged-write primitives in TableStore
2026-05-01 01:13:47 +02:00
Ragnor Comerford
7c09220210
MR-794 step 1: fix u32 cast + pin scan_with_staged filter limitation
Two CI failures, both addressed:

(1) u32/u64 type mismatch in stage_append (compile error):
ds.manifest.max_fragment_id is Option<u32>, but Lance's Fragment::id
and the commit-time renumbering counter in
Transaction::fragments_with_ids operate on u64. Cast max_fragment_id
to u64 before the arithmetic.

(2) scan_with_staged_pushes_filter_through_committed_and_staged failed
because Lance's stats-based fragment pruning drops uncommitted staged
fragments from filtered scans — they lack the per-column statistics
that committed fragments carry. With filter `age >= 30` and a staged
dave (age=35), dave is silently absent from the result.
scanner.use_stats(false) does not bypass this in lance 4.0.0
(verified locally).

Rather than chase Lance internals further, document the limitation:
- stage_merge_insert / scan_with_staged docstring updated to flag the
  filter contract as incomplete on staged fragments.
- Test renamed to scan_with_staged_with_filter_silently_drops_staged_rows
  and flipped to assert the actual behavior, with a clear note pointing
  at the design pivot (.context/mr-794-step2-design.md §1.1) and
  instructions for whoever sees the assertion fail in the future.
- Test also asserts that unfiltered scan_with_staged returns all rows —
  confirms the issue is specifically filter pushdown, not fragment
  scanning per se.

The engine's MR-794 step 2+ design (in-memory pending-batch
accumulation + DataFusion MemTable for read-your-writes) sidesteps
this entirely; production code is unaffected. scan_with_staged stays
on the public surface for primitive-level testing and for callers
that don't need filter pushdown.

All 8 staged_writes tests + 10 runs + 63 end_to_end + consistency
green locally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 01:03:27 +02:00
Ragnor Comerford
85cfffeaf8
MR-794 step 1: assign real fragment IDs to staged appends
CI exposed the actual root cause behind the three staged_writes
test failures: Lance's InsertBuilder::execute_uncommitted produces
fragments with id=0 as a "Temporary ID" (lance-4.0.0
dataset/write.rs:1044, with the assertion at line 1712). Real IDs
get assigned at commit time by Transaction::fragments_with_ids
(transaction.rs:1456). Because we expose pre-commit fragments to
scan_with_staged via Scanner::with_fragments, two fragments collide
on id=0 in the combined list — the staged fragment with the seed
fragment, or two staged fragments with each other.

Lance's scanner mishandles the collision. Symptoms observed in
the three failing tests:
- chained_stage_appends: only 1 distinct _rowid (other fragments
  silently dropped)
- count_rows_with_staged_matches_scan: range overflow ("Invalid read
  of range 0..2 for fragment 0 with 1 addressable rows")
- scan_with_staged_pushes_filter: duplicate carol + missing dave
  (one fragment read twice, the other not at all)

Fix: assign real fragment IDs in stage_append, mirroring Lance's
commit-time logic. Use ds.manifest.max_fragment_id + 1 as the base,
incremented by the prior_stages fragment count so chained
stage_appends produce distinct IDs. The row_id_meta assignment
stays — both are needed for the scanner to correctly map row IDs
through the combined fragment list.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 00:18:47 +02:00
Ragnor Comerford
61b3f5090b
MR-794 step 1: thread row-ID offset + add commit_staged + filter tests
Three follow-ups to the staged-writes primitives, all caught by the
"are we missing tests?" review:

(1) Path A row-ID threading (Gap 1, real bug):
stage_append now takes prior_stages: &[StagedWrite] and offsets the
assigned row IDs by the sum of prior stages' physical_rows. Without
this, two stage_appends against the same dataset both started at
ds.manifest.next_row_id, producing fragments with overlapping _rowid
ranges. This would have fired in Step 2+ on any multi-statement
mutation like `insert Knows ...; insert Knows ...` (multiple appends
to the same edge table — allowed under D₂′). The slice mirrors
scan_with_staged's API shape; the same slice is passed to both stage
and scan. Documented contract: only stage_append results in
prior_stages (D₂′ guarantees this upstream).

(2) commit_staged round-trip tests (Gap 2):
Two tests covering stage_append + commit_staged and stage_merge_insert
+ commit_staged. Validate that Lance's commit-time row-ID assignment
works correctly even after our pre-commit row_id_meta assignment in
the append path — the two assignments diverge but neither is observed
across the boundary.

(3) Filter pushdown test (Gap 3):
scan_with_staged with a SQL filter applies it across both committed
and staged fragments. Validates the MR-794 ticket's claim that Lance's
with_fragments preserves filter/vector/FTS pushdown (Lance tests
test_scalar_index_respects_fragment_list etc.).

Also adds chained_stage_appends_have_distinct_row_ids which directly
demonstrates the Gap 1 fix by projecting _rowid and asserting no
duplicates across 1 committed + 2 staged rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:59:59 +02:00
Ragnor Comerford
714f1f0c0a
MR-794 step 1: assign row_id_meta to stage_append fragments
CI exposed a real Step 1 bug surfaced by the new staged_writes tests:
stage_append → scan_with_staged fails on stable_row_id datasets with
"Missing row id meta" (lance-4.0.0/src/dataset/rowids.rs:22).

Root cause: InsertBuilder::execute_uncommitted produces fragments with
row_id_meta = None. Lance's commit phase normally populates it via
Transaction::assign_row_ids, but scan_with_staged reads the staged
fragments BEFORE commit. MergeInsertBuilder::execute_uncommitted dodges
this by populating row_id_meta inline (transaction.rs:1618) — that's
why the two merge-side tests in tests/staged_writes.rs passed and the
two append-side tests failed.

The bug was always present in the primitive — PR #66 shipped it the
same way. PR #66 had no tests calling stage_append, so neither CI nor
the bot reviewers caught it. Step 2+ would have hit it on the first
mutation that did "insert + insert with FK validation," but the failure
would have looked like a MutationStaging wiring bug; localizing it
here saves the next session the chase.

Fix: assign row_id_meta on the cloned fragments returned in
StagedWrite.new_fragments. Mirrors the relevant arm of Lance's
Transaction::assign_row_ids (transaction.rs:2682) for the
row_id_meta = None case. The transaction's internal fragment copy stays
untouched — Lance assigns its own IDs at commit time, and the two ID
assignments don't have to agree because no caller threads _rowid from
the staged scan into the commit path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:49:14 +02:00
Ragnor Comerford
2fe2669017
MR-794 step 1: import arrow_array::Array in staged_writes test
CI failed compiling tests/staged_writes.rs — `.len()` is on the Array
trait, not on the concrete StringArray/Int32Array types. Add the
trait import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:19:05 +02:00
Ragnor Comerford
4c5fa3d8b8
MR-794 step 1: address PR #67 Codex P1 — document chained-merge contract
Codex flagged that combine_committed_with_staged can return duplicates
on chained stage_merge_inserts: each call's MergeInsertBuilder runs
against the committed view (it does not see prior staged fragments), so
two staged merges whose source rows share keys both produce
Operation::Update transactions whose new_fragments contain the shared
row. The combined scan returns it twice.

The bug is intrinsic to Lance's API: there is no public way to make
MergeInsertBuilder see uncommitted fragments. Fixing the primitive
itself requires either a Lance API extension or in-memory pre-merge
logic, neither in scope for v1.

The v1 fix is a parse-time companion (D₂′) added with the engine rewire
in MR-794 step 2+: per touched table, ops must be all stage_append OR
exactly one stage_merge_insert. Multi-table queries and append-chains
remain safe; only chained merges on a single table are rejected.

This commit:
- Documents the contract on stage_merge_insert and
  combine_committed_with_staged so callers know the invariant the
  primitive relies on.
- Adds tests/staged_writes.rs with four primitive-level tests:
  - stage_append + scan_with_staged shows committed + staged
  - stage_merge_insert dedupes superseded committed fragments
    (regression for the removed_fragment_ids fix that PR #66's
    730631c added)
  - count_rows_with_staged matches scan
  - chained stage_merge_insert with shared key documents the
    duplicate-row behavior; assertion pins it so a future change either
    preserves the contract or consciously fixes it (and updates the test)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 23:10:19 +02:00
Ragnor Comerford
6dc4167291
MR-794 step 1: address PR #66 review — track removed_fragment_ids
Three independent automated reviews (Cubic P1, Cursor High, Codex P1)
flagged a real correctness bug in stage_merge_insert: Operation::Update
returns three fields — removed_fragment_ids, updated_fragments,
new_fragments — and we were collecting only the latter two into
StagedWrite.new_fragments while discarding removed_fragment_ids.

That breaks read-your-writes for any merge_insert that rewrites an
existing fragment: scan_with_staged combines the dataset's full committed
manifest with the staged new_fragments, so the *original* committed
fragment (which the rewrite supersedes) and its rewritten version both
end up in the Scanner's fragment list. Result: duplicate rows.

Fix:

- StagedWrite gains `removed_fragment_ids: Vec<u64>` populated from
  Operation::Update; empty for Operation::Append (which never supersedes
  existing fragments).
- scan_with_staged / count_rows_with_staged take `&[StagedWrite]` instead
  of `&[Fragment]` so they have access to both fields.
- A new `combine_committed_with_staged` helper composes the visible
  fragment list as `committed - removed + new`, deduping by fragment ID.

Also addresses cubic's P3 doc-fab note: the StagedWrite doc comment
claimed the type was "used by MutationStaging and the loader" but those
callers don't exist in this PR (they're MR-794 step 2+). Reword to
"defined here for later integration" so the doc doesn't lie about the
current state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 19:19:39 +02:00
Ragnor Comerford
3601002440
MR-794 step 1: add staged-write primitives to TableStore
Lance's distributed-write API splits "write fragment files" from "advance
HEAD": write_fragments returns a Transaction with FragmentMetadata; a
later CommitBuilder::execute(transaction) commits via the manifest CAS.
The same shape exists for merge_insert via
MergeInsertBuilder::execute_uncommitted. Scanner::with_fragments(staged)
lets in-flight reads see uncommitted staged data.

Adds wrappers for these primitives:

- StagedWrite carries the uncommitted Transaction plus the new Fragments
  (extracted for read-your-writes via Scanner::with_fragments).
- TableStore::stage_append wraps InsertBuilder::execute_uncommitted.
- TableStore::stage_merge_insert wraps MergeInsertBuilder::execute_uncommitted.
- TableStore::commit_staged wraps CommitBuilder::execute.
- TableStore::scan_with_staged / count_rows_with_staged thread the staged
  fragments into a Scanner alongside the dataset's committed fragments.

The MutationStaging integration that uses these primitives is the next
step in MR-794 — it requires a coordinated rewrite of execute_insert /
execute_update / execute_delete plus the load_jsonl_reader path, plus
end-of-query commit logic. Doc comment on MutationStaging is updated to
reference MR-794 and these primitives so the followup is well-anchored.

The current MR-771 limitation in docs/runs.md ("mid-query partial failure
leaves Lance HEAD ahead of __manifest") still applies until the
follow-up lands; the primitives are the building blocks but not yet the
fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 19:19:38 +02:00
Ragnor Comerford
b73813e525
Merge pull request #65 from ModernRelay/ragnorc/demote-run
MR-771: demote Run to direct-publish via expected_table_versions CAS
2026-04-30 19:08:17 +02:00
Ragnor Comerford
1a906403cb
MR-771: address cubic comment — drop vacuous __run__ check in cancel test
cubic correctly flagged that the assertion `!branches_after.iter().any(|b| b.starts_with("__run__"))` is vacuous because `branch_list()` already filters `__run__*` via `is_internal_system_branch`. The real structural property (no `__run__` branches can ever be created) is enforced by MR-771's deletion of `begin_run` etc. — that's a build-time invariant, not a runtime one.

Drop the vacuous assertion; document why. The remaining checks (public branch list unchanged + `_graph_runs.lance` never reappears) cover the actual cancel-safety properties.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 15:31:49 +02:00
Ragnor Comerford
a7109d5fba
MR-771: address PR review feedback
Three fixes from automated PR review on #65:

1. Internal-branch guard in mutation/load (Cursor Bugbot, Medium).
   Pre-MR-771 the begin_run path called ensure_public_branch_ref;
   the direct-publish replacements only normalized the name. A caller
   passing __run__* or __schema_apply_lock__ verbatim could write
   directly to a system branch. Re-add the explicit guard at the
   public write boundary in mutate_with_current_actor and load.

2. Panic-safe coordinator restoration (Cursor Bugbot, High).
   The previous swap-and-restore pattern would skip restore_coordinator
   if execute_named_mutation panicked, leaving the handle pinned to
   the wrong branch indefinitely. Replace with a CoordinatorRestoreGuard
   RAII type that captures the previous coordinator on swap and
   restores it in Drop.

3. Flaky cancel-safety test (cubic, P2).
   tests/runs.rs::cancelled_mutation_future_leaves_no_state asserted
   manifest version equality after handle.abort(), but abort races
   the spawned task. Re-frame around what actually defines cancel
   safety: no __run__* branches, no _graph_runs.lance, no synthesized
   public branches.

The fourth comment (Codex P1: branch_delete losing its in-flight
write barrier) is bigger in scope — fits in the MR-794 storage-trait
staging story rather than a hotfix here. Tracked there.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 15:17:00 +02:00
Ragnor Comerford
35be20cb05
MR-771: demote Run to direct-publish via expected_table_versions CAS
mutate_as and load now write directly to target tables and call the
publisher once at the end with per-table expected versions; the Run
state machine, _graph_runs.lance writers, __run__ staging branches,
and server /runs/* endpoints are removed. Multi-statement mutations
remain atomic at the manifest level via an in-memory MutationStaging
accumulator that gives read-your-writes within a query and a single
publish at the end. Concurrent-writer conflicts surface as
ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the
old DivergentUpdate merge shape. Documents one known limitation in
docs/runs.md: a multi-statement mid-query failure where op-N writes
a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the
manifest until a follow-up introduces per-table Lance branches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 08:52:50 +02:00
Ragnor Comerford
034c49f82b
Merge pull request #64 from ModernRelay/ragnorc/architecture-diagrams
docs: add Mermaid architecture diagrams (system / storage / read + write flows)
2026-04-29 19:31:08 +02:00
Ragnor Comerford
97c6490c1d
docs: post-merge code-review fixes (Cursor + cubic on PR #60 era)
Two factual mismatches caught during code-grounded re-review:

- docs/architecture.md: "13 cmd families" was stale — the CLI has 17
  Command variants (Version, Embed, Init, Load, Ingest, Branch, Schema,
  Query, Snapshot, Export, Run, Commit, Read, Change, Policy, Optimize,
  Cleanup). Replaced the count with "command families" so the diagram
  doesn't drift again.

- docs/execution.md: the mutation prose said "every mutation runs on a
  fresh __run__<id> branch", which over-claims. mutation.rs:555 short-
  circuits when the target is already a __run__ branch — the assumption
  there is the caller is managing the surrounding run lifecycle. Added a
  one-paragraph caveat noting the exception with the file:line citation.

Both diagrams unchanged; only annotations / counts adjusted.
2026-04-29 17:14:48 +02:00
Ragnor Comerford
411d86b743
docs/execution: fix mutation atomicity narrative — run-branch + publish_run
The mutation flow diagram and prose previously claimed multi-statement
mutations publish through a single ManifestRepo::commit. That's wrong:
each statement commits independently to a __run__<id> branch via
commit_updates; atomicity comes from the publish_run step that promotes
the run-branch into the target (fast path) or three-way merges it
(merge path).

Diagram now shows:
- begin_run forks __run__<id> from target head
- per-statement commit_updates on the run-branch (loop)
- OCC pre-check on target head
- publish_run with fast-path / merge-path branches
- terminate_run(Published)

Prose now points at runs.md for the full run lifecycle and cites the
correct entry points: mutate_with_current_actor (mutation.rs:539),
publish_run (omnigraph.rs:858).

Addresses Codex review comment on PR #64.
2026-04-29 17:09:29 +02:00
Ragnor Comerford
64b9d56476
docs: add Mermaid architecture diagrams across architecture / storage / execution
Replace the single ASCII stack in docs/architecture.md with a hierarchy of
Mermaid diagrams that show the system from external context down to the
component level. Add an on-disk layout diagram in docs/storage.md and two
sequence diagrams (read query, mutation) in docs/execution.md so readers
can navigate from "what is OmniGraph" to "how does a query run" without
opening source.

Static structure (docs/architecture.md):

- System context — agents/clients, embedding providers, Cedar, object store.
- Layer view — eight-layer stack with L1 (Lance) / L2 (OmniGraph) styling
  via classDef, replacing the pre-existing ASCII art.
- Component zoom-ins — compiler, engine, storage trait, index lifecycle,
  server/CLI. Each zoom-in cites file:line entry points.

Aspirational shapes (storage trait, full reconciler) are visually marked
and pointed at the relevant invariants.md section so readers see the
intended seam without thinking it's already implemented.

On-disk layout (docs/storage.md):

- Tree from repo URI through __manifest, nodes/, edges/, _graph_commits.lance,
  _graph_runs.lance, _refs/branches/ down into Lance's per-dataset
  internals (_versions/, data/, _indices/, _refs/, _transactions/).
- Annotated with the actual filenames so readers can `ls` the same paths.
- Slots in below the existing __manifest CAS / OCC / migration prose; does
  not move or rewrite that content.

Runtime flows (docs/execution.md):

- Read flow sequence: client → Omnigraph::query → typecheck → lower →
  execute_query → table_store → Lance scanner → RecordBatch stream.
- Mutation flow sequence: Omnigraph::mutate → resolve literals →
  Lance write op (Append / merge_insert) → ManifestRepo::commit →
  __manifest upsert.
- Both diagrams are followed by a "Code paths" block with verified
  file:line citations so readers can navigate from diagram element to
  source in one step.

Conventions established (this is the first Mermaid in the repo):

- L1 = orange (#fef3e8), L2 = blue (#e8f4fd), aspirational = dashed.
- Diagram size cap ~9 elements; more detail goes in a sub-diagram.
- Diagrams paired with prose; code-path citations follow each diagram.
- Consistent vocabulary across diagrams: frontend / compiler / engine /
  storage trait / Lance / object store. No accidental synonyms.

Subsequent PRs will add flow diagrams for schema apply, branch + merge,
run isolation, index reconcile, and the embedding pipeline in the same
conventions.
2026-04-29 16:58:56 +02:00
Ragnor Comerford
4e5374a85e
Merge pull request #63 from ModernRelay/claude/extend-manifest-batch-publisher-GQzB6 2026-04-29 14:45:33 +02:00
Claude
63babede4a
AGENTS.md: reframe 'minimize ongoing liability' as a general decision lens
The previous bullets read like a migration pattern (centralized
dispatcher, one match arm, no shape forks). That's one application, not
the principle. Reframe it as a bidirectional decision lens: ask "which
option has lower ongoing cost over time?" and let the answer be more
code, less code, DRYing, duplication, removal, addition, a new
abstraction, or flattening one — whichever shape converges over the
expected change horizon.

Add explicit examples of cases where the lower-liability option is
*more* code (dispatcher, migration framework, typed error variants) and
where it's *less* (premature abstractions, "just in case" paths,
helpers that wedge independently-evolving callers together) so readers
don't collapse the principle into "minimize code".
2026-04-29 12:32:30 +00:00
Claude
b6a596670a
AGENTS.md: add 'minimize ongoing liability' as a first principle
The always-on rules are concretizations of a broader engineering posture
that wasn't stated explicitly. Add a short section that frames the spirit
behind those rules:

- One centralized detection point, not many heal hooks.
- One dispatcher, not branch-on-shape in every consumer.
- One canonical shape after migration, not forks on "old vs new".
- Three similar lines beats a premature abstraction.
- Delete dead paths when their last caller leaves.

Plus a forward-looking review prompt ("what do these paths look like after
5 more changes like this?") so the principle bites at design time, not
just at review time.

The internal-schema-version mechanism we just shipped is a concrete
application: one stamp + one dispatcher + one match arm per change, no
heal hooks scattered across the engine. Codify the pattern so future
work doesn't drift back to ad-hoc.
2026-04-29 11:48:47 +00:00
Claude
243c0c3464
Add internal-schema versioning + auto-migration for __manifest
The on-disk shape of `__manifest` is reconciled with the binary via a single
stamp + dispatcher in `db/manifest/migrations.rs`:

- `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` declares the shape this binary writes.
- The on-disk stamp `omnigraph:internal_schema_version` lives in the manifest
  dataset's schema-level metadata (Lance `update_schema_metadata`).
- `migrate_internal_schema(&mut dataset)` walks `match`-arm steps forward from
  the on-disk stamp until it matches the binary, then returns. Idempotent.
- `init_manifest_repo` stamps the current version at creation; the publisher's
  open-for-write path runs pending migrations before reading state. Reads
  stay side-effect-free.
- Forward-version protection: a stamp higher than the binary's known version
  triggers a clear "upgrade omnigraph first" error so an old binary cannot
  clobber a newer schema.

Self-heals existing pre-MR-766 deployments by auto-applying the v1→v2 step:
the `lance-schema:unenforced-primary-key` annotation on `__manifest.object_id`
that engages Lance's row-level CAS at commit time. New repos created via
`init` are stamped at v2 immediately and don't need migration.

Adding a future on-disk shape change is one constant bump, one match arm in
`migrate_internal_schema`, and one test — no new branches in unrelated code
paths. Code outside the migration module never inspects the stamp.

New tests in `manifest/tests.rs`:
- `test_init_stamps_internal_schema_version`
- `test_publish_migrates_pre_stamp_manifest_to_current_version`
- `test_publish_rejects_manifest_stamped_at_future_version`

Docs: `docs/storage.md`, `docs/maintenance.md`, `docs/constants.md` updated
per the AGENTS.md maintenance contract.
2026-04-29 11:44:14 +00:00
Claude
5eb47b8c13
docs: surface MR-766 publisher OCC in storage / errors / constants
- storage.md: document the row-level CAS annotation on `__manifest.object_id`
  and the `expected_table_versions` OCC contract on `ManifestBatchPublisher::publish`.
- errors.md: list `ManifestConflictDetails` and its variants alongside `ManifestError`.
- constants.md: add `PUBLISHER_RETRY_BUDGET = 5`.

Per AGENTS.md "Maintenance contract": new schema construct, new constant, and
new typed error shape all need to ship with the source change.
2026-04-29 07:56:18 +00:00
Claude
df0e158190
Add per-table expected-version OCC to ManifestBatchPublisher (MR-766)
Layered approach selected by the CAS-granularity investigation
(.context/merge-insert-cas-granularity.md):

- Annotate __manifest.object_id with `lance-schema:unenforced-primary-key`,
  enabling Lance row-level CAS via the bloom-filter conflict resolver.
  Closes a latent silent-duplicate bug where two concurrent publishes of
  the same `version:T@v=N+1` row could both land in disjoint fragments.
- Extend `ManifestBatchPublisher::publish` with `expected_table_versions:
  &HashMap<String, u64>`. Empty map preserves today's behavior; populated
  map asserts the manifest's latest non-tombstoned version per table
  matches the caller's view. Mismatches surface as a typed
  `ManifestConflictDetails::ExpectedVersionMismatch { table_key, expected,
  actual }` so callers can match without parsing strings.
- Set `merge_builder.conflict_retries(0)` so Lance's transparent rebase
  cannot silently break the OCC contract; retries are owned by the
  publisher loop, where each attempt re-runs `load_publish_state` and
  the expected-version pre-check.
- Surface `ManifestCoordinator::commit_with_expected` for the callers
  that need strict OCC (the run-demotion ticket); existing `commit` and
  `commit_changes` paths are unaffected.

New tests in `manifest/tests.rs` cover: matching expected versions,
stale expected with typed details, drift on an untouched expected
table, unknown expected table (actual=0), and the headline case of two
concurrent publishes with overlapping expected versions where exactly
one succeeds.
2026-04-29 00:34:20 +00:00
Claude
bb95fdceda
Investigate Lance MergeInsertBuilder CAS granularity (MR-766 prereq)
Confirms Lance v4.0.0 has row-level CAS for merge_insert only when the
join-key column carries lance-schema:unenforced-primary-key=true. Our
__manifest schema does not, so the publisher silently allows duplicate
object_id rows under concurrent writers. Note + reproducible scratch
crate select the layered (pre-check + row-level CAS) approach for the
publisher API ticket.
2026-04-28 23:30:17 +00:00
Ragnor Comerford
58dba6210e
Merge pull request #61 from ModernRelay/ragnorc/query-execution-deep-dive
docs/invariants: drop commercial section; separate patterns from invariants
2026-04-29 00:44:26 +02:00
Ragnor Comerford
56b30c5c5a
Restructure invariants doc: drop commercial, separate patterns from invariants
- Removed §IX (OSS / Cloud kernel-product split) — business strategy belongs
  in MR-738, not the technical invariants doc.
- Filled the §IV (Additivity / migration) placeholder with five evolution
  invariants.
- Reframed §I to be substrate-agnostic: invariants are about respecting any
  substrate; Lance / DataFusion are noted as the current chosen substrate
  rather than as the invariant itself.
- Added §VI Database guarantees (12 invariants): atomicity, schema integrity,
  isolation, durability, causal consistency, determinism, idempotency, no
  silent loss, bounded operations, failure scope, crash recovery, consistency
  model.
- Added §II.8 wire-protocol agnosticism (kernel transport-agnostic,
  Flight/HTTP at the server boundary).
- Reframed §VII as "Current architectural patterns" — explicitly distinct
  from invariants. Each pattern entry now names the underlying invariant it
  realizes (reconciler / Union / mutations-wrap-reads / SIP / factorize /
  stable row IDs / rank columns / policy predicates / Source).
- Pulled specific config defaults out of §VI (timeouts, memory caps);
  invariant is that bounds exist, values live in docs/constants.md.
- Split §IX deny-list into "invariant violations" (high bar) and "pattern
  violations" (overridable with justification).
- Added status legend: decided / open — see MR-X / aspirational. Annotated
  invariants and patterns that are not yet upheld in current code.
- Updated review checklist (§X) to cover database-guarantee dimensions and
  the wire-protocol / Source / patterns sections.
- Updated Living Document policy (§XI) to spell out how to revise patterns,
  resolve open invariants, and lift aspirational annotations.

Source tickets: MR-737, MR-744, MR-765, MR-694 family, MR-722/MR-725.
2026-04-29 00:39:11 +02:00
Ragnor Comerford
a9430978fb
Merge pull request #60 from ModernRelay/ragnorc/omnigraph-spec
Add AGENTS.md (map) + docs/ knowledge base + CI link check
2026-04-29 00:15:19 +02:00