Commit graph

19 commits

Author SHA1 Message Date
Ragnor Comerford
5afde54d69
agents: stop overclaiming atomic multi-table publish — describe the three layers honestly
External reviewer flagged that the capability matrix's "Atomic
multi-dataset publish" cell implied Lance gives us a single primitive
for cross-table atomicity. It doesn't. The real contract is three
layers stacked:

  (1) per-table Lance `commit_staged` for the data write
  (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for
      cross-table ordering
  (3) recovery-on-open reconciler for the residual gap between (1)
      and (2) — NOT YET SHIPPED, tracked in MR-847.

Until MR-847 lands, a failure between per-table `commit_staged` and
the manifest publish leaves drift on the partially-committed tables
(the "Phase B → Phase C residual" documented in `docs/runs.md`).

Also enumerate the legacy inline-commit residuals (`append_batch`,
`merge_insert_batches`, `overwrite_batch`, `create_*_index`) alongside
`delete_where` and `create_vector_index` — they remain on the trait
pending Phase 1b call-site conversion + Phase 9 demotion.

End the row with an explicit DO NOT: future agents reading the
capability matrix should not describe atomicity as "fully upheld"
until MR-847 ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 19:35:34 +02:00
Ragnor Comerford
9b0920b5da
address PR #70 bot review (Cubic + Cursor): 7 inline + failpoint test + invariants notes
Cubic findings:
* `tests/forbidden_apis.rs`: expand `FORBIDDEN_PATTERNS` with `Dataset::write`
  / `Dataset::append` / `Dataset::delete` / `Dataset::merge_insert` /
  `Dataset::add_columns` / `update_columns` / `drop_columns` /
  `truncate_table` / `restore` and the bare `.merge_insert(` /
  `.add_columns(` / `.update_columns(` / `.drop_columns(` /
  `.truncate_table(` method patterns. Deliberately avoid `.append(` /
  `.delete(` / `.write(` (over-match `Vec::append`, `.delete_branch(`,
  arrow-array `.append(`, etc.). Allow-list `commit_graph.rs` and
  `graph_coordinator.rs` — they're manifest-layer infra that legitimately
  uses `Dataset::write` for system tables.
* `schema_apply.rs:253`: pass `entry.table_branch.as_deref()` (not
  `None`) to `open_dataset_head_for_write` for consistency with the
  sibling `indexed_tables` block. Schema apply rejects non-main
  branches at the lock-acquire step today, so behavior is unchanged;
  this is a defensive consistency fix that survives a future relaxation
  of the lock check.
* `storage_layer.rs:131` doc: was `Vec<&StagedWrite>` with lifetime
  claim; actually returns `Vec<StagedWrite>` (cloned). Fixed.
* `AGENTS.md:201` capability matrix row + `storage_layer.rs:1` module
  doc: softened the "stage_* + commit_staged are the only paths" /
  "trait funnels every write" overclaim. Inline-commit residuals
  (`delete_where`, `create_vector_index`) remain on the trait pending
  upstream Lance work (#6658, #6666); legacy `append_batch` etc.
  remain pending Phase 1b / Phase 9. Module doc now describes the
  current transitional state honestly.

Cursor Bugbot findings:
* `storage_layer.rs:360`: trait `delete_where` consumed `SnapshotHandle`
  but returned only `DeleteState`, dropping the post-delete dataset.
  Future callers migrating from the inherent `&mut Dataset` API would
  lose the post-delete dataset state needed for indexing /
  `table_state` queries. Fixed: returns `(SnapshotHandle, DeleteState)`
  matching `append_batch` / `overwrite_batch` shape.
* `storage_layer.rs:824`: removed dead `_scanner_type_marker` fn and
  the unused `Scanner` import (the marker existed only to suppress an
  unused-import warning — fixing the import is the cleaner answer).

Engine-level Phase A failpoint test (closes the partial-criterion
flagged in Cubic's acceptance-criteria checklist):
* `db/omnigraph/table_ops.rs::stage_and_commit_btree`: instrumented
  with `crate::failpoints::maybe_fail("ensure_indices.post_stage_pre_commit_btree")`
  between `stage_create_btree_index` and `commit_staged`.
* `tests/failpoints.rs::ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`:
  triggers the failpoint via a schema-apply that adds a new node type;
  proves that existing tables are unaffected (Person mutation succeeds
  after the failed apply) — i.e. Phase A failure leaves no Lance-HEAD
  drift on tables outside the failed `added_tables` iteration.

`docs/invariants.md` transitional notes:
* §VI.23 (atomicity per query): annotated as upheld at the
  writer-trait surface for inserts / updates / scalar-index builds /
  merge_insert / overwrite after MR-793 PR #70. Per-table
  commit_staged → manifest publish window remains; closing requires
  MR-847's recovery-on-open reconciler. `delete_where` and
  `create_vector_index` remain inline pending lance#6658 / #6666.
* §VII.35 (reconciler pattern): annotated as partial — staged
  primitives are the building blocks; the reconciler task itself is
  MR-848.
* §VIII.45 (reference impl per trait): `TableStorage` has its primary
  impl on `TableStore` with opaque-handle signatures; no test impl
  yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:47:07 +02:00
Ragnor Comerford
b87be5e9f0
agents: read every Lance page even slightly relevant, not just the obvious match
Behavior is interlocked across Lance pages — transactions reference
index lifecycle, index lifecycle references compaction, compaction
references row-id lineage. Skipping a "slightly relevant" page is how
alignment misses happen. The index alone is not a substitute for
reading the pages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:44:41 +02:00
Ragnor Comerford
17bf978d0e
MR-793 follow-up: lance docs alignment audit + mandate full-page fetch via mdrip
* AGENTS.md / docs/lance.md: agents must use `npx mdrip` (not summarizing
  WebFetch) when consulting Lance docs. WebFetch routinely drops
  load-bearing details — `pub(crate)` blockers, sub-specs behind nav hubs,
  default flags. Lesson learned during the MR-793 alignment audit.
* docs/lance.md: add "Last alignment audit: 2026-05-02" stanza
  documenting MemWAL gap, lance#6666 companion ticket, stable-row-ID
  status (experimental, may unblock MR-848), FRI as documented
  compaction-friendly alternative.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 17:41:32 +02:00
Ragnor Comerford
3135ff5d19
MR-793 phases 1-6: TableStorage trait + staged-write surface for engine writers
Hoists Lance's stage+commit two-phase write pattern from "discipline at
each writer" to a sealed trait surface (`TableStorage`). New engine code
that needs to advance Lance HEAD MUST go through `stage_*` + `commit_staged`;
the trait's opaque `SnapshotHandle` / `StagedHandle` types keep
`lance::Dataset` and `lance::Transaction` out of trait signatures.

Phases landed (see .context/mr-793-design.md for the full plan):
* 1a: `crates/omnigraph/src/storage_layer.rs` — `TableStorage` trait,
  sealed (only in-tree types can impl), single impl on `TableStore`
  delegating to existing inherent methods; `Omnigraph::storage()`
  accessor returns `&dyn TableStorage`.
* 2: three new staged primitives — `stage_overwrite`,
  `stage_create_btree_index`, `stage_create_inverted_index` —
  implementing the simple branch of Lance's `CreateIndexBuilder::execute`
  (scalar indices only; vector indices stay inline because
  `build_index_metadata_from_segments` is `pub(crate)` in lance-4.0.0).
  Six new tests in `tests/staged_writes.rs` pin both the new primitives
  and the inline residuals (`delete_where`, `create_vector_index`).
* 3: `tests/forbidden_apis.rs` — defense-in-depth integration test
  walks engine source, fails on direct lance::* inline-commit API use
  outside `table_store.rs` / `db/manifest/`. Skips comment lines and
  honors `// forbidden-api-allow:` sentinels.
* 4: `ensure_indices` migration — scalar index builds now route through
  `stage_create_*_index` + `commit_staged` instead of
  `create_*_index(&mut Dataset)`. Vector indices stay inline (residual,
  named honestly at the call site).
* 5: `branch_merge::publish_rewritten_merge_table` migration — the
  merge_insert phase now uses `stage_merge_insert` + `commit_staged`;
  delete phase stays inline (Lance #6658 residual, named honestly).
* 6: `schema_apply` rewritten_tables migration — non-empty rewrites
  use `stage_overwrite` + `commit_staged`; empty-batch rewrites stay
  inline because `InsertBuilder::execute_uncommitted` rejects empty
  data. The narrow inline window is bounded by `__schema_apply_lock__`.

Verified-green test surface:
* `cargo test -p omnigraph-engine` — 68 lib + ~120 integration tests
  (incl. 6 new staged_writes tests + the new forbidden_apis test).
* `cargo test -p omnigraph-engine --features failpoints --test failpoints`
  — 5 tests, all green.
* `cargo test --workspace` — green.

Deferred to follow-up sessions (see design doc §17 split):
* Phase 1b — convert remaining engine call sites to `&dyn TableStorage`
  (mostly READS that don't touch the staged-write invariant).
* Phase 7 — recovery-on-open reconciler (closes Phase B → Phase C
  residual across process restarts; new subsystem).
* Phase 8 — index-coverage reconciler (full §VII.35 compliance —
  removes synchronous index work from the publish path).
* Phase 9 — demote unused `TableStore` inherent methods to `pub(crate)`
  (depends on Phase 1b).

Lance upstream blockers documented:
* lance-format/lance#6658 — two-phase delete API (open, no PRs).
* Companion: `build_index_metadata_from_segments` should be `pub` so
  vector-index builds can be staged outside the lance crate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 11:03:15 +02:00
Ragnor Comerford
a61e82f47a
MR-794 step 2: docs — runs/invariants/architecture/execution + cleanup
Refresh user-facing and agent-facing docs for the staged-write rewire
and clean up stale Run-state-machine references that survived MR-771.

MR-794-specific updates:
* docs/runs.md — remove "Known limitation: mid-query partial failure"
  section; document the in-memory accumulator + D₂ rule + the
  LoadMode::Overwrite residual.
* docs/invariants.md §VI.25 — flip from aspirational/open to
  upheld for inserts/updates. Within-query read-your-writes is now
  load-bearing for the publisher CAS contract.
* docs/architecture.md — add "Mutation atomicity — in-memory
  accumulator (MR-794)" subsection with per-op flow; refresh the
  engine + state diagrams to drop RunRegistry and add MutationStaging.
* docs/execution.md — rewrite the mutation flow sequence diagram
  for the staged-write path; updated the LoadMode table to call
  out per-mode commit semantics; rewrote load vs ingest.
* docs/query-language.md — document the D₂ parse-time rule.
* docs/errors.md — add the D₂ BadRequest rejection path.
* docs/testing.md — extend the runs.rs row to cover the new MR-794
  contract tests; add the staged_writes.rs row.
* docs/releases/v0.4.1.md (new) — release note covering the rewire,
  test additions, residuals, and files changed.
* AGENTS.md (CLAUDE.md symlink) — update the atomic-per-query
  description and the L2 capability matrix row.

Stale-reference cleanup (MR-771 leftovers):
* docs/storage.md — drop live _graph_runs.lance / _graph_run_actors.lance
  from the layout diagram and prose; mark legacy.
* docs/branches-commits.md — move __run__<id> to a legacy note;
  remove publish_run from the publish-trigger list.
* docs/audit.md — refresh _as API list (drop begin_run_as / publish_run_as);
  legacy RunRecord.actor_id moved to a historical note.
* docs/constants.md — mark run registry / branch-prefix rows as legacy.
* docs/cli.md — replace the legacy omnigraph run * quickstart block
  with omnigraph commit list/show.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:43:19 +02:00
Ragnor Comerford
35be20cb05
MR-771: demote Run to direct-publish via expected_table_versions CAS
mutate_as and load now write directly to target tables and call the
publisher once at the end with per-table expected versions; the Run
state machine, _graph_runs.lance writers, __run__ staging branches,
and server /runs/* endpoints are removed. Multi-statement mutations
remain atomic at the manifest level via an in-memory MutationStaging
accumulator that gives read-your-writes within a query and a single
publish at the end. Concurrent-writer conflicts surface as
ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the
old DivergentUpdate merge shape. Documents one known limitation in
docs/runs.md: a multi-statement mid-query failure where op-N writes
a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the
manifest until a follow-up introduces per-table Lance branches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-30 08:52:50 +02:00
Claude
63babede4a
AGENTS.md: reframe 'minimize ongoing liability' as a general decision lens
The previous bullets read like a migration pattern (centralized
dispatcher, one match arm, no shape forks). That's one application, not
the principle. Reframe it as a bidirectional decision lens: ask "which
option has lower ongoing cost over time?" and let the answer be more
code, less code, DRYing, duplication, removal, addition, a new
abstraction, or flattening one — whichever shape converges over the
expected change horizon.

Add explicit examples of cases where the lower-liability option is
*more* code (dispatcher, migration framework, typed error variants) and
where it's *less* (premature abstractions, "just in case" paths,
helpers that wedge independently-evolving callers together) so readers
don't collapse the principle into "minimize code".
2026-04-29 12:32:30 +00:00
Claude
b6a596670a
AGENTS.md: add 'minimize ongoing liability' as a first principle
The always-on rules are concretizations of a broader engineering posture
that wasn't stated explicitly. Add a short section that frames the spirit
behind those rules:

- One centralized detection point, not many heal hooks.
- One dispatcher, not branch-on-shape in every consumer.
- One canonical shape after migration, not forks on "old vs new".
- Three similar lines beats a premature abstraction.
- Delete dead paths when their last caller leaves.

Plus a forward-looking review prompt ("what do these paths look like after
5 more changes like this?") so the principle bites at design time, not
just at review time.

The internal-schema-version mechanism we just shipped is a concrete
application: one stamp + one dispatcher + one match arm per change, no
heal hooks scattered across the engine. Codify the pattern so future
work doesn't drift back to ad-hoc.
2026-04-29 11:48:47 +00:00
Ragnor Comerford
6f25c4f9f8
Address reviewer feedback (Cursor + cubic) on PR #60
All eight comments verified against source and applied:

- AGENTS.md: pull @docs/{invariants,lance,testing}.md imports out of
  the markdown blockquote. Claude Code's @-import parser expects @ at
  column 0; the leading "> " of a blockquote silently broke
  recognition, so the claimed auto-include did nothing. (Cursor,
  Medium severity.)
- docs/cli-reference.md: command-family count 13 → 17. The current
  enum Command in crates/omnigraph-cli/src/main.rs has 17 top-level
  variants. (cubic P2.)
- docs/ci.md: Homebrew tap update is a regular `git push`, not a
  force-push (release.yml:117 is `git push origin HEAD:main`). (cubic
  P2.)
- docs/errors.md: add the Storage variant to the NanoError list — it
  exists at error.rs:88-89 but the doc enumerated only 10 of 11.
  (cubic P2.)
- docs/storage.md: clarify tombstone semantics. There is no
  tombstone_version column; state.rs:180 reads the tombstone version
  from the table_version column on rows where object_type =
  table_tombstone. (cubic P2.)
- docs/branches-commits.md: split the GraphCommit pseudo-struct from
  the underlying storage. actor_id is joined in-memory from
  _graph_commit_actors.lance, not a column on _graph_commits.lance.
  (cubic P2.)
- docs/schema-language.md: rename IR_VERSION to SCHEMA_IR_VERSION to
  match the actual constant name in catalog/schema_ir.rs:11.
  (cubic P3.)
- docs/testing.md: engine integration test count 16 → 15 (matches
  `ls crates/omnigraph/tests/*.rs`). (cubic P3.)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-29 00:09:06 +02:00
Ragnor Comerford
ada58ccd7b
Make "check existing coverage first" a top-level testing principle
The original docs/testing.md mentioned finding existing tests as step 1
of the checklist but never explicitly said "if existing coverage
already addresses your case, extend it; don't duplicate." Adds a
prominent "First principle" section that names extend-vs-new as the
preferred outcome and lists three duplicated init_and_load blocks as
the most common form of test rot.

Adds an extra checklist item: verify your change makes an *existing*
test fail before it makes a new one pass — if you can break the code
without breaking a test, that coverage gap is the bug to fix first.

Strengthens the AGENTS.md callout so the principle ("always check what
already covers it") is in scope from the top of every session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-29 00:03:50 +02:00
Ragnor Comerford
8be0e6a067
Add docs/testing.md as required-read every session
Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:55:21 +02:00
Ragnor Comerford
a06d8bcf82
Promote docs/lance.md to required-read every session
Adds @docs/lance.md alongside @docs/invariants.md so the Lance index
loads on every turn (Claude Code @-import; explicit-open instruction
for other agents). Reframes the directive from "when you hit a
Lance-shaped problem" to "consult before every task to identify which
upstream pages are relevant." The Lance docs are the authoritative
source for substrate behavior, so reasoning about them should start
every change rather than be triggered conditionally.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:50:42 +02:00
Ragnor Comerford
b6440d6b17
Add docs/lance.md — task-organized index of Lance upstream docs
Curates the Lance documentation site (lance.org) into a problem-domain
index so agents fetch the right page when working on Lance-touching
code instead of guessing or grepping our codebase. Organized by topic:
storage format & file layout, branching/tags/time travel, indexes
(scalar + system + vector), reads/writes, schema evolution, object
store, data types, performance, compaction, DataFusion integration,
SDK reference, plus quick-starts and the upstream AGENTS.md.

Skips ~200 irrelevant URLs from the upstream sitemap (Namespace REST
API model surface, Spark/Trino/Databricks/etc. integrations,
Python/Ray/HuggingFace docs, community pages) since omnigraph is
Rust-only and doesn't run a Lance Namespace catalog.

AGENTS.md surfaces it in the topic index and adds a directive: "when
you hit a Lance-shaped problem, consult docs/lance.md and fetch the
upstream URL before guessing."

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:48:28 +02:00
Ragnor Comerford
43724b9f18
Make docs/invariants.md required reading every session
Adds a top-of-file directive plus a Claude-Code @-import so the full
invariants document is loaded into context on every turn, not only
when an agent follows a pointer. Other agents are instructed to open
it explicitly at session start. The §IX deny-list and §X review
checklist apply to every change, so they should be in scope by
default rather than gated on the agent remembering to look.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:39:09 +02:00
Ragnor Comerford
1e7334275a
Trim always-on rules to architectural-level invariants
Drops four rules whose phrasing leaned on implementation specifics
(nearest LIMIT, __run__<id> branches, __schema_apply_lock__,
branch_list filter convention) — those are real constraints, but they
live at the implementation layer and would go stale if internals are
renamed or refactored. The architectural intent is captured by the
remaining six rules and by the per-area docs.

Reframes the kept rules at the survives-a-rename level: "multi-dataset
publish is atomic across the whole graph" rather than naming the
manifest table or the publisher type, etc.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:38:24 +02:00
Ragnor Comerford
c924e121d2
Add architectural invariants & deny-list as docs/invariants.md
A standing reference for invariants that hold across storage, engine,
server, schema, indexing, observability, and the OSS/Cloud split. Used
to check RFCs and PRs against the substrate boundaries (don't rebuild
what Lance gives us), layering rules (one trait boundary per layer),
distributability constraints (Send+Sync, location-neutral IR), honesty
expectations (estimate-vs-actual, bounded failure modes), unified
patterns (reconciler, Union polymorphism, SIP, factorize), the §IX
deny-list, and the §X review checklist.

§IV (additivity / migration) and §VIII (OSS/Cloud kernel-product split)
are referenced but not yet drafted — flagged as placeholders pending
upstream fill-in.

AGENTS.md surfaces it from the topic index, the always-on rules
section, and the maintenance contract; the deny-list is also inlined
there as a fast-pass review filter so it stays in scope every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:34:44 +02:00
Ragnor Comerford
a335d98854
Refactor AGENTS.md from encyclopedia to map; move spec into docs/
Splits the 990-line AGENTS.md into a 184-line map (architecture,
where-to-find index, always-on invariants, capability matrix,
maintenance contract) plus 18 new docs/*.md files holding the deep
content per topic (storage, schema and query languages, indexes,
embeddings, branches/commits, runs, merge, changes, execution, policy,
server, CLI reference, audit, errors, CI, constants, v0.3.1 notes).

Adds scripts/check-agents-md.sh and a check_agents_md CI job that
verifies every docs/ link in AGENTS.md resolves and every doc in the
canonical set is linked. CLAUDE.md remains a symlink to AGENTS.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:31:08 +02:00
Ragnor Comerford
cfea41e942
Add AGENTS.md as canonical agent guide; symlink CLAUDE.md to it
Captures the v0.3.1 feature spec (storage, schema/query languages, IR,
indexes, embeddings, branches/commits/runs, merge, server, CLI, policy,
deployment) and adds a §26 maintenance contract instructing agents to
keep this file current alongside any user-visible change. CLAUDE.md is
a symlink so there's one source of truth.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:10:09 +02:00