External reviewer flagged that the capability matrix's "Atomic
multi-dataset publish" cell implied Lance gives us a single primitive
for cross-table atomicity. It doesn't. The real contract is three
layers stacked:
(1) per-table Lance `commit_staged` for the data write
(2) `__manifest` row-level CAS via `ManifestBatchPublisher` for
cross-table ordering
(3) recovery-on-open reconciler for the residual gap between (1)
and (2) — NOT YET SHIPPED, tracked in MR-847.
Until MR-847 lands, a failure between per-table `commit_staged` and
the manifest publish leaves drift on the partially-committed tables
(the "Phase B → Phase C residual" documented in `docs/runs.md`).
Also enumerate the legacy inline-commit residuals (`append_batch`,
`merge_insert_batches`, `overwrite_batch`, `create_*_index`) alongside
`delete_where` and `create_vector_index` — they remain on the trait
pending Phase 1b call-site conversion + Phase 9 demotion.
End the row with an explicit DO NOT: future agents reading the
capability matrix should not describe atomicity as "fully upheld"
until MR-847 ships.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cubic findings:
* `tests/forbidden_apis.rs`: expand `FORBIDDEN_PATTERNS` with `Dataset::write`
/ `Dataset::append` / `Dataset::delete` / `Dataset::merge_insert` /
`Dataset::add_columns` / `update_columns` / `drop_columns` /
`truncate_table` / `restore` and the bare `.merge_insert(` /
`.add_columns(` / `.update_columns(` / `.drop_columns(` /
`.truncate_table(` method patterns. Deliberately avoid `.append(` /
`.delete(` / `.write(` (over-match `Vec::append`, `.delete_branch(`,
arrow-array `.append(`, etc.). Allow-list `commit_graph.rs` and
`graph_coordinator.rs` — they're manifest-layer infra that legitimately
uses `Dataset::write` for system tables.
* `schema_apply.rs:253`: pass `entry.table_branch.as_deref()` (not
`None`) to `open_dataset_head_for_write` for consistency with the
sibling `indexed_tables` block. Schema apply rejects non-main
branches at the lock-acquire step today, so behavior is unchanged;
this is a defensive consistency fix that survives a future relaxation
of the lock check.
* `storage_layer.rs:131` doc: was `Vec<&StagedWrite>` with lifetime
claim; actually returns `Vec<StagedWrite>` (cloned). Fixed.
* `AGENTS.md:201` capability matrix row + `storage_layer.rs:1` module
doc: softened the "stage_* + commit_staged are the only paths" /
"trait funnels every write" overclaim. Inline-commit residuals
(`delete_where`, `create_vector_index`) remain on the trait pending
upstream Lance work (#6658, #6666); legacy `append_batch` etc.
remain pending Phase 1b / Phase 9. Module doc now describes the
current transitional state honestly.
Cursor Bugbot findings:
* `storage_layer.rs:360`: trait `delete_where` consumed `SnapshotHandle`
but returned only `DeleteState`, dropping the post-delete dataset.
Future callers migrating from the inherent `&mut Dataset` API would
lose the post-delete dataset state needed for indexing /
`table_state` queries. Fixed: returns `(SnapshotHandle, DeleteState)`
matching `append_batch` / `overwrite_batch` shape.
* `storage_layer.rs:824`: removed dead `_scanner_type_marker` fn and
the unused `Scanner` import (the marker existed only to suppress an
unused-import warning — fixing the import is the cleaner answer).
Engine-level Phase A failpoint test (closes the partial-criterion
flagged in Cubic's acceptance-criteria checklist):
* `db/omnigraph/table_ops.rs::stage_and_commit_btree`: instrumented
with `crate::failpoints::maybe_fail("ensure_indices.post_stage_pre_commit_btree")`
between `stage_create_btree_index` and `commit_staged`.
* `tests/failpoints.rs::ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`:
triggers the failpoint via a schema-apply that adds a new node type;
proves that existing tables are unaffected (Person mutation succeeds
after the failed apply) — i.e. Phase A failure leaves no Lance-HEAD
drift on tables outside the failed `added_tables` iteration.
`docs/invariants.md` transitional notes:
* §VI.23 (atomicity per query): annotated as upheld at the
writer-trait surface for inserts / updates / scalar-index builds /
merge_insert / overwrite after MR-793 PR #70. Per-table
commit_staged → manifest publish window remains; closing requires
MR-847's recovery-on-open reconciler. `delete_where` and
`create_vector_index` remain inline pending lance#6658 / #6666.
* §VII.35 (reconciler pattern): annotated as partial — staged
primitives are the building blocks; the reconciler task itself is
MR-848.
* §VIII.45 (reference impl per trait): `TableStorage` has its primary
impl on `TableStore` with opaque-handle signatures; no test impl
yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Behavior is interlocked across Lance pages — transactions reference
index lifecycle, index lifecycle references compaction, compaction
references row-id lineage. Skipping a "slightly relevant" page is how
alignment misses happen. The index alone is not a substitute for
reading the pages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hoists Lance's stage+commit two-phase write pattern from "discipline at
each writer" to a sealed trait surface (`TableStorage`). New engine code
that needs to advance Lance HEAD MUST go through `stage_*` + `commit_staged`;
the trait's opaque `SnapshotHandle` / `StagedHandle` types keep
`lance::Dataset` and `lance::Transaction` out of trait signatures.
Phases landed (see .context/mr-793-design.md for the full plan):
* 1a: `crates/omnigraph/src/storage_layer.rs` — `TableStorage` trait,
sealed (only in-tree types can impl), single impl on `TableStore`
delegating to existing inherent methods; `Omnigraph::storage()`
accessor returns `&dyn TableStorage`.
* 2: three new staged primitives — `stage_overwrite`,
`stage_create_btree_index`, `stage_create_inverted_index` —
implementing the simple branch of Lance's `CreateIndexBuilder::execute`
(scalar indices only; vector indices stay inline because
`build_index_metadata_from_segments` is `pub(crate)` in lance-4.0.0).
Six new tests in `tests/staged_writes.rs` pin both the new primitives
and the inline residuals (`delete_where`, `create_vector_index`).
* 3: `tests/forbidden_apis.rs` — defense-in-depth integration test
walks engine source, fails on direct lance::* inline-commit API use
outside `table_store.rs` / `db/manifest/`. Skips comment lines and
honors `// forbidden-api-allow:` sentinels.
* 4: `ensure_indices` migration — scalar index builds now route through
`stage_create_*_index` + `commit_staged` instead of
`create_*_index(&mut Dataset)`. Vector indices stay inline (residual,
named honestly at the call site).
* 5: `branch_merge::publish_rewritten_merge_table` migration — the
merge_insert phase now uses `stage_merge_insert` + `commit_staged`;
delete phase stays inline (Lance #6658 residual, named honestly).
* 6: `schema_apply` rewritten_tables migration — non-empty rewrites
use `stage_overwrite` + `commit_staged`; empty-batch rewrites stay
inline because `InsertBuilder::execute_uncommitted` rejects empty
data. The narrow inline window is bounded by `__schema_apply_lock__`.
Verified-green test surface:
* `cargo test -p omnigraph-engine` — 68 lib + ~120 integration tests
(incl. 6 new staged_writes tests + the new forbidden_apis test).
* `cargo test -p omnigraph-engine --features failpoints --test failpoints`
— 5 tests, all green.
* `cargo test --workspace` — green.
Deferred to follow-up sessions (see design doc §17 split):
* Phase 1b — convert remaining engine call sites to `&dyn TableStorage`
(mostly READS that don't touch the staged-write invariant).
* Phase 7 — recovery-on-open reconciler (closes Phase B → Phase C
residual across process restarts; new subsystem).
* Phase 8 — index-coverage reconciler (full §VII.35 compliance —
removes synchronous index work from the publish path).
* Phase 9 — demote unused `TableStore` inherent methods to `pub(crate)`
(depends on Phase 1b).
Lance upstream blockers documented:
* lance-format/lance#6658 — two-phase delete API (open, no PRs).
* Companion: `build_index_metadata_from_segments` should be `pub` so
vector-index builds can be staged outside the lance crate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refresh user-facing and agent-facing docs for the staged-write rewire
and clean up stale Run-state-machine references that survived MR-771.
MR-794-specific updates:
* docs/runs.md — remove "Known limitation: mid-query partial failure"
section; document the in-memory accumulator + D₂ rule + the
LoadMode::Overwrite residual.
* docs/invariants.md §VI.25 — flip from aspirational/open to
upheld for inserts/updates. Within-query read-your-writes is now
load-bearing for the publisher CAS contract.
* docs/architecture.md — add "Mutation atomicity — in-memory
accumulator (MR-794)" subsection with per-op flow; refresh the
engine + state diagrams to drop RunRegistry and add MutationStaging.
* docs/execution.md — rewrite the mutation flow sequence diagram
for the staged-write path; updated the LoadMode table to call
out per-mode commit semantics; rewrote load vs ingest.
* docs/query-language.md — document the D₂ parse-time rule.
* docs/errors.md — add the D₂ BadRequest rejection path.
* docs/testing.md — extend the runs.rs row to cover the new MR-794
contract tests; add the staged_writes.rs row.
* docs/releases/v0.4.1.md (new) — release note covering the rewire,
test additions, residuals, and files changed.
* AGENTS.md (CLAUDE.md symlink) — update the atomic-per-query
description and the L2 capability matrix row.
Stale-reference cleanup (MR-771 leftovers):
* docs/storage.md — drop live _graph_runs.lance / _graph_run_actors.lance
from the layout diagram and prose; mark legacy.
* docs/branches-commits.md — move __run__<id> to a legacy note;
remove publish_run from the publish-trigger list.
* docs/audit.md — refresh _as API list (drop begin_run_as / publish_run_as);
legacy RunRecord.actor_id moved to a historical note.
* docs/constants.md — mark run registry / branch-prefix rows as legacy.
* docs/cli.md — replace the legacy omnigraph run * quickstart block
with omnigraph commit list/show.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mutate_as and load now write directly to target tables and call the
publisher once at the end with per-table expected versions; the Run
state machine, _graph_runs.lance writers, __run__ staging branches,
and server /runs/* endpoints are removed. Multi-statement mutations
remain atomic at the manifest level via an in-memory MutationStaging
accumulator that gives read-your-writes within a query and a single
publish at the end. Concurrent-writer conflicts surface as
ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the
old DivergentUpdate merge shape. Documents one known limitation in
docs/runs.md: a multi-statement mid-query failure where op-N writes
a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the
manifest until a follow-up introduces per-table Lance branches.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous bullets read like a migration pattern (centralized
dispatcher, one match arm, no shape forks). That's one application, not
the principle. Reframe it as a bidirectional decision lens: ask "which
option has lower ongoing cost over time?" and let the answer be more
code, less code, DRYing, duplication, removal, addition, a new
abstraction, or flattening one — whichever shape converges over the
expected change horizon.
Add explicit examples of cases where the lower-liability option is
*more* code (dispatcher, migration framework, typed error variants) and
where it's *less* (premature abstractions, "just in case" paths,
helpers that wedge independently-evolving callers together) so readers
don't collapse the principle into "minimize code".
The always-on rules are concretizations of a broader engineering posture
that wasn't stated explicitly. Add a short section that frames the spirit
behind those rules:
- One centralized detection point, not many heal hooks.
- One dispatcher, not branch-on-shape in every consumer.
- One canonical shape after migration, not forks on "old vs new".
- Three similar lines beats a premature abstraction.
- Delete dead paths when their last caller leaves.
Plus a forward-looking review prompt ("what do these paths look like after
5 more changes like this?") so the principle bites at design time, not
just at review time.
The internal-schema-version mechanism we just shipped is a concrete
application: one stamp + one dispatcher + one match arm per change, no
heal hooks scattered across the engine. Codify the pattern so future
work doesn't drift back to ad-hoc.
All eight comments verified against source and applied:
- AGENTS.md: pull @docs/{invariants,lance,testing}.md imports out of
the markdown blockquote. Claude Code's @-import parser expects @ at
column 0; the leading "> " of a blockquote silently broke
recognition, so the claimed auto-include did nothing. (Cursor,
Medium severity.)
- docs/cli-reference.md: command-family count 13 → 17. The current
enum Command in crates/omnigraph-cli/src/main.rs has 17 top-level
variants. (cubic P2.)
- docs/ci.md: Homebrew tap update is a regular `git push`, not a
force-push (release.yml:117 is `git push origin HEAD:main`). (cubic
P2.)
- docs/errors.md: add the Storage variant to the NanoError list — it
exists at error.rs:88-89 but the doc enumerated only 10 of 11.
(cubic P2.)
- docs/storage.md: clarify tombstone semantics. There is no
tombstone_version column; state.rs:180 reads the tombstone version
from the table_version column on rows where object_type =
table_tombstone. (cubic P2.)
- docs/branches-commits.md: split the GraphCommit pseudo-struct from
the underlying storage. actor_id is joined in-memory from
_graph_commit_actors.lance, not a column on _graph_commits.lance.
(cubic P2.)
- docs/schema-language.md: rename IR_VERSION to SCHEMA_IR_VERSION to
match the actual constant name in catalog/schema_ir.rs:11.
(cubic P3.)
- docs/testing.md: engine integration test count 16 → 15 (matches
`ls crates/omnigraph/tests/*.rs`). (cubic P3.)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original docs/testing.md mentioned finding existing tests as step 1
of the checklist but never explicitly said "if existing coverage
already addresses your case, extend it; don't duplicate." Adds a
prominent "First principle" section that names extend-vs-new as the
preferred outcome and lists three duplicated init_and_load blocks as
the most common form of test rot.
Adds an extra checklist item: verify your change makes an *existing*
test fail before it makes a new one pass — if you can break the code
without breaking a test, that coverage gap is the bug to fix first.
Strengthens the AGENTS.md callout so the principle ("always check what
already covers it") is in scope from the top of every session.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.
Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.
Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds @docs/lance.md alongside @docs/invariants.md so the Lance index
loads on every turn (Claude Code @-import; explicit-open instruction
for other agents). Reframes the directive from "when you hit a
Lance-shaped problem" to "consult before every task to identify which
upstream pages are relevant." The Lance docs are the authoritative
source for substrate behavior, so reasoning about them should start
every change rather than be triggered conditionally.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Curates the Lance documentation site (lance.org) into a problem-domain
index so agents fetch the right page when working on Lance-touching
code instead of guessing or grepping our codebase. Organized by topic:
storage format & file layout, branching/tags/time travel, indexes
(scalar + system + vector), reads/writes, schema evolution, object
store, data types, performance, compaction, DataFusion integration,
SDK reference, plus quick-starts and the upstream AGENTS.md.
Skips ~200 irrelevant URLs from the upstream sitemap (Namespace REST
API model surface, Spark/Trino/Databricks/etc. integrations,
Python/Ray/HuggingFace docs, community pages) since omnigraph is
Rust-only and doesn't run a Lance Namespace catalog.
AGENTS.md surfaces it in the topic index and adds a directive: "when
you hit a Lance-shaped problem, consult docs/lance.md and fetch the
upstream URL before guessing."
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a top-of-file directive plus a Claude-Code @-import so the full
invariants document is loaded into context on every turn, not only
when an agent follows a pointer. Other agents are instructed to open
it explicitly at session start. The §IX deny-list and §X review
checklist apply to every change, so they should be in scope by
default rather than gated on the agent remembering to look.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops four rules whose phrasing leaned on implementation specifics
(nearest LIMIT, __run__<id> branches, __schema_apply_lock__,
branch_list filter convention) — those are real constraints, but they
live at the implementation layer and would go stale if internals are
renamed or refactored. The architectural intent is captured by the
remaining six rules and by the per-area docs.
Reframes the kept rules at the survives-a-rename level: "multi-dataset
publish is atomic across the whole graph" rather than naming the
manifest table or the publisher type, etc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A standing reference for invariants that hold across storage, engine,
server, schema, indexing, observability, and the OSS/Cloud split. Used
to check RFCs and PRs against the substrate boundaries (don't rebuild
what Lance gives us), layering rules (one trait boundary per layer),
distributability constraints (Send+Sync, location-neutral IR), honesty
expectations (estimate-vs-actual, bounded failure modes), unified
patterns (reconciler, Union polymorphism, SIP, factorize), the §IX
deny-list, and the §X review checklist.
§IV (additivity / migration) and §VIII (OSS/Cloud kernel-product split)
are referenced but not yet drafted — flagged as placeholders pending
upstream fill-in.
AGENTS.md surfaces it from the topic index, the always-on rules
section, and the maintenance contract; the deny-list is also inlined
there as a fast-pass review filter so it stays in scope every turn.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Splits the 990-line AGENTS.md into a 184-line map (architecture,
where-to-find index, always-on invariants, capability matrix,
maintenance contract) plus 18 new docs/*.md files holding the deep
content per topic (storage, schema and query languages, indexes,
embeddings, branches/commits, runs, merge, changes, execution, policy,
server, CLI reference, audit, errors, CI, constants, v0.3.1 notes).
Adds scripts/check-agents-md.sh and a check_agents_md CI job that
verifies every docs/ link in AGENTS.md resolves and every doc in the
canonical set is linked. CLAUDE.md remains a symlink to AGENTS.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures the v0.3.1 feature spec (storage, schema/query languages, IR,
indexes, embeddings, branches/commits/runs, merge, server, CLI, policy,
deployment) and adds a §26 maintenance contract instructing agents to
keep this file current alongside any user-visible change. CLAUDE.md is
a symlink so there's one source of truth.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>