Three bundled changes:
1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and
the test function). OmniGraph is consumed by both humans and agents
- naming the test after one audience misframes the library.
2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and
review-round labels from source, tests, and docs added by this
branch. Internal traceability belongs in commit messages and PR
descriptions, not in checked-in artifacts. Upstream
lance-format/lance issue refs and pre-existing MR-XXX refs in docs
not touched by this branch are left alone.
3. Two outstanding review findings addressed:
- `needs_index_work_node` / `needs_index_work_edge`: propagate
`count_rows` errors instead of `unwrap_or(0)`. Silently treating
transient I/O failures as "0 rows" risked skipping a table from
the recovery sidecar pin set that was actually about to be
modified.
- `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`:
strengthen the assertion to fail when sidecar B classifies under
a stale snapshot. The new assertion checks post-recovery Lance
HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar
deleted + audit rows present" pair passed in both the bug and
fix paths because both delete the sidecar and write an audit
row; the differentiator is the post-recovery HEAD. Strengthening
the assertion exposed an additional nuance: in this overlapping-
sidecar scenario sidecar B's audit kind is RolledBack (no-op)
rather than RolledForward, since sidecar A's roll-forward
publishes Lance HEAD as the new manifest pin (absorbing B's
work). The docstring now explains why this is correct given
current `roll_forward_all` semantics.
All workspace tests pass with --features failpoints.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update the doc surface to reflect MR-847 having shipped end to end —
sidecar protocol, classifier, all-or-nothing decision tree, roll-forward
via ManifestBatchPublisher, roll-back via Dataset::restore with
fragment-set short-circuit, audit trail in
_graph_commit_recoveries.lance, OpenMode::{ReadWrite, ReadOnly}, and
the four migrated writers all carrying sidecars across Phase B → Phase C.
- docs/invariants.md §VI.23: change from "upheld at the writer-trait
surface for inserts/updates/etc., per-table commit_staged → manifest
publish window remains" to "upheld at the writer-trait surface AND
across process boundaries". The MR-847 sweep closes the residual on
the next Omnigraph::open. The "continuous in-process" property
(no ExpectedVersionMismatch surfacing to subsequent writers between
Phase B failure and process restart) is honest follow-up at MR-856.
- docs/runs.md: replace "Finalize → publisher residual" section with
"Open-time recovery sweep (MR-847)" — describes the sidecar protocol
lifecycle (Phases A-D), the sweep's classifier + decision dispatch,
the audit trail, and the operator-facing query
(omnigraph commit list --filter actor=omnigraph:recovery).
- AGENTS.md capability matrix "Atomic single-dataset commits" row:
drop the "Layer (3) is not yet shipped — tracked in MR-847" caveat;
describe the three layers as all shipping; reference MR-856 for the
background-reconciler follow-up.
- docs/storage.md: add _graph_commit_recoveries.lance and
__recovery/{ulid}.json to the on-disk layout (mermaid + prose).
- docs/branches-commits.md: new "Recovery audit trail (MR-847)"
subsection describing the join from
_graph_commits.lance:actor_id="omnigraph:recovery" to
_graph_commit_recoveries.lance:graph_commit_id for operator
post-mortem.
- docs/maintenance.md: note the MR-847 recovery floor on cleanup —
--keep < 3 may garbage-collect Lance versions the recovery sweep
needs as a rollback target. Default --keep 10 is safe.
- docs/testing.md: add tests/recovery.rs to the engine integration-test
table; expand the failpoints.rs row to mention the four MR-847
per-writer Phase B → recovery integration tests.
- .context/mr-847-design.md: prepend a "Status: DONE" stanza listing
every commit hash + scope across phases 1-10.
AGENTS.md ↔ docs/ cross-link check passes (26 links, 26 docs).
Full workspace test sweep passes with --features failpoints (361 tests
across 20 binaries).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
External reviewer flagged that the capability matrix's "Atomic
multi-dataset publish" cell implied Lance gives us a single primitive
for cross-table atomicity. It doesn't. The real contract is three
layers stacked:
(1) per-table Lance `commit_staged` for the data write
(2) `__manifest` row-level CAS via `ManifestBatchPublisher` for
cross-table ordering
(3) recovery-on-open reconciler for the residual gap between (1)
and (2) — NOT YET SHIPPED, tracked in MR-847.
Until MR-847 lands, a failure between per-table `commit_staged` and
the manifest publish leaves drift on the partially-committed tables
(the "Phase B → Phase C residual" documented in `docs/runs.md`).
Also enumerate the legacy inline-commit residuals (`append_batch`,
`merge_insert_batches`, `overwrite_batch`, `create_*_index`) alongside
`delete_where` and `create_vector_index` — they remain on the trait
pending Phase 1b call-site conversion + Phase 9 demotion.
End the row with an explicit DO NOT: future agents reading the
capability matrix should not describe atomicity as "fully upheld"
until MR-847 ships.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cubic findings:
* `tests/forbidden_apis.rs`: expand `FORBIDDEN_PATTERNS` with `Dataset::write`
/ `Dataset::append` / `Dataset::delete` / `Dataset::merge_insert` /
`Dataset::add_columns` / `update_columns` / `drop_columns` /
`truncate_table` / `restore` and the bare `.merge_insert(` /
`.add_columns(` / `.update_columns(` / `.drop_columns(` /
`.truncate_table(` method patterns. Deliberately avoid `.append(` /
`.delete(` / `.write(` (over-match `Vec::append`, `.delete_branch(`,
arrow-array `.append(`, etc.). Allow-list `commit_graph.rs` and
`graph_coordinator.rs` — they're manifest-layer infra that legitimately
uses `Dataset::write` for system tables.
* `schema_apply.rs:253`: pass `entry.table_branch.as_deref()` (not
`None`) to `open_dataset_head_for_write` for consistency with the
sibling `indexed_tables` block. Schema apply rejects non-main
branches at the lock-acquire step today, so behavior is unchanged;
this is a defensive consistency fix that survives a future relaxation
of the lock check.
* `storage_layer.rs:131` doc: was `Vec<&StagedWrite>` with lifetime
claim; actually returns `Vec<StagedWrite>` (cloned). Fixed.
* `AGENTS.md:201` capability matrix row + `storage_layer.rs:1` module
doc: softened the "stage_* + commit_staged are the only paths" /
"trait funnels every write" overclaim. Inline-commit residuals
(`delete_where`, `create_vector_index`) remain on the trait pending
upstream Lance work (#6658, #6666); legacy `append_batch` etc.
remain pending Phase 1b / Phase 9. Module doc now describes the
current transitional state honestly.
Cursor Bugbot findings:
* `storage_layer.rs:360`: trait `delete_where` consumed `SnapshotHandle`
but returned only `DeleteState`, dropping the post-delete dataset.
Future callers migrating from the inherent `&mut Dataset` API would
lose the post-delete dataset state needed for indexing /
`table_state` queries. Fixed: returns `(SnapshotHandle, DeleteState)`
matching `append_batch` / `overwrite_batch` shape.
* `storage_layer.rs:824`: removed dead `_scanner_type_marker` fn and
the unused `Scanner` import (the marker existed only to suppress an
unused-import warning — fixing the import is the cleaner answer).
Engine-level Phase A failpoint test (closes the partial-criterion
flagged in Cubic's acceptance-criteria checklist):
* `db/omnigraph/table_ops.rs::stage_and_commit_btree`: instrumented
with `crate::failpoints::maybe_fail("ensure_indices.post_stage_pre_commit_btree")`
between `stage_create_btree_index` and `commit_staged`.
* `tests/failpoints.rs::ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`:
triggers the failpoint via a schema-apply that adds a new node type;
proves that existing tables are unaffected (Person mutation succeeds
after the failed apply) — i.e. Phase A failure leaves no Lance-HEAD
drift on tables outside the failed `added_tables` iteration.
`docs/invariants.md` transitional notes:
* §VI.23 (atomicity per query): annotated as upheld at the
writer-trait surface for inserts / updates / scalar-index builds /
merge_insert / overwrite after MR-793 PR #70. Per-table
commit_staged → manifest publish window remains; closing requires
MR-847's recovery-on-open reconciler. `delete_where` and
`create_vector_index` remain inline pending lance#6658 / #6666.
* §VII.35 (reconciler pattern): annotated as partial — staged
primitives are the building blocks; the reconciler task itself is
MR-848.
* §VIII.45 (reference impl per trait): `TableStorage` has its primary
impl on `TableStore` with opaque-handle signatures; no test impl
yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Behavior is interlocked across Lance pages — transactions reference
index lifecycle, index lifecycle references compaction, compaction
references row-id lineage. Skipping a "slightly relevant" page is how
alignment misses happen. The index alone is not a substitute for
reading the pages.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hoists Lance's stage+commit two-phase write pattern from "discipline at
each writer" to a sealed trait surface (`TableStorage`). New engine code
that needs to advance Lance HEAD MUST go through `stage_*` + `commit_staged`;
the trait's opaque `SnapshotHandle` / `StagedHandle` types keep
`lance::Dataset` and `lance::Transaction` out of trait signatures.
Phases landed (see .context/mr-793-design.md for the full plan):
* 1a: `crates/omnigraph/src/storage_layer.rs` — `TableStorage` trait,
sealed (only in-tree types can impl), single impl on `TableStore`
delegating to existing inherent methods; `Omnigraph::storage()`
accessor returns `&dyn TableStorage`.
* 2: three new staged primitives — `stage_overwrite`,
`stage_create_btree_index`, `stage_create_inverted_index` —
implementing the simple branch of Lance's `CreateIndexBuilder::execute`
(scalar indices only; vector indices stay inline because
`build_index_metadata_from_segments` is `pub(crate)` in lance-4.0.0).
Six new tests in `tests/staged_writes.rs` pin both the new primitives
and the inline residuals (`delete_where`, `create_vector_index`).
* 3: `tests/forbidden_apis.rs` — defense-in-depth integration test
walks engine source, fails on direct lance::* inline-commit API use
outside `table_store.rs` / `db/manifest/`. Skips comment lines and
honors `// forbidden-api-allow:` sentinels.
* 4: `ensure_indices` migration — scalar index builds now route through
`stage_create_*_index` + `commit_staged` instead of
`create_*_index(&mut Dataset)`. Vector indices stay inline (residual,
named honestly at the call site).
* 5: `branch_merge::publish_rewritten_merge_table` migration — the
merge_insert phase now uses `stage_merge_insert` + `commit_staged`;
delete phase stays inline (Lance #6658 residual, named honestly).
* 6: `schema_apply` rewritten_tables migration — non-empty rewrites
use `stage_overwrite` + `commit_staged`; empty-batch rewrites stay
inline because `InsertBuilder::execute_uncommitted` rejects empty
data. The narrow inline window is bounded by `__schema_apply_lock__`.
Verified-green test surface:
* `cargo test -p omnigraph-engine` — 68 lib + ~120 integration tests
(incl. 6 new staged_writes tests + the new forbidden_apis test).
* `cargo test -p omnigraph-engine --features failpoints --test failpoints`
— 5 tests, all green.
* `cargo test --workspace` — green.
Deferred to follow-up sessions (see design doc §17 split):
* Phase 1b — convert remaining engine call sites to `&dyn TableStorage`
(mostly READS that don't touch the staged-write invariant).
* Phase 7 — recovery-on-open reconciler (closes Phase B → Phase C
residual across process restarts; new subsystem).
* Phase 8 — index-coverage reconciler (full §VII.35 compliance —
removes synchronous index work from the publish path).
* Phase 9 — demote unused `TableStore` inherent methods to `pub(crate)`
(depends on Phase 1b).
Lance upstream blockers documented:
* lance-format/lance#6658 — two-phase delete API (open, no PRs).
* Companion: `build_index_metadata_from_segments` should be `pub` so
vector-index builds can be staged outside the lance crate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refresh user-facing and agent-facing docs for the staged-write rewire
and clean up stale Run-state-machine references that survived MR-771.
MR-794-specific updates:
* docs/runs.md — remove "Known limitation: mid-query partial failure"
section; document the in-memory accumulator + D₂ rule + the
LoadMode::Overwrite residual.
* docs/invariants.md §VI.25 — flip from aspirational/open to
upheld for inserts/updates. Within-query read-your-writes is now
load-bearing for the publisher CAS contract.
* docs/architecture.md — add "Mutation atomicity — in-memory
accumulator (MR-794)" subsection with per-op flow; refresh the
engine + state diagrams to drop RunRegistry and add MutationStaging.
* docs/execution.md — rewrite the mutation flow sequence diagram
for the staged-write path; updated the LoadMode table to call
out per-mode commit semantics; rewrote load vs ingest.
* docs/query-language.md — document the D₂ parse-time rule.
* docs/errors.md — add the D₂ BadRequest rejection path.
* docs/testing.md — extend the runs.rs row to cover the new MR-794
contract tests; add the staged_writes.rs row.
* docs/releases/v0.4.1.md (new) — release note covering the rewire,
test additions, residuals, and files changed.
* AGENTS.md (CLAUDE.md symlink) — update the atomic-per-query
description and the L2 capability matrix row.
Stale-reference cleanup (MR-771 leftovers):
* docs/storage.md — drop live _graph_runs.lance / _graph_run_actors.lance
from the layout diagram and prose; mark legacy.
* docs/branches-commits.md — move __run__<id> to a legacy note;
remove publish_run from the publish-trigger list.
* docs/audit.md — refresh _as API list (drop begin_run_as / publish_run_as);
legacy RunRecord.actor_id moved to a historical note.
* docs/constants.md — mark run registry / branch-prefix rows as legacy.
* docs/cli.md — replace the legacy omnigraph run * quickstart block
with omnigraph commit list/show.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mutate_as and load now write directly to target tables and call the
publisher once at the end with per-table expected versions; the Run
state machine, _graph_runs.lance writers, __run__ staging branches,
and server /runs/* endpoints are removed. Multi-statement mutations
remain atomic at the manifest level via an in-memory MutationStaging
accumulator that gives read-your-writes within a query and a single
publish at the end. Concurrent-writer conflicts surface as
ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the
old DivergentUpdate merge shape. Documents one known limitation in
docs/runs.md: a multi-statement mid-query failure where op-N writes
a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the
manifest until a follow-up introduces per-table Lance branches.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous bullets read like a migration pattern (centralized
dispatcher, one match arm, no shape forks). That's one application, not
the principle. Reframe it as a bidirectional decision lens: ask "which
option has lower ongoing cost over time?" and let the answer be more
code, less code, DRYing, duplication, removal, addition, a new
abstraction, or flattening one — whichever shape converges over the
expected change horizon.
Add explicit examples of cases where the lower-liability option is
*more* code (dispatcher, migration framework, typed error variants) and
where it's *less* (premature abstractions, "just in case" paths,
helpers that wedge independently-evolving callers together) so readers
don't collapse the principle into "minimize code".
The always-on rules are concretizations of a broader engineering posture
that wasn't stated explicitly. Add a short section that frames the spirit
behind those rules:
- One centralized detection point, not many heal hooks.
- One dispatcher, not branch-on-shape in every consumer.
- One canonical shape after migration, not forks on "old vs new".
- Three similar lines beats a premature abstraction.
- Delete dead paths when their last caller leaves.
Plus a forward-looking review prompt ("what do these paths look like after
5 more changes like this?") so the principle bites at design time, not
just at review time.
The internal-schema-version mechanism we just shipped is a concrete
application: one stamp + one dispatcher + one match arm per change, no
heal hooks scattered across the engine. Codify the pattern so future
work doesn't drift back to ad-hoc.
All eight comments verified against source and applied:
- AGENTS.md: pull @docs/{invariants,lance,testing}.md imports out of
the markdown blockquote. Claude Code's @-import parser expects @ at
column 0; the leading "> " of a blockquote silently broke
recognition, so the claimed auto-include did nothing. (Cursor,
Medium severity.)
- docs/cli-reference.md: command-family count 13 → 17. The current
enum Command in crates/omnigraph-cli/src/main.rs has 17 top-level
variants. (cubic P2.)
- docs/ci.md: Homebrew tap update is a regular `git push`, not a
force-push (release.yml:117 is `git push origin HEAD:main`). (cubic
P2.)
- docs/errors.md: add the Storage variant to the NanoError list — it
exists at error.rs:88-89 but the doc enumerated only 10 of 11.
(cubic P2.)
- docs/storage.md: clarify tombstone semantics. There is no
tombstone_version column; state.rs:180 reads the tombstone version
from the table_version column on rows where object_type =
table_tombstone. (cubic P2.)
- docs/branches-commits.md: split the GraphCommit pseudo-struct from
the underlying storage. actor_id is joined in-memory from
_graph_commit_actors.lance, not a column on _graph_commits.lance.
(cubic P2.)
- docs/schema-language.md: rename IR_VERSION to SCHEMA_IR_VERSION to
match the actual constant name in catalog/schema_ir.rs:11.
(cubic P3.)
- docs/testing.md: engine integration test count 16 → 15 (matches
`ls crates/omnigraph/tests/*.rs`). (cubic P3.)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original docs/testing.md mentioned finding existing tests as step 1
of the checklist but never explicitly said "if existing coverage
already addresses your case, extend it; don't duplicate." Adds a
prominent "First principle" section that names extend-vs-new as the
preferred outcome and lists three duplicated init_and_load blocks as
the most common form of test rot.
Adds an extra checklist item: verify your change makes an *existing*
test fail before it makes a new one pass — if you can break the code
without breaking a test, that coverage gap is the bug to fix first.
Strengthens the AGENTS.md callout so the principle ("always check what
already covers it") is in scope from the top of every session.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.
Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.
Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds @docs/lance.md alongside @docs/invariants.md so the Lance index
loads on every turn (Claude Code @-import; explicit-open instruction
for other agents). Reframes the directive from "when you hit a
Lance-shaped problem" to "consult before every task to identify which
upstream pages are relevant." The Lance docs are the authoritative
source for substrate behavior, so reasoning about them should start
every change rather than be triggered conditionally.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Curates the Lance documentation site (lance.org) into a problem-domain
index so agents fetch the right page when working on Lance-touching
code instead of guessing or grepping our codebase. Organized by topic:
storage format & file layout, branching/tags/time travel, indexes
(scalar + system + vector), reads/writes, schema evolution, object
store, data types, performance, compaction, DataFusion integration,
SDK reference, plus quick-starts and the upstream AGENTS.md.
Skips ~200 irrelevant URLs from the upstream sitemap (Namespace REST
API model surface, Spark/Trino/Databricks/etc. integrations,
Python/Ray/HuggingFace docs, community pages) since omnigraph is
Rust-only and doesn't run a Lance Namespace catalog.
AGENTS.md surfaces it in the topic index and adds a directive: "when
you hit a Lance-shaped problem, consult docs/lance.md and fetch the
upstream URL before guessing."
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a top-of-file directive plus a Claude-Code @-import so the full
invariants document is loaded into context on every turn, not only
when an agent follows a pointer. Other agents are instructed to open
it explicitly at session start. The §IX deny-list and §X review
checklist apply to every change, so they should be in scope by
default rather than gated on the agent remembering to look.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops four rules whose phrasing leaned on implementation specifics
(nearest LIMIT, __run__<id> branches, __schema_apply_lock__,
branch_list filter convention) — those are real constraints, but they
live at the implementation layer and would go stale if internals are
renamed or refactored. The architectural intent is captured by the
remaining six rules and by the per-area docs.
Reframes the kept rules at the survives-a-rename level: "multi-dataset
publish is atomic across the whole graph" rather than naming the
manifest table or the publisher type, etc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A standing reference for invariants that hold across storage, engine,
server, schema, indexing, observability, and the OSS/Cloud split. Used
to check RFCs and PRs against the substrate boundaries (don't rebuild
what Lance gives us), layering rules (one trait boundary per layer),
distributability constraints (Send+Sync, location-neutral IR), honesty
expectations (estimate-vs-actual, bounded failure modes), unified
patterns (reconciler, Union polymorphism, SIP, factorize), the §IX
deny-list, and the §X review checklist.
§IV (additivity / migration) and §VIII (OSS/Cloud kernel-product split)
are referenced but not yet drafted — flagged as placeholders pending
upstream fill-in.
AGENTS.md surfaces it from the topic index, the always-on rules
section, and the maintenance contract; the deny-list is also inlined
there as a fast-pass review filter so it stays in scope every turn.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Splits the 990-line AGENTS.md into a 184-line map (architecture,
where-to-find index, always-on invariants, capability matrix,
maintenance contract) plus 18 new docs/*.md files holding the deep
content per topic (storage, schema and query languages, indexes,
embeddings, branches/commits, runs, merge, changes, execution, policy,
server, CLI reference, audit, errors, CI, constants, v0.3.1 notes).
Adds scripts/check-agents-md.sh and a check_agents_md CI job that
verifies every docs/ link in AGENTS.md resolves and every doc in the
canonical set is linked. CLAUDE.md remains a symlink to AGENTS.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures the v0.3.1 feature spec (storage, schema/query languages, IR,
indexes, embeddings, branches/commits/runs, merge, server, CLI, policy,
deployment) and adds a §26 maintenance contract instructing agents to
keep this file current alongside any user-visible change. CLAUDE.md is
a symlink so there's one source of truth.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>