Warm-up commit for the policy chassis epic (MR-722). PR #1 of the
chassis series — same role as schema-lint v1's commit #1 baseline.
Zero behavioral change; establishes the regression test, the
load-bearing doc comment, and the user-doc paragraph for an
invariant already true in code.
Server auth already resolves `actor_id` from the matched bearer
token at `omnigraph-server/src/lib.rs:692-694`, overwriting whatever
the handler put in the PolicyRequest. The principle is named in
docs/dev/invariants.md Hard Invariant 11 ("clients cannot set actor
identity directly"). What was missing: a regression test, a
load-bearing doc comment at the resolution site, and a user-facing
documentation paragraph. This commit adds all three.
Why first. The actor-identity invariant is the foundation every
other policy decision stands on. If `actor_id` can be spoofed, every
chassis primitive (per-row scope, audit log, two-person rule)
becomes ungated. Pinning the invariant first means PR #2 (the
chassis core) doesn't have to re-prove this assertion.
Changes:
* crates/omnigraph-server/tests/server.rs — new regression test
actor_id_resolves_from_bearer_token_ignoring_client_supplied_headers
with three sub-assertions:
- spoof-up: bearer for denied actor + X-Actor-Id naming allowed
actor → 403 (header doesn't promote)
- spoof-down: bearer for allowed actor + X-Actor-Id naming denied
actor → 200 (header doesn't demote)
- empty-string spoof: empty X-Actor-Id doesn't clear resolved actor
Cross-link to MR-777 (auth boundary cases — actor-id collision +
malformed bearer) noted in the test docstring.
* crates/omnigraph-server/src/lib.rs — expanded doc comment at
the actor-resolution site explaining the SECURITY INVARIANT,
citing Hard Invariant 11, the Supabase RLS history footgun, and
the regression test that pins the contract. Reader thinking "I
should let clients override actor_id for impersonation" hits
this comment first.
* docs/user/policy.md — new "Actor identity (signed-claim-only)"
section near the existing Server enforcement section. Closes the
user-facing doc gap MR-731's "Done when" requires.
Architectural decisions for PR #2+ pinned this session (not
implemented here, recorded so future implementers don't re-litigate):
- PolicyEngine moves to new `omnigraph-policy` workspace crate so
both engine and server can depend on it (Q2).
- `enforce(action, scope, actor)` will take a new `ResourceScope`
enum, leaving room for MR-725's per-type and per-row variants (Q3).
- `PolicyAction::Admin` is kept and wired (Option A) — meta-action
for policy-management surfaces (hot reload, audit log query,
approvals list) as those consumer features land (Q4).
Test results:
- cargo test -p omnigraph-server --test server: 45 pass (44 existing
+ 1 new); no regressions
- scripts/check-agents-md.sh: passes (34 links / 33 docs OK)
Out of scope (PR #2+):
- Omnigraph::with_policy() + enforce() method
- omnigraph-policy crate creation
- ResourceScope enum
- CLI policy injection into Omnigraph
- HTTP-layer redundant-check removal
- MR-724 Admin action wiring (PR #2)
- MR-723 default-deny 3-state (PR #4)
- MR-736 severity warn/deny (PR #5)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
* schema-lint chassis v1 (WIP): tier surfacing + plan doc
First commit of the chassis v1 branch. Lands a small, foundational
slice without behavior change, plus a planning doc that lays out the
remaining 7 commits in sequence so the PR can be reviewed
incrementally.
This commit:
- Adds SchemaMigrationStep::diagnostic() returning the full
&'static DiagnosticCode (family + tier + severity) for
UnsupportedChange steps with codes. Renderers can now reach the
tier without re-implementing the code → tier lookup.
- CLI `omnigraph schema plan` output now displays tier alongside
code:
unsupported change on node:Person.age [OG-DS-104, destructive]:
removing property 'Person.age' is not supported in schema
migration v1
Operators see at-a-glance the kind of risk each rejection
represents — not just the rule identifier.
- No behavior change. All 11 existing schema_apply tests still pass.
Planning doc at docs/schema-lint-v1-plan.md tracks the 7 remaining
commits to bring v1 to feature-complete:
1. (this commit) Tier surfacing in plan output.
2. Soft / Hard mode enum on drop steps.
3. Tombstone fields on catalog IR.
4. Planner emits DropProperty { Soft } by default.
5. Apply path implements Soft mode.
6. Convert PR #62 destructive-rejection tests.
7. --allow-data-loss flag + Hard mode.
8. (optional) Tombstone unhide / restore command.
Delete the planning doc when v1 lands. Intentionally checked in to
the WIP branch so the scope is reviewable; not intended as a
permanent doc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* schema-lint v1 commit 2: DropMode + dormant Drop* variants
Second commit of the chassis v1 branch. Lands the type-level shape
of soft/hard drops without wiring them up. Variants are reachable
from emitters but the planner doesn't produce them yet; the apply
path returns an explicit not-yet-implemented error if one shows up
via deserialization.
Added:
- `DropMode { Soft, Hard }` — orthogonal to `SafetyTier`. Tier
classifies the rule's risk class; mode is the operator's intent
for data treatment.
- `Soft` → catalog tombstone, data retained. Tier: safe.
- `Hard` → Lance-level removal. Tier: destructive; will require
--allow-data-loss to apply (commit 7).
- `SchemaMigrationStep::DropType { type_kind, name, mode }` and
`SchemaMigrationStep::DropProperty { type_kind, type_name,
property_name, mode }` variants.
- Re-export `DropMode` from `omnigraph_compiler::DropMode` so
downstream crates don't reach into the catalog submodule.
- CLI `render_schema_plan_step` arms for both variants, surfacing
the mode in plan output: `drop property 'Person.age' of node
'Person' (soft mode)`.
- `apply_schema_with_lock` exhaustive match arm for the two new
variants that returns `manifest_internal` with a clear
not-yet-implemented message. If a SchemaIR JSON containing
Drop{Type,Property} arrives (e.g. from a future tool or hand-
written), the apply path fails explicitly rather than silently
misclassifying.
- Two new in-source tests:
- `drop_steps_round_trip_through_serde` — pins the wire shape
for all four (variant × mode) combinations.
- `drop_mode_serde_uses_snake_case` — pins external-tool-
friendly serialization (`"soft"` / `"hard"`).
Build: clean, only pre-existing warnings.
Tests:
- omnigraph-compiler schema_plan: 6/6 (4 existing + 2 new).
- omnigraph-engine schema_apply: 11/11 (unchanged — planner still
emits UnsupportedChange for removal paths).
Next commit (commit 3 per docs/schema-lint-v1-plan.md): add the
`tombstoned: bool` fields to NodeIR / EdgeIR / PropertyIR for the
catalog representation of soft-mode tombstones.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* plan doc: reframe v1 around Lance native drop_columns
After a substrate audit of the Lance data-evolution guide on
2026-05-13, the v1 plan was simplified. Two key findings:
1. Lance's `drop_columns()` is already metadata-only and reversible
via time travel until cleanup. No need for a parallel
`tombstoned: bool` field in our catalog IR — Lance's version
graph IS the tombstone.
2. The full schema_apply substrate migration (add_columns,
drop_columns, alter_columns vs. stage_overwrite across all step
types) is consolidated in MR-948 as a sibling issue. v1 only
uses the relevant slice (drop_columns for OG-DS-1XX).
Net plan changes:
- Commit 3 (original): tombstone fields on catalog IR → dropped.
No catalog IR change needed. The Lance drop_columns commit IS the
tombstone.
- Commit 5 (original): apply path writes tombstoned: true → replaced
with: apply path calls Dataset::drop_columns([name]).
- Commit 7 Hard mode: stage_overwrite removing the column → replaced
with: drop_columns + compact_files + cleanup_old_versions. Same
APIs omnigraph cleanup already uses.
- Commit 8 (original): omnigraph schema unhide → dropped. Time
travel is the undo (omnigraph snapshot --at <commit>).
Net result: 8 commits → 5 commits. ~250 LoC less surface. More
substrate-aligned.
The chassis types from commit 2 (DropMode enum, DropType /
DropProperty variants) are kept exactly as designed; only the
implementation strategy changed.
The Lance docs quote is included in the doc so future readers see
the substrate behavior cited verbatim.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* schema-lint v1 commit 3: emit + apply DropProperty { Soft }
Wire the dormant DropProperty variant end-to-end for the Soft case.
Per docs/schema-lint-v1-plan.md, commit #3 of the schema-lint chassis
v1 series (MR-694).
Planner (schema_plan.rs):
- plan_properties: emit DropProperty { type_kind, type_name,
property_name, mode: Soft } instead of UnsupportedChange when a
property exists in accepted but not in desired. Plan is now
supported = true for drop-only changes.
Apply (schema_apply.rs):
- Route DropProperty { Soft } through rewritten_tables. The existing
batch_for_schema_apply_rewrite path already iterates the *target*
schema fields, so a property absent from desired_catalog is
naturally projected away. The prior Lance version retains the
dropped column for time-travel reversibility (until cleanup runs).
- DropType still errors (lands in commit #4 with different mechanics:
__manifest entry removal instead of column projection).
- DropProperty { Hard } still errors (lands in commit #5 with
--allow-data-loss CLI flag + immediate compact_files +
cleanup_old_versions).
Tests:
- Planner unit test plan_emits_soft_drop_for_removed_nullable_property
asserts the variant emission + supported = true + no UnsupportedChange.
- Integration test apply_schema_drops_a_nullable_property_softly_
preserves_prior_version (replaces the former
apply_schema_rejects_dropping_a_property_with_data) asserts:
(a) plan contains DropProperty { Soft }
(b) apply succeeds + manifest advances + row count unchanged
(c) current dataset schema lacks the dropped column
(d) snapshot_at_version(pre_drop) still has the dropped column
(e) reopen consistency — drop preserved across engine restart
Recovery: rides on SidecarKind::SchemaApply per MR-847. No new
sidecar kind needed; the entire apply path is already sidecar-wrapped.
Substrate alignment: this commit uses the stage_overwrite full-rewrite
path (full_rewrite cost class) rather than Lance native drop_columns
(catalog_only cost class). MR-948 is the follow-up substrate-alignment
refactor that introduces a LanceColumnOp surface and switches the
metadata-only case onto drop_columns. Functional outcome is identical;
cost-class improvement deferred.
Test results:
- cargo test -p omnigraph-compiler --lib: 238 passed
- cargo test -p omnigraph-engine --test schema_apply: 11 passed
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
* docs: move schema-lint-v1-plan into docs/dev/ + add to index
Post-rebase fixup for the docs split (#93). The plan doc was added
to docs/ at the top level before main reorganized to docs/{user,dev}/.
This moves it into docs/dev/ and adds an entry to docs/dev/index.md
under a new "Active Implementation Plans" section so the
check-agents-md.sh link check passes.
Per the original commit message (617a77d), the plan doc is intentionally
temporary — it will be deleted when v1 lands.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous "fetch the full page" recommendation in AGENTS.md and
docs/dev/lance.md pointed at an unknown-author npm CLI that, on consent,
wrote agent-targeted content into AGENTS.md and modified .gitignore /
tsconfig.json. Source audit was clean of malicious code but the
self-perpetuating prompt-injection pattern combined with a single
maintainer and ~21 downloads/day made it not worth the risk. Switched
to the curl + pandoc command already documented as the no-tool option.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Branch protection on main, declared as code rather than as opaque
GitHub UI state. Pairs with the CODEOWNERS chassis (#88): once this
PR lands and an admin runs the apply script, every PR to main must
satisfy code-owner review and the listed required checks.
Components:
- .github/branch-protection.json — the policy. Edit this to change
required checks, review counts, etc. Includes a _comment field for
human readers; the apply script strips it before PUT.
- scripts/apply-branch-protection.sh — idempotent apply via `gh api`.
Reads back current state for verification. Supports DRY_RUN=1.
- docs/branch-protection.md — explains the policy, how to apply, how
to change, why declared as code.
- AGENTS.md topic-index row.
Policy summary:
- Required status checks (strict): Classify Changes, Check AGENTS.md
Links, Test Workspace, Test omnigraph-server --features aws,
CODEOWNERS / drift, CODEOWNERS / noedit.
- Required approving reviews: 1, must be a code owner.
- Dismiss stale reviews on new commits.
- Required linear history (squash or rebase merges only).
- No force pushes, no deletions, no admin bypasses.
- Required conversation resolution.
What's NOT in this PR:
- Required signed commits — not yet; maintainers must enroll GPG/SSH
signing first or merges will block.
- Tag protection for v* tags — separate PR.
- Additional required checks (cargo deny, audit, fmt, clippy, CodeQL,
schema-lint MR-946) — separate PRs as each lands.
- The script is NOT run by CI. Branch-protection changes are admin
actions; CI-driven auto-apply would defeat the purpose. Manual
invocation is the audit point.
How to apply after merge:
./scripts/apply-branch-protection.sh
Requires gh-CLI auth with repo-admin permissions.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* codeowners: generator + drift CI + initial roles
Source-of-truth approach to CODEOWNERS: yml is hand-edited, CODEOWNERS
is generated and CI-enforced. Every role change is a reviewable PR
with a permanent in-repo audit trail. No GitHub UI clicks, no shadow
state.
Initial roles:
engineering @aaltshuler owns crates/** + default (.github/,
scripts/, Cargo.*, openapi.json,
everything else not docs)
docs @aaltshuler @ragnorc owns docs/**, README.md, AGENTS.md,
CLAUDE.md, SECURITY.md
Per GitHub semantics, multiple owners on a CODEOWNERS line means "any
one satisfies the review" — for docs, either named member can approve.
Strict "N distinct approvers" would need a CI workaround (not wired
today; tracked for future hardening).
Components:
- .github/codeowners-roles.yml — source of truth. Edit this.
- .github/scripts/render-codeowners.py — generator (PyYAML; ~100 LoC).
- .github/CODEOWNERS — generated. CI rejects hand-edits.
- .github/workflows/codeowners.yml — two checks:
* drift: re-render and assert CODEOWNERS matches.
* noedit: reject PRs that edit CODEOWNERS without editing the yml.
- docs/codeowners.md — explains the source-of-truth pattern, how to
change roles, how to add new roles.
- AGENTS.md topic-index row.
What's NOT in this PR:
- Branch protection on main (separate PR; needs `gh api` call against
the org).
- Required-reviewer enforcement (depends on branch protection landing).
- Required CI status checks (depends on branch protection landing).
- Scheduled rotation (the schedule: block in the yml + a weekly
workflow). Today's roles are stable; rotation isn't needed yet.
- Linear-as-source-of-truth integration (Approach 4 from the design
discussion; deferred).
Verified:
- Generator output is deterministic (idempotent re-runs).
- scripts/check-agents-md.sh OK (28 links, 28 docs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* codeowners: fix catch-all ordering (Devin review #88)
Devin caught a real bug: GitHub CODEOWNERS uses "last match wins"
semantics, but the generator emitted the catch-all `*` AFTER specific
patterns. Net effect: `*` won for every file, silently nullifying the
docs role and never routing reviews to @ragnorc.
Fix is one-line — emit the default `*` line before iterating the
specific paths. Also:
- Added a regression assertion in the generator: after rendering, the
first non-comment line must start with `*` if a default is
configured. Generator exits non-zero otherwise. Catches the same
class of mistake in any future refactor.
- Rewrote the yml header comment, which incorrectly stated "keep
more-specific paths after broader patterns" (correct for GitHub
semantics but the generator was doing the opposite — so the comment
read as a description of behavior when it was actually a contradicted
intention).
Verified by re-rendering: `*` is now line 12, `crates/**` is line 14,
`docs/**` is line 15, etc. README.md matches both `*` and `README.md`;
`README.md` is later → wins → @aaltshuler + @ragnorc both assigned.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First slice of the schema-lint chassis. Adds stable `OG-XXX-NNN`
codes to schema-migration rejections so operators can suppress, look
up, and filter on identifiers rather than free-text prose. Atlas-style
chassis adapted to omnigraph's typed-IR substrate (no SQL injection
vector, no per-engine locks, native edge/vector/embedding types).
What's in v0:
- New `omnigraph-compiler/src/lint/` module with:
- `diagnostic.rs` — Family / SafetyTier / Severity enums covering ten
families (DS, MF, CD, BC, NM, OW, NL, VE, ED, LK). Only DS and MF
are populated in this PR.
- `codes.rs` — 8 DiagnosticCode constants (OG-DS-101..105,
OG-MF-103, OG-MF-104, OG-MF-106). Five of the eight are wired to
real emission sites; the other three are reserved.
- Unit tests for catalog invariants: codes unique, prefix matches
family, suffixes are 3-digit, destructive defaults to error,
lookup() works, EMITTED_IN_V0 codes exist in ALL_CODES.
- `SchemaMigrationStep::UnsupportedChange` gains an optional
`code: Option<String>` field. New `unsupported_error_message()`
helper prefixes the message with `[code]` when present.
- 5 of 17 existing rejection paths now carry codes:
- `removing node type` → OG-DS-102
- `removing edge type` → OG-DS-103
- `removing property` → OG-DS-104
- `adding required property without backfill` → OG-MF-103
- `changing property type` → OG-MF-106
Remaining 12 paths carry `code: None` and are tagged as future work.
- `schema_apply` surfaces the formatted error (with `[code]` prefix);
CLI `omnigraph schema plan` renders the code on the
`unsupported change on <entity>` line.
- PR #62 destructive-rejection tests in `tests/schema_apply.rs` now
assert on the stable code (`msg.contains("OG-DS-104")`) instead of
the error-message substring. 11/11 tests pass.
- New `docs/schema-lint.md` documents the v0 catalog + the 10 families
+ Atlas prior art. AGENTS.md index updated.
What's explicitly NOT in v0 (subsequent PRs):
- No severity config in `omnigraph.yaml` (MR-694 §2).
- No `@allow(OG-XXX-NNN, "rationale")` suppression directive (§3).
- No `--allow-data-loss` flag or destructive-tier enforcement.
- No new `SchemaMigrationStep` variants (soft/hard drops, default,
widen/narrow). MR-700, MR-697 land those.
- No pre-migration checks (MR-941).
- No CD / VE / LK / NM family rules (MR-942..945).
- No CI integration (MR-946).
Tests: 235 compiler tests, 11 schema_apply integration tests, 14
lint module tests, 55 CLI tests — all green.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The L1 capability list claimed the flag was enabled "for the
commit-graph and run-registry datasets" — stale. Every Lance
dataset OmniGraph creates has enable_stable_row_ids: true; the
run-registry datasets are gone since MR-771. Replace with a single
paragraph capturing the invariant, the consequences (row-version
columns available, CreateIndex × Rewrite not retryable, Lance reader
version required), the legacy-dataset constraint (one-way at create,
dump-and-reload to migrate), and a pointer to the regression test in
staged_writes.rs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reframes the first-principle section to lead with Winters' "engineering
is programming integrated over time" as the lens, keeping "minimize
ongoing liability" as the operative directive and folding in "complexity
should be earned." Adds a new Tiebreakers subsection with two rules
that the prior section lacked clean appeals for:
- correctness > simplicity > performance (lexicographic)
- reversibility shapes evidence demand (reversible → prod metrics over
napkin math over RFCs; irreversible → RFC up-front)
Adds a Hyrum's-Law deny-list entry in both AGENTS.md and
docs/invariants.md §IX: shipping observable behavior is shipping a
contract, even when undocumented.
Net always-on context cost: ~7 lines. No renumbering of §I–VIII
invariants; Hyrum's Law lands in the deny-list to avoid breaking
back-references.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The audit of test coverage flagged three holes:
- `omnigraph optimize` and `omnigraph cleanup` had no integration tests
(no `maintenance.rs`). Add one covering empty/idempotent edges, the
policy-validation contract on `cleanup`, and head preservation under
aggressive policies.
- `apply_schema` only covered I32 -> I64 type-change rejection. Add the
symmetric narrowing case plus rejections for the other destructive
shapes (drop property with data, drop node type, drop edge type, add
required property without backfill) and assert the manifest version
doesn't advance. Add a positive `@rename_from` case to pin the
stable-type-id contract preserves rows through a rename.
- `docs/testing.md` was missing `validators.rs` and the new
`maintenance.rs` from its file table; bump the count and add rows.
* MR-786: merge-pair truth table with exhaustive op-variant matrix
Add crates/omnigraph/tests/merge_truth_table.rs that enumerates every
(left_op, right_op) cell from the operation vocabulary named in the
ticket — {noop, addNode, removeNode, addEdge, removeEdge, setProperty,
dropProperty, addLabel, removeLabel} — and asserts the deterministic
outcome of Omnigraph::branch_merge against a structured oracle.
The matrix is built in a 9x9 match in build_case, so adding a new
OpVariant is a compile-time, fail-on-omission task. Today's mutation
grammar only exposes insert | update set | delete (see
docs/query-language.md), so the 36 cells over the first six ops are
executable and the 45 cells involving dropProperty/addLabel/removeLabel
are recorded as Expected::Unsupported with a note. Each executable cell
spins up a fresh tempdir, applies one mutation per branch, calls
branch_merge, and asserts either:
* MergeOutcome (AlreadyUpToDate / FastForward / Merged) plus a
GraphAssert on the affected entities, or
* an OmniError::MergeConflicts whose entries match the expected
table_key + MergeConflictKind (row_id is optional because edge
ULIDs are generated at runtime).
branch_merge is directional, so the (L, R) and (R, L) cells live in
separate entries in the matrix and are run independently — the
op-pair symmetry encoded in build_case serves as the commutativity
oracle without doubling the runtime. End-to-end the suite runs in
~10s on a fresh build, well under the 30s budget asserted at the
bottom of the test.
Also adds a row to docs/testing.md so the test-coverage map points
future agents at this file.
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
* Use one Omnigraph handle for both branches
Self-review caught that the runner was opening two Omnigraph handles
on the same temp dataset (one for main, a second via Omnigraph::open
for feature). tests/branching.rs uses one handle and passes the branch
name to mutate_branch — same pattern works here and avoids any
cache-coherency surprises between the two handles. Also drops the
post-merge reopen, which only existed to give the second handle a
fresh snapshot.
Runtime drops ~10s -> ~9s.
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
* Assert exact conflict count, not subset inclusion
cubic and Devin Review both flagged that check_outcome's
Expected::Conflicts arm only enforces want ⊆ got, so a regression that
produces a spurious extra conflict (e.g. emitting both OrphanEdge and
a stray DivergentInsert) would silently pass the truth-table cell.
For a deterministic oracle that's the wrong direction — the cell pins
the exact conflict-artifact set, not a lower bound. Add an
assert_eq!(got.len(), want.len()) before the existence loop. All 36
executable cells still pass; runtime unchanged.
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
* Subsume 4 conflict tests in branching.rs into truth table
The four `branch_merge_reports_*_conflict` tests
(DivergentUpdate / DivergentInsert / DeleteVsUpdate / OrphanEdge)
were redundant with the deterministic-oracle cells in the new
`merge_truth_table.rs` and only added drift risk.
To preserve the post-conflict invariant that lived in
`branch_merge_reports_divergent_update_conflict` (target unchanged
after a failed merge), the truth-table runner now generalizes it:
on every `Conflicts` cell, main's state is asserted against
`state_after_apply_only(right_op)`. That gives strictly more
coverage than the deleted tests carried, since the invariant now
applies to *all* seven conflict cells, not just one.
The `UniqueViolation` and `CardinalityViolation` cases stay in
`branching.rs` — they're combinatorial (require >1 op per side
with a non-default schema) and out of scope for the pair-wise
truth table.
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
* Fix misleading 'Total edges: 0' comment in (AddEdge, RemoveEdge) cell
Devin Review flagged that the comment said 'Total edges: 0' while the
parenthetical math evaluates to 1 (matching `GraphAssert::base()`).
The assertion is correct; only the leading number in the comment was
wrong. Reworded to 'Net edges: … = 1 (matches base)' so the prose
agrees with both the math and the assertion.
Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>
---------
Co-authored-by: Ragnor <ragnor@modernrelay.com>
Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The architectural rule "no cross-query BEGIN/COMMIT; branches fill that
role" lives in docs/invariants.md §VI.23 but is not surfaced anywhere
user-facing. New users coming from Postgres/MySQL hit the gap when they
realize multiple queries on main are independently atomic, not jointly
atomic.
This page explains the model with worked examples:
* Single-query multi-statement (atomic by default)
* Two separate queries on main (NOT atomic — common surprise)
* Many queries via a branch (atomic at merge)
* Coordinating multiple agents via branch-per-agent
Plus a comparison table to BEGIN/COMMIT, failure-mode rundown, and
"when to use what" decision matrix.
Linked from AGENTS.md "Where to find each topic" between
branches-commits.md and runs.md.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanups bundled because they're both single-line, post-MR-686
hygiene flagged by cubic during PR review:
- docs/server.md:102 said "Rate limiting — none" while the new
admission-control section earlier in the file documents 429s on the
five mutating handlers. Replace with a pointer to the admission
section and clarify that no graph-wide rate limiter is wired.
- schema_apply.rs:445-451 called `db.version().await` twice — once
for the conditional check, once in the error format string —
creating a cosmetic TOCTOU under interior mutability. Cache the
result in `current_manifest_version` so the error message reflects
the version that triggered the rejection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- docs/server.md: new "Per-actor admission control (MR-686)" section
documenting WorkloadController defaults, the 429/503 mapping with
Retry-After semantics, the Cedar-then-admission ordering, and the
/change-only-for-now scope. Adds 429 / 503 to the listed HTTP status
codes and `too_many_requests` / `service_unavailable` to the ErrorCode
enumeration in the error model paragraph.
- docs/architecture.md: server/CLI diagram updated. Adds WorkloadController
and WriteQueueManager nodes; flow is HTTP -> auth -> Cedar -> admission
-> engine -> queue. Engine label changed to "Arc<Omnigraph>" to reflect
the AppState flip. Prose now points at server.md and runs.md for the
admission/queue contracts. The CLI's bypass-admission note is preserved.
- docs/invariants.md §VI.23 status annotation: explicitly cites the
per-(table, branch) writer-queue + revalidation-under-queue as closing
the Lance-HEAD-vs-manifest drift class under concurrent writers once
the global RwLock is removed (PR 2 Step F). Continuous in-process
rollback recovery still aspirational (MR-870 ticket).
scripts/check-agents-md.sh passes (26 links, 26 docs).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three new §VI invariants name what OmniGraph commits to as an agent-native
system of record: branches as the cross-query coordination primitive,
per-query isolation as a per-query opt-in (Serializable up, eventual down),
and type-aware agent-resolvable merges. Plus an explicit non-commitments
subsection so reviewers see what is intentionally out of scope (Strict
Serializable across queries, cross-process linearizable single-object writes,
auto-resolution of ambiguous merge conflicts).
§VII and §VIII renumber by +3 to make room (35-43 -> 38-46, 44-47 -> 47-50);
deny-list and review-checklist references in §IX/§X follow. testing.md's
pre-existing stale §VII.33/34/36 references resolve to their actual
§VIII.47/48/50 targets in the same pass. staged_writes.rs:866's docstring
gains an MR-686 forward reference so the load-bearing concurrency-hazard
test points readers at the queue work that closes the gap.
§VI.34 is preserved alongside the broader §VI.36 to keep its MR-425
pointer addressable; the overlap is documented in §VI.36's status line.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three terminologies were calling themselves Phase A/B in PR #72:
1. Per-writer recovery (canonical, four phases A/B/C/D — sidecar /
commit_staged loop / manifest publish / sidecar delete in
`docs/runs.md:157`).
2. Per-table staged-write contract from MR-793 (two phases —
`stage_*` then `commit_staged`).
3. Test-narrative scaffolding (Phase A = setup the failure, Phase B
= verify recovery — used as section dividers in failpoints.rs).
Same letters, three meanings; three reviewers including the bots have
already misread the code in the resulting fog. This change keeps
"Phase A/B/C/D" exclusively for #1 and rewrites the other two:
- `ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`
→ `ensure_indices_stage_btree_failure_leaves_existing_tables_writable`
(matches the `stage_create_btree_index` API verb).
- Comment at `table_ops.rs:610` and the test docstring at
`failpoints.rs:807` rewrite "a Phase A failure in the staged-index
path" → "a stage-step failure in the staged-index path".
- Twelve `// Phase A:` / `// Phase B:` test scaffolding comment
headers in `failpoints.rs` (across six test fns) become
`// Setup:` / `// Recovery:`.
- A "Phase letter convention" note added near `docs/runs.md:165`
spells the rule out for future readers.
Also bundled: rename
`composite_flow_init_load_branch_merge_time_travel_optimize_cleanup`
→ `composite_flow_canonical_lifecycle` so it pairs as a story name
with `composite_flow_multi_branch_sequential_merges` (the previously-
deferred symmetry rename).
No behaviour change. Both renamed tests pass; full failpoints (18) +
composite_flow (2) suites pass; workspace baseline + clippy clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `composite_flow_multi_branch_sequential_merges` covering the
agent-workflow pattern that single-merge tests in `branching.rs`
cannot reach: two feature branches diverging from main with main
writes interleaved between every diverge point, sequential merges
into main, time-travel through the resulting merge DAG, and reopen
consistency over a multi-merge history.
The script (18 numbered steps with assertions per step):
init+load → mutate main → branch feat-a → mutate main → mutate
feat-a → branch feat-b → mutate feat-b → mutate feat-a (with
edge) → merge feat-a → mutate main → merge feat-b → time-travel
to pre-merge-a + pre-merge-b → reopen + verify.
Catches eight compositional gap categories that only surface with
≥2 merges and main mutations between them: base/LCA recomputation
across two merges, manifest-pin propagation through merge commits,
time-travel through merge DAG without state bleed-through, branch-
DAG consistency, sibling-branch isolation under writes elsewhere,
post-merge main-write integration, multi-merge reopen replay, and
clean-flow recovery-sidecar absence.
`composite_flow.rs` was added to `docs/testing.md` so the before-
every-task checklist points agents at the file before duplicating
coverage.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bundle of three correctness fixes plus a shared invariants helper that
existing tests now use.
1. SchemaApply atomicity: close the residual gap where a sidecar exists
but staging files don't (e.g., Phase B failure BEFORE
`_schema.pg.staging` write). `recover_schema_state_files` now returns
a `SchemaStateRecovery` discriminator (`Noop` /
`CleanedStaging` / `CompletedStagingRename { schema_apply_sidecar }`);
the token threads through `recover_manifest_drift` →
`process_sidecar`. SchemaApply sidecars are eligible for roll-forward
ONLY when the staging rename completed in the same recovery pass.
Full mode rolls back; RollForwardOnly defers. Without this, recovery
would publish the manifest pin against new-schema data while
`_schema.pg` stayed old (real corruption). New failpoint
`schema_apply.before_staging_write` + new test
`schema_apply_without_schema_staging_rolls_back_on_next_open` pin
the gating.
2. Rollback target correction. Rollback now restores Lance HEAD to the
current manifest pin (`state.manifest_pinned`) instead of the
sidecar's `expected_version`. For UnexpectedAtP1/UnexpectedMultistep
classifications these can differ; the old code could regress Lance
HEAD past the manifest pin, re-introducing drift in the OTHER
direction. The new behavior establishes `Lance HEAD == manifest pin`
post-rollback — the canonical drift-free invariant. Param renamed
from `expected_version` → `target_version` to match. Audit
`to_version` records the actual restore target.
This is a latent-behavior change. Any external consumer that compared
`audit.to_version` against `sidecar.expected_version` for non-trivial
classifications now sees the manifest pin instead.
3. Audit commit-graph unification. `record_audit` now opens the
per-branch commit graph for ANY sidecar with `sidecar.branch.is_some()`
— not just BranchMerge. Plain Mutation/Load/EnsureIndices commits on a
feature branch now correctly land on that branch's commit graph,
instead of main's. Closes the class of bug analogous to D2 but for
non-merge writers.
Pre-existing repos with non-main commits already on main's commit
graph stay where they are; future recoveries write to the per-branch
ref. Mixed-version compatibility is asymmetric but safe (old binaries
ignore per-branch refs they don't know about; new binaries read both).
4. Recovery invariants helper + branch-axis cells. New
`tests/helpers/recovery.rs` (~505 LOC) exports
`assert_post_recovery_invariants(repo, op_id, RecoveryExpectation)`
plus a `TableExpectation` builder. Six existing recovery tests
refactored to call it; per-test bespoke assertions replaced. Two new
branch-axis cells added in `tests/failpoints.rs`:
- `recovery_rolls_forward_load_on_feature_branch`
- `recovery_rolls_forward_ensure_indices_on_feature_branch`
The loader gains a `mutation.post_finalize_pre_publisher` failpoint
hook (gated on the `failpoints` feature; zero-cost in release) so the
load test can pin the same Phase B → Phase C boundary the mutation
path uses.
Misc:
- `Omnigraph::refresh` extracts `reload_schema_if_source_changed`:
early-return when schema source unchanged (saves IR parse + catalog
rebuild on the steady-state refresh path).
- New test injection point
`failpoint_publish_table_head_without_index_rebuild_for_test`
under `#[cfg(feature = "failpoints")]`.
Tests: 31 recovery + failpoint integration tests pass (14 + 17, up from
14 + 16). Full workspace sweep with `--features failpoints` clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds RecoveryMode { Full, RollForwardOnly } and wires Omnigraph::refresh
to invoke roll-forward-only recovery. This closes the documented
"long-running server between Phase B failure and process restart"
residual without requiring a restart, for the common case (mutation /
load finalize → publisher failure).
Why roll-forward only and not full sweep:
* Roll-forward is safe under concurrency (publisher uses row-level
CAS).
* Roll-back uses Dataset::restore, which "wins" against concurrent
Append/Update/Delete/CreateIndex/Merge per check_restore_txn —
silently orphaning the concurrent writer's commit (pinned by
tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning).
Sidecars that classify as RollBack-eligible are LEFT ON DISK for the
next ReadWrite open, where no concurrent writers exist and full
restore is safe.
Implementation:
* recovery.rs: RecoveryMode enum; recover_manifest_drift takes mode;
process_sidecar branches on mode for Abort and RollBack — both
defer to next ReadWrite open under RollForwardOnly. RollForward
behavior unchanged.
* omnigraph.rs: Omnigraph::refresh promoted to pub; calls
recover_manifest_drift in RollForwardOnly mode after coordinator
refresh. Steady-state cost: one list_dir of __recovery (early
return on empty). Adds refresh_coordinator_only — pub(crate) —
for engine-internal callers that hold an in-flight sidecar (the
schema_apply lease-check + lock-release paths). Without this split,
refresh would race the in-flight sidecar.
* schema_apply.rs: switch all 6 internal db.refresh() call sites to
refresh_coordinator_only().
Tests:
* refresh_runs_roll_forward_recovery_in_process — trigger
mutation.post_finalize_pre_publisher; without restart, call
db.refresh(); assert sidecar deleted, drifted row visible,
subsequent mutation succeeds.
* refresh_defers_rollback_eligible_sidecar_to_next_open — synthesize
a Mutation sidecar with bogus expected (UnexpectedAtP1 → RollBack);
refresh leaves it on disk and Lance HEAD unchanged; drop and reopen
runs the full sweep which advances HEAD via restore.
Docs:
* docs/runs.md "Long-running servers" caveat updated to describe the
refresh-time roll-forward path and the rollback-defer behavior.
* docs/invariants.md §VI.23 status line updated to reflect in-process
closure of the common case.
Workspace tests pass with --features failpoints; no regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three bundled changes:
1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and
the test function). OmniGraph is consumed by both humans and agents
- naming the test after one audience misframes the library.
2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and
review-round labels from source, tests, and docs added by this
branch. Internal traceability belongs in commit messages and PR
descriptions, not in checked-in artifacts. Upstream
lance-format/lance issue refs and pre-existing MR-XXX refs in docs
not touched by this branch are left alone.
3. Two outstanding review findings addressed:
- `needs_index_work_node` / `needs_index_work_edge`: propagate
`count_rows` errors instead of `unwrap_or(0)`. Silently treating
transient I/O failures as "0 rows" risked skipping a table from
the recovery sidecar pin set that was actually about to be
modified.
- `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`:
strengthen the assertion to fail when sidecar B classifies under
a stale snapshot. The new assertion checks post-recovery Lance
HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar
deleted + audit rows present" pair passed in both the bug and
fix paths because both delete the sidecar and write an audit
row; the differentiator is the post-recovery HEAD. Strengthening
the assertion exposed an additional nuance: in this overlapping-
sidecar scenario sidecar B's audit kind is RolledBack (no-op)
rather than RolledForward, since sidecar A's roll-forward
publishes Lance HEAD as the new manifest pin (absorbing B's
work). The docstring now explains why this is correct given
current `roll_forward_all` semantics.
All workspace tests pass with --features failpoints.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bot reviewers (cubic, cursor, chatgpt-codex) caught 4 merge-blocking
bugs + 3 strongly-recommended fixes + 3 doc errors in the initial PR.
Each fix has a paired test demonstrating the bug before the fix.
Merge-blocking fixes:
- BranchMerge moved to loose-match classifier arm. publish_rewritten_
merge_table runs multiple commit_staged calls per table (merge_insert
+ delete_where + index rebuilds). Strict classification rolled back
valid completed Phase B work as UnexpectedMultistep. Three new unit
tests pin the loose-match behavior for BranchMerge.
- branch_merge sidecar uses self.active_branch() (the resolved target
branch) instead of inferring from the first sorted table key. The
previous heuristic could record None (== main) when the merge target
was a non-main branch, causing recovery to publish to the wrong
manifest namespace.
- Best-effort sidecar delete in all 5 writer sites (mutation, loader,
schema_apply, branch_merge, ensure_indices). Previously, a sidecar
cleanup failure after a successful manifest publish would error out
the user's call for a write that already landed. Now: log a warning
and ignore — the next open's recovery sweep tidies the stale sidecar
via NoMovement classification.
- ensure_indices sidecar scoped to tables that need work via new
helpers needs_index_work_node / needs_index_work_edge. Previously
the sidecar pinned every catalog table; if only one needed indexing,
the others classified as NoMovement and the all-or-nothing decision
rolled back legitimate index work.
Strongly-recommended fixes:
- recover_manifest_drift now takes &mut GraphCoordinator and refreshes
between sidecars. Sidecar B's classification needs to see sidecar
A's manifest changes, otherwise B can be classified against stale
pins and incorrectly roll back work that just landed.
- list_sidecars sorts URIs before reading. Sidecar filenames are
ULIDs (chronologically sortable), so this gives deterministic,
time-ordered processing. Filesystem-order was nondeterministic.
- ReadOnly opens skip recover_schema_state_files too (was: only the
MR-847 sweep was gated). Read-only consumers may run with read-only
credentials; silent open-time mutations violate the contract.
Doc cleanups:
- Removed stale "Phase 4 placeholder" comment from
recover_manifest_drift.
- docs/runs.md decision-tree wording now correctly surfaces the
InvariantViolation abort path.
- docs/branches-commits.md clarifies actor_id is in
_graph_commit_actors.lance (joined by graph_commit_id), not on
_graph_commits.lance itself.
Test surface (post-fixes):
- 25 unit tests in db::manifest::recovery (+4 from this commit).
- 10 integration tests in tests/recovery.rs (+3 from this commit).
- ~672 tests across ~25 binaries pass with --features failpoints.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update the doc surface to reflect MR-847 having shipped end to end —
sidecar protocol, classifier, all-or-nothing decision tree, roll-forward
via ManifestBatchPublisher, roll-back via Dataset::restore with
fragment-set short-circuit, audit trail in
_graph_commit_recoveries.lance, OpenMode::{ReadWrite, ReadOnly}, and
the four migrated writers all carrying sidecars across Phase B → Phase C.
- docs/invariants.md §VI.23: change from "upheld at the writer-trait
surface for inserts/updates/etc., per-table commit_staged → manifest
publish window remains" to "upheld at the writer-trait surface AND
across process boundaries". The MR-847 sweep closes the residual on
the next Omnigraph::open. The "continuous in-process" property
(no ExpectedVersionMismatch surfacing to subsequent writers between
Phase B failure and process restart) is honest follow-up at MR-856.
- docs/runs.md: replace "Finalize → publisher residual" section with
"Open-time recovery sweep (MR-847)" — describes the sidecar protocol
lifecycle (Phases A-D), the sweep's classifier + decision dispatch,
the audit trail, and the operator-facing query
(omnigraph commit list --filter actor=omnigraph:recovery).
- AGENTS.md capability matrix "Atomic single-dataset commits" row:
drop the "Layer (3) is not yet shipped — tracked in MR-847" caveat;
describe the three layers as all shipping; reference MR-856 for the
background-reconciler follow-up.
- docs/storage.md: add _graph_commit_recoveries.lance and
__recovery/{ulid}.json to the on-disk layout (mermaid + prose).
- docs/branches-commits.md: new "Recovery audit trail (MR-847)"
subsection describing the join from
_graph_commits.lance:actor_id="omnigraph:recovery" to
_graph_commit_recoveries.lance:graph_commit_id for operator
post-mortem.
- docs/maintenance.md: note the MR-847 recovery floor on cleanup —
--keep < 3 may garbage-collect Lance versions the recovery sweep
needs as a rollback target. Default --keep 10 is safe.
- docs/testing.md: add tests/recovery.rs to the engine integration-test
table; expand the failpoints.rs row to mention the four MR-847
per-writer Phase B → recovery integration tests.
- .context/mr-847-design.md: prepend a "Status: DONE" stanza listing
every commit hash + scope across phases 1-10.
AGENTS.md ↔ docs/ cross-link check passes (26 links, 26 docs).
Full workspace test sweep passes with --features failpoints (361 tests
across 20 binaries).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes MR-793 acceptance §1 via option (b): every inline-commit method
remaining on the trait surface is named, the upstream blocker or
internal phase that closes it is cited, and the call-site residual
comment is mandated.
Reframes the criterion text in the MR-793 ticket comment from "either
full sealing OR all residuals enumerated" — this commit ships the
"enumerated" path. The "full sealing" path (Phase 1b + Phase 9 + the
two Lance upstream tickets) closes the matrix entirely.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fetched https://lance.org/format/table/mem_wal/ in full via npx mdrip.
The "Overview / Details / Implementation" sidebar items turned out to
be anchor sections on the same URL, not separate pages.
Key findings (relevant to MR-847's recovery reconciler design):
* MemWAL is opt-in. Requires (1) unenforced primary key in schema,
(2) explicit shard config, (3) writers using the LSM-tree write
path. omnigraph does NOT enable it; we use direct write_fragments +
commit(Operation::Append).
* MemWAL is intra-table — addresses streaming-write throughput for
one Lance base table via MemTables → flushed MemTables → async
merge. It does not coordinate across multiple tables.
* MemWAL's recovery is intra-table: WAL replay reconstructs MemTable
state for one table. It does NOT help with omnigraph's cross-table
manifest-pinned-vs-Lance-HEAD drift class.
Conclusion: MR-847's recovery reconciler design is unaffected. The
two operate at different abstraction layers.
Borrowable: MemWAL's epoch-based fencing pattern is structurally
similar to a future multi-coordinator sidecar protocol; noted on
MR-847 for if MR-668 (multi-process) ever lands.
Cubic findings:
* `tests/forbidden_apis.rs`: expand `FORBIDDEN_PATTERNS` with `Dataset::write`
/ `Dataset::append` / `Dataset::delete` / `Dataset::merge_insert` /
`Dataset::add_columns` / `update_columns` / `drop_columns` /
`truncate_table` / `restore` and the bare `.merge_insert(` /
`.add_columns(` / `.update_columns(` / `.drop_columns(` /
`.truncate_table(` method patterns. Deliberately avoid `.append(` /
`.delete(` / `.write(` (over-match `Vec::append`, `.delete_branch(`,
arrow-array `.append(`, etc.). Allow-list `commit_graph.rs` and
`graph_coordinator.rs` — they're manifest-layer infra that legitimately
uses `Dataset::write` for system tables.
* `schema_apply.rs:253`: pass `entry.table_branch.as_deref()` (not
`None`) to `open_dataset_head_for_write` for consistency with the
sibling `indexed_tables` block. Schema apply rejects non-main
branches at the lock-acquire step today, so behavior is unchanged;
this is a defensive consistency fix that survives a future relaxation
of the lock check.
* `storage_layer.rs:131` doc: was `Vec<&StagedWrite>` with lifetime
claim; actually returns `Vec<StagedWrite>` (cloned). Fixed.
* `AGENTS.md:201` capability matrix row + `storage_layer.rs:1` module
doc: softened the "stage_* + commit_staged are the only paths" /
"trait funnels every write" overclaim. Inline-commit residuals
(`delete_where`, `create_vector_index`) remain on the trait pending
upstream Lance work (#6658, #6666); legacy `append_batch` etc.
remain pending Phase 1b / Phase 9. Module doc now describes the
current transitional state honestly.
Cursor Bugbot findings:
* `storage_layer.rs:360`: trait `delete_where` consumed `SnapshotHandle`
but returned only `DeleteState`, dropping the post-delete dataset.
Future callers migrating from the inherent `&mut Dataset` API would
lose the post-delete dataset state needed for indexing /
`table_state` queries. Fixed: returns `(SnapshotHandle, DeleteState)`
matching `append_batch` / `overwrite_batch` shape.
* `storage_layer.rs:824`: removed dead `_scanner_type_marker` fn and
the unused `Scanner` import (the marker existed only to suppress an
unused-import warning — fixing the import is the cleaner answer).
Engine-level Phase A failpoint test (closes the partial-criterion
flagged in Cubic's acceptance-criteria checklist):
* `db/omnigraph/table_ops.rs::stage_and_commit_btree`: instrumented
with `crate::failpoints::maybe_fail("ensure_indices.post_stage_pre_commit_btree")`
between `stage_create_btree_index` and `commit_staged`.
* `tests/failpoints.rs::ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`:
triggers the failpoint via a schema-apply that adds a new node type;
proves that existing tables are unaffected (Person mutation succeeds
after the failed apply) — i.e. Phase A failure leaves no Lance-HEAD
drift on tables outside the failed `added_tables` iteration.
`docs/invariants.md` transitional notes:
* §VI.23 (atomicity per query): annotated as upheld at the
writer-trait surface for inserts / updates / scalar-index builds /
merge_insert / overwrite after MR-793 PR #70. Per-table
commit_staged → manifest publish window remains; closing requires
MR-847's recovery-on-open reconciler. `delete_where` and
`create_vector_index` remain inline pending lance#6658 / #6666.
* §VII.35 (reconciler pattern): annotated as partial — staged
primitives are the building blocks; the reconciler task itself is
MR-848.
* §VIII.45 (reference impl per trait): `TableStorage` has its primary
impl on `TableStore` with opaque-handle signatures; no test impl
yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hoists Lance's stage+commit two-phase write pattern from "discipline at
each writer" to a sealed trait surface (`TableStorage`). New engine code
that needs to advance Lance HEAD MUST go through `stage_*` + `commit_staged`;
the trait's opaque `SnapshotHandle` / `StagedHandle` types keep
`lance::Dataset` and `lance::Transaction` out of trait signatures.
Phases landed (see .context/mr-793-design.md for the full plan):
* 1a: `crates/omnigraph/src/storage_layer.rs` — `TableStorage` trait,
sealed (only in-tree types can impl), single impl on `TableStore`
delegating to existing inherent methods; `Omnigraph::storage()`
accessor returns `&dyn TableStorage`.
* 2: three new staged primitives — `stage_overwrite`,
`stage_create_btree_index`, `stage_create_inverted_index` —
implementing the simple branch of Lance's `CreateIndexBuilder::execute`
(scalar indices only; vector indices stay inline because
`build_index_metadata_from_segments` is `pub(crate)` in lance-4.0.0).
Six new tests in `tests/staged_writes.rs` pin both the new primitives
and the inline residuals (`delete_where`, `create_vector_index`).
* 3: `tests/forbidden_apis.rs` — defense-in-depth integration test
walks engine source, fails on direct lance::* inline-commit API use
outside `table_store.rs` / `db/manifest/`. Skips comment lines and
honors `// forbidden-api-allow:` sentinels.
* 4: `ensure_indices` migration — scalar index builds now route through
`stage_create_*_index` + `commit_staged` instead of
`create_*_index(&mut Dataset)`. Vector indices stay inline (residual,
named honestly at the call site).
* 5: `branch_merge::publish_rewritten_merge_table` migration — the
merge_insert phase now uses `stage_merge_insert` + `commit_staged`;
delete phase stays inline (Lance #6658 residual, named honestly).
* 6: `schema_apply` rewritten_tables migration — non-empty rewrites
use `stage_overwrite` + `commit_staged`; empty-batch rewrites stay
inline because `InsertBuilder::execute_uncommitted` rejects empty
data. The narrow inline window is bounded by `__schema_apply_lock__`.
Verified-green test surface:
* `cargo test -p omnigraph-engine` — 68 lib + ~120 integration tests
(incl. 6 new staged_writes tests + the new forbidden_apis test).
* `cargo test -p omnigraph-engine --features failpoints --test failpoints`
— 5 tests, all green.
* `cargo test --workspace` — green.
Deferred to follow-up sessions (see design doc §17 split):
* Phase 1b — convert remaining engine call sites to `&dyn TableStorage`
(mostly READS that don't touch the staged-write invariant).
* Phase 7 — recovery-on-open reconciler (closes Phase B → Phase C
residual across process restarts; new subsystem).
* Phase 8 — index-coverage reconciler (full §VII.35 compliance —
removes synchronous index work from the publish path).
* Phase 9 — demote unused `TableStore` inherent methods to `pub(crate)`
(depends on Phase 1b).
Lance upstream blockers documented:
* lance-format/lance#6658 — two-phase delete API (open, no PRs).
* Companion: `build_index_metadata_from_segments` should be `pub` so
vector-index builds can be staged outside the lance crate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five fixes from PR #68 review (Cursor Bugbot + Codex + Cubic):
* **scan_with_pending gains merge-shadow semantics** (Codex P1, Cubic P1#1):
new `key_column: Option<&str>` parameter. When set, committed rows
whose key value appears in any pending batch are excluded from the
scan — making `scan_with_pending` correctly merge-semantic for chained
updates instead of naively unioning. execute_update calls with
Some("id"). Without this, a chained `update where age > 30` could
match a row whose pending value already moved out of range.
* **Multi-delete on same table no longer trips ExpectedVersionMismatch**
(Cursor Bugbot HIGH): open_table_for_mutation routes through
reopen_for_mutation when staging.inline_committed has the table,
using the post-inline-commit Lance version captured at record_inline
time. The legacy open_for_mutation_on_branch fence (Lance HEAD ==
manifest pinned) is correct cross-writer but wrong intra-query when
deletes have already advanced HEAD on this table. Branch goes away
when Lance ships two-phase delete (lance-format/lance#6658).
* **Cardinality validation consolidated** (Cursor LOW + Codex P2 +
Cubic P1#2 + Cubic P2): new exec/staging::count_src_per_edge +
enforce_cardinality_bounds shared by mutation and loader paths.
Restores the missing min-cardinality check on the engine path.
Loader Merge mode passes Some("id") to dedupe edges being updated
by id (not double-count committed + pending). Loader Append mode
and engine path pass None (ULID-generated ids never collide).
* **Dead count_rows_with_pending removed** (Cursor LOW): never called.
* **Misleading concat-helper comment fixed** (Cubic P3): claimed
schema normalization the helper doesn't implement. Updated to match
reality.
* **Documentation honesty** (Cubic P1#3): MR-794 narrows but doesn't
eliminate the "Lance HEAD ahead of __manifest" drift class. Drift is
unreachable for op-execution failures (the partial_failure test pins
this), but a residual remains at the finalize→publisher boundary
because Lance has no multi-dataset commit primitive: per-table
commit_staged calls run sequentially before manifest commit. Updated
docs/runs.md, docs/invariants.md §VI.25, docs/releases/v0.4.1.md to
scope the claim precisely.
* **Failpoint test pinning the residual**: new
mutation.post_finalize_pre_publisher failpoint + two tests in
tests/failpoints.rs that confirm the documented residual behavior.
Catches future regressions that widen the residual.
Test additions on tests/runs.rs:
* chained_updates_with_overlapping_predicate_respects_intermediate_value
* multi_statement_delete_on_same_node_table
* cascade_delete_node_then_explicit_delete_edge_on_same_table
* mutation_insert_edge_enforces_min_cardinality
* load_merge_mode_dedupes_edge_for_cardinality_count
113/113 engine integration tests pass (runs + end_to_end + consistency
+ staged_writes + validators). Failpoints feature build runs in CI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refresh user-facing and agent-facing docs for the staged-write rewire
and clean up stale Run-state-machine references that survived MR-771.
MR-794-specific updates:
* docs/runs.md — remove "Known limitation: mid-query partial failure"
section; document the in-memory accumulator + D₂ rule + the
LoadMode::Overwrite residual.
* docs/invariants.md §VI.25 — flip from aspirational/open to
upheld for inserts/updates. Within-query read-your-writes is now
load-bearing for the publisher CAS contract.
* docs/architecture.md — add "Mutation atomicity — in-memory
accumulator (MR-794)" subsection with per-op flow; refresh the
engine + state diagrams to drop RunRegistry and add MutationStaging.
* docs/execution.md — rewrite the mutation flow sequence diagram
for the staged-write path; updated the LoadMode table to call
out per-mode commit semantics; rewrote load vs ingest.
* docs/query-language.md — document the D₂ parse-time rule.
* docs/errors.md — add the D₂ BadRequest rejection path.
* docs/testing.md — extend the runs.rs row to cover the new MR-794
contract tests; add the staged_writes.rs row.
* docs/releases/v0.4.1.md (new) — release note covering the rewire,
test additions, residuals, and files changed.
* AGENTS.md (CLAUDE.md symlink) — update the atomic-per-query
description and the L2 capability matrix row.
Stale-reference cleanup (MR-771 leftovers):
* docs/storage.md — drop live _graph_runs.lance / _graph_run_actors.lance
from the layout diagram and prose; mark legacy.
* docs/branches-commits.md — move __run__<id> to a legacy note;
remove publish_run from the publish-trigger list.
* docs/audit.md — refresh _as API list (drop begin_run_as / publish_run_as);
legacy RunRecord.actor_id moved to a historical note.
* docs/constants.md — mark run registry / branch-prefix rows as legacy.
* docs/cli.md — replace the legacy omnigraph run * quickstart block
with omnigraph commit list/show.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mutate_as and load now write directly to target tables and call the
publisher once at the end with per-table expected versions; the Run
state machine, _graph_runs.lance writers, __run__ staging branches,
and server /runs/* endpoints are removed. Multi-statement mutations
remain atomic at the manifest level via an in-memory MutationStaging
accumulator that gives read-your-writes within a query and a single
publish at the end. Concurrent-writer conflicts surface as
ExpectedVersionMismatch (HTTP 409 manifest_conflict) instead of the
old DivergentUpdate merge shape. Documents one known limitation in
docs/runs.md: a multi-statement mid-query failure where op-N writes
a Lance fragment and op-N+1 fails leaves Lance HEAD ahead of the
manifest until a follow-up introduces per-table Lance branches.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two factual mismatches caught during code-grounded re-review:
- docs/architecture.md: "13 cmd families" was stale — the CLI has 17
Command variants (Version, Embed, Init, Load, Ingest, Branch, Schema,
Query, Snapshot, Export, Run, Commit, Read, Change, Policy, Optimize,
Cleanup). Replaced the count with "command families" so the diagram
doesn't drift again.
- docs/execution.md: the mutation prose said "every mutation runs on a
fresh __run__<id> branch", which over-claims. mutation.rs:555 short-
circuits when the target is already a __run__ branch — the assumption
there is the caller is managing the surrounding run lifecycle. Added a
one-paragraph caveat noting the exception with the file:line citation.
Both diagrams unchanged; only annotations / counts adjusted.
The mutation flow diagram and prose previously claimed multi-statement
mutations publish through a single ManifestRepo::commit. That's wrong:
each statement commits independently to a __run__<id> branch via
commit_updates; atomicity comes from the publish_run step that promotes
the run-branch into the target (fast path) or three-way merges it
(merge path).
Diagram now shows:
- begin_run forks __run__<id> from target head
- per-statement commit_updates on the run-branch (loop)
- OCC pre-check on target head
- publish_run with fast-path / merge-path branches
- terminate_run(Published)
Prose now points at runs.md for the full run lifecycle and cites the
correct entry points: mutate_with_current_actor (mutation.rs:539),
publish_run (omnigraph.rs:858).
Addresses Codex review comment on PR #64.
Replace the single ASCII stack in docs/architecture.md with a hierarchy of
Mermaid diagrams that show the system from external context down to the
component level. Add an on-disk layout diagram in docs/storage.md and two
sequence diagrams (read query, mutation) in docs/execution.md so readers
can navigate from "what is OmniGraph" to "how does a query run" without
opening source.
Static structure (docs/architecture.md):
- System context — agents/clients, embedding providers, Cedar, object store.
- Layer view — eight-layer stack with L1 (Lance) / L2 (OmniGraph) styling
via classDef, replacing the pre-existing ASCII art.
- Component zoom-ins — compiler, engine, storage trait, index lifecycle,
server/CLI. Each zoom-in cites file:line entry points.
Aspirational shapes (storage trait, full reconciler) are visually marked
and pointed at the relevant invariants.md section so readers see the
intended seam without thinking it's already implemented.
On-disk layout (docs/storage.md):
- Tree from repo URI through __manifest, nodes/, edges/, _graph_commits.lance,
_graph_runs.lance, _refs/branches/ down into Lance's per-dataset
internals (_versions/, data/, _indices/, _refs/, _transactions/).
- Annotated with the actual filenames so readers can `ls` the same paths.
- Slots in below the existing __manifest CAS / OCC / migration prose; does
not move or rewrite that content.
Runtime flows (docs/execution.md):
- Read flow sequence: client → Omnigraph::query → typecheck → lower →
execute_query → table_store → Lance scanner → RecordBatch stream.
- Mutation flow sequence: Omnigraph::mutate → resolve literals →
Lance write op (Append / merge_insert) → ManifestRepo::commit →
__manifest upsert.
- Both diagrams are followed by a "Code paths" block with verified
file:line citations so readers can navigate from diagram element to
source in one step.
Conventions established (this is the first Mermaid in the repo):
- L1 = orange (#fef3e8), L2 = blue (#e8f4fd), aspirational = dashed.
- Diagram size cap ~9 elements; more detail goes in a sub-diagram.
- Diagrams paired with prose; code-path citations follow each diagram.
- Consistent vocabulary across diagrams: frontend / compiler / engine /
storage trait / Lance / object store. No accidental synonyms.
Subsequent PRs will add flow diagrams for schema apply, branch + merge,
run isolation, index reconcile, and the embedding pipeline in the same
conventions.
The on-disk shape of `__manifest` is reconciled with the binary via a single
stamp + dispatcher in `db/manifest/migrations.rs`:
- `INTERNAL_MANIFEST_SCHEMA_VERSION = 2` declares the shape this binary writes.
- The on-disk stamp `omnigraph:internal_schema_version` lives in the manifest
dataset's schema-level metadata (Lance `update_schema_metadata`).
- `migrate_internal_schema(&mut dataset)` walks `match`-arm steps forward from
the on-disk stamp until it matches the binary, then returns. Idempotent.
- `init_manifest_repo` stamps the current version at creation; the publisher's
open-for-write path runs pending migrations before reading state. Reads
stay side-effect-free.
- Forward-version protection: a stamp higher than the binary's known version
triggers a clear "upgrade omnigraph first" error so an old binary cannot
clobber a newer schema.
Self-heals existing pre-MR-766 deployments by auto-applying the v1→v2 step:
the `lance-schema:unenforced-primary-key` annotation on `__manifest.object_id`
that engages Lance's row-level CAS at commit time. New repos created via
`init` are stamped at v2 immediately and don't need migration.
Adding a future on-disk shape change is one constant bump, one match arm in
`migrate_internal_schema`, and one test — no new branches in unrelated code
paths. Code outside the migration module never inspects the stamp.
New tests in `manifest/tests.rs`:
- `test_init_stamps_internal_schema_version`
- `test_publish_migrates_pre_stamp_manifest_to_current_version`
- `test_publish_rejects_manifest_stamped_at_future_version`
Docs: `docs/storage.md`, `docs/maintenance.md`, `docs/constants.md` updated
per the AGENTS.md maintenance contract.
- storage.md: document the row-level CAS annotation on `__manifest.object_id`
and the `expected_table_versions` OCC contract on `ManifestBatchPublisher::publish`.
- errors.md: list `ManifestConflictDetails` and its variants alongside `ManifestError`.
- constants.md: add `PUBLISHER_RETRY_BUDGET = 5`.
Per AGENTS.md "Maintenance contract": new schema construct, new constant, and
new typed error shape all need to ship with the source change.
- Removed §IX (OSS / Cloud kernel-product split) — business strategy belongs
in MR-738, not the technical invariants doc.
- Filled the §IV (Additivity / migration) placeholder with five evolution
invariants.
- Reframed §I to be substrate-agnostic: invariants are about respecting any
substrate; Lance / DataFusion are noted as the current chosen substrate
rather than as the invariant itself.
- Added §VI Database guarantees (12 invariants): atomicity, schema integrity,
isolation, durability, causal consistency, determinism, idempotency, no
silent loss, bounded operations, failure scope, crash recovery, consistency
model.
- Added §II.8 wire-protocol agnosticism (kernel transport-agnostic,
Flight/HTTP at the server boundary).
- Reframed §VII as "Current architectural patterns" — explicitly distinct
from invariants. Each pattern entry now names the underlying invariant it
realizes (reconciler / Union / mutations-wrap-reads / SIP / factorize /
stable row IDs / rank columns / policy predicates / Source).
- Pulled specific config defaults out of §VI (timeouts, memory caps);
invariant is that bounds exist, values live in docs/constants.md.
- Split §IX deny-list into "invariant violations" (high bar) and "pattern
violations" (overridable with justification).
- Added status legend: decided / open — see MR-X / aspirational. Annotated
invariants and patterns that are not yet upheld in current code.
- Updated review checklist (§X) to cover database-guarantee dimensions and
the wire-protocol / Source / patterns sections.
- Updated Living Document policy (§XI) to spell out how to revise patterns,
resolve open invariants, and lift aspirational annotations.
Source tickets: MR-737, MR-744, MR-765, MR-694 family, MR-722/MR-725.
All eight comments verified against source and applied:
- AGENTS.md: pull @docs/{invariants,lance,testing}.md imports out of
the markdown blockquote. Claude Code's @-import parser expects @ at
column 0; the leading "> " of a blockquote silently broke
recognition, so the claimed auto-include did nothing. (Cursor,
Medium severity.)
- docs/cli-reference.md: command-family count 13 → 17. The current
enum Command in crates/omnigraph-cli/src/main.rs has 17 top-level
variants. (cubic P2.)
- docs/ci.md: Homebrew tap update is a regular `git push`, not a
force-push (release.yml:117 is `git push origin HEAD:main`). (cubic
P2.)
- docs/errors.md: add the Storage variant to the NanoError list — it
exists at error.rs:88-89 but the doc enumerated only 10 of 11.
(cubic P2.)
- docs/storage.md: clarify tombstone semantics. There is no
tombstone_version column; state.rs:180 reads the tombstone version
from the table_version column on rows where object_type =
table_tombstone. (cubic P2.)
- docs/branches-commits.md: split the GraphCommit pseudo-struct from
the underlying storage. actor_id is joined in-memory from
_graph_commit_actors.lance, not a column on _graph_commits.lance.
(cubic P2.)
- docs/schema-language.md: rename IR_VERSION to SCHEMA_IR_VERSION to
match the actual constant name in catalog/schema_ir.rs:11.
(cubic P3.)
- docs/testing.md: engine integration test count 16 → 15 (matches
`ls crates/omnigraph/tests/*.rs`). (cubic P3.)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The original docs/testing.md mentioned finding existing tests as step 1
of the checklist but never explicitly said "if existing coverage
already addresses your case, extend it; don't duplicate." Adds a
prominent "First principle" section that names extend-vs-new as the
preferred outcome and lists three duplicated init_and_load blocks as
the most common form of test rot.
Adds an extra checklist item: verify your change makes an *existing*
test fail before it makes a new one pass — if you can break the code
without breaking a test, that coverage gap is the bug to fix first.
Strengthens the AGENTS.md callout so the principle ("always check what
already covers it") is in scope from the top of every session.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.
Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.
Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Curates the Lance documentation site (lance.org) into a problem-domain
index so agents fetch the right page when working on Lance-touching
code instead of guessing or grepping our codebase. Organized by topic:
storage format & file layout, branching/tags/time travel, indexes
(scalar + system + vector), reads/writes, schema evolution, object
store, data types, performance, compaction, DataFusion integration,
SDK reference, plus quick-starts and the upstream AGENTS.md.
Skips ~200 irrelevant URLs from the upstream sitemap (Namespace REST
API model surface, Spark/Trino/Databricks/etc. integrations,
Python/Ray/HuggingFace docs, community pages) since omnigraph is
Rust-only and doesn't run a Lance Namespace catalog.
AGENTS.md surfaces it in the topic index and adds a directive: "when
you hit a Lance-shaped problem, consult docs/lance.md and fetch the
upstream URL before guessing."
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A standing reference for invariants that hold across storage, engine,
server, schema, indexing, observability, and the OSS/Cloud split. Used
to check RFCs and PRs against the substrate boundaries (don't rebuild
what Lance gives us), layering rules (one trait boundary per layer),
distributability constraints (Send+Sync, location-neutral IR), honesty
expectations (estimate-vs-actual, bounded failure modes), unified
patterns (reconciler, Union polymorphism, SIP, factorize), the §IX
deny-list, and the §X review checklist.
§IV (additivity / migration) and §VIII (OSS/Cloud kernel-product split)
are referenced but not yet drafted — flagged as placeholders pending
upstream fill-in.
AGENTS.md surfaces it from the topic index, the always-on rules
section, and the maintenance contract; the deny-list is also inlined
there as a fast-pass review filter so it stays in scope every turn.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Splits the 990-line AGENTS.md into a 184-line map (architecture,
where-to-find index, always-on invariants, capability matrix,
maintenance contract) plus 18 new docs/*.md files holding the deep
content per topic (storage, schema and query languages, indexes,
embeddings, branches/commits, runs, merge, changes, execution, policy,
server, CLI reference, audit, errors, CI, constants, v0.3.1 notes).
Adds scripts/check-agents-md.sh and a check_agents_md CI job that
verifies every docs/ link in AGENTS.md resolves and every doc in the
canonical set is linked. CLAUDE.md remains a symlink to AGENTS.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- docs/deployment.md: new "Token sources" section listing the three
bearer-token source precedences (AWS SM, JSON file/env, single token).
New "Build Variants" section explaining default vs aws builds and
their release-artifact naming. New "AWS Secrets Manager" section
covering env var, secret payload format, IAM role credential
discovery, and the hard error for feature-less builds.
- CONTRIBUTING.md: documents the `aws` feature and the two test
commands contributors should run when touching auth code.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>