Commit graph

10 commits

Author SHA1 Message Date
Ragnor Comerford
7aca6ddac5
docs: PR 2 documentation pass (server / architecture / §VI.23)
- docs/server.md: new "Per-actor admission control (MR-686)" section
  documenting WorkloadController defaults, the 429/503 mapping with
  Retry-After semantics, the Cedar-then-admission ordering, and the
  /change-only-for-now scope. Adds 429 / 503 to the listed HTTP status
  codes and `too_many_requests` / `service_unavailable` to the ErrorCode
  enumeration in the error model paragraph.

- docs/architecture.md: server/CLI diagram updated. Adds WorkloadController
  and WriteQueueManager nodes; flow is HTTP -> auth -> Cedar -> admission
  -> engine -> queue. Engine label changed to "Arc<Omnigraph>" to reflect
  the AppState flip. Prose now points at server.md and runs.md for the
  admission/queue contracts. The CLI's bypass-admission note is preserved.

- docs/invariants.md §VI.23 status annotation: explicitly cites the
  per-(table, branch) writer-queue + revalidation-under-queue as closing
  the Lance-HEAD-vs-manifest drift class under concurrent writers once
  the global RwLock is removed (PR 2 Step F). Continuous in-process
  rollback recovery still aspirational (MR-870 ticket).

scripts/check-agents-md.sh passes (26 links, 26 docs).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 17:09:49 +02:00
Ragnor Comerford
c12f6adb0c
docs/invariants: add §VI.35-37 + non-commitments for MR-686
Three new §VI invariants name what OmniGraph commits to as an agent-native
system of record: branches as the cross-query coordination primitive,
per-query isolation as a per-query opt-in (Serializable up, eventual down),
and type-aware agent-resolvable merges. Plus an explicit non-commitments
subsection so reviewers see what is intentionally out of scope (Strict
Serializable across queries, cross-process linearizable single-object writes,
auto-resolution of ambiguous merge conflicts).

§VII and §VIII renumber by +3 to make room (35-43 -> 38-46, 44-47 -> 47-50);
deny-list and review-checklist references in §IX/§X follow. testing.md's
pre-existing stale §VII.33/34/36 references resolve to their actual
§VIII.47/48/50 targets in the same pass. staged_writes.rs:866's docstring
gains an MR-686 forward reference so the load-bearing concurrency-hazard
test points readers at the queue work that closes the gap.

§VI.34 is preserved alongside the broader §VI.36 to keep its MR-425
pointer addressable; the overlap is documented in §VI.36's status line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-07 14:45:54 +02:00
Ragnor Comerford
aaa031e834
recovery: refresh-time roll-forward closes the in-process residual
Adds RecoveryMode { Full, RollForwardOnly } and wires Omnigraph::refresh
to invoke roll-forward-only recovery. This closes the documented
"long-running server between Phase B failure and process restart"
residual without requiring a restart, for the common case (mutation /
load finalize → publisher failure).

Why roll-forward only and not full sweep:
  * Roll-forward is safe under concurrency (publisher uses row-level
    CAS).
  * Roll-back uses Dataset::restore, which "wins" against concurrent
    Append/Update/Delete/CreateIndex/Merge per check_restore_txn —
    silently orphaning the concurrent writer's commit (pinned by
    tests/staged_writes.rs::lance_restore_loses_to_concurrent_append_via_orphaning).
    Sidecars that classify as RollBack-eligible are LEFT ON DISK for the
    next ReadWrite open, where no concurrent writers exist and full
    restore is safe.

Implementation:
  * recovery.rs: RecoveryMode enum; recover_manifest_drift takes mode;
    process_sidecar branches on mode for Abort and RollBack — both
    defer to next ReadWrite open under RollForwardOnly. RollForward
    behavior unchanged.
  * omnigraph.rs: Omnigraph::refresh promoted to pub; calls
    recover_manifest_drift in RollForwardOnly mode after coordinator
    refresh. Steady-state cost: one list_dir of __recovery (early
    return on empty). Adds refresh_coordinator_only — pub(crate) —
    for engine-internal callers that hold an in-flight sidecar (the
    schema_apply lease-check + lock-release paths). Without this split,
    refresh would race the in-flight sidecar.
  * schema_apply.rs: switch all 6 internal db.refresh() call sites to
    refresh_coordinator_only().

Tests:
  * refresh_runs_roll_forward_recovery_in_process — trigger
    mutation.post_finalize_pre_publisher; without restart, call
    db.refresh(); assert sidecar deleted, drifted row visible,
    subsequent mutation succeeds.
  * refresh_defers_rollback_eligible_sidecar_to_next_open — synthesize
    a Mutation sidecar with bogus expected (UnexpectedAtP1 → RollBack);
    refresh leaves it on disk and Lance HEAD unchanged; drop and reopen
    runs the full sweep which advances HEAD via restore.

Docs:
  * docs/runs.md "Long-running servers" caveat updated to describe the
    refresh-time roll-forward path and the rollback-defer behavior.
  * docs/invariants.md §VI.23 status line updated to reflect in-process
    closure of the common case.

Workspace tests pass with --features failpoints; no regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 00:15:42 +02:00
Ragnor Comerford
05e52f2ee0
recovery: rename composite test, strip ticket references, address review
Three bundled changes:

1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and
   the test function). OmniGraph is consumed by both humans and agents
   - naming the test after one audience misframes the library.

2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and
   review-round labels from source, tests, and docs added by this
   branch. Internal traceability belongs in commit messages and PR
   descriptions, not in checked-in artifacts. Upstream
   lance-format/lance issue refs and pre-existing MR-XXX refs in docs
   not touched by this branch are left alone.

3. Two outstanding review findings addressed:
   - `needs_index_work_node` / `needs_index_work_edge`: propagate
     `count_rows` errors instead of `unwrap_or(0)`. Silently treating
     transient I/O failures as "0 rows" risked skipping a table from
     the recovery sidecar pin set that was actually about to be
     modified.
   - `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`:
     strengthen the assertion to fail when sidecar B classifies under
     a stale snapshot. The new assertion checks post-recovery Lance
     HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar
     deleted + audit rows present" pair passed in both the bug and
     fix paths because both delete the sidecar and write an audit
     row; the differentiator is the post-recovery HEAD. Strengthening
     the assertion exposed an additional nuance: in this overlapping-
     sidecar scenario sidecar B's audit kind is RolledBack (no-op)
     rather than RolledForward, since sidecar A's roll-forward
     publishes Lance HEAD as the new manifest pin (absorbing B's
     work). The docstring now explains why this is correct given
     current `roll_forward_all` semantics.

All workspace tests pass with --features failpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 13:56:36 +02:00
Ragnor Comerford
932334ba01
recovery: document MR-847 ship across all reference docs (Phase 10)
Update the doc surface to reflect MR-847 having shipped end to end —
sidecar protocol, classifier, all-or-nothing decision tree, roll-forward
via ManifestBatchPublisher, roll-back via Dataset::restore with
fragment-set short-circuit, audit trail in
_graph_commit_recoveries.lance, OpenMode::{ReadWrite, ReadOnly}, and
the four migrated writers all carrying sidecars across Phase B → Phase C.

- docs/invariants.md §VI.23: change from "upheld at the writer-trait
  surface for inserts/updates/etc., per-table commit_staged → manifest
  publish window remains" to "upheld at the writer-trait surface AND
  across process boundaries". The MR-847 sweep closes the residual on
  the next Omnigraph::open. The "continuous in-process" property
  (no ExpectedVersionMismatch surfacing to subsequent writers between
  Phase B failure and process restart) is honest follow-up at MR-856.

- docs/runs.md: replace "Finalize → publisher residual" section with
  "Open-time recovery sweep (MR-847)" — describes the sidecar protocol
  lifecycle (Phases A-D), the sweep's classifier + decision dispatch,
  the audit trail, and the operator-facing query
  (omnigraph commit list --filter actor=omnigraph:recovery).

- AGENTS.md capability matrix "Atomic single-dataset commits" row:
  drop the "Layer (3) is not yet shipped — tracked in MR-847" caveat;
  describe the three layers as all shipping; reference MR-856 for the
  background-reconciler follow-up.

- docs/storage.md: add _graph_commit_recoveries.lance and
  __recovery/{ulid}.json to the on-disk layout (mermaid + prose).

- docs/branches-commits.md: new "Recovery audit trail (MR-847)"
  subsection describing the join from
  _graph_commits.lance:actor_id="omnigraph:recovery" to
  _graph_commit_recoveries.lance:graph_commit_id for operator
  post-mortem.

- docs/maintenance.md: note the MR-847 recovery floor on cleanup —
  --keep < 3 may garbage-collect Lance versions the recovery sweep
  needs as a rollback target. Default --keep 10 is safe.

- docs/testing.md: add tests/recovery.rs to the engine integration-test
  table; expand the failpoints.rs row to mention the four MR-847
  per-writer Phase B → recovery integration tests.

- .context/mr-847-design.md: prepend a "Status: DONE" stanza listing
  every commit hash + scope across phases 1-10.

AGENTS.md ↔ docs/ cross-link check passes (26 links, 26 docs).
Full workspace test sweep passes with --features failpoints (361 tests
across 20 binaries).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-03 00:46:24 +02:00
Ragnor Comerford
9b0920b5da
address PR #70 bot review (Cubic + Cursor): 7 inline + failpoint test + invariants notes
Cubic findings:
* `tests/forbidden_apis.rs`: expand `FORBIDDEN_PATTERNS` with `Dataset::write`
  / `Dataset::append` / `Dataset::delete` / `Dataset::merge_insert` /
  `Dataset::add_columns` / `update_columns` / `drop_columns` /
  `truncate_table` / `restore` and the bare `.merge_insert(` /
  `.add_columns(` / `.update_columns(` / `.drop_columns(` /
  `.truncate_table(` method patterns. Deliberately avoid `.append(` /
  `.delete(` / `.write(` (over-match `Vec::append`, `.delete_branch(`,
  arrow-array `.append(`, etc.). Allow-list `commit_graph.rs` and
  `graph_coordinator.rs` — they're manifest-layer infra that legitimately
  uses `Dataset::write` for system tables.
* `schema_apply.rs:253`: pass `entry.table_branch.as_deref()` (not
  `None`) to `open_dataset_head_for_write` for consistency with the
  sibling `indexed_tables` block. Schema apply rejects non-main
  branches at the lock-acquire step today, so behavior is unchanged;
  this is a defensive consistency fix that survives a future relaxation
  of the lock check.
* `storage_layer.rs:131` doc: was `Vec<&StagedWrite>` with lifetime
  claim; actually returns `Vec<StagedWrite>` (cloned). Fixed.
* `AGENTS.md:201` capability matrix row + `storage_layer.rs:1` module
  doc: softened the "stage_* + commit_staged are the only paths" /
  "trait funnels every write" overclaim. Inline-commit residuals
  (`delete_where`, `create_vector_index`) remain on the trait pending
  upstream Lance work (#6658, #6666); legacy `append_batch` etc.
  remain pending Phase 1b / Phase 9. Module doc now describes the
  current transitional state honestly.

Cursor Bugbot findings:
* `storage_layer.rs:360`: trait `delete_where` consumed `SnapshotHandle`
  but returned only `DeleteState`, dropping the post-delete dataset.
  Future callers migrating from the inherent `&mut Dataset` API would
  lose the post-delete dataset state needed for indexing /
  `table_state` queries. Fixed: returns `(SnapshotHandle, DeleteState)`
  matching `append_batch` / `overwrite_batch` shape.
* `storage_layer.rs:824`: removed dead `_scanner_type_marker` fn and
  the unused `Scanner` import (the marker existed only to suppress an
  unused-import warning — fixing the import is the cleaner answer).

Engine-level Phase A failpoint test (closes the partial-criterion
flagged in Cubic's acceptance-criteria checklist):
* `db/omnigraph/table_ops.rs::stage_and_commit_btree`: instrumented
  with `crate::failpoints::maybe_fail("ensure_indices.post_stage_pre_commit_btree")`
  between `stage_create_btree_index` and `commit_staged`.
* `tests/failpoints.rs::ensure_indices_phase_a_btree_failure_leaves_existing_tables_writable`:
  triggers the failpoint via a schema-apply that adds a new node type;
  proves that existing tables are unaffected (Person mutation succeeds
  after the failed apply) — i.e. Phase A failure leaves no Lance-HEAD
  drift on tables outside the failed `added_tables` iteration.

`docs/invariants.md` transitional notes:
* §VI.23 (atomicity per query): annotated as upheld at the
  writer-trait surface for inserts / updates / scalar-index builds /
  merge_insert / overwrite after MR-793 PR #70. Per-table
  commit_staged → manifest publish window remains; closing requires
  MR-847's recovery-on-open reconciler. `delete_where` and
  `create_vector_index` remain inline pending lance#6658 / #6666.
* §VII.35 (reconciler pattern): annotated as partial — staged
  primitives are the building blocks; the reconciler task itself is
  MR-848.
* §VIII.45 (reference impl per trait): `TableStorage` has its primary
  impl on `TableStore` with opaque-handle signatures; no test impl
  yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-02 18:47:07 +02:00
Ragnor Comerford
3223b51cf1
MR-794 step 2: address PR #68 review — merge semantics, cardinality, residual
Five fixes from PR #68 review (Cursor Bugbot + Codex + Cubic):

* **scan_with_pending gains merge-shadow semantics** (Codex P1, Cubic P1#1):
  new `key_column: Option<&str>` parameter. When set, committed rows
  whose key value appears in any pending batch are excluded from the
  scan — making `scan_with_pending` correctly merge-semantic for chained
  updates instead of naively unioning. execute_update calls with
  Some("id"). Without this, a chained `update where age > 30` could
  match a row whose pending value already moved out of range.

* **Multi-delete on same table no longer trips ExpectedVersionMismatch**
  (Cursor Bugbot HIGH): open_table_for_mutation routes through
  reopen_for_mutation when staging.inline_committed has the table,
  using the post-inline-commit Lance version captured at record_inline
  time. The legacy open_for_mutation_on_branch fence (Lance HEAD ==
  manifest pinned) is correct cross-writer but wrong intra-query when
  deletes have already advanced HEAD on this table. Branch goes away
  when Lance ships two-phase delete (lance-format/lance#6658).

* **Cardinality validation consolidated** (Cursor LOW + Codex P2 +
  Cubic P1#2 + Cubic P2): new exec/staging::count_src_per_edge +
  enforce_cardinality_bounds shared by mutation and loader paths.
  Restores the missing min-cardinality check on the engine path.
  Loader Merge mode passes Some("id") to dedupe edges being updated
  by id (not double-count committed + pending). Loader Append mode
  and engine path pass None (ULID-generated ids never collide).

* **Dead count_rows_with_pending removed** (Cursor LOW): never called.

* **Misleading concat-helper comment fixed** (Cubic P3): claimed
  schema normalization the helper doesn't implement. Updated to match
  reality.

* **Documentation honesty** (Cubic P1#3): MR-794 narrows but doesn't
  eliminate the "Lance HEAD ahead of __manifest" drift class. Drift is
  unreachable for op-execution failures (the partial_failure test pins
  this), but a residual remains at the finalize→publisher boundary
  because Lance has no multi-dataset commit primitive: per-table
  commit_staged calls run sequentially before manifest commit. Updated
  docs/runs.md, docs/invariants.md §VI.25, docs/releases/v0.4.1.md to
  scope the claim precisely.

* **Failpoint test pinning the residual**: new
  mutation.post_finalize_pre_publisher failpoint + two tests in
  tests/failpoints.rs that confirm the documented residual behavior.
  Catches future regressions that widen the residual.

Test additions on tests/runs.rs:

* chained_updates_with_overlapping_predicate_respects_intermediate_value
* multi_statement_delete_on_same_node_table
* cascade_delete_node_then_explicit_delete_edge_on_same_table
* mutation_insert_edge_enforces_min_cardinality
* load_merge_mode_dedupes_edge_for_cardinality_count

113/113 engine integration tests pass (runs + end_to_end + consistency
+ staged_writes + validators). Failpoints feature build runs in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 13:47:55 +02:00
Ragnor Comerford
a61e82f47a
MR-794 step 2: docs — runs/invariants/architecture/execution + cleanup
Refresh user-facing and agent-facing docs for the staged-write rewire
and clean up stale Run-state-machine references that survived MR-771.

MR-794-specific updates:
* docs/runs.md — remove "Known limitation: mid-query partial failure"
  section; document the in-memory accumulator + D₂ rule + the
  LoadMode::Overwrite residual.
* docs/invariants.md §VI.25 — flip from aspirational/open to
  upheld for inserts/updates. Within-query read-your-writes is now
  load-bearing for the publisher CAS contract.
* docs/architecture.md — add "Mutation atomicity — in-memory
  accumulator (MR-794)" subsection with per-op flow; refresh the
  engine + state diagrams to drop RunRegistry and add MutationStaging.
* docs/execution.md — rewrite the mutation flow sequence diagram
  for the staged-write path; updated the LoadMode table to call
  out per-mode commit semantics; rewrote load vs ingest.
* docs/query-language.md — document the D₂ parse-time rule.
* docs/errors.md — add the D₂ BadRequest rejection path.
* docs/testing.md — extend the runs.rs row to cover the new MR-794
  contract tests; add the staged_writes.rs row.
* docs/releases/v0.4.1.md (new) — release note covering the rewire,
  test additions, residuals, and files changed.
* AGENTS.md (CLAUDE.md symlink) — update the atomic-per-query
  description and the L2 capability matrix row.

Stale-reference cleanup (MR-771 leftovers):
* docs/storage.md — drop live _graph_runs.lance / _graph_run_actors.lance
  from the layout diagram and prose; mark legacy.
* docs/branches-commits.md — move __run__<id> to a legacy note;
  remove publish_run from the publish-trigger list.
* docs/audit.md — refresh _as API list (drop begin_run_as / publish_run_as);
  legacy RunRecord.actor_id moved to a historical note.
* docs/constants.md — mark run registry / branch-prefix rows as legacy.
* docs/cli.md — replace the legacy omnigraph run * quickstart block
  with omnigraph commit list/show.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:43:19 +02:00
Ragnor Comerford
56b30c5c5a
Restructure invariants doc: drop commercial, separate patterns from invariants
- Removed §IX (OSS / Cloud kernel-product split) — business strategy belongs
  in MR-738, not the technical invariants doc.
- Filled the §IV (Additivity / migration) placeholder with five evolution
  invariants.
- Reframed §I to be substrate-agnostic: invariants are about respecting any
  substrate; Lance / DataFusion are noted as the current chosen substrate
  rather than as the invariant itself.
- Added §VI Database guarantees (12 invariants): atomicity, schema integrity,
  isolation, durability, causal consistency, determinism, idempotency, no
  silent loss, bounded operations, failure scope, crash recovery, consistency
  model.
- Added §II.8 wire-protocol agnosticism (kernel transport-agnostic,
  Flight/HTTP at the server boundary).
- Reframed §VII as "Current architectural patterns" — explicitly distinct
  from invariants. Each pattern entry now names the underlying invariant it
  realizes (reconciler / Union / mutations-wrap-reads / SIP / factorize /
  stable row IDs / rank columns / policy predicates / Source).
- Pulled specific config defaults out of §VI (timeouts, memory caps);
  invariant is that bounds exist, values live in docs/constants.md.
- Split §IX deny-list into "invariant violations" (high bar) and "pattern
  violations" (overridable with justification).
- Added status legend: decided / open — see MR-X / aspirational. Annotated
  invariants and patterns that are not yet upheld in current code.
- Updated review checklist (§X) to cover database-guarantee dimensions and
  the wire-protocol / Source / patterns sections.
- Updated Living Document policy (§XI) to spell out how to revise patterns,
  resolve open invariants, and lift aspirational annotations.

Source tickets: MR-737, MR-744, MR-765, MR-694 family, MR-722/MR-725.
2026-04-29 00:39:11 +02:00
Ragnor Comerford
c924e121d2
Add architectural invariants & deny-list as docs/invariants.md
A standing reference for invariants that hold across storage, engine,
server, schema, indexing, observability, and the OSS/Cloud split. Used
to check RFCs and PRs against the substrate boundaries (don't rebuild
what Lance gives us), layering rules (one trait boundary per layer),
distributability constraints (Send+Sync, location-neutral IR), honesty
expectations (estimate-vs-actual, bounded failure modes), unified
patterns (reconciler, Union polymorphism, SIP, factorize), the §IX
deny-list, and the §X review checklist.

§IV (additivity / migration) and §VIII (OSS/Cloud kernel-product split)
are referenced but not yet drafted — flagged as placeholders pending
upstream fill-in.

AGENTS.md surfaces it from the topic index, the always-on rules
section, and the maintenance contract; the deny-list is also inlined
there as a fast-pass review filter so it stays in scope every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 23:34:44 +02:00