omnigraph/docs/dev/testing.md

# Testing

This file is the always-on map of the test surface. **Consult it before every task** so you know what tests already cover the area you're about to change, what helpers to reuse, and where a new test belongs. The architectural invariant for boundary-matched tests lives in [docs/dev/invariants.md](invariants.md).

## Where tests live, per crate

| Crate | Path | Style |
|---|---|---|
| `omnigraph` (engine) | `crates/omnigraph/tests/` | Integration tests (28 files), fixture-driven, share `tests/helpers/mod.rs` |
| `omnigraph-cli` | `crates/omnigraph-cli/tests/` | Per-area suites (post-modularization): `cli_cluster.rs` (cluster command surface + operator-actor cascade), `cli_cluster_e2e.rs` (spawned-binary lifecycle compositions — lost-state re-import recovery, out-of-band drift, graph-root destruction, multi-graph mixed-disposition convergence), `cli_data.rs` (load/read/change/branch/commit/export/snapshot/policy/embed/maintenance + operator format cascade), `cli_schema_config.rs` (init/config, schema plan/apply, RFC-008 deprecation warnings + `config migrate` + strict mode), `cli_queries.rs`, `system_local.rs` (full-cycle cluster lifecycle with a spawned `--cluster` server, applied-policy enforcement over HTTP, keyed-credential auth, operator aliases), `system_remote.rs`; share `tests/support/mod.rs` (hermetic `OMNIGRAPH_HOME` by default) |
| `omnigraph-cluster` | mostly in-source `#[cfg(test)] mod tests`; `tests/failpoints.rs` (feature-gated); `tests/s3_cluster.rs` (bucket-gated full lifecycle on object storage) | Cluster config parser, local JSON state diff, state CAS/lock handling/recovery, read-only validate/plan/status plus explicit refresh/import graph observations, config-only apply (content-addressed payload publish, disposition gating, composite-digest convergence, idempotent re-apply), catalog payload verification (status read-only, refresh drift + self-heal), failpoint crash-mid-apply / CAS-race coverage, Stage 4A graph creation (create executor, recovery sidecars + sweep rows, create crash windows), Stage 4B schema apply (migration previews in plan, schema executor, schema-apply sweep classification, schema crash windows), Stage 4C gated deletes (digest-bound approvals, delete executor + tombstones, delete sweep rows, delete crash windows), and 5A policy binding metadata (applies_to in the applied revision, binding-change diffing + convergence, pre-5A backfill), and the 5B serving-snapshot read API (converged read, refusal rows) |
| `omnigraph-server` | `crates/omnigraph-server/tests/` | Per-area suites (post-modularization): `auth_policy.rs`, `data_routes.rs`, `schema_routes.rs`, `stored_queries.rs`, `multi_graph.rs` (cluster-mode boot — converged serving, policy binding wiring, boot refusals — + the concurrent branch-ops matrix), `boot_settings.rs` (mode inference, PolicySource), `s3.rs` (bucket-gated: single-graph serving + config-free `--cluster s3://` boot), `openapi.rs` (OpenAPI drift / regeneration); share `tests/support/mod.rs` |
| `omnigraph-compiler` | mostly in-source `#[cfg(test)] mod tests` | Parser, type-checker, IR lowering, lint |

The engine's `tests/` is the principal coverage surface; most graph-shaped behavior is exercised there.

## Engine integration tests (`crates/omnigraph/tests/`)

| File | Covers |
|---|---|
| `end_to_end.rs` | Full init → load → query/mutate flow |
| `branching.rs` | Branch create / list / delete, lazy fork |
| `merge_truth_table.rs` | Merge-pair truth table (MR-786): all 9×9 `(left_op, right_op)` cells from `{noop, addNode, removeNode, addEdge, removeEdge, setProperty, dropProperty, addLabel, removeLabel}`. Adding a new op to `OpVariant` forces a compile error in `build_case` until the new row + column are dispositioned. 36 executable cells run through real `branch_merge` with a structured oracle (`MergeOutcome` / `MergeConflictKind` + graph-state assert); 45 cells involving `dropProperty`/`addLabel`/`removeLabel` are recorded as `Unsupported` until the mutation grammar grows. |
| `writes.rs` | Direct-publish writes: cancellation, non-strict insert/merge rebase under the per-table queue, strict stale-write conflicts, multi-statement atomicity, MR-794 staged-write rewire (D₂ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery) |
| `staged_writes.rs` | TableStore staged-write primitives (`stage_append`, `stage_merge_insert`, `commit_staged`, `scan_with_staged`, `count_rows_with_staged`) — primitive-level only; engine code uses the in-memory `MutationStaging` accumulator instead |
| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
| `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
| `lifecycle.rs` | Graph lifecycle, schema state |
| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
| `changes.rs` | `diff_between` / `diff_commits` |
| `consistency.rs` | Cross-table snapshot isolation, atomic publish |
| `schema_apply.rs` | Migration plan + apply, schema-apply lock |
| `search.rs` | FTS / vector / hybrid (`bm25`, `nearest`, `rrf`) |
| `traversal.rs` | `Expand`, variable-length hops, anti-join (CSR path — `OMNIGRAPH_TRAVERSAL_MODE` unset) |
| `traversal_indexed.rs` | BTREE-indexed Expand (`execute_expand_indexed`) forced via `OMNIGRAPH_TRAVERSAL_MODE`, asserted semantically equal to the CSR path; own binary, all `#[serial]` so env writes never race |
| `proptest_equivalence.rs` | Property-based query-correctness invariants over generated graphs (shared key alphabet forces cross-type id collisions, cycles, self-loops) — pins Expand-mode equivalence so a future fork divergence fails loudly instead of silently; `#[serial]` |
| `ordering.rs` | ORDER BY contract: descending, multi-key precedence, deterministic key-column tie-break (total order, so `ORDER … LIMIT` is deterministic), NULL placement (`nulls_first = !descending`) |
| `literal_filters.rs` | Execution goldens for non-string/non-integer scalar literal filters (F64/F32/Bool/Date/DateTime) across both the in-memory comparison arm and the Lance-pushdown arm |
| `aggregation.rs` | `count`, `sum`, `avg`, `min`, `max` |
| `export.rs` | NDJSON streaming export filters |
| `s3_storage.rs` | S3-backed graph (skipped unless `OMNIGRAPH_S3_TEST_BUCKET` is set) |
| `lance_version_columns.rs` | Per-row `_row_last_updated_at_version` behavior |
| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
| `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly — no HTTP — proving embedded and CLI callers hit the same gate as the server, with action × scope shapes matching `authorize_request` |
| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice |
| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). |
| `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path |
| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). |

## Fixtures

`crates/omnigraph/tests/fixtures/` holds the canonical schema (`.pg`), seed data (`.jsonl`), and queries (`.gq`) shared across tests. Reuse these before inventing new ones — the helpers harness already knows how to load them.

## Test helpers

- **Engine** — `crates/omnigraph/tests/helpers/mod.rs`: `init_and_load()` (bootstrap a temp graph + load standard fixture), `snapshot_main()`, `snapshot_branch()`, query/mutation runners, row collection and counting. Use these instead of hand-rolling.
- **CLI** — `crates/omnigraph-cli/tests/support/mod.rs`: `Command`-style wrapper for invoking `omnigraph`, server-process spawning, fixture resolution, output assertion helpers.
- **Server** — no shared helpers; server tests call the `Omnigraph` engine API directly and exercise endpoints over the wire.

> Note: the storage adapter has an in-memory backend (`ObjectStorageAdapter::in_memory()`, full contract including true conditional updates) used by the adapter contract tests in `storage.rs`. It covers only the text-object layer (sidecars, schema staging, cluster state) — **Lance datasets bypass the adapter**, so engine integration tests still use `tempfile::tempdir()`. An in-memory Lance substrate remains an architectural ask — keep it explicit in [docs/dev/invariants.md](invariants.md) under known gaps.

## Failpoints (fault injection)

- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` (in `crates/omnigraph/Cargo.toml` **and** `crates/omnigraph-cluster/Cargo.toml`; the cluster feature does not enable the engine's).
- Wrappers: `crates/omnigraph/src/failpoints.rs` and `crates/omnigraph-cluster/src/failpoints.rs` expose `maybe_fail("name")` and `ScopedFailPoint` for tests.
- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, cluster apply's payload→state-write window, etc.).
- Activated tests: `crates/omnigraph/tests/failpoints.rs` and `crates/omnigraph-cluster/tests/failpoints.rs` (crash-mid-apply + state CAS race via `fail::cfg_callback`; integration binaries, never in-source — the fail registry is process-global). Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints` / `cargo test -p omnigraph-cluster --features failpoints --test failpoints`.

## RustFS / S3 integration

CI runs three S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml` → `rustfs_integration` job):

- `cargo test -p omnigraph-engine --test s3_storage`
- `cargo test -p omnigraph-server --test s3` (single-graph serving + config-free `--cluster s3://` boot)
- `cargo test -p omnigraph-cluster --test s3_cluster` (full control-plane lifecycle on the bucket)
- `cargo test -p omnigraph-cli --test system_local local_cli_s3_end_to_end_init_load_read_flow`
- `cargo test -p omnigraph-engine --features failpoints --test failpoints s3_` (recovery-sidecar lifecycle on a real bucket)

Locally, set `OMNIGRAPH_S3_TEST_BUCKET` (and the usual `AWS_*` vars including `AWS_ENDPOINT_URL_S3` for non-AWS) before running. Without those, S3 tests skip gracefully.

## System e2e requirements and suppression

The CLI system tests (`system_local.rs`) spawn the workspace-built `omnigraph` and `omnigraph-server` binaries (cargo provides paths via `CARGO_BIN_EXE_*`), bind ephemeral localhost ports, and use local-FS temp dirs — no external services, no env vars required; they run in the default `cargo test --workspace`. The comprehensive cluster lifecycle e2es (multi-server-restart flows) honor an opt-out for constrained sandboxes: set `OMNIGRAPH_SKIP_SYSTEM_E2E=1` to skip them with a logged message (the same graceful-skip pattern as the S3 gate). Cargo-native filtering also works: `cargo test --test system_local -- --skip local_cluster`.

## OpenAPI drift

`crates/omnigraph-server/tests/openapi.rs` regenerates `openapi.json` and diffs against the checked-in copy. CI auto-commits the regeneration on same-repository PRs and otherwise runs in strict-check mode (env: `OMNIGRAPH_UPDATE_OPENAPI`).

## Examples & benches

- `crates/omnigraph/examples/bench_expand.rs` — runnable example (not part of CI).
- No `benches/` directories. Add `benches/` per crate when you ship a perf-driven change, and include the motivating workload with the optimization.

## Coverage tooling — what's missing

There is **no** coverage tooling in the repository today: no `tarpaulin.toml`, no `codecov.yml`, no coverage CI step. If you want to know whether your change is covered, the answer comes from reading and running the relevant integration tests, not from a tool.

If introducing coverage tooling is in scope for your task, the natural first step is `cargo-llvm-cov` wired into a separate CI job, and a per-crate threshold rather than a global one.

## First principle: check what already covers it

**Before writing any new test, check whether an existing test already covers the case.** The cost of duplicating coverage is high: more code to read, more places to keep in sync when behavior changes, and more drift when one copy lags. The cost of *extending* an existing test is usually one extra assertion or one extra fixture row.

How to check:

1. **Map the change to an area** — use the engine integration-test table above (`branching.rs`, `writes.rs`, `search.rs`, etc.). The filename usually names the area.
2. **Open the file and skim every test fn name.** Test fn names are the index — read them all, not just the first few.
3. **Grep for the symbol or path you're changing.** `rg <FunctionName>` or `rg <enum_variant>` across all `tests/` directories surfaces existing coverage you might miss.
4. **Decide one of three outcomes**, in this order of preference:
   - *Existing test already asserts the new behavior* → no new test needed; this PR is a refactor or no-op behaviorally. Confirm by running the existing test against the change.
   - *Existing test covers the area but not your case* → **add an assertion or a fixture row to the existing test**, don't write a new function with `init_and_load()` again.
   - *No existing coverage in any test file* → only then write a new test; put it in the file that owns the area, or open a new file only if the area itself is new.

Three duplicated `init_and_load() → run_query → assert_eq` blocks where one parameterized test would do is the most common form of test rot in this repository. Don't add to it.

## Before-every-task checklist

When you pick up any change, walk through this:

1. **Find existing coverage** (per the principle above). Don't just look at the first test file by name — grep for the symbol you're touching across every crate's `tests/`.
2. **Run those tests locally before editing.** `cargo test --workspace --locked` for the broad pass; `-p <crate> --test <file>` for a focused loop. Confirm a clean baseline.
3. **Decide extend-vs-new** explicitly. If you can extend an existing test (assertion, fixture row, parameterization), do that. Only add a new test fn or new file if no existing one owns the area.
4. **Reuse the helpers.** `init_and_load()`, fixture files, the CLI `support` harness — re-use them. Don't bootstrap a fresh graph by hand if a helper exists.
5. **Mind the boundary.** Per [docs/dev/invariants.md](invariants.md), test at the layer the change lives at — planner-level changes deserve planner-level tests, not just end-to-end.
6. **For substrate-touching changes** (Lance behavior), reach for `failpoints` or fixture-driven scenarios, not stubbed-out mocks.
7. **For server / API changes**, confirm the OpenAPI regeneration happens in `openapi.rs` and that the diff lands in `openapi.json`.
8. **Verify your change makes an existing test fail before it makes the new one pass.** If you can break the code without breaking a test, your coverage gap is the problem to fix first.

When in doubt, re-read [docs/dev/invariants.md](invariants.md) — quality gates apply to every change.
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
+								# Testing
-												docs: split user and developer docs (#93)
											
										
										
											2026-05-15 03:45:22 +03:00
+								This file is the always-on map of the test surface. **Consult it before every task** so you know what tests already cover the area you're about to change, what helpers to reuse, and where a new test belongs. The architectural invariant for boundary-matched tests lives in [docs/dev/invariants.md](invariants.md).
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
 								## Where tests live, per crate
 								| Crate | Path | Style |
 								|---|---|---|
-												docs(testing): bring the test map up to release truth

Lands an orphaned-but-accurate working-tree edit (the engine table rows
for forbidden_apis.rs, lance_surface_guards.rs, traversal_indexed,
proptest_equivalence, ordering, literal_filters, policy_engine_chassis —
all real files; 21 -> 28 count) and replaces the stale pre-modularization
crate rows: the CLI and server entries now describe the per-area suites
(#192/#193 splits) plus this cycle's additions (RFC-008 deprecation
coverage, keyed-credential auth, hermetic OMNIGRAPH_HOME harness, the
bucket-gated s3 suites).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

											
										
										
											2026-06-12 13:22:34 +03:00
+								| `omnigraph` (engine) | `crates/omnigraph/tests/` | Integration tests (28 files), fixture-driven, share `tests/helpers/mod.rs` |
 								| `omnigraph-cli` | `crates/omnigraph-cli/tests/` | Per-area suites (post-modularization): `cli_cluster.rs` (cluster command surface + operator-actor cascade), `cli_cluster_e2e.rs` (spawned-binary lifecycle compositions — lost-state re-import recovery, out-of-band drift, graph-root destruction, multi-graph mixed-disposition convergence), `cli_data.rs` (load/read/change/branch/commit/export/snapshot/policy/embed/maintenance + operator format cascade), `cli_schema_config.rs` (init/config, schema plan/apply, RFC-008 deprecation warnings + `config migrate` + strict mode), `cli_queries.rs`, `system_local.rs` (full-cycle cluster lifecycle with a spawned `--cluster` server, applied-policy enforcement over HTTP, keyed-credential auth, operator aliases), `system_remote.rs`; share `tests/support/mod.rs` (hermetic `OMNIGRAPH_HOME` by default) |
-												test(cluster,server): gated object-storage cluster e2e + CI wiring + docs

s3_cluster.rs runs the full control-plane lifecycle against a real
bucket (CI: containerized RustFS; locally the RustFS binary): import →
lock released (pins the drop-time release regression caught on the first
live smoke) → apply (graph roots + catalog on the bucket, nothing local)
→ serving snapshots from both the config dir and the bare URI → schema
evolution → approved delete (prefix removal) → empty-cluster refusal.
The server suite gains the config-free boot test: --cluster s3://… with
zero local files serves a stored query over HTTP.

CI: the rustfs job runs both suites; the classify filter covers the
cluster store/serve modules and the new test files. The server smoke
drops its name filter — every test in the s3 target is bucket-gated, and
a filter matching nothing passes vacuously (which silently ran zero
tests for a while).

Docs: deployment.md gains the Bucket-no-volume shape as the preferred
cloud deployment; cluster.md/server.md document --cluster <uri>;
testing.md maps the new suite.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

											
										
										
											2026-06-11 15:56:40 +03:00
+								| `omnigraph-cluster` | mostly in-source `#[cfg(test)] mod tests`; `tests/failpoints.rs` (feature-gated); `tests/s3_cluster.rs` (bucket-gated full lifecycle on object storage) | Cluster config parser, local JSON state diff, state CAS/lock handling/recovery, read-only validate/plan/status plus explicit refresh/import graph observations, config-only apply (content-addressed payload publish, disposition gating, composite-digest convergence, idempotent re-apply), catalog payload verification (status read-only, refresh drift + self-heal), failpoint crash-mid-apply / CAS-race coverage, Stage 4A graph creation (create executor, recovery sidecars + sweep rows, create crash windows), Stage 4B schema apply (migration previews in plan, schema executor, schema-apply sweep classification, schema crash windows), Stage 4C gated deletes (digest-bound approvals, delete executor + tombstones, delete sweep rows, delete crash windows), and 5A policy binding metadata (applies_to in the applied revision, binding-change diffing + convergence, pre-5A backfill), and the 5B serving-snapshot read API (converged read, refusal rows) |
-												docs(testing): bring the test map up to release truth

Lands an orphaned-but-accurate working-tree edit (the engine table rows
for forbidden_apis.rs, lance_surface_guards.rs, traversal_indexed,
proptest_equivalence, ordering, literal_filters, policy_engine_chassis —
all real files; 21 -> 28 count) and replaces the stale pre-modularization
crate rows: the CLI and server entries now describe the per-area suites
(#192/#193 splits) plus this cycle's additions (RFC-008 deprecation
coverage, keyed-credential auth, hermetic OMNIGRAPH_HOME harness, the
bucket-gated s3 suites).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

											
										
										
											2026-06-12 13:22:34 +03:00
+								| `omnigraph-server` | `crates/omnigraph-server/tests/` | Per-area suites (post-modularization): `auth_policy.rs`, `data_routes.rs`, `schema_routes.rs`, `stored_queries.rs`, `multi_graph.rs` (cluster-mode boot — converged serving, policy binding wiring, boot refusals — + the concurrent branch-ops matrix), `boot_settings.rs` (mode inference, PolicySource), `s3.rs` (bucket-gated: single-graph serving + config-free `--cluster s3://` boot), `openapi.rs` (OpenAPI drift / regeneration); share `tests/support/mod.rs` |
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
+								| `omnigraph-compiler` | mostly in-source `#[cfg(test)] mod tests` | Parser, type-checker, IR lowering, lint |
 								The engine's `tests/` is the principal coverage surface; most graph-shaped behavior is exercised there.
 								## Engine integration tests (`crates/omnigraph/tests/`)
 								| File | Covers |
 								|---|---|
 								| `end_to_end.rs` | Full init → load → query/mutate flow |
 								| `branching.rs` | Branch create / list / delete, lazy fork |
-												MR-786: merge-pair truth table with exhaustive op-variant matrix (#81)

* MR-786: merge-pair truth table with exhaustive op-variant matrix

Add crates/omnigraph/tests/merge_truth_table.rs that enumerates every
(left_op, right_op) cell from the operation vocabulary named in the
ticket — {noop, addNode, removeNode, addEdge, removeEdge, setProperty,
dropProperty, addLabel, removeLabel} — and asserts the deterministic
outcome of Omnigraph::branch_merge against a structured oracle.

The matrix is built in a 9x9 match in build_case, so adding a new
OpVariant is a compile-time, fail-on-omission task. Today's mutation
grammar only exposes insert | update set | delete (see
docs/query-language.md), so the 36 cells over the first six ops are
executable and the 45 cells involving dropProperty/addLabel/removeLabel
are recorded as Expected::Unsupported with a note. Each executable cell
spins up a fresh tempdir, applies one mutation per branch, calls
branch_merge, and asserts either:

  * MergeOutcome (AlreadyUpToDate / FastForward / Merged) plus a
    GraphAssert on the affected entities, or
  * an OmniError::MergeConflicts whose entries match the expected
    table_key + MergeConflictKind (row_id is optional because edge
    ULIDs are generated at runtime).

branch_merge is directional, so the (L, R) and (R, L) cells live in
separate entries in the matrix and are run independently — the
op-pair symmetry encoded in build_case serves as the commutativity
oracle without doubling the runtime. End-to-end the suite runs in
~10s on a fresh build, well under the 30s budget asserted at the
bottom of the test.

Also adds a row to docs/testing.md so the test-coverage map points
future agents at this file.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* Use one Omnigraph handle for both branches

Self-review caught that the runner was opening two Omnigraph handles
on the same temp dataset (one for main, a second via Omnigraph::open
for feature). tests/branching.rs uses one handle and passes the branch
name to mutate_branch — same pattern works here and avoids any
cache-coherency surprises between the two handles. Also drops the
post-merge reopen, which only existed to give the second handle a
fresh snapshot.

Runtime drops ~10s -> ~9s.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* Assert exact conflict count, not subset inclusion

cubic and Devin Review both flagged that check_outcome's
Expected::Conflicts arm only enforces want ⊆ got, so a regression that
produces a spurious extra conflict (e.g. emitting both OrphanEdge and
a stray DivergentInsert) would silently pass the truth-table cell.

For a deterministic oracle that's the wrong direction — the cell pins
the exact conflict-artifact set, not a lower bound. Add an
assert_eq!(got.len(), want.len()) before the existence loop. All 36
executable cells still pass; runtime unchanged.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* Subsume 4 conflict tests in branching.rs into truth table

The four `branch_merge_reports_*_conflict` tests
(DivergentUpdate / DivergentInsert / DeleteVsUpdate / OrphanEdge)
were redundant with the deterministic-oracle cells in the new
`merge_truth_table.rs` and only added drift risk.

To preserve the post-conflict invariant that lived in
`branch_merge_reports_divergent_update_conflict` (target unchanged
after a failed merge), the truth-table runner now generalizes it:
on every `Conflicts` cell, main's state is asserted against
`state_after_apply_only(right_op)`. That gives strictly more
coverage than the deleted tests carried, since the invariant now
applies to *all* seven conflict cells, not just one.

The `UniqueViolation` and `CardinalityViolation` cases stay in
`branching.rs` — they're combinatorial (require >1 op per side
with a non-default schema) and out of scope for the pair-wise
truth table.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

* Fix misleading 'Total edges: 0' comment in (AddEdge, RemoveEdge) cell

Devin Review flagged that the comment said 'Total edges: 0' while the
parenthetical math evaluates to 1 (matching `GraphAssert::base()`).
The assertion is correct; only the leading number in the comment was
wrong. Reworded to 'Net edges: … = 1 (matches base)' so the prose
agrees with both the math and the assertion.

Co-Authored-By: Ragnor Comerford <ragnor.comerford@gmail.com>

---------

Co-authored-by: Ragnor <ragnor@modernrelay.com>
Co-authored-by: Ragnor Comerford <ragnor.comerford@gmail.com>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
											
										
										
											2026-05-12 22:36:01 +03:00
+								| `merge_truth_table.rs` | Merge-pair truth table (MR-786): all 9×9 `(left_op, right_op)` cells from `{noop, addNode, removeNode, addEdge, removeEdge, setProperty, dropProperty, addLabel, removeLabel}`. Adding a new op to `OpVariant` forces a compile error in `build_case` until the new row + column are dispositioned. 36 executable cells run through real `branch_merge` with a structured oracle (`MergeOutcome` / `MergeConflictKind` + graph-state assert); 45 cells involving `dropProperty`/`addLabel`/`removeLabel` are recorded as `Unsupported` until the mutation grammar grows. |
-												fix(maintenance): route uncovered drift through repair (#156)

* docs(invariants): note the non-atomic manifest->commit-graph publish gap

Every graph publish commits __manifest then appends _graph_commits as two
separate writes; a crash between them leaves the manifest ahead of the commit
DAG. Live reads + durability are unaffected (reads resolve via the manifest) and
recovery does not repair it; impact is bounded to commit history / time-travel
by commit id / merge-base completeness. Pre-existing across all publishes, not
the optimize reconcile specifically. Documented as a Known Gap; the fix is a
commit-graph reconcilable from the manifest, not a recovery sidecar.

* fix(maintenance): route uncovered drift through repair

* fix(maintenance): harden repair review feedback
											
										
										
											2026-06-09 14:42:54 +02:00
+								| `writes.rs` | Direct-publish writes: cancellation, non-strict insert/merge rebase under the per-table queue, strict stale-write conflicts, multi-statement atomicity, MR-794 staged-write rewire (D₂ rejection, insert+update coalesce, multi-append coalesce, partial-failure recovery, load RI/cardinality recovery) |
-												MR-794 step 2: docs — runs/invariants/architecture/execution + cleanup

Refresh user-facing and agent-facing docs for the staged-write rewire
and clean up stale Run-state-machine references that survived MR-771.

MR-794-specific updates:
* docs/runs.md — remove "Known limitation: mid-query partial failure"
  section; document the in-memory accumulator + D₂ rule + the
  LoadMode::Overwrite residual.
* docs/invariants.md §VI.25 — flip from aspirational/open to
  upheld for inserts/updates. Within-query read-your-writes is now
  load-bearing for the publisher CAS contract.
* docs/architecture.md — add "Mutation atomicity — in-memory
  accumulator (MR-794)" subsection with per-op flow; refresh the
  engine + state diagrams to drop RunRegistry and add MutationStaging.
* docs/execution.md — rewrite the mutation flow sequence diagram
  for the staged-write path; updated the LoadMode table to call
  out per-mode commit semantics; rewrote load vs ingest.
* docs/query-language.md — document the D₂ parse-time rule.
* docs/errors.md — add the D₂ BadRequest rejection path.
* docs/testing.md — extend the runs.rs row to cover the new MR-794
  contract tests; add the staged_writes.rs row.
* docs/releases/v0.4.1.md (new) — release note covering the rewire,
  test additions, residuals, and files changed.
* AGENTS.md (CLAUDE.md symlink) — update the atomic-per-query
  description and the L2 capability matrix row.

Stale-reference cleanup (MR-771 leftovers):
* docs/storage.md — drop live _graph_runs.lance / _graph_run_actors.lance
  from the layout diagram and prose; mark legacy.
* docs/branches-commits.md — move __run__<id> to a legacy note;
  remove publish_run from the publish-trigger list.
* docs/audit.md — refresh _as API list (drop begin_run_as / publish_run_as);
  legacy RunRecord.actor_id moved to a historical note.
* docs/constants.md — mark run registry / branch-prefix rows as legacy.
* docs/cli.md — replace the legacy omnigraph run * quickstart block
  with omnigraph commit list/show.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-05-01 10:43:19 +02:00
+								| `staged_writes.rs` | TableStore staged-write primitives (`stage_append`, `stage_merge_insert`, `commit_staged`, `scan_with_staged`, `count_rows_with_staged`) — primitive-level only; engine code uses the in-memory `MutationStaging` accumulator instead |
-												docs(testing): bring the test map up to release truth

Lands an orphaned-but-accurate working-tree edit (the engine table rows
for forbidden_apis.rs, lance_surface_guards.rs, traversal_indexed,
proptest_equivalence, ordering, literal_filters, policy_engine_chassis —
all real files; 21 -> 28 count) and replaces the stale pre-modularization
crate rows: the CLI and server entries now describe the per-area suites
(#192/#193 splits) plus this cycle's additions (RFC-008 deprecation
coverage, keyed-credential auth, hermetic OMNIGRAPH_HOME harness, the
bucket-gated s3 suites).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

											
										
										
											2026-06-12 13:22:34 +03:00
+								| `forbidden_apis.rs` | Defense-in-depth source-walk guard: engine code (`exec/`, `db/omnigraph/`, `loader/`, `changes/`) must not reach around the sealed storage trait to Lance inline-commit APIs; `// forbidden-api-allow: <reason>` sentinel exempts reviewed lines |
 								| `lance_surface_guards.rs` | Pins the Lance API surfaces omnigraph depends on (named runtime + compile-only guards; see [lance.md](lance.md)) — the first smoke check on any Lance version bump; e.g. `compact_files_still_fails_on_blob_columns` turns red when the upstream blob-compaction fix lands |
-												Rename repo terminology to graph (#118)
											
										
										
											2026-05-24 16:46:00 +01:00
+								| `lifecycle.rs` | Graph lifecycle, schema state |
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
+								| `point_in_time.rs` | Snapshots, time travel (`snapshot_at_version`, `entity_at`) |
 								| `changes.rs` | `diff_between` / `diff_commits` |
 								| `consistency.rs` | Cross-table snapshot isolation, atomic publish |
 								| `schema_apply.rs` | Migration plan + apply, schema-apply lock |
 								| `search.rs` | FTS / vector / hybrid (`bm25`, `nearest`, `rrf`) |
-												docs(testing): bring the test map up to release truth

Lands an orphaned-but-accurate working-tree edit (the engine table rows
for forbidden_apis.rs, lance_surface_guards.rs, traversal_indexed,
proptest_equivalence, ordering, literal_filters, policy_engine_chassis —
all real files; 21 -> 28 count) and replaces the stale pre-modularization
crate rows: the CLI and server entries now describe the per-area suites
(#192/#193 splits) plus this cycle's additions (RFC-008 deprecation
coverage, keyed-credential auth, hermetic OMNIGRAPH_HOME harness, the
bucket-gated s3 suites).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

											
										
										
											2026-06-12 13:22:34 +03:00
+								| `traversal.rs` | `Expand`, variable-length hops, anti-join (CSR path — `OMNIGRAPH_TRAVERSAL_MODE` unset) |
 								| `traversal_indexed.rs` | BTREE-indexed Expand (`execute_expand_indexed`) forced via `OMNIGRAPH_TRAVERSAL_MODE`, asserted semantically equal to the CSR path; own binary, all `#[serial]` so env writes never race |
 								| `proptest_equivalence.rs` | Property-based query-correctness invariants over generated graphs (shared key alphabet forces cross-type id collisions, cycles, self-loops) — pins Expand-mode equivalence so a future fork divergence fails loudly instead of silently; `#[serial]` |
 								| `ordering.rs` | ORDER BY contract: descending, multi-key precedence, deterministic key-column tie-break (total order, so `ORDER … LIMIT` is deterministic), NULL placement (`nulls_first = !descending`) |
 								| `literal_filters.rs` | Execution goldens for non-string/non-integer scalar literal filters (F64/F32/Bool/Date/DateTime) across both the in-memory comparison arm and the Lance-pushdown arm |
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
+								| `aggregation.rs` | `count`, `sum`, `avg`, `min`, `max` |
 								| `export.rs` | NDJSON streaming export filters |
-												Rename repo terminology to graph (#118)
											
										
										
											2026-05-24 16:46:00 +01:00
+								| `s3_storage.rs` | S3-backed graph (skipped unless `OMNIGRAPH_S3_TEST_BUCKET` is set) |
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
+								| `lance_version_columns.rs` | Per-row `_row_last_updated_at_version` behavior |
-												Add maintenance + destructive-migration test coverage

The audit of test coverage flagged three holes:

- `omnigraph optimize` and `omnigraph cleanup` had no integration tests
  (no `maintenance.rs`). Add one covering empty/idempotent edges, the
  policy-validation contract on `cleanup`, and head preservation under
  aggressive policies.
- `apply_schema` only covered I32 -> I64 type-change rejection. Add the
  symmetric narrowing case plus rejections for the other destructive
  shapes (drop property with data, drop node type, drop edge type, add
  required property without backfill) and assert the manifest version
  doesn't advance. Add a positive `@rename_from` case to pin the
  stable-type-id contract preserves rows through a rename.
- `docs/testing.md` was missing `validators.rs` and the new
  `maintenance.rs` from its file table; bump the count and add rows.

											
										
										
											2026-04-28 23:31:52 +00:00
+								| `validators.rs` | Schema constraint enforcement (enum, range, unique, cardinality) across JSONL, insert, update paths |
-												docs(testing): bring the test map up to release truth

Lands an orphaned-but-accurate working-tree edit (the engine table rows
for forbidden_apis.rs, lance_surface_guards.rs, traversal_indexed,
proptest_equivalence, ordering, literal_filters, policy_engine_chassis —
all real files; 21 -> 28 count) and replaces the stale pre-modularization
crate rows: the CLI and server entries now describe the per-area suites
(#192/#193 splits) plus this cycle's additions (RFC-008 deprecation
coverage, keyed-credential auth, hermetic OMNIGRAPH_HOME harness, the
bucket-gated s3 suites).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

											
										
										
											2026-06-12 13:22:34 +03:00
+								| `policy_engine_chassis.rs` | Engine-layer Cedar enforcement (MR-722): allow + deny through every `_as` writer via the SDK directly — no HTTP — proving embedded and CLI callers hit the same gate as the server, with action × scope shapes matching `authorize_request` |
-												fix(maintenance): route uncovered drift through repair (#156)

* docs(invariants): note the non-atomic manifest->commit-graph publish gap

Every graph publish commits __manifest then appends _graph_commits as two
separate writes; a crash between them leaves the manifest ahead of the commit
DAG. Live reads + durability are unaffected (reads resolve via the manifest) and
recovery does not repair it; impact is bounded to commit history / time-travel
by commit id / merge-base completeness. Pre-existing across all publishes, not
the optimize reconcile specifically. Documented as a Known Gap; the fix is a
commit-graph reconcilable from the manifest, not a recovery sidecar.

* fix(maintenance): route uncovered drift through repair

* fix(maintenance): harden repair review feedback
											
										
										
											2026-06-09 14:42:54 +02:00
+								| `maintenance.rs` | `optimize` (compaction), `repair` (explicit uncovered-drift publish), and `cleanup` (version GC): empty/idempotent/no-op edges, policy validation, head preservation; `optimize` publishes its own compaction (`optimize_publishes_compaction_to_manifest_so_schema_apply_succeeds`), skips pre-existing uncovered drift (`optimize_skips_preexisting_manifest_head_drift`), and refuses to run while a `__recovery` sidecar is pending (`optimize_defers_when_recovery_sidecar_is_pending`); `repair` previews/heals verified maintenance drift, refuses raw semantic drift without `--force`, and forced repair publishes only by explicit operator choice |
-												Recovery liveness, storage fault-injection matrix, and one storage implementation over object_store (#203)

* test(engine): pin the long-lived-handle heal contract for sidecar-covered drift

A Phase B -> Phase C failure (commit_staged advanced Lance HEAD, manifest
publish did not land, recovery sidecar persists) currently wedges every
subsequent staged write on the same engine handle: the commit-time drift
guard rejects with 'run omnigraph repair', but repair itself refuses
while a recovery sidecar is pending, so a long-lived server can only
recover by restart. The documented contract (writes.md 'Long-running
servers', invariants.md invariant 5) says refresh-time roll-forward
closes this residual without restart -- but no write path runs it.

Two red tests pin the intended contract at the write entry points:
a follow-up load (the POST /ingest shape: shared handle, no reopen)
and a follow-up mutation must heal roll-forward-eligible sidecars
in-process and then succeed.

Currently failing with:
  table 'node:Company' has Lance HEAD version 2 ahead of manifest
  version 1; run `omnigraph repair` before writing

The fix lands in the next commit.

* fix(engine): heal pending recovery sidecars at the staged-write entry points

Close the long-lived-process gap in the recovery protocol: a Phase B ->
Phase C residual (per-table commit_staged landed, manifest publish did
not, sidecar persists) previously recovered only at the next ReadWrite
open or via an explicit refresh() that no production write path called,
so a long-lived server wedged every subsequent write on the commit-time
drift guard until restart.

New recovery::heal_pending_sidecars_roll_forward:
- one list_dir of __recovery/ at write entry (empty -> immediate
  return, the steady state), so the per-write cost is one storage list;
- per sidecar, acquires the same per-(table_key, table_branch) write
  queues every sidecar writer holds from before write_sidecar until
  after delete_sidecar, then re-checks sidecar existence -- this
  serializes the heal against live writers instead of rolling an
  in-flight sidecar forward from under its writer (which would fail
  that writer's publish CAS spuriously). Lock order queues ->
  coordinator matches every writer's commit->publish path. This is the
  queue-acquisition design recovery.rs and write_queue.rs already
  documented for in-process recovery;
- processes in RollForwardOnly mode: the common residual rolls forward
  in-process; rollback-eligible sidecars still defer to the next
  ReadWrite open (Dataset::restore is unsafe under concurrency).

Wire it into load_as and mutate_as (before the inline delete path can
advance any HEAD), and rebase Omnigraph::refresh onto the same helper
so refresh stops racing live writers' sidecars.

The maintenance entry points (apply_schema_as, branch_merge_as,
ensure_indices) intentionally keep their strict fail-loud preconditions
for now; wiring the same heal there is a follow-up with its own tests.

Turns the previous commit's two red tests green.

* fix(engine): name the right recovery path in the commit-time drift guard

The drift guard's 'run omnigraph repair before writing' advice is a
dead end when the drift is covered by a pending recovery sidecar:
repair refuses while a sidecar is pending. With the write-entry heal in
place, reaching this guard with sidecar-covered drift means the heal
deferred it (rollback-eligible), and the actual recovery path is a
read-write reopen. Distinguish the two classes on the error path only
(one sidecar list, after the conflict is already certain); a listing
failure falls back to the uncovered-drift wording rather than masking
the conflict.

Pinned by extending refresh_defers_rollback_eligible_sidecar_to_next_open
with a write attempt against the deferred sidecar.

* docs: write-entry in-process sidecar heal — contract and coverage

Update the recovery contract docs to match the previous two commits:
invariant 5 now states that the staged-write entry points and refresh
run in-process roll-forward recovery (long-lived processes converge on
the next write, not at restart); writes.md 'Long-running servers'
describes the heal's queue-acquisition concurrency contract, the
improved drift-guard error, and the entry points that intentionally do
not heal yet; testing.md indexes the new failpoint tests; AGENTS.md
capability matrix drops the claim that in-process recovery is entirely
future work (only the rollback path remains with the background
reconciler).

* test(engine): pin the entry heal contract for schema apply and branch merge

Without the write-entry heal, the two maintenance writers do worse than
wedge on sidecar-covered drift -- they proceed and decide its fate
implicitly:

- schema apply re-plans table rewrites from the manifest pin, orphaning
  the drifted Phase-B commit (its rows silently vanish from the
  rewritten table) while the stale sidecar lingers to misclassify
  against the post-apply pins;
- branch merge publishes over the drift, making the failed writer's
  commit visible as an unattributed side effect (no recovery audit
  row), and leaves the stale sidecar behind.

Two red tests pin the intended contract: both entry points heal the
sidecar first (attributed roll-forward), then run on the converged
state. Currently failing on the stale-sidecar / dropped-rows
assertions; the fix lands in the next commit.

* fix(engine): heal pending recovery sidecars at the schema-apply and branch-merge entries

Extend the write-entry heal to the remaining two write entry points.
Unlike load/mutate (which wedge on the drift guard), these proceeded
over sidecar-covered drift and decided its fate implicitly:

- schema apply re-planned table rewrites from the manifest pin,
  orphaning the drifted Phase-B commit -- its rows silently vanished
  from the rewritten table -- while the stale sidecar lingered to
  misclassify against the post-apply pins;
- branch merge published over the drift, making the failed writer's
  commit visible without a recovery audit row, and left the stale
  sidecar behind.

Both now run the same queue-serialized roll-forward heal at entry,
before their own sidecar exists, so recovery is attributed (audit row)
and deterministic. ensure_indices stays heal-free: it runs inside the
load / schema-apply flows after their entry heal.

Turns the previous commit's two red tests green. Docs updated in the
same change (invariant 5, writes.md, testing.md, AGENTS.md).

* test(engine): pin Phase A sidecar-write failure semantics

Storage fault-injection matrix, row 1: a sidecar PUT failure (S3
PutObject / fs write) in Phase A. New failpoint recovery.sidecar_write
at the top of write_sidecar -- the single choke point all five sidecar
writers go through -- models the storage error backend-generically.

Also adds the other three storage-fault failpoints used by the
following commits (recovery.sidecar_delete, recovery.sidecar_list,
recovery.record_audit); each is a no-op without the failpoints feature.

Pinned contract: every writer writes its sidecar BEFORE its first
HEAD-advancing commit, so a put failure aborts with zero drift (no
sidecar, Lance HEAD == manifest pin, no rows) and a transient fault
never wedges the graph -- the same handle writes/merges normally once
it clears. Covered for load (the staging writer) and branch_merge (the
multi-table writer, forced onto the RewriteMerged path by diverging
both sides).

* test(engine): pin Phase D delete, list, and audit-append storage-fault semantics

Storage fault-injection matrix, rows 2/3/5, plus the real-backend run:

- recovery.sidecar_delete: a Phase D delete failure (S3 DeleteObject)
  must NOT fail the user's write -- the manifest publish already
  landed, so the caller's data is durable. The swallowed failure
  leaves a stale sidecar; the next write's entry heal consumes it via
  the stale-sidecar audit-recovery path (RolledForward, attributed).

- recovery.sidecar_list: a __recovery/ list failure (S3 ListObjectsV2)
  is loud at every consumer -- the write-entry heal fails the write
  and the open-time sweep fails the open. Silently skipping recovery
  over a pending sidecar would be consumer tolerance of drift. Once
  the fault clears, open recovers the pending sidecar normally.

- recovery.record_audit: an audit write failure after the
  roll-forward's manifest publish aborts that recovery attempt and
  keeps the sidecar; re-entry detects the already-published manifest,
  records exactly ONE RolledForward audit row, and converges -- the
  retry tolerance documented on record_audit, exercised end-to-end.

- s3_load_recovers_after_publisher_failure_without_reopen: the
  same-handle heal scenario on a real bucket (gated on
  OMNIGRAPH_S3_TEST_BUCKET, skips locally), exercising sidecar
  put/list/delete through S3StorageAdapter instead of the local-FS
  adapter. CI wiring lands in a follow-up commit.

* test(engine): refuse corrupt recovery sidecars loudly

Storage fault-injection matrix, row 4 (no failpoint needed -- the
corrupt file is written by hand, sibling to the unknown-schema-version
refusal test): a truncated/garbage __recovery/{ulid}.json must be
refused loudly by both the write-entry heal (the write fails naming
the parse error) and the open-time sweep (ReadWrite open fails naming
the file), with the file left on disk for operator inspection.
Read-only opens still work -- the sweep is skipped there.

* test(engine): run the S3 sidecar-lifecycle coverage in CI + document the fault matrix

- ci.yml rustfs_integration: new step running the bucket-gated
  failpoints tests (name filter s3_) against the RustFS container, so
  sidecar put/list/delete are exercised through S3StorageAdapter on
  every storage-affecting PR.
- writes.md: sidecar I/O failure semantics -- Phase A put failure
  aborts with zero drift; Phase D delete failure is swallowed (write
  already durable) and healed by the next write; list failures are
  loud at heal and open; corrupt sidecars are refused with the file
  kept for inspection; audit-append failures are retried to exactly
  one audit row.
- testing.md: index the storage-fault matrix in the failpoints.rs row
  and the new RustFS CI line.

* test(engine): pin read-visibility of acknowledged local if-absent writes

The cluster lib test import_missing_state_creates_state_with_graph_-
observation flakes at ~50% under full-workspace load ('EOF while
parsing a value' reading back the state.json its own import just
acknowledged). Root cause is in the engine's local storage adapter:
write_text_if_absent writes through a buffered tokio::fs::File and
returns when write_all resolves -- which, per tokio's documented File
semantics, means the bytes reached tokio's internal buffer, not the
file. The actual write completes in a background blocking task after
drop, so a caller that acknowledges success and reads the object back
can see an empty or partial file. Under load the window widens; the
red run fails at iteration 0 with 0 of 8192 bytes on disk.

The regression test pins the contract at the adapter boundary: when
write_text_if_absent resolves, the full contents are visible to any
reader; a losing second claim leaves the winner's object untouched.

The fix lands in the next commit.

* fix(engine): publish local storage writes with atomic visibility

Close the class, not the instance. The local adapter admitted three
ways for a reader to observe a write that was acknowledged or visible
before its bytes were complete:

1. write_text_if_absent acknowledged success when the buffered
   tokio::fs::File write_all resolved -- i.e. when the bytes reached
   tokio's internal buffer, not the file. A caller reading back its own
   acknowledged write could see an empty object (the ~50% cluster
   import flake under full-workspace load; the regression test failed
   at iteration 0 with 0 of 8192 bytes visible).
2. The same call published its CLAIM (create_new) before its CONTENT,
   so concurrent readers saw an empty claimed file in the window.
3. write_text (plain tokio::fs::write) exposed truncated content
   mid-replace -- silently falsifying write_sidecar's 'readers either
   see the complete sidecar or none' contract on local FS (true on S3,
   where PutObject is atomic).

A flush in write_text_if_absent would have fixed only (1). Instead,
both local write paths now publish complete temp files atomically:
rename for replace (write_text -- the idiom write_text_if_match
already used) and hard_link for no-replace (write_text_if_absent --
link fails AlreadyExists, so exactly one of N concurrent claimants
wins and the winner's object is fully readable at the instant it
becomes visible). The local adapter now honors the same object-level
atomic-visibility contract as the S3 adapter, which is what every
caller (recovery sidecar protocol, cluster state CAS) was written
against. Crash-orphaned *.tmp.* files are inert: the sidecar sweep
filters to .json, and cluster state reads address state.json by name.

fsync/durability policy is unchanged (no fsync before, none now);
this fix is about visibility ordering, not power-loss durability.

Pre-existing on main (landed with the multi-graph server mode change,
PR #119); surfaced by this branch's heal work only because one extra
list_dir per write shifted test timing. Cluster lib suite: 12/25
failures before, 0/25 after. Turns the previous commit's red test
green.

* refactor(engine): one storage implementation over object_store for every backend

Collapse LocalStorageAdapter (hand-rolled tokio::fs) and
S3StorageAdapter into a single ObjectStorageAdapter backed by
Arc<dyn object_store::ObjectStore> -- LocalFileSystem for local URIs,
the existing AmazonS3 build for s3://, plus a pub in_memory()
constructor (full contract including TRUE conditional updates; the
in-memory test backend testing.md asked for at the adapter level).

Why: the acknowledged-before-visible bug showed the two-impl shape has
no referee -- one prose contract, two independent answers. Upstream
LocalFileSystem::put_opts is byte-for-byte the staged-temp+rename/
hard_link idiom that fix converged on, and Lance's own commit protocol
is built on the same primitives (put-if-not-exists / rename-if-not-
exists), so the substrate-aligned move is to stop hand-rolling it.
The per-backend residue shrinks to a UriCodec (URI <-> object path)
and one capability flag.

Semantics preserved by construction, with three deliberate deltas:
- exists() is now object-store-semantics everywhere (head + non-empty
  prefix fallback): an EMPTY local directory no longer 'exists'. The
  only dir-shaped caller (_graph_commits.lance probes) self-heals via
  ensure_commit_graph_initialized where it previously wedged loudly.
- A directory at an object path reads as NotFound, not as an IO error
  ('only objects exist'). The cluster unreadable-payload test used a
  same-named directory as a portable non-NotFound trigger; it now uses
  chmod 000, which still models genuine transient IO.
- write_text_if_match keeps content-token semantics on local
  (PutMode::Update is NotImplemented upstream for LocalFileSystem in
  0.12.5 and 0.13.2); the capability flag gates the token SOURCE in
  read_text_versioned too -- an ETag token with content-compare writes
  would lose every CAS.

delete_prefix keeps a local remove_dir_all branch: directories are a
local-FS concept, and list+delete would leave empty skeletons that
cluster graph_root_exists (raw Path::exists) reports as still present.

LocalStorageAdapter remains as a delegating shim so the pinned
contract tests gate this swap textually unchanged; the shim and the
test parameterization over local + in-memory land next. Cargo gains
the explicit 'fs' feature (already transitively enabled by lance).

* test(engine): one executable storage contract, run against every backend

Remove the LocalStorageAdapter delegation shim and migrate its
construction sites to ObjectStorageAdapter::local(). Replace the
per-backend duplicated tests with a single contract_suite asserting
the trait's promises (atomic replace, exists incl. the dataset-root
prefix probe, one-winner if_absent, versioned CAS with loud CAS-lost,
rename, list round-trip with no sibling-prefix bleed, idempotent
delete/delete_prefix), run against the local backend and the new
in-memory backend -- which implements true conditional updates, so the
strong-CAS path is exercised without a bucket. The bucket-gated S3
variant already exists (s3_adapter_conditional_writes_contract).

New local-specific pins for the deliberate semantic edges of the
collapse: empty directories are not objects (exists=false; the Lance
dataset-root probe shape is the non-empty case), file://-anchored and
spaces-in-path list output round-trips byte-identically into
read_text, dot-segment paths are lexically absolutized (the CLI's
./graph.omni shape), and upstream rename creating missing destination
parents. The acknowledged-write visibility regression test stays, now
documenting that the cross-API std::fs read-back is the point.

* refactor(cluster): drop put_json's per-backend atomicity branch

The local temp+rename dance predates the storage adapter guaranteeing
atomic visibility; now that write_text publishes via a staged temp +
rename on the filesystem (and a single atomic PUT on object stores) by
contract, the branch duplicated upstream behavior. One call, both
backends.

* docs: storage adapter collapse — contract, in-memory backend, local CAS gap

- testing.md: the 'no MemStorage backend' note is half-closed —
  ObjectStorageAdapter::in_memory() covers the text-object layer with
  the full contract (true conditional updates); Lance datasets bypass
  the adapter, so the engine substrate ask stays open.
- invariants.md: truth-matrix Tests row updated; new Known Gap for
  local write_text_if_match (upstream PutMode::Update is unimplemented
  for LocalFileSystem; content-token emulation is safe only under the
  cluster lock protocol — close before admitting a lock-free caller).
- writes.md: backend notes for the unified adapter (name#N staging
  residue invisible to the sweep, backend-wrapped error text with
  exists()-probing for missing-vs-error, loud permission failures).

* docs: finish renaming the storage adapters in user docs and test comments

storage.md's URI-scheme table and the S3 failpoint test's doc comment
still named the deleted LocalStorageAdapter/S3StorageAdapter; both now
describe the unified ObjectStorageAdapter over object_store, including
the relative-path absolutization note for local URIs.

* test(engine): pin branch-awareness of the drift guard's recovery advice

A pending sidecar on ANOTHER branch does not cover this branch's
drift: with a deferred feature-branch sidecar on disk and genuinely
uncovered drift on main, the main write's error must still point at
omnigraph repair -- a read-write reopen recovers the sidecar but
cannot repair main's uncovered drift. Currently red: the guard
matches sidecar pins by table_key only, so the feature sidecar flips
main's advice to the reopen path. Fix in the next commit.

Surfaced by external review of the drift-guard change.

* fix(engine): branch-aware sidecar matching in the drift guard's advice

The commit-time drift guard's sidecar-covered check matched pins by
table_key alone, so a pending sidecar on another branch flipped this
branch's uncovered-drift advice from 'run omnigraph repair' to the
reopen path -- and a reopen recovers that sidecar but cannot repair
this branch's drift. Compare the pin's table_branch too. Turns the
previous commit's red test green.

Surfaced by external review of the drift-guard change.

* test(engine): pin heal non-interference with a live schema apply

The write-entry heal's schema-staging reconcile runs before any queue
acquisition, so a load on the same handle, overlapping a schema apply
parked between its staging write and manifest commit, promotes the
apply's staging files (new catalog live against the old manifest),
classifies the LIVE apply's sidecar, and publishes its registrations
out from under it. The resumed apply then collides with its own stolen
commit. Currently red with:

  Lance("Concurrent modification: table version 3 already exists for
  node:Tag")

The fix (per-sidecar reconcile under the sidecar's write-queue guards,
plus a serialization key the schema-apply writer and the heal both
acquire) lands in the next commit.

Surfaced by external review of the write-entry heal.

* fix(engine): serialize the heal's schema-staging reconcile with live schema applies

The write-entry heal ran recover_schema_state_files up front, before
acquiring any queue guards. Overlapping a live schema apply parked
between its staging write and manifest commit, the heal promoted the
apply's staging files (new catalog live against the old manifest),
classified the LIVE apply's sidecar, and published its registrations —
the resumed apply then collided with its own stolen commit.

Correct by construction:

- New schema-apply serialization queue key, acquired by the schema-
  apply writer (alongside its per-table keys) from before write_sidecar
  until after delete_sidecar. Per-table keys alone don't cover a
  registration-only migration, which pins no existing tables but has a
  sidecar and staging files on disk.
- The heal reconciles schema staging lazily, PER SchemaApply sidecar,
  after acquiring that sidecar's guards (including the serialization
  key) and re-confirming the sidecar exists — a sidecar that survives
  the queue wait belongs to a dead writer, so the reconcile can no
  longer race a live apply. Recomputing per sidecar also removes the
  staleness of one up-front result across a multi-sidecar pass.
- Omnigraph::refresh drops its up-front reconcile-and-pass-through
  (same race, and a pre-promoted result would make the heal's guarded
  reconcile see clean staging and wrongly defer the sidecar): it now
  reconciles standalone only when NO sidecar exists — which cannot
  race a live apply, whose sidecar always precedes its staging files —
  and otherwise defers entirely to the heal.

The open-time sweep keeps its precomputed reconcile: open has no
concurrent writers. Turns the previous commit's red test green.

Surfaced by external review of the write-entry heal.

Self-audit addendum folded in: refresh's no-sidecar gate had a TOCTOU
(a live apply could write its sidecar + staging between the empty
check and the reconcile) — the standalone reconcile now holds the
serialization key across the list-then-reconcile pair. The remaining
residual is cross-process only (in-process queues cannot serialize
against a writer in another process; the open-time sweep has the same
pre-existing exposure) and is now an explicit Known Gap in
invariants.md rather than an implicit one.

* test(engine): pin catalog reload after the heal recovers a schema apply

When the write-entry heal rolls a crashed apply's SchemaApply sidecar
forward on the same handle, disk and manifest move to the new schema
(staging promoted, registrations published) but the handle's in-memory
schema_source/catalog do not. Subsequent writes then validate against
the stale catalog and reject rows of types the graph already has.
Currently red with:

  record 1: unknown node type 'Tag'

refresh() reloads after its heal; the write entry points must too.
Fix in the next commit.

Surfaced by external review of the write-entry heal.

* fix(engine): reload the in-memory catalog after the heal recovers a schema apply

heal_pending_recovery_sidecars refreshed the coordinator and
invalidated the runtime cache after processing sidecars, but never
reloaded schema_source/catalog — so a write whose entry heal rolled a
crashed SchemaApply sidecar forward proceeded to validate against the
OLD schema while disk and manifest were already on the new one.
reload_schema_if_source_changed is the same post-heal step refresh()
already runs; it no-ops on the (overwhelmingly common) non-schema heal
because the on-disk source is unchanged. Turns the previous commit's
red test green.

Surfaced by external review of the write-entry heal.

* test(engine): pin that a deleted-branch sidecar cannot wedge the graph

A rollback-eligible sidecar pinned to a branch is deferred by every
roll-forward-only pass; if the branch is then deleted, the sidecar
survives, referencing a branch with no manifest tree. The heal (every
write entry) and the open-time sweep (every ReadWrite open) both fail
opening the dead branch, and repair refuses while a sidecar is pending
-- a terminal read-only state with manual sidecar surgery as the only
exit. Currently red with:

  Lance("Not found: .../__manifest/tree/feature/_versions")

The branch's tree and forks are already reclaimed, so the pinned drift
is unreachable and the sidecar is provably moot; the fix classifies it
as an orphaned-branch terminal state (audit + discard) in both passes.

Surfaced by review (P1, verified by repro).

* fix(engine): classify deleted-branch sidecars as orphaned instead of wedging

A deferred (rollback-eligible) sidecar pinned to a branch survives
branch_delete; both the write-entry heal and the open-time sweep then
failed unconditionally opening the dead branch -- every write and
every ReadWrite open errored, and repair refuses while a sidecar
pends. Terminal state, manual sidecar surgery the only exit.

The branch's tree and per-table forks are already reclaimed at delete,
so the drift the sidecar pins is unreachable and the sidecar is
provably moot. Both passes now check the sidecar's branch against the
manifest's branch list (the authority -- deliberately NOT inferred
from a Not-found on open, which could be a transient storage error
masking real recovery intent) and discard orphans with an
OrphanedBranchDiscarded audit row, commit appended on main since the
sidecar's own branch no longer has a commit graph.

The open-time half is pre-existing; the write-entry heal made it hot.
Turns the previous commit's red test green.

Surfaced by review (P1, verified by repro).

* chore: harden review nits — vacuous CI filter, root-runner skip, liveness note

- ci.yml: the RustFS sidecar-lifecycle step now fails loudly if the
  's3_' name filter matches zero tests (cargo passes vacuously on an
  empty filter; the step exists specifically to prove S3 sidecar I/O
  coverage). The pre-existing CLI smoke step has the same shape and is
  left for a follow-up.
- cluster unreadable-payload test: cfg(unix) + a skip-with-log when
  running as root (mode 000 is still readable to root, common in
  container dev runners), so the test degrades instead of failing.
- refresh: document the one-pass-late convergence for legacy staging
  residue while non-SchemaApply sidecars pend, so nobody 'fixes' it by
  re-running the reconcile unserialized — the exact race the
  serialization key closes.

* test(engine): pin orphan-discard idempotency across a delete fault

discard_orphaned_branch_sidecar writes its audit row and main commit
before deleting the sidecar; a Phase D delete fault leaves the sidecar
on disk with the audit already durable, and the retry repeated the
whole path -- a second OrphanedBranchDiscarded audit row (and commit)
for the same operation. Currently red: 2 rows after one fault + retry.
The retry must only finish the delete. Fix next.

Also promotes the recovery-audit kinds reader into the shared test
helpers (it was recovery.rs-local).

Surfaced by external review of the orphan-discard fix.

* fix(engine): orphan-discard idempotency + heal reports acted-vs-deferred

Two review findings on the recovery surface:

- discard_orphaned_branch_sidecar now checks the audit table for an
  existing (operation_id, OrphanedBranchDiscarded) row before appending
  the commit + audit pair, so a Phase D delete fault retries ONLY the
  delete instead of duplicating audit rows and commit-graph entries.
  Cold path: the list scan runs only when an orphaned sidecar exists.
  Turns the previous commit's red test green (exactly one audit row
  across fault + retry).

- process_sidecar returns whether durable state changed; the heal sets
  processed_any only for sidecars that were actually rolled forward /
  rolled back / audit-recovered (orphan discards count). Deferred
  sidecars (rollback-eligible, invariant-violating, unpromoted
  SchemaApply) no longer trigger a per-write schema reload + full
  runtime-cache invalidation while they pend -- the cache is
  snapshot-keyed so this was waste, not corruption, but it was paid on
  every write until reopen. Acted-paths' processed=true remains pinned
  by load_after_schema_apply_phase_b_failure_uses_recovered_catalog
  (the reload depends on it).

Surfaced by external review.

* test(engine): pin the orphan-discard audit-append fault leg as documented tolerance

The orphan discard's commit append and audit append are two writes; a
failure between them leaves a recovery commit with no audit row, and
the retry (keyed on the audit row, the operator-facing record) appends
a second commit before the audit lands. This is the same
not-atomic-pair-write tolerance record_audit documents and the
manifest->commit-graph Known Gap covers for every publish: bounded
commit-graph noise, audit row exactly-once under clean failures.
Keying idempotency on commit rows instead would need an operation_id
column on _graph_commits, and audit-before-commit would dangle the
graph_commit_id join -- both worse than the documented residual.

Make the tolerance explicit instead of implicit: docstring names the
window, a failpoint sits inside it, and the new test pins convergence
across the fault (sidecar consumed, exactly one audit row), completing
the orphan-discard fault matrix alongside the delete-fault leg.

Surfaced by external review of the orphan-discard idempotency.

* test(engine): pin honest drift-guard advice when sidecar listing fails

The guard's unwrap_or(false) conflated 'classified as uncovered' with
'could not classify': a transient list fault on the guard's second
list (the entry heal's first list having succeeded) confidently routed
the operator to omnigraph repair even when the heal had just deferred
a rollback-eligible sidecar -- and repair refuses while a sidecar is
pending. Currently red: the error says 'run omnigraph repair' with no
mention of the reopen path. The fix names both paths plus the failure
cause when classification is impossible.

Surfaced by external review of the drift-guard fallback.

* fix(engine): admit ambiguity in the drift guard when sidecar listing fails

Replace the unwrap_or(false) fallback with a tri-state: covered ->
reopen advice; uncovered -> repair advice; listing FAILED -> say the
drift could not be classified, name the cause, and give both paths in
order ('run repair, or reopen read-write if repair reports a pending
sidecar'). The old fallback confidently routed a transient list fault
to repair, which refuses while a sidecar is pending -- a self-
correcting but pointless detour. The conflict itself is still always
raised; only the advice degrades honestly. Turns the previous commit's
red test green.

Surfaced by external review of the drift-guard fallback.
											
										
										
											2026-06-13 11:20:08 +02:00
+								| `failpoints.rs` | Failure-injection coverage (gated on `failpoints` feature). Includes the five per-writer Phase B → recovery integration tests (`recovery_rolls_forward_after_finalize_publisher_failure`, `schema_apply_phase_b_failure_recovered_on_next_open`, `branch_merge_phase_b_failure_recovered_on_next_open`, `ensure_indices_phase_b_failure_recovered_on_next_open`, `optimize_phase_b_failure_recovered_on_next_open`) and the write-entry in-process heal contract (the four `*_after_finalize_publisher_failure_heals_without_reopen` tests — load, mutation, schema apply, branch merge: a follow-up write on the same handle rolls a sidecar-covered residual forward without reopen/refresh) and the storage-fault matrix for the sidecar lifecycle (`recovery.sidecar_{write,delete,list}` / `recovery.record_audit` failpoints: Phase A put failure aborts with zero drift, Phase D delete failure is swallowed and healed by the next write, list failures are loud at heal and open, audit-append failures are retried to exactly one audit row; plus the bucket-gated `s3_load_recovers_after_publisher_failure_without_reopen`). |
-												recovery: rename composite test, strip ticket references, address review

Three bundled changes:

1. Rename `tests/agent_lifecycle.rs` -> `tests/composite_flow.rs` (and
   the test function). OmniGraph is consumed by both humans and agents
   - naming the test after one audience misframes the library.

2. Strip Linear ticket IDs, PR numbers, bot reviewer names, and
   review-round labels from source, tests, and docs added by this
   branch. Internal traceability belongs in commit messages and PR
   descriptions, not in checked-in artifacts. Upstream
   lance-format/lance issue refs and pre-existing MR-XXX refs in docs
   not touched by this branch are left alone.

3. Two outstanding review findings addressed:
   - `needs_index_work_node` / `needs_index_work_edge`: propagate
     `count_rows` errors instead of `unwrap_or(0)`. Silently treating
     transient I/O failures as "0 rows" risked skipping a table from
     the recovery sidecar pin set that was actually about to be
     modified.
   - `recovery_multi_sidecar_requires_fresh_snapshot_for_correctness`:
     strengthen the assertion to fail when sidecar B classifies under
     a stale snapshot. The new assertion checks post-recovery Lance
     HEAD == v3 (no `Dataset::restore` ran). The previous "sidecar
     deleted + audit rows present" pair passed in both the bug and
     fix paths because both delete the sidecar and write an audit
     row; the differentiator is the post-recovery HEAD. Strengthening
     the assertion exposed an additional nuance: in this overlapping-
     sidecar scenario sidecar B's audit kind is RolledBack (no-op)
     rather than RolledForward, since sidecar A's roll-forward
     publishes Lance HEAD as the new manifest pin (absorbing B's
     work). The docstring now explains why this is correct given
     current `roll_forward_all` semantics.

All workspace tests pass with --features failpoints.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

											
										
										
											2026-05-03 13:56:36 +02:00
+								| `recovery.rs` | Open-time recovery sweep — sidecar I/O, classifier dispatch (NoMovement / RolledPastExpected / UnexpectedAtP1 / UnexpectedMultistep / InvariantViolation), all-or-nothing decision, roll-forward via `ManifestBatchPublisher::publish`, roll-back via `Dataset::restore`, audit row in `_graph_commit_recoveries.lance`, `OpenMode::ReadOnly` skip path |
-												fix: optimize publishes compaction; recovery roll-back converges manifest (#141)

* test(optimize): cover manifest publish + HEAD-drift reconcile

Red against the pre-fix optimize, which ran compact_files without
publishing the compacted version to __manifest:

- maintenance: optimize must publish so the manifest table_version
  tracks the compacted Lance HEAD and a later schema apply succeeds;
  and must reconcile a pre-existing manifest-behind-HEAD drift (forged
  via raw Lance compaction) so strict writes commit again.
- end_to_end + composite_flow: post-optimize query / strict update /
  reopen in the full lifecycle (the canonical flow previously omitted
  post-optimize writes as a documented "known limitation").
- failpoints: a crash between compaction and the manifest publish rolls
  forward on next open.

* fix(optimize): publish compaction to manifest and reconcile HEAD drift

optimize ran Lance compact_files without publishing the new version to
__manifest, so the manifest table_version lagged the Lance HEAD: reads
stayed pinned to the pre-compaction version, and the next schema apply or
strict update/delete failed its HEAD-vs-manifest precondition with
"stale view ... refresh and retry" (open-time recovery rollback inflated
the gap on retry).

optimize now publishes each compacted table's version under the
per-(table, main) write queue, guarded by a manifest CAS and a
SidecarKind::Optimize recovery sidecar (loose-match; roll-forward is safe
because compaction is content-preserving). When a table has nothing left
to compact but its Lance HEAD is already ahead of the manifest pin
(pre-fix drift, or a recovery restore commit), optimize reconciles the
manifest forward to HEAD (metadata-only, no sidecar). Caches and the
CSR/CSC graph index are invalidated after a publish.

Docs updated (maintenance, storage, branches-commits, writes, testing).

* test(recovery): rollback convergence + optimize-defer regressions

Red against the current code, landed before the fix:
- recovery: after the open-time sweep rolls a sidecar back, the manifest
  must track Lance HEAD (no residual drift) so a follow-up schema apply
  succeeds — the original "+1 per retry" loop. Today roll-back restores
  without publishing, so the manifest lags HEAD and the apply fails its
  HEAD-vs-manifest precondition.
- maintenance: optimize must refuse while a recovery sidecar is pending —
  operating on an unrecovered graph could publish a partial write the
  sweep would roll back.

Also removes optimize_reconciles_preexisting_manifest_head_drift: the
ad-hoc drift reconcile it covered is replaced by recovery-side convergence.

* fix(recovery): converge manifest on roll-back; optimize defers on pending recovery

Root of PR #141's review findings and the original "+1 per retry" loop:
a Lance HEAD ahead of the manifest was ambiguous (benign content-preserving
drift vs. a partial write a sidecar will roll back), and optimize's reconcile
guessed it benign. Close the class instead of guessing:

- Recovery roll-back now PUBLISHES the restored version (via a
  push_table_update_at_head helper shared with roll-forward), so the manifest
  tracks the Lance HEAD after recovery — symmetric with roll-forward. This
  fixes the +1 loop (after one roll-back the retry's HEAD-vs-manifest
  precondition passes) and removes the only remaining source of orphaned
  drift. The audit still records the logical rolled-back-to version; the
  manifest is published at the restore commit (identical content).
- optimize drops the ad-hoc drift reconcile and instead REFUSES when a
  __recovery sidecar is pending, so it only ever operates on a recovered
  graph (manifest == HEAD); its compaction publish can no longer commit a
  partial write. With the reconcile gone, the blob-skip-vs-reconcile gap is
  moot.

Updates the rollback recovery-test helper (manifest == HEAD after roll-back),
the failpoints assertions, and the user/dev docs.

* test(recovery): fix rollback assertion for manifest convergence

The roll-back-publishes change makes the manifest version advance after a
SchemaApply roll-back (to the old-schema content), so the
schema_apply_without_schema_staging_rolls_back_on_next_open assertion must
be `version > pre`, not `version == pre`. This update was dropped during
the commit churn and surfaced as a CI Test Workspace failure; the
old-schema-preserved intent stays covered by count_rows + _schema.pg + the
RolledBack convergence invariant.
											
										
										
											2026-06-08 01:50:12 +02:00
+								| `composite_flow.rs` | Compositional/narrative end-to-end stories — multi-step flows that compose mechanics covered by other test files. Catches integration regressions where individual operations all pass their unit tests but their composition breaks (sequential merges, post-merge main writes, time-travel through merge DAG, reopen consistency over multi-merge histories, post-optimize and post-cleanup strict writes). |
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
 								## Fixtures
 								`crates/omnigraph/tests/fixtures/` holds the canonical schema (`.pg`), seed data (`.jsonl`), and queries (`.gq`) shared across tests. Reuse these before inventing new ones — the helpers harness already knows how to load them.
 								## Test helpers
-												Rename repo terminology to graph (#118)
											
										
										
											2026-05-24 16:46:00 +01:00
+								- **Engine** — `crates/omnigraph/tests/helpers/mod.rs`: `init_and_load()` (bootstrap a temp graph + load standard fixture), `snapshot_main()`, `snapshot_branch()`, query/mutation runners, row collection and counting. Use these instead of hand-rolling.
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
+								- **CLI** — `crates/omnigraph-cli/tests/support/mod.rs`: `Command`-style wrapper for invoking `omnigraph`, server-process spawning, fixture resolution, output assertion helpers.
 								- **Server** — no shared helpers; server tests call the `Omnigraph` engine API directly and exercise endpoints over the wire.
-												Recovery liveness, storage fault-injection matrix, and one storage implementation over object_store (#203)

* test(engine): pin the long-lived-handle heal contract for sidecar-covered drift

A Phase B -> Phase C failure (commit_staged advanced Lance HEAD, manifest
publish did not land, recovery sidecar persists) currently wedges every
subsequent staged write on the same engine handle: the commit-time drift
guard rejects with 'run omnigraph repair', but repair itself refuses
while a recovery sidecar is pending, so a long-lived server can only
recover by restart. The documented contract (writes.md 'Long-running
servers', invariants.md invariant 5) says refresh-time roll-forward
closes this residual without restart -- but no write path runs it.

Two red tests pin the intended contract at the write entry points:
a follow-up load (the POST /ingest shape: shared handle, no reopen)
and a follow-up mutation must heal roll-forward-eligible sidecars
in-process and then succeed.

Currently failing with:
  table 'node:Company' has Lance HEAD version 2 ahead of manifest
  version 1; run `omnigraph repair` before writing

The fix lands in the next commit.

* fix(engine): heal pending recovery sidecars at the staged-write entry points

Close the long-lived-process gap in the recovery protocol: a Phase B ->
Phase C residual (per-table commit_staged landed, manifest publish did
not, sidecar persists) previously recovered only at the next ReadWrite
open or via an explicit refresh() that no production write path called,
so a long-lived server wedged every subsequent write on the commit-time
drift guard until restart.

New recovery::heal_pending_sidecars_roll_forward:
- one list_dir of __recovery/ at write entry (empty -> immediate
  return, the steady state), so the per-write cost is one storage list;
- per sidecar, acquires the same per-(table_key, table_branch) write
  queues every sidecar writer holds from before write_sidecar until
  after delete_sidecar, then re-checks sidecar existence -- this
  serializes the heal against live writers instead of rolling an
  in-flight sidecar forward from under its writer (which would fail
  that writer's publish CAS spuriously). Lock order queues ->
  coordinator matches every writer's commit->publish path. This is the
  queue-acquisition design recovery.rs and write_queue.rs already
  documented for in-process recovery;
- processes in RollForwardOnly mode: the common residual rolls forward
  in-process; rollback-eligible sidecars still defer to the next
  ReadWrite open (Dataset::restore is unsafe under concurrency).

Wire it into load_as and mutate_as (before the inline delete path can
advance any HEAD), and rebase Omnigraph::refresh onto the same helper
so refresh stops racing live writers' sidecars.

The maintenance entry points (apply_schema_as, branch_merge_as,
ensure_indices) intentionally keep their strict fail-loud preconditions
for now; wiring the same heal there is a follow-up with its own tests.

Turns the previous commit's two red tests green.

* fix(engine): name the right recovery path in the commit-time drift guard

The drift guard's 'run omnigraph repair before writing' advice is a
dead end when the drift is covered by a pending recovery sidecar:
repair refuses while a sidecar is pending. With the write-entry heal in
place, reaching this guard with sidecar-covered drift means the heal
deferred it (rollback-eligible), and the actual recovery path is a
read-write reopen. Distinguish the two classes on the error path only
(one sidecar list, after the conflict is already certain); a listing
failure falls back to the uncovered-drift wording rather than masking
the conflict.

Pinned by extending refresh_defers_rollback_eligible_sidecar_to_next_open
with a write attempt against the deferred sidecar.

* docs: write-entry in-process sidecar heal — contract and coverage

Update the recovery contract docs to match the previous two commits:
invariant 5 now states that the staged-write entry points and refresh
run in-process roll-forward recovery (long-lived processes converge on
the next write, not at restart); writes.md 'Long-running servers'
describes the heal's queue-acquisition concurrency contract, the
improved drift-guard error, and the entry points that intentionally do
not heal yet; testing.md indexes the new failpoint tests; AGENTS.md
capability matrix drops the claim that in-process recovery is entirely
future work (only the rollback path remains with the background
reconciler).

* test(engine): pin the entry heal contract for schema apply and branch merge

Without the write-entry heal, the two maintenance writers do worse than
wedge on sidecar-covered drift -- they proceed and decide its fate
implicitly:

- schema apply re-plans table rewrites from the manifest pin, orphaning
  the drifted Phase-B commit (its rows silently vanish from the
  rewritten table) while the stale sidecar lingers to misclassify
  against the post-apply pins;
- branch merge publishes over the drift, making the failed writer's
  commit visible as an unattributed side effect (no recovery audit
  row), and leaves the stale sidecar behind.

Two red tests pin the intended contract: both entry points heal the
sidecar first (attributed roll-forward), then run on the converged
state. Currently failing on the stale-sidecar / dropped-rows
assertions; the fix lands in the next commit.

* fix(engine): heal pending recovery sidecars at the schema-apply and branch-merge entries

Extend the write-entry heal to the remaining two write entry points.
Unlike load/mutate (which wedge on the drift guard), these proceeded
over sidecar-covered drift and decided its fate implicitly:

- schema apply re-planned table rewrites from the manifest pin,
  orphaning the drifted Phase-B commit -- its rows silently vanished
  from the rewritten table -- while the stale sidecar lingered to
  misclassify against the post-apply pins;
- branch merge published over the drift, making the failed writer's
  commit visible without a recovery audit row, and left the stale
  sidecar behind.

Both now run the same queue-serialized roll-forward heal at entry,
before their own sidecar exists, so recovery is attributed (audit row)
and deterministic. ensure_indices stays heal-free: it runs inside the
load / schema-apply flows after their entry heal.

Turns the previous commit's two red tests green. Docs updated in the
same change (invariant 5, writes.md, testing.md, AGENTS.md).

* test(engine): pin Phase A sidecar-write failure semantics

Storage fault-injection matrix, row 1: a sidecar PUT failure (S3
PutObject / fs write) in Phase A. New failpoint recovery.sidecar_write
at the top of write_sidecar -- the single choke point all five sidecar
writers go through -- models the storage error backend-generically.

Also adds the other three storage-fault failpoints used by the
following commits (recovery.sidecar_delete, recovery.sidecar_list,
recovery.record_audit); each is a no-op without the failpoints feature.

Pinned contract: every writer writes its sidecar BEFORE its first
HEAD-advancing commit, so a put failure aborts with zero drift (no
sidecar, Lance HEAD == manifest pin, no rows) and a transient fault
never wedges the graph -- the same handle writes/merges normally once
it clears. Covered for load (the staging writer) and branch_merge (the
multi-table writer, forced onto the RewriteMerged path by diverging
both sides).

* test(engine): pin Phase D delete, list, and audit-append storage-fault semantics

Storage fault-injection matrix, rows 2/3/5, plus the real-backend run:

- recovery.sidecar_delete: a Phase D delete failure (S3 DeleteObject)
  must NOT fail the user's write -- the manifest publish already
  landed, so the caller's data is durable. The swallowed failure
  leaves a stale sidecar; the next write's entry heal consumes it via
  the stale-sidecar audit-recovery path (RolledForward, attributed).

- recovery.sidecar_list: a __recovery/ list failure (S3 ListObjectsV2)
  is loud at every consumer -- the write-entry heal fails the write
  and the open-time sweep fails the open. Silently skipping recovery
  over a pending sidecar would be consumer tolerance of drift. Once
  the fault clears, open recovers the pending sidecar normally.

- recovery.record_audit: an audit write failure after the
  roll-forward's manifest publish aborts that recovery attempt and
  keeps the sidecar; re-entry detects the already-published manifest,
  records exactly ONE RolledForward audit row, and converges -- the
  retry tolerance documented on record_audit, exercised end-to-end.

- s3_load_recovers_after_publisher_failure_without_reopen: the
  same-handle heal scenario on a real bucket (gated on
  OMNIGRAPH_S3_TEST_BUCKET, skips locally), exercising sidecar
  put/list/delete through S3StorageAdapter instead of the local-FS
  adapter. CI wiring lands in a follow-up commit.

* test(engine): refuse corrupt recovery sidecars loudly

Storage fault-injection matrix, row 4 (no failpoint needed -- the
corrupt file is written by hand, sibling to the unknown-schema-version
refusal test): a truncated/garbage __recovery/{ulid}.json must be
refused loudly by both the write-entry heal (the write fails naming
the parse error) and the open-time sweep (ReadWrite open fails naming
the file), with the file left on disk for operator inspection.
Read-only opens still work -- the sweep is skipped there.

* test(engine): run the S3 sidecar-lifecycle coverage in CI + document the fault matrix

- ci.yml rustfs_integration: new step running the bucket-gated
  failpoints tests (name filter s3_) against the RustFS container, so
  sidecar put/list/delete are exercised through S3StorageAdapter on
  every storage-affecting PR.
- writes.md: sidecar I/O failure semantics -- Phase A put failure
  aborts with zero drift; Phase D delete failure is swallowed (write
  already durable) and healed by the next write; list failures are
  loud at heal and open; corrupt sidecars are refused with the file
  kept for inspection; audit-append failures are retried to exactly
  one audit row.
- testing.md: index the storage-fault matrix in the failpoints.rs row
  and the new RustFS CI line.

* test(engine): pin read-visibility of acknowledged local if-absent writes

The cluster lib test import_missing_state_creates_state_with_graph_-
observation flakes at ~50% under full-workspace load ('EOF while
parsing a value' reading back the state.json its own import just
acknowledged). Root cause is in the engine's local storage adapter:
write_text_if_absent writes through a buffered tokio::fs::File and
returns when write_all resolves -- which, per tokio's documented File
semantics, means the bytes reached tokio's internal buffer, not the
file. The actual write completes in a background blocking task after
drop, so a caller that acknowledges success and reads the object back
can see an empty or partial file. Under load the window widens; the
red run fails at iteration 0 with 0 of 8192 bytes on disk.

The regression test pins the contract at the adapter boundary: when
write_text_if_absent resolves, the full contents are visible to any
reader; a losing second claim leaves the winner's object untouched.

The fix lands in the next commit.

* fix(engine): publish local storage writes with atomic visibility

Close the class, not the instance. The local adapter admitted three
ways for a reader to observe a write that was acknowledged or visible
before its bytes were complete:

1. write_text_if_absent acknowledged success when the buffered
   tokio::fs::File write_all resolved -- i.e. when the bytes reached
   tokio's internal buffer, not the file. A caller reading back its own
   acknowledged write could see an empty object (the ~50% cluster
   import flake under full-workspace load; the regression test failed
   at iteration 0 with 0 of 8192 bytes visible).
2. The same call published its CLAIM (create_new) before its CONTENT,
   so concurrent readers saw an empty claimed file in the window.
3. write_text (plain tokio::fs::write) exposed truncated content
   mid-replace -- silently falsifying write_sidecar's 'readers either
   see the complete sidecar or none' contract on local FS (true on S3,
   where PutObject is atomic).

A flush in write_text_if_absent would have fixed only (1). Instead,
both local write paths now publish complete temp files atomically:
rename for replace (write_text -- the idiom write_text_if_match
already used) and hard_link for no-replace (write_text_if_absent --
link fails AlreadyExists, so exactly one of N concurrent claimants
wins and the winner's object is fully readable at the instant it
becomes visible). The local adapter now honors the same object-level
atomic-visibility contract as the S3 adapter, which is what every
caller (recovery sidecar protocol, cluster state CAS) was written
against. Crash-orphaned *.tmp.* files are inert: the sidecar sweep
filters to .json, and cluster state reads address state.json by name.

fsync/durability policy is unchanged (no fsync before, none now);
this fix is about visibility ordering, not power-loss durability.

Pre-existing on main (landed with the multi-graph server mode change,
PR #119); surfaced by this branch's heal work only because one extra
list_dir per write shifted test timing. Cluster lib suite: 12/25
failures before, 0/25 after. Turns the previous commit's red test
green.

* refactor(engine): one storage implementation over object_store for every backend

Collapse LocalStorageAdapter (hand-rolled tokio::fs) and
S3StorageAdapter into a single ObjectStorageAdapter backed by
Arc<dyn object_store::ObjectStore> -- LocalFileSystem for local URIs,
the existing AmazonS3 build for s3://, plus a pub in_memory()
constructor (full contract including TRUE conditional updates; the
in-memory test backend testing.md asked for at the adapter level).

Why: the acknowledged-before-visible bug showed the two-impl shape has
no referee -- one prose contract, two independent answers. Upstream
LocalFileSystem::put_opts is byte-for-byte the staged-temp+rename/
hard_link idiom that fix converged on, and Lance's own commit protocol
is built on the same primitives (put-if-not-exists / rename-if-not-
exists), so the substrate-aligned move is to stop hand-rolling it.
The per-backend residue shrinks to a UriCodec (URI <-> object path)
and one capability flag.

Semantics preserved by construction, with three deliberate deltas:
- exists() is now object-store-semantics everywhere (head + non-empty
  prefix fallback): an EMPTY local directory no longer 'exists'. The
  only dir-shaped caller (_graph_commits.lance probes) self-heals via
  ensure_commit_graph_initialized where it previously wedged loudly.
- A directory at an object path reads as NotFound, not as an IO error
  ('only objects exist'). The cluster unreadable-payload test used a
  same-named directory as a portable non-NotFound trigger; it now uses
  chmod 000, which still models genuine transient IO.
- write_text_if_match keeps content-token semantics on local
  (PutMode::Update is NotImplemented upstream for LocalFileSystem in
  0.12.5 and 0.13.2); the capability flag gates the token SOURCE in
  read_text_versioned too -- an ETag token with content-compare writes
  would lose every CAS.

delete_prefix keeps a local remove_dir_all branch: directories are a
local-FS concept, and list+delete would leave empty skeletons that
cluster graph_root_exists (raw Path::exists) reports as still present.

LocalStorageAdapter remains as a delegating shim so the pinned
contract tests gate this swap textually unchanged; the shim and the
test parameterization over local + in-memory land next. Cargo gains
the explicit 'fs' feature (already transitively enabled by lance).

* test(engine): one executable storage contract, run against every backend

Remove the LocalStorageAdapter delegation shim and migrate its
construction sites to ObjectStorageAdapter::local(). Replace the
per-backend duplicated tests with a single contract_suite asserting
the trait's promises (atomic replace, exists incl. the dataset-root
prefix probe, one-winner if_absent, versioned CAS with loud CAS-lost,
rename, list round-trip with no sibling-prefix bleed, idempotent
delete/delete_prefix), run against the local backend and the new
in-memory backend -- which implements true conditional updates, so the
strong-CAS path is exercised without a bucket. The bucket-gated S3
variant already exists (s3_adapter_conditional_writes_contract).

New local-specific pins for the deliberate semantic edges of the
collapse: empty directories are not objects (exists=false; the Lance
dataset-root probe shape is the non-empty case), file://-anchored and
spaces-in-path list output round-trips byte-identically into
read_text, dot-segment paths are lexically absolutized (the CLI's
./graph.omni shape), and upstream rename creating missing destination
parents. The acknowledged-write visibility regression test stays, now
documenting that the cross-API std::fs read-back is the point.

* refactor(cluster): drop put_json's per-backend atomicity branch

The local temp+rename dance predates the storage adapter guaranteeing
atomic visibility; now that write_text publishes via a staged temp +
rename on the filesystem (and a single atomic PUT on object stores) by
contract, the branch duplicated upstream behavior. One call, both
backends.

* docs: storage adapter collapse — contract, in-memory backend, local CAS gap

- testing.md: the 'no MemStorage backend' note is half-closed —
  ObjectStorageAdapter::in_memory() covers the text-object layer with
  the full contract (true conditional updates); Lance datasets bypass
  the adapter, so the engine substrate ask stays open.
- invariants.md: truth-matrix Tests row updated; new Known Gap for
  local write_text_if_match (upstream PutMode::Update is unimplemented
  for LocalFileSystem; content-token emulation is safe only under the
  cluster lock protocol — close before admitting a lock-free caller).
- writes.md: backend notes for the unified adapter (name#N staging
  residue invisible to the sweep, backend-wrapped error text with
  exists()-probing for missing-vs-error, loud permission failures).

* docs: finish renaming the storage adapters in user docs and test comments

storage.md's URI-scheme table and the S3 failpoint test's doc comment
still named the deleted LocalStorageAdapter/S3StorageAdapter; both now
describe the unified ObjectStorageAdapter over object_store, including
the relative-path absolutization note for local URIs.

* test(engine): pin branch-awareness of the drift guard's recovery advice

A pending sidecar on ANOTHER branch does not cover this branch's
drift: with a deferred feature-branch sidecar on disk and genuinely
uncovered drift on main, the main write's error must still point at
omnigraph repair -- a read-write reopen recovers the sidecar but
cannot repair main's uncovered drift. Currently red: the guard
matches sidecar pins by table_key only, so the feature sidecar flips
main's advice to the reopen path. Fix in the next commit.

Surfaced by external review of the drift-guard change.

* fix(engine): branch-aware sidecar matching in the drift guard's advice

The commit-time drift guard's sidecar-covered check matched pins by
table_key alone, so a pending sidecar on another branch flipped this
branch's uncovered-drift advice from 'run omnigraph repair' to the
reopen path -- and a reopen recovers that sidecar but cannot repair
this branch's drift. Compare the pin's table_branch too. Turns the
previous commit's red test green.

Surfaced by external review of the drift-guard change.

* test(engine): pin heal non-interference with a live schema apply

The write-entry heal's schema-staging reconcile runs before any queue
acquisition, so a load on the same handle, overlapping a schema apply
parked between its staging write and manifest commit, promotes the
apply's staging files (new catalog live against the old manifest),
classifies the LIVE apply's sidecar, and publishes its registrations
out from under it. The resumed apply then collides with its own stolen
commit. Currently red with:

  Lance("Concurrent modification: table version 3 already exists for
  node:Tag")

The fix (per-sidecar reconcile under the sidecar's write-queue guards,
plus a serialization key the schema-apply writer and the heal both
acquire) lands in the next commit.

Surfaced by external review of the write-entry heal.

* fix(engine): serialize the heal's schema-staging reconcile with live schema applies

The write-entry heal ran recover_schema_state_files up front, before
acquiring any queue guards. Overlapping a live schema apply parked
between its staging write and manifest commit, the heal promoted the
apply's staging files (new catalog live against the old manifest),
classified the LIVE apply's sidecar, and published its registrations —
the resumed apply then collided with its own stolen commit.

Correct by construction:

- New schema-apply serialization queue key, acquired by the schema-
  apply writer (alongside its per-table keys) from before write_sidecar
  until after delete_sidecar. Per-table keys alone don't cover a
  registration-only migration, which pins no existing tables but has a
  sidecar and staging files on disk.
- The heal reconciles schema staging lazily, PER SchemaApply sidecar,
  after acquiring that sidecar's guards (including the serialization
  key) and re-confirming the sidecar exists — a sidecar that survives
  the queue wait belongs to a dead writer, so the reconcile can no
  longer race a live apply. Recomputing per sidecar also removes the
  staleness of one up-front result across a multi-sidecar pass.
- Omnigraph::refresh drops its up-front reconcile-and-pass-through
  (same race, and a pre-promoted result would make the heal's guarded
  reconcile see clean staging and wrongly defer the sidecar): it now
  reconciles standalone only when NO sidecar exists — which cannot
  race a live apply, whose sidecar always precedes its staging files —
  and otherwise defers entirely to the heal.

The open-time sweep keeps its precomputed reconcile: open has no
concurrent writers. Turns the previous commit's red test green.

Surfaced by external review of the write-entry heal.

Self-audit addendum folded in: refresh's no-sidecar gate had a TOCTOU
(a live apply could write its sidecar + staging between the empty
check and the reconcile) — the standalone reconcile now holds the
serialization key across the list-then-reconcile pair. The remaining
residual is cross-process only (in-process queues cannot serialize
against a writer in another process; the open-time sweep has the same
pre-existing exposure) and is now an explicit Known Gap in
invariants.md rather than an implicit one.

* test(engine): pin catalog reload after the heal recovers a schema apply

When the write-entry heal rolls a crashed apply's SchemaApply sidecar
forward on the same handle, disk and manifest move to the new schema
(staging promoted, registrations published) but the handle's in-memory
schema_source/catalog do not. Subsequent writes then validate against
the stale catalog and reject rows of types the graph already has.
Currently red with:

  record 1: unknown node type 'Tag'

refresh() reloads after its heal; the write entry points must too.
Fix in the next commit.

Surfaced by external review of the write-entry heal.

* fix(engine): reload the in-memory catalog after the heal recovers a schema apply

heal_pending_recovery_sidecars refreshed the coordinator and
invalidated the runtime cache after processing sidecars, but never
reloaded schema_source/catalog — so a write whose entry heal rolled a
crashed SchemaApply sidecar forward proceeded to validate against the
OLD schema while disk and manifest were already on the new one.
reload_schema_if_source_changed is the same post-heal step refresh()
already runs; it no-ops on the (overwhelmingly common) non-schema heal
because the on-disk source is unchanged. Turns the previous commit's
red test green.

Surfaced by external review of the write-entry heal.

* test(engine): pin that a deleted-branch sidecar cannot wedge the graph

A rollback-eligible sidecar pinned to a branch is deferred by every
roll-forward-only pass; if the branch is then deleted, the sidecar
survives, referencing a branch with no manifest tree. The heal (every
write entry) and the open-time sweep (every ReadWrite open) both fail
opening the dead branch, and repair refuses while a sidecar is pending
-- a terminal read-only state with manual sidecar surgery as the only
exit. Currently red with:

  Lance("Not found: .../__manifest/tree/feature/_versions")

The branch's tree and forks are already reclaimed, so the pinned drift
is unreachable and the sidecar is provably moot; the fix classifies it
as an orphaned-branch terminal state (audit + discard) in both passes.

Surfaced by review (P1, verified by repro).

* fix(engine): classify deleted-branch sidecars as orphaned instead of wedging

A deferred (rollback-eligible) sidecar pinned to a branch survives
branch_delete; both the write-entry heal and the open-time sweep then
failed unconditionally opening the dead branch -- every write and
every ReadWrite open errored, and repair refuses while a sidecar
pends. Terminal state, manual sidecar surgery the only exit.

The branch's tree and per-table forks are already reclaimed at delete,
so the drift the sidecar pins is unreachable and the sidecar is
provably moot. Both passes now check the sidecar's branch against the
manifest's branch list (the authority -- deliberately NOT inferred
from a Not-found on open, which could be a transient storage error
masking real recovery intent) and discard orphans with an
OrphanedBranchDiscarded audit row, commit appended on main since the
sidecar's own branch no longer has a commit graph.

The open-time half is pre-existing; the write-entry heal made it hot.
Turns the previous commit's red test green.

Surfaced by review (P1, verified by repro).

* chore: harden review nits — vacuous CI filter, root-runner skip, liveness note

- ci.yml: the RustFS sidecar-lifecycle step now fails loudly if the
  's3_' name filter matches zero tests (cargo passes vacuously on an
  empty filter; the step exists specifically to prove S3 sidecar I/O
  coverage). The pre-existing CLI smoke step has the same shape and is
  left for a follow-up.
- cluster unreadable-payload test: cfg(unix) + a skip-with-log when
  running as root (mode 000 is still readable to root, common in
  container dev runners), so the test degrades instead of failing.
- refresh: document the one-pass-late convergence for legacy staging
  residue while non-SchemaApply sidecars pend, so nobody 'fixes' it by
  re-running the reconcile unserialized — the exact race the
  serialization key closes.

* test(engine): pin orphan-discard idempotency across a delete fault

discard_orphaned_branch_sidecar writes its audit row and main commit
before deleting the sidecar; a Phase D delete fault leaves the sidecar
on disk with the audit already durable, and the retry repeated the
whole path -- a second OrphanedBranchDiscarded audit row (and commit)
for the same operation. Currently red: 2 rows after one fault + retry.
The retry must only finish the delete. Fix next.

Also promotes the recovery-audit kinds reader into the shared test
helpers (it was recovery.rs-local).

Surfaced by external review of the orphan-discard fix.

* fix(engine): orphan-discard idempotency + heal reports acted-vs-deferred

Two review findings on the recovery surface:

- discard_orphaned_branch_sidecar now checks the audit table for an
  existing (operation_id, OrphanedBranchDiscarded) row before appending
  the commit + audit pair, so a Phase D delete fault retries ONLY the
  delete instead of duplicating audit rows and commit-graph entries.
  Cold path: the list scan runs only when an orphaned sidecar exists.
  Turns the previous commit's red test green (exactly one audit row
  across fault + retry).

- process_sidecar returns whether durable state changed; the heal sets
  processed_any only for sidecars that were actually rolled forward /
  rolled back / audit-recovered (orphan discards count). Deferred
  sidecars (rollback-eligible, invariant-violating, unpromoted
  SchemaApply) no longer trigger a per-write schema reload + full
  runtime-cache invalidation while they pend -- the cache is
  snapshot-keyed so this was waste, not corruption, but it was paid on
  every write until reopen. Acted-paths' processed=true remains pinned
  by load_after_schema_apply_phase_b_failure_uses_recovered_catalog
  (the reload depends on it).

Surfaced by external review.

* test(engine): pin the orphan-discard audit-append fault leg as documented tolerance

The orphan discard's commit append and audit append are two writes; a
failure between them leaves a recovery commit with no audit row, and
the retry (keyed on the audit row, the operator-facing record) appends
a second commit before the audit lands. This is the same
not-atomic-pair-write tolerance record_audit documents and the
manifest->commit-graph Known Gap covers for every publish: bounded
commit-graph noise, audit row exactly-once under clean failures.
Keying idempotency on commit rows instead would need an operation_id
column on _graph_commits, and audit-before-commit would dangle the
graph_commit_id join -- both worse than the documented residual.

Make the tolerance explicit instead of implicit: docstring names the
window, a failpoint sits inside it, and the new test pins convergence
across the fault (sidecar consumed, exactly one audit row), completing
the orphan-discard fault matrix alongside the delete-fault leg.

Surfaced by external review of the orphan-discard idempotency.

* test(engine): pin honest drift-guard advice when sidecar listing fails

The guard's unwrap_or(false) conflated 'classified as uncovered' with
'could not classify': a transient list fault on the guard's second
list (the entry heal's first list having succeeded) confidently routed
the operator to omnigraph repair even when the heal had just deferred
a rollback-eligible sidecar -- and repair refuses while a sidecar is
pending. Currently red: the error says 'run omnigraph repair' with no
mention of the reopen path. The fix names both paths plus the failure
cause when classification is impossible.

Surfaced by external review of the drift-guard fallback.

* fix(engine): admit ambiguity in the drift guard when sidecar listing fails

Replace the unwrap_or(false) fallback with a tri-state: covered ->
reopen advice; uncovered -> repair advice; listing FAILED -> say the
drift could not be classified, name the cause, and give both paths in
order ('run repair, or reopen read-write if repair reports a pending
sidecar'). The old fallback confidently routed a transient list fault
to repair, which refuses while a sidecar is pending -- a self-
correcting but pointless detour. The conflict itself is still always
raised; only the advice degrades honestly. Turns the previous commit's
red test green.

Surfaced by external review of the drift-guard fallback.
											
										
										
											2026-06-13 11:20:08 +02:00
+								> Note: the storage adapter has an in-memory backend (`ObjectStorageAdapter::in_memory()`, full contract including true conditional updates) used by the adapter contract tests in `storage.rs`. It covers only the text-object layer (sidecars, schema staging, cluster state) — **Lance datasets bypass the adapter**, so engine integration tests still use `tempfile::tempdir()`. An in-memory Lance substrate remains an architectural ask — keep it explicit in [docs/dev/invariants.md](invariants.md) under known gaps.
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
 								## Failpoints (fault injection)
-												docs(cluster): record Stage 3B failpoint + verification coverage

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

											
										
										
											2026-06-10 02:15:13 +03:00
+								- Cargo feature: `failpoints = ["dep:fail", "fail/failpoints"]` (in `crates/omnigraph/Cargo.toml` **and** `crates/omnigraph-cluster/Cargo.toml`; the cluster feature does not enable the engine's).
 								- Wrappers: `crates/omnigraph/src/failpoints.rs` and `crates/omnigraph-cluster/src/failpoints.rs` expose `maybe_fail("name")` and `ScopedFailPoint` for tests.
 								- Call sites are inserted at sensitive transaction boundaries (branch create, graph publish commit, cluster apply's payload→state-write window, etc.).
 								- Activated tests: `crates/omnigraph/tests/failpoints.rs` and `crates/omnigraph-cluster/tests/failpoints.rs` (crash-mid-apply + state CAS race via `fail::cfg_callback`; integration binaries, never in-source — the fail registry is process-global). Run with `cargo test -p omnigraph-engine --features failpoints --test failpoints` / `cargo test -p omnigraph-cluster --features failpoints --test failpoints`.
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
 								## RustFS / S3 integration
 								CI runs three S3-backed tests against a containerized RustFS server (`.github/workflows/ci.yml` → `rustfs_integration` job):
 								- `cargo test -p omnigraph-engine --test s3_storage`
-												test(cluster,server): gated object-storage cluster e2e + CI wiring + docs

s3_cluster.rs runs the full control-plane lifecycle against a real
bucket (CI: containerized RustFS; locally the RustFS binary): import →
lock released (pins the drop-time release regression caught on the first
live smoke) → apply (graph roots + catalog on the bucket, nothing local)
→ serving snapshots from both the config dir and the bare URI → schema
evolution → approved delete (prefix removal) → empty-cluster refusal.
The server suite gains the config-free boot test: --cluster s3://… with
zero local files serves a stored query over HTTP.

CI: the rustfs job runs both suites; the classify filter covers the
cluster store/serve modules and the new test files. The server smoke
drops its name filter — every test in the s3 target is bucket-gated, and
a filter matching nothing passes vacuously (which silently ran zero
tests for a while).

Docs: deployment.md gains the Bucket-no-volume shape as the preferred
cloud deployment; cluster.md/server.md document --cluster <uri>;
testing.md maps the new suite.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

											
										
										
											2026-06-11 15:56:40 +03:00
+								- `cargo test -p omnigraph-server --test s3` (single-graph serving + config-free `--cluster s3://` boot)
 								- `cargo test -p omnigraph-cluster --test s3_cluster` (full control-plane lifecycle on the bucket)
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
+								- `cargo test -p omnigraph-cli --test system_local local_cli_s3_end_to_end_init_load_read_flow`
-												Recovery liveness, storage fault-injection matrix, and one storage implementation over object_store (#203)

* test(engine): pin the long-lived-handle heal contract for sidecar-covered drift

A Phase B -> Phase C failure (commit_staged advanced Lance HEAD, manifest
publish did not land, recovery sidecar persists) currently wedges every
subsequent staged write on the same engine handle: the commit-time drift
guard rejects with 'run omnigraph repair', but repair itself refuses
while a recovery sidecar is pending, so a long-lived server can only
recover by restart. The documented contract (writes.md 'Long-running
servers', invariants.md invariant 5) says refresh-time roll-forward
closes this residual without restart -- but no write path runs it.

Two red tests pin the intended contract at the write entry points:
a follow-up load (the POST /ingest shape: shared handle, no reopen)
and a follow-up mutation must heal roll-forward-eligible sidecars
in-process and then succeed.

Currently failing with:
  table 'node:Company' has Lance HEAD version 2 ahead of manifest
  version 1; run `omnigraph repair` before writing

The fix lands in the next commit.

* fix(engine): heal pending recovery sidecars at the staged-write entry points

Close the long-lived-process gap in the recovery protocol: a Phase B ->
Phase C residual (per-table commit_staged landed, manifest publish did
not, sidecar persists) previously recovered only at the next ReadWrite
open or via an explicit refresh() that no production write path called,
so a long-lived server wedged every subsequent write on the commit-time
drift guard until restart.

New recovery::heal_pending_sidecars_roll_forward:
- one list_dir of __recovery/ at write entry (empty -> immediate
  return, the steady state), so the per-write cost is one storage list;
- per sidecar, acquires the same per-(table_key, table_branch) write
  queues every sidecar writer holds from before write_sidecar until
  after delete_sidecar, then re-checks sidecar existence -- this
  serializes the heal against live writers instead of rolling an
  in-flight sidecar forward from under its writer (which would fail
  that writer's publish CAS spuriously). Lock order queues ->
  coordinator matches every writer's commit->publish path. This is the
  queue-acquisition design recovery.rs and write_queue.rs already
  documented for in-process recovery;
- processes in RollForwardOnly mode: the common residual rolls forward
  in-process; rollback-eligible sidecars still defer to the next
  ReadWrite open (Dataset::restore is unsafe under concurrency).

Wire it into load_as and mutate_as (before the inline delete path can
advance any HEAD), and rebase Omnigraph::refresh onto the same helper
so refresh stops racing live writers' sidecars.

The maintenance entry points (apply_schema_as, branch_merge_as,
ensure_indices) intentionally keep their strict fail-loud preconditions
for now; wiring the same heal there is a follow-up with its own tests.

Turns the previous commit's two red tests green.

* fix(engine): name the right recovery path in the commit-time drift guard

The drift guard's 'run omnigraph repair before writing' advice is a
dead end when the drift is covered by a pending recovery sidecar:
repair refuses while a sidecar is pending. With the write-entry heal in
place, reaching this guard with sidecar-covered drift means the heal
deferred it (rollback-eligible), and the actual recovery path is a
read-write reopen. Distinguish the two classes on the error path only
(one sidecar list, after the conflict is already certain); a listing
failure falls back to the uncovered-drift wording rather than masking
the conflict.

Pinned by extending refresh_defers_rollback_eligible_sidecar_to_next_open
with a write attempt against the deferred sidecar.

* docs: write-entry in-process sidecar heal — contract and coverage

Update the recovery contract docs to match the previous two commits:
invariant 5 now states that the staged-write entry points and refresh
run in-process roll-forward recovery (long-lived processes converge on
the next write, not at restart); writes.md 'Long-running servers'
describes the heal's queue-acquisition concurrency contract, the
improved drift-guard error, and the entry points that intentionally do
not heal yet; testing.md indexes the new failpoint tests; AGENTS.md
capability matrix drops the claim that in-process recovery is entirely
future work (only the rollback path remains with the background
reconciler).

* test(engine): pin the entry heal contract for schema apply and branch merge

Without the write-entry heal, the two maintenance writers do worse than
wedge on sidecar-covered drift -- they proceed and decide its fate
implicitly:

- schema apply re-plans table rewrites from the manifest pin, orphaning
  the drifted Phase-B commit (its rows silently vanish from the
  rewritten table) while the stale sidecar lingers to misclassify
  against the post-apply pins;
- branch merge publishes over the drift, making the failed writer's
  commit visible as an unattributed side effect (no recovery audit
  row), and leaves the stale sidecar behind.

Two red tests pin the intended contract: both entry points heal the
sidecar first (attributed roll-forward), then run on the converged
state. Currently failing on the stale-sidecar / dropped-rows
assertions; the fix lands in the next commit.

* fix(engine): heal pending recovery sidecars at the schema-apply and branch-merge entries

Extend the write-entry heal to the remaining two write entry points.
Unlike load/mutate (which wedge on the drift guard), these proceeded
over sidecar-covered drift and decided its fate implicitly:

- schema apply re-planned table rewrites from the manifest pin,
  orphaning the drifted Phase-B commit -- its rows silently vanished
  from the rewritten table -- while the stale sidecar lingered to
  misclassify against the post-apply pins;
- branch merge published over the drift, making the failed writer's
  commit visible without a recovery audit row, and left the stale
  sidecar behind.

Both now run the same queue-serialized roll-forward heal at entry,
before their own sidecar exists, so recovery is attributed (audit row)
and deterministic. ensure_indices stays heal-free: it runs inside the
load / schema-apply flows after their entry heal.

Turns the previous commit's two red tests green. Docs updated in the
same change (invariant 5, writes.md, testing.md, AGENTS.md).

* test(engine): pin Phase A sidecar-write failure semantics

Storage fault-injection matrix, row 1: a sidecar PUT failure (S3
PutObject / fs write) in Phase A. New failpoint recovery.sidecar_write
at the top of write_sidecar -- the single choke point all five sidecar
writers go through -- models the storage error backend-generically.

Also adds the other three storage-fault failpoints used by the
following commits (recovery.sidecar_delete, recovery.sidecar_list,
recovery.record_audit); each is a no-op without the failpoints feature.

Pinned contract: every writer writes its sidecar BEFORE its first
HEAD-advancing commit, so a put failure aborts with zero drift (no
sidecar, Lance HEAD == manifest pin, no rows) and a transient fault
never wedges the graph -- the same handle writes/merges normally once
it clears. Covered for load (the staging writer) and branch_merge (the
multi-table writer, forced onto the RewriteMerged path by diverging
both sides).

* test(engine): pin Phase D delete, list, and audit-append storage-fault semantics

Storage fault-injection matrix, rows 2/3/5, plus the real-backend run:

- recovery.sidecar_delete: a Phase D delete failure (S3 DeleteObject)
  must NOT fail the user's write -- the manifest publish already
  landed, so the caller's data is durable. The swallowed failure
  leaves a stale sidecar; the next write's entry heal consumes it via
  the stale-sidecar audit-recovery path (RolledForward, attributed).

- recovery.sidecar_list: a __recovery/ list failure (S3 ListObjectsV2)
  is loud at every consumer -- the write-entry heal fails the write
  and the open-time sweep fails the open. Silently skipping recovery
  over a pending sidecar would be consumer tolerance of drift. Once
  the fault clears, open recovers the pending sidecar normally.

- recovery.record_audit: an audit write failure after the
  roll-forward's manifest publish aborts that recovery attempt and
  keeps the sidecar; re-entry detects the already-published manifest,
  records exactly ONE RolledForward audit row, and converges -- the
  retry tolerance documented on record_audit, exercised end-to-end.

- s3_load_recovers_after_publisher_failure_without_reopen: the
  same-handle heal scenario on a real bucket (gated on
  OMNIGRAPH_S3_TEST_BUCKET, skips locally), exercising sidecar
  put/list/delete through S3StorageAdapter instead of the local-FS
  adapter. CI wiring lands in a follow-up commit.

* test(engine): refuse corrupt recovery sidecars loudly

Storage fault-injection matrix, row 4 (no failpoint needed -- the
corrupt file is written by hand, sibling to the unknown-schema-version
refusal test): a truncated/garbage __recovery/{ulid}.json must be
refused loudly by both the write-entry heal (the write fails naming
the parse error) and the open-time sweep (ReadWrite open fails naming
the file), with the file left on disk for operator inspection.
Read-only opens still work -- the sweep is skipped there.

* test(engine): run the S3 sidecar-lifecycle coverage in CI + document the fault matrix

- ci.yml rustfs_integration: new step running the bucket-gated
  failpoints tests (name filter s3_) against the RustFS container, so
  sidecar put/list/delete are exercised through S3StorageAdapter on
  every storage-affecting PR.
- writes.md: sidecar I/O failure semantics -- Phase A put failure
  aborts with zero drift; Phase D delete failure is swallowed (write
  already durable) and healed by the next write; list failures are
  loud at heal and open; corrupt sidecars are refused with the file
  kept for inspection; audit-append failures are retried to exactly
  one audit row.
- testing.md: index the storage-fault matrix in the failpoints.rs row
  and the new RustFS CI line.

* test(engine): pin read-visibility of acknowledged local if-absent writes

The cluster lib test import_missing_state_creates_state_with_graph_-
observation flakes at ~50% under full-workspace load ('EOF while
parsing a value' reading back the state.json its own import just
acknowledged). Root cause is in the engine's local storage adapter:
write_text_if_absent writes through a buffered tokio::fs::File and
returns when write_all resolves -- which, per tokio's documented File
semantics, means the bytes reached tokio's internal buffer, not the
file. The actual write completes in a background blocking task after
drop, so a caller that acknowledges success and reads the object back
can see an empty or partial file. Under load the window widens; the
red run fails at iteration 0 with 0 of 8192 bytes on disk.

The regression test pins the contract at the adapter boundary: when
write_text_if_absent resolves, the full contents are visible to any
reader; a losing second claim leaves the winner's object untouched.

The fix lands in the next commit.

* fix(engine): publish local storage writes with atomic visibility

Close the class, not the instance. The local adapter admitted three
ways for a reader to observe a write that was acknowledged or visible
before its bytes were complete:

1. write_text_if_absent acknowledged success when the buffered
   tokio::fs::File write_all resolved -- i.e. when the bytes reached
   tokio's internal buffer, not the file. A caller reading back its own
   acknowledged write could see an empty object (the ~50% cluster
   import flake under full-workspace load; the regression test failed
   at iteration 0 with 0 of 8192 bytes visible).
2. The same call published its CLAIM (create_new) before its CONTENT,
   so concurrent readers saw an empty claimed file in the window.
3. write_text (plain tokio::fs::write) exposed truncated content
   mid-replace -- silently falsifying write_sidecar's 'readers either
   see the complete sidecar or none' contract on local FS (true on S3,
   where PutObject is atomic).

A flush in write_text_if_absent would have fixed only (1). Instead,
both local write paths now publish complete temp files atomically:
rename for replace (write_text -- the idiom write_text_if_match
already used) and hard_link for no-replace (write_text_if_absent --
link fails AlreadyExists, so exactly one of N concurrent claimants
wins and the winner's object is fully readable at the instant it
becomes visible). The local adapter now honors the same object-level
atomic-visibility contract as the S3 adapter, which is what every
caller (recovery sidecar protocol, cluster state CAS) was written
against. Crash-orphaned *.tmp.* files are inert: the sidecar sweep
filters to .json, and cluster state reads address state.json by name.

fsync/durability policy is unchanged (no fsync before, none now);
this fix is about visibility ordering, not power-loss durability.

Pre-existing on main (landed with the multi-graph server mode change,
PR #119); surfaced by this branch's heal work only because one extra
list_dir per write shifted test timing. Cluster lib suite: 12/25
failures before, 0/25 after. Turns the previous commit's red test
green.

* refactor(engine): one storage implementation over object_store for every backend

Collapse LocalStorageAdapter (hand-rolled tokio::fs) and
S3StorageAdapter into a single ObjectStorageAdapter backed by
Arc<dyn object_store::ObjectStore> -- LocalFileSystem for local URIs,
the existing AmazonS3 build for s3://, plus a pub in_memory()
constructor (full contract including TRUE conditional updates; the
in-memory test backend testing.md asked for at the adapter level).

Why: the acknowledged-before-visible bug showed the two-impl shape has
no referee -- one prose contract, two independent answers. Upstream
LocalFileSystem::put_opts is byte-for-byte the staged-temp+rename/
hard_link idiom that fix converged on, and Lance's own commit protocol
is built on the same primitives (put-if-not-exists / rename-if-not-
exists), so the substrate-aligned move is to stop hand-rolling it.
The per-backend residue shrinks to a UriCodec (URI <-> object path)
and one capability flag.

Semantics preserved by construction, with three deliberate deltas:
- exists() is now object-store-semantics everywhere (head + non-empty
  prefix fallback): an EMPTY local directory no longer 'exists'. The
  only dir-shaped caller (_graph_commits.lance probes) self-heals via
  ensure_commit_graph_initialized where it previously wedged loudly.
- A directory at an object path reads as NotFound, not as an IO error
  ('only objects exist'). The cluster unreadable-payload test used a
  same-named directory as a portable non-NotFound trigger; it now uses
  chmod 000, which still models genuine transient IO.
- write_text_if_match keeps content-token semantics on local
  (PutMode::Update is NotImplemented upstream for LocalFileSystem in
  0.12.5 and 0.13.2); the capability flag gates the token SOURCE in
  read_text_versioned too -- an ETag token with content-compare writes
  would lose every CAS.

delete_prefix keeps a local remove_dir_all branch: directories are a
local-FS concept, and list+delete would leave empty skeletons that
cluster graph_root_exists (raw Path::exists) reports as still present.

LocalStorageAdapter remains as a delegating shim so the pinned
contract tests gate this swap textually unchanged; the shim and the
test parameterization over local + in-memory land next. Cargo gains
the explicit 'fs' feature (already transitively enabled by lance).

* test(engine): one executable storage contract, run against every backend

Remove the LocalStorageAdapter delegation shim and migrate its
construction sites to ObjectStorageAdapter::local(). Replace the
per-backend duplicated tests with a single contract_suite asserting
the trait's promises (atomic replace, exists incl. the dataset-root
prefix probe, one-winner if_absent, versioned CAS with loud CAS-lost,
rename, list round-trip with no sibling-prefix bleed, idempotent
delete/delete_prefix), run against the local backend and the new
in-memory backend -- which implements true conditional updates, so the
strong-CAS path is exercised without a bucket. The bucket-gated S3
variant already exists (s3_adapter_conditional_writes_contract).

New local-specific pins for the deliberate semantic edges of the
collapse: empty directories are not objects (exists=false; the Lance
dataset-root probe shape is the non-empty case), file://-anchored and
spaces-in-path list output round-trips byte-identically into
read_text, dot-segment paths are lexically absolutized (the CLI's
./graph.omni shape), and upstream rename creating missing destination
parents. The acknowledged-write visibility regression test stays, now
documenting that the cross-API std::fs read-back is the point.

* refactor(cluster): drop put_json's per-backend atomicity branch

The local temp+rename dance predates the storage adapter guaranteeing
atomic visibility; now that write_text publishes via a staged temp +
rename on the filesystem (and a single atomic PUT on object stores) by
contract, the branch duplicated upstream behavior. One call, both
backends.

* docs: storage adapter collapse — contract, in-memory backend, local CAS gap

- testing.md: the 'no MemStorage backend' note is half-closed —
  ObjectStorageAdapter::in_memory() covers the text-object layer with
  the full contract (true conditional updates); Lance datasets bypass
  the adapter, so the engine substrate ask stays open.
- invariants.md: truth-matrix Tests row updated; new Known Gap for
  local write_text_if_match (upstream PutMode::Update is unimplemented
  for LocalFileSystem; content-token emulation is safe only under the
  cluster lock protocol — close before admitting a lock-free caller).
- writes.md: backend notes for the unified adapter (name#N staging
  residue invisible to the sweep, backend-wrapped error text with
  exists()-probing for missing-vs-error, loud permission failures).

* docs: finish renaming the storage adapters in user docs and test comments

storage.md's URI-scheme table and the S3 failpoint test's doc comment
still named the deleted LocalStorageAdapter/S3StorageAdapter; both now
describe the unified ObjectStorageAdapter over object_store, including
the relative-path absolutization note for local URIs.

* test(engine): pin branch-awareness of the drift guard's recovery advice

A pending sidecar on ANOTHER branch does not cover this branch's
drift: with a deferred feature-branch sidecar on disk and genuinely
uncovered drift on main, the main write's error must still point at
omnigraph repair -- a read-write reopen recovers the sidecar but
cannot repair main's uncovered drift. Currently red: the guard
matches sidecar pins by table_key only, so the feature sidecar flips
main's advice to the reopen path. Fix in the next commit.

Surfaced by external review of the drift-guard change.

* fix(engine): branch-aware sidecar matching in the drift guard's advice

The commit-time drift guard's sidecar-covered check matched pins by
table_key alone, so a pending sidecar on another branch flipped this
branch's uncovered-drift advice from 'run omnigraph repair' to the
reopen path -- and a reopen recovers that sidecar but cannot repair
this branch's drift. Compare the pin's table_branch too. Turns the
previous commit's red test green.

Surfaced by external review of the drift-guard change.

* test(engine): pin heal non-interference with a live schema apply

The write-entry heal's schema-staging reconcile runs before any queue
acquisition, so a load on the same handle, overlapping a schema apply
parked between its staging write and manifest commit, promotes the
apply's staging files (new catalog live against the old manifest),
classifies the LIVE apply's sidecar, and publishes its registrations
out from under it. The resumed apply then collides with its own stolen
commit. Currently red with:

  Lance("Concurrent modification: table version 3 already exists for
  node:Tag")

The fix (per-sidecar reconcile under the sidecar's write-queue guards,
plus a serialization key the schema-apply writer and the heal both
acquire) lands in the next commit.

Surfaced by external review of the write-entry heal.

* fix(engine): serialize the heal's schema-staging reconcile with live schema applies

The write-entry heal ran recover_schema_state_files up front, before
acquiring any queue guards. Overlapping a live schema apply parked
between its staging write and manifest commit, the heal promoted the
apply's staging files (new catalog live against the old manifest),
classified the LIVE apply's sidecar, and published its registrations —
the resumed apply then collided with its own stolen commit.

Correct by construction:

- New schema-apply serialization queue key, acquired by the schema-
  apply writer (alongside its per-table keys) from before write_sidecar
  until after delete_sidecar. Per-table keys alone don't cover a
  registration-only migration, which pins no existing tables but has a
  sidecar and staging files on disk.
- The heal reconciles schema staging lazily, PER SchemaApply sidecar,
  after acquiring that sidecar's guards (including the serialization
  key) and re-confirming the sidecar exists — a sidecar that survives
  the queue wait belongs to a dead writer, so the reconcile can no
  longer race a live apply. Recomputing per sidecar also removes the
  staleness of one up-front result across a multi-sidecar pass.
- Omnigraph::refresh drops its up-front reconcile-and-pass-through
  (same race, and a pre-promoted result would make the heal's guarded
  reconcile see clean staging and wrongly defer the sidecar): it now
  reconciles standalone only when NO sidecar exists — which cannot
  race a live apply, whose sidecar always precedes its staging files —
  and otherwise defers entirely to the heal.

The open-time sweep keeps its precomputed reconcile: open has no
concurrent writers. Turns the previous commit's red test green.

Surfaced by external review of the write-entry heal.

Self-audit addendum folded in: refresh's no-sidecar gate had a TOCTOU
(a live apply could write its sidecar + staging between the empty
check and the reconcile) — the standalone reconcile now holds the
serialization key across the list-then-reconcile pair. The remaining
residual is cross-process only (in-process queues cannot serialize
against a writer in another process; the open-time sweep has the same
pre-existing exposure) and is now an explicit Known Gap in
invariants.md rather than an implicit one.

* test(engine): pin catalog reload after the heal recovers a schema apply

When the write-entry heal rolls a crashed apply's SchemaApply sidecar
forward on the same handle, disk and manifest move to the new schema
(staging promoted, registrations published) but the handle's in-memory
schema_source/catalog do not. Subsequent writes then validate against
the stale catalog and reject rows of types the graph already has.
Currently red with:

  record 1: unknown node type 'Tag'

refresh() reloads after its heal; the write entry points must too.
Fix in the next commit.

Surfaced by external review of the write-entry heal.

* fix(engine): reload the in-memory catalog after the heal recovers a schema apply

heal_pending_recovery_sidecars refreshed the coordinator and
invalidated the runtime cache after processing sidecars, but never
reloaded schema_source/catalog — so a write whose entry heal rolled a
crashed SchemaApply sidecar forward proceeded to validate against the
OLD schema while disk and manifest were already on the new one.
reload_schema_if_source_changed is the same post-heal step refresh()
already runs; it no-ops on the (overwhelmingly common) non-schema heal
because the on-disk source is unchanged. Turns the previous commit's
red test green.

Surfaced by external review of the write-entry heal.

* test(engine): pin that a deleted-branch sidecar cannot wedge the graph

A rollback-eligible sidecar pinned to a branch is deferred by every
roll-forward-only pass; if the branch is then deleted, the sidecar
survives, referencing a branch with no manifest tree. The heal (every
write entry) and the open-time sweep (every ReadWrite open) both fail
opening the dead branch, and repair refuses while a sidecar is pending
-- a terminal read-only state with manual sidecar surgery as the only
exit. Currently red with:

  Lance("Not found: .../__manifest/tree/feature/_versions")

The branch's tree and forks are already reclaimed, so the pinned drift
is unreachable and the sidecar is provably moot; the fix classifies it
as an orphaned-branch terminal state (audit + discard) in both passes.

Surfaced by review (P1, verified by repro).

* fix(engine): classify deleted-branch sidecars as orphaned instead of wedging

A deferred (rollback-eligible) sidecar pinned to a branch survives
branch_delete; both the write-entry heal and the open-time sweep then
failed unconditionally opening the dead branch -- every write and
every ReadWrite open errored, and repair refuses while a sidecar
pends. Terminal state, manual sidecar surgery the only exit.

The branch's tree and per-table forks are already reclaimed at delete,
so the drift the sidecar pins is unreachable and the sidecar is
provably moot. Both passes now check the sidecar's branch against the
manifest's branch list (the authority -- deliberately NOT inferred
from a Not-found on open, which could be a transient storage error
masking real recovery intent) and discard orphans with an
OrphanedBranchDiscarded audit row, commit appended on main since the
sidecar's own branch no longer has a commit graph.

The open-time half is pre-existing; the write-entry heal made it hot.
Turns the previous commit's red test green.

Surfaced by review (P1, verified by repro).

* chore: harden review nits — vacuous CI filter, root-runner skip, liveness note

- ci.yml: the RustFS sidecar-lifecycle step now fails loudly if the
  's3_' name filter matches zero tests (cargo passes vacuously on an
  empty filter; the step exists specifically to prove S3 sidecar I/O
  coverage). The pre-existing CLI smoke step has the same shape and is
  left for a follow-up.
- cluster unreadable-payload test: cfg(unix) + a skip-with-log when
  running as root (mode 000 is still readable to root, common in
  container dev runners), so the test degrades instead of failing.
- refresh: document the one-pass-late convergence for legacy staging
  residue while non-SchemaApply sidecars pend, so nobody 'fixes' it by
  re-running the reconcile unserialized — the exact race the
  serialization key closes.

* test(engine): pin orphan-discard idempotency across a delete fault

discard_orphaned_branch_sidecar writes its audit row and main commit
before deleting the sidecar; a Phase D delete fault leaves the sidecar
on disk with the audit already durable, and the retry repeated the
whole path -- a second OrphanedBranchDiscarded audit row (and commit)
for the same operation. Currently red: 2 rows after one fault + retry.
The retry must only finish the delete. Fix next.

Also promotes the recovery-audit kinds reader into the shared test
helpers (it was recovery.rs-local).

Surfaced by external review of the orphan-discard fix.

* fix(engine): orphan-discard idempotency + heal reports acted-vs-deferred

Two review findings on the recovery surface:

- discard_orphaned_branch_sidecar now checks the audit table for an
  existing (operation_id, OrphanedBranchDiscarded) row before appending
  the commit + audit pair, so a Phase D delete fault retries ONLY the
  delete instead of duplicating audit rows and commit-graph entries.
  Cold path: the list scan runs only when an orphaned sidecar exists.
  Turns the previous commit's red test green (exactly one audit row
  across fault + retry).

- process_sidecar returns whether durable state changed; the heal sets
  processed_any only for sidecars that were actually rolled forward /
  rolled back / audit-recovered (orphan discards count). Deferred
  sidecars (rollback-eligible, invariant-violating, unpromoted
  SchemaApply) no longer trigger a per-write schema reload + full
  runtime-cache invalidation while they pend -- the cache is
  snapshot-keyed so this was waste, not corruption, but it was paid on
  every write until reopen. Acted-paths' processed=true remains pinned
  by load_after_schema_apply_phase_b_failure_uses_recovered_catalog
  (the reload depends on it).

Surfaced by external review.

* test(engine): pin the orphan-discard audit-append fault leg as documented tolerance

The orphan discard's commit append and audit append are two writes; a
failure between them leaves a recovery commit with no audit row, and
the retry (keyed on the audit row, the operator-facing record) appends
a second commit before the audit lands. This is the same
not-atomic-pair-write tolerance record_audit documents and the
manifest->commit-graph Known Gap covers for every publish: bounded
commit-graph noise, audit row exactly-once under clean failures.
Keying idempotency on commit rows instead would need an operation_id
column on _graph_commits, and audit-before-commit would dangle the
graph_commit_id join -- both worse than the documented residual.

Make the tolerance explicit instead of implicit: docstring names the
window, a failpoint sits inside it, and the new test pins convergence
across the fault (sidecar consumed, exactly one audit row), completing
the orphan-discard fault matrix alongside the delete-fault leg.

Surfaced by external review of the orphan-discard idempotency.

* test(engine): pin honest drift-guard advice when sidecar listing fails

The guard's unwrap_or(false) conflated 'classified as uncovered' with
'could not classify': a transient list fault on the guard's second
list (the entry heal's first list having succeeded) confidently routed
the operator to omnigraph repair even when the heal had just deferred
a rollback-eligible sidecar -- and repair refuses while a sidecar is
pending. Currently red: the error says 'run omnigraph repair' with no
mention of the reopen path. The fix names both paths plus the failure
cause when classification is impossible.

Surfaced by external review of the drift-guard fallback.

* fix(engine): admit ambiguity in the drift guard when sidecar listing fails

Replace the unwrap_or(false) fallback with a tri-state: covered ->
reopen advice; uncovered -> repair advice; listing FAILED -> say the
drift could not be classified, name the cause, and give both paths in
order ('run repair, or reopen read-write if repair reports a pending
sidecar'). The old fallback confidently routed a transient list fault
to repair, which refuses while a sidecar is pending -- a self-
correcting but pointless detour. The conflict itself is still always
raised; only the advice degrades honestly. Turns the previous commit's
red test green.

Surfaced by external review of the drift-guard fallback.
											
										
										
											2026-06-13 11:20:08 +02:00
+								- `cargo test -p omnigraph-engine --features failpoints --test failpoints s3_` (recovery-sidecar lifecycle on a real bucket)
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
 								Locally, set `OMNIGRAPH_S3_TEST_BUCKET` (and the usual `AWS_*` vars including `AWS_ENDPOINT_URL_S3` for non-AWS) before running. Without those, S3 tests skip gracefully.
-												test(cli): address review — assert schema-show success, document exit-code stance, add e2e opt-out

- The drift-heal verification now asserts `schema show` succeeded and
  produced a schema before checking the rogue field's absence (a failed
  command previously made the negative assertion vacuously pass).
- cluster_cli documents why it deliberately does not assert exit codes
  (blocked applies exit non-zero by contract while emitting the structured
  output callers assert on).
- The comprehensive lifecycle e2es honor OMNIGRAPH_SKIP_SYSTEM_E2E=1
  (graceful skip-with-message, the S3-gate pattern) for constrained
  sandboxes; requirements + suppression documented in testing.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

											
										
										
											2026-06-10 19:05:12 +03:00
+								## System e2e requirements and suppression
 								The CLI system tests (`system_local.rs`) spawn the workspace-built `omnigraph` and `omnigraph-server` binaries (cargo provides paths via `CARGO_BIN_EXE_*`), bind ephemeral localhost ports, and use local-FS temp dirs — no external services, no env vars required; they run in the default `cargo test --workspace`. The comprehensive cluster lifecycle e2es (multi-server-restart flows) honor an opt-out for constrained sandboxes: set `OMNIGRAPH_SKIP_SYSTEM_E2E=1` to skip them with a logged message (the same graceful-skip pattern as the S3 gate). Cargo-native filtering also works: `cargo test --test system_local -- --skip local_cluster`.
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
+								## OpenAPI drift
-												Rename repo terminology to graph (#118)
											
										
										
											2026-05-24 16:46:00 +01:00
+								`crates/omnigraph-server/tests/openapi.rs` regenerates `openapi.json` and diffs against the checked-in copy. CI auto-commits the regeneration on same-repository PRs and otherwise runs in strict-check mode (env: `OMNIGRAPH_UPDATE_OPENAPI`).
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
 								## Examples & benches
 								- `crates/omnigraph/examples/bench_expand.rs` — runnable example (not part of CI).
-												docs: split user and developer docs (#93)
											
										
										
											2026-05-15 03:45:22 +03:00
+								- No `benches/` directories. Add `benches/` per crate when you ship a perf-driven change, and include the motivating workload with the optimization.
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
 								## Coverage tooling — what's missing
-												Rename repo terminology to graph (#118)
											
										
										
											2026-05-24 16:46:00 +01:00
+								There is **no** coverage tooling in the repository today: no `tarpaulin.toml`, no `codecov.yml`, no coverage CI step. If you want to know whether your change is covered, the answer comes from reading and running the relevant integration tests, not from a tool.
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
 								If introducing coverage tooling is in scope for your task, the natural first step is `cargo-llvm-cov` wired into a separate CI job, and a per-crate threshold rather than a global one.
-												Make "check existing coverage first" a top-level testing principle

The original docs/testing.md mentioned finding existing tests as step 1
of the checklist but never explicitly said "if existing coverage
already addresses your case, extend it; don't duplicate." Adds a
prominent "First principle" section that names extend-vs-new as the
preferred outcome and lists three duplicated init_and_load blocks as
the most common form of test rot.

Adds an extra checklist item: verify your change makes an *existing*
test fail before it makes a new one pass — if you can break the code
without breaking a test, that coverage gap is the bug to fix first.

Strengthens the AGENTS.md callout so the principle ("always check what
already covers it") is in scope from the top of every session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-29 00:03:50 +02:00
+								## First principle: check what already covers it
 								**Before writing any new test, check whether an existing test already covers the case.** The cost of duplicating coverage is high: more code to read, more places to keep in sync when behavior changes, and more drift when one copy lags. The cost of *extending* an existing test is usually one extra assertion or one extra fixture row.
 								How to check:
-												docs: rename runs.md/runs.rs → writes and repoint all references (#131)

The Run state machine was removed in MR-771 (v0.4.0); `docs/dev/runs.md`
and `crates/omnigraph/tests/runs.rs` have since documented and tested the
direct-publish write path, so the "runs" name was misleading.

- git mv docs/dev/runs.md → docs/dev/writes.md (reframe H1 + intro;
  keep MR-771 history note)
- git mv crates/omnigraph/tests/runs.rs → tests/writes.rs (reframe header)
- repoint every runs.md / runs.rs reference across docs, AGENTS.md, and
  source comments
- fix four pre-existing broken `docs/runs.md` links (the file never lived
  at that path) to `docs/dev/writes.md`
- fix the stale v0.4.0 anchor to the live section

No behavior change: every source edit is a comment. Engine builds and the
renamed test passes 25/25; scripts/check-agents-md.sh passes.

The run-removal cleanup itself (run_registry.rs guard, __run__ prefix) is
deferred to MR-770.
											
										
										
											2026-05-30 23:20:56 +02:00
+. **Map the change to an area** — use the engine integration-test table above (`branching.rs`, `writes.rs`, `search.rs`, etc.). The filename usually names the area.
-												Make "check existing coverage first" a top-level testing principle

The original docs/testing.md mentioned finding existing tests as step 1
of the checklist but never explicitly said "if existing coverage
already addresses your case, extend it; don't duplicate." Adds a
prominent "First principle" section that names extend-vs-new as the
preferred outcome and lists three duplicated init_and_load blocks as
the most common form of test rot.

Adds an extra checklist item: verify your change makes an *existing*
test fail before it makes a new one pass — if you can break the code
without breaking a test, that coverage gap is the bug to fix first.

Strengthens the AGENTS.md callout so the principle ("always check what
already covers it") is in scope from the top of every session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-29 00:03:50 +02:00
+. **Open the file and skim every test fn name.** Test fn names are the index — read them all, not just the first few.
 . **Grep for the symbol or path you're changing.** `rg <FunctionName>` or `rg <enum_variant>` across all `tests/` directories surfaces existing coverage you might miss.
 . **Decide one of three outcomes**, in this order of preference:
 								   - *Existing test already asserts the new behavior* → no new test needed; this PR is a refactor or no-op behaviorally. Confirm by running the existing test against the change.
 								   - *Existing test covers the area but not your case* → **add an assertion or a fixture row to the existing test**, don't write a new function with `init_and_load()` again.
 								   - *No existing coverage in any test file* → only then write a new test; put it in the file that owns the area, or open a new file only if the area itself is new.
-												Rename repo terminology to graph (#118)
											
										
										
											2026-05-24 16:46:00 +01:00
+								Three duplicated `init_and_load() → run_query → assert_eq` blocks where one parameterized test would do is the most common form of test rot in this repository. Don't add to it.
-												Make "check existing coverage first" a top-level testing principle

The original docs/testing.md mentioned finding existing tests as step 1
of the checklist but never explicitly said "if existing coverage
already addresses your case, extend it; don't duplicate." Adds a
prominent "First principle" section that names extend-vs-new as the
preferred outcome and lists three duplicated init_and_load blocks as
the most common form of test rot.

Adds an extra checklist item: verify your change makes an *existing*
test fail before it makes a new one pass — if you can break the code
without breaking a test, that coverage gap is the bug to fix first.

Strengthens the AGENTS.md callout so the principle ("always check what
already covers it") is in scope from the top of every session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-29 00:03:50 +02:00
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
+								## Before-every-task checklist
 								When you pick up any change, walk through this:
-												Make "check existing coverage first" a top-level testing principle

The original docs/testing.md mentioned finding existing tests as step 1
of the checklist but never explicitly said "if existing coverage
already addresses your case, extend it; don't duplicate." Adds a
prominent "First principle" section that names extend-vs-new as the
preferred outcome and lists three duplicated init_and_load blocks as
the most common form of test rot.

Adds an extra checklist item: verify your change makes an *existing*
test fail before it makes a new one pass — if you can break the code
without breaking a test, that coverage gap is the bug to fix first.

Strengthens the AGENTS.md callout so the principle ("always check what
already covers it") is in scope from the top of every session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-29 00:03:50 +02:00
+. **Find existing coverage** (per the principle above). Don't just look at the first test file by name — grep for the symbol you're touching across every crate's `tests/`.
 . **Run those tests locally before editing.** `cargo test --workspace --locked` for the broad pass; `-p <crate> --test <file>` for a focused loop. Confirm a clean baseline.
 . **Decide extend-vs-new** explicitly. If you can extend an existing test (assertion, fixture row, parameterization), do that. Only add a new test fn or new file if no existing one owns the area.
-												Rename repo terminology to graph (#118)
											
										
										
											2026-05-24 16:46:00 +01:00
+. **Reuse the helpers.** `init_and_load()`, fixture files, the CLI `support` harness — re-use them. Don't bootstrap a fresh graph by hand if a helper exists.
-												docs: split user and developer docs (#93)
											
										
										
											2026-05-15 03:45:22 +03:00
+. **Mind the boundary.** Per [docs/dev/invariants.md](invariants.md), test at the layer the change lives at — planner-level changes deserve planner-level tests, not just end-to-end.
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
+. **For substrate-touching changes** (Lance behavior), reach for `failpoints` or fixture-driven scenarios, not stubbed-out mocks.
 . **For server / API changes**, confirm the OpenAPI regeneration happens in `openapi.rs` and that the diff lands in `openapi.json`.
-												Make "check existing coverage first" a top-level testing principle

The original docs/testing.md mentioned finding existing tests as step 1
of the checklist but never explicitly said "if existing coverage
already addresses your case, extend it; don't duplicate." Adds a
prominent "First principle" section that names extend-vs-new as the
preferred outcome and lists three duplicated init_and_load blocks as
the most common form of test rot.

Adds an extra checklist item: verify your change makes an *existing*
test fail before it makes a new one pass — if you can break the code
without breaking a test, that coverage gap is the bug to fix first.

Strengthens the AGENTS.md callout so the principle ("always check what
already covers it") is in scope from the top of every session.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-29 00:03:50 +02:00
+. **Verify your change makes an existing test fail before it makes the new one pass.** If you can break the code without breaking a test, your coverage gap is the problem to fix first.
-												Add docs/testing.md as required-read every session

Maps the test surface (engine integration tests by area, CLI/server
tests, helpers harness, fixtures, failpoints feature, RustFS S3
integration, OpenAPI drift) and gives a before-every-task checklist:
find existing tests for the area, run them as a clean baseline, plan
the new test up front, reuse helpers, mind the layer boundary per
invariants §VII.33.

Notes that there's no coverage tooling today — coverage knowledge
comes from reading and running the relevant integration tests, not a
tarpaulin/codecov report.

Threaded into AGENTS.md as the third required-reading file alongside
invariants.md and lance.md, with a Claude-Code @-import so agents
load it on every turn.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

											
										
										
											2026-04-28 23:55:21 +02:00
-												docs: split user and developer docs (#93)
											
										
										
											2026-05-15 03:45:22 +03:00
+								When in doubt, re-read [docs/dev/invariants.md](invariants.md) — quality gates apply to every change.