omnigraph/AGENTS.md

304 lines
36 KiB
Markdown
Raw Normal View History

# OmniGraph — Agent Guide
This file is the always-on map for AI coding agents (Claude Code, Codex, Cursor, Cline) working in this codebase. It is loaded into context on every turn, so it stays as a **map plus the rules and invariants that need to be in scope at all times** — the encyclopedia content lives under [`docs/`](docs/). When you need depth, follow a pointer.
Address reviewer feedback (Cursor + cubic) on PR #60 All eight comments verified against source and applied: - AGENTS.md: pull @docs/{invariants,lance,testing}.md imports out of the markdown blockquote. Claude Code's @-import parser expects @ at column 0; the leading "> " of a blockquote silently broke recognition, so the claimed auto-include did nothing. (Cursor, Medium severity.) - docs/cli-reference.md: command-family count 13 → 17. The current enum Command in crates/omnigraph-cli/src/main.rs has 17 top-level variants. (cubic P2.) - docs/ci.md: Homebrew tap update is a regular `git push`, not a force-push (release.yml:117 is `git push origin HEAD:main`). (cubic P2.) - docs/errors.md: add the Storage variant to the NanoError list — it exists at error.rs:88-89 but the doc enumerated only 10 of 11. (cubic P2.) - docs/storage.md: clarify tombstone semantics. There is no tombstone_version column; state.rs:180 reads the tombstone version from the table_version column on rows where object_type = table_tombstone. (cubic P2.) - docs/branches-commits.md: split the GraphCommit pseudo-struct from the underlying storage. actor_id is joined in-memory from _graph_commit_actors.lance, not a column on _graph_commits.lance. (cubic P2.) - docs/schema-language.md: rename IR_VERSION to SCHEMA_IR_VERSION to match the actual constant name in catalog/schema_ir.rs:11. (cubic P3.) - docs/testing.md: engine integration test count 16 → 15 (matches `ls crates/omnigraph/tests/*.rs`). (cubic P3.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-29 00:09:06 +02:00
**Required reading every session, every change:**
1. **[docs/dev/invariants.md](docs/dev/invariants.md)** — the architectural invariants and deny-list. Apply to every PR, not only architecture work.
2. **[docs/dev/lance.md](docs/dev/lance.md)** — the curated index of upstream Lance docs. **Consult it before every task** to identify which Lance pages are relevant. **Then fetch every page in the matching domain section, plus every page that is even slightly relevant** — not just the page whose title most obviously matches the task. Behavior is interlocked across pages (transactions reference index lifecycle; index lifecycle references compaction; compaction references row-id lineage), and skipping a "slightly relevant" page is how alignment misses happen. The index itself is not a substitute for reading the pages — never act on the index alone. **Always fetch the FULL page content, not summaries** — use `curl -sL <url> | pandoc -f html -t markdown` or paste the rendered page text manually. Tools that summarize pages (like Claude's `WebFetch`) drop load-bearing details — we have caught alignment misses (default flags, `pub(crate)` blockers, three-page sub-specs hidden behind navigation hubs) only after dumping the full markdown.
3. **[docs/dev/testing.md](docs/dev/testing.md)** — the test-coverage map. **Always check what already covers your change before writing a new test.** Extending an existing test (an assertion, a fixture row, a parameterization) is preferred over a duplicated `init_and_load()` block. Walk the before-every-task checklist to identify existing coverage, run those tests as a clean baseline, and only add a new test fn or file when no existing one owns the area.
Address reviewer feedback (Cursor + cubic) on PR #60 All eight comments verified against source and applied: - AGENTS.md: pull @docs/{invariants,lance,testing}.md imports out of the markdown blockquote. Claude Code's @-import parser expects @ at column 0; the leading "> " of a blockquote silently broke recognition, so the claimed auto-include did nothing. (Cursor, Medium severity.) - docs/cli-reference.md: command-family count 13 → 17. The current enum Command in crates/omnigraph-cli/src/main.rs has 17 top-level variants. (cubic P2.) - docs/ci.md: Homebrew tap update is a regular `git push`, not a force-push (release.yml:117 is `git push origin HEAD:main`). (cubic P2.) - docs/errors.md: add the Storage variant to the NanoError list — it exists at error.rs:88-89 but the doc enumerated only 10 of 11. (cubic P2.) - docs/storage.md: clarify tombstone semantics. There is no tombstone_version column; state.rs:180 reads the tombstone version from the table_version column on rows where object_type = table_tombstone. (cubic P2.) - docs/branches-commits.md: split the GraphCommit pseudo-struct from the underlying storage. actor_id is joined in-memory from _graph_commit_actors.lance, not a column on _graph_commits.lance. (cubic P2.) - docs/schema-language.md: rename IR_VERSION to SCHEMA_IR_VERSION to match the actual constant name in catalog/schema_ir.rs:11. (cubic P3.) - docs/testing.md: engine integration test count 16 → 15 (matches `ls crates/omnigraph/tests/*.rs`). (cubic P3.) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-29 00:09:06 +02:00
Tools that support `@`-imports (Claude Code) auto-include all three files via the imports below — note these must sit at column 0 (not inside a blockquote) for the parser to recognize them. Other agents (Codex, Cursor, Cline, …) must open them explicitly at the start of each session.
@docs/dev/invariants.md
@docs/dev/lance.md
@docs/dev/testing.md
`CLAUDE.md` is a symlink to this file — there is exactly one source of truth. Edit `AGENTS.md`.
**Version surveyed:** 0.8.0
**Workspace crates:** `omnigraph-compiler`, `omnigraph` (engine), `omnigraph-policy`, `omnigraph-api-types` (shared HTTP wire DTOs), `omnigraph-cluster`, `omnigraph-cli`, `omnigraph-server`
build(deps): bump Lance 6.0.1 → 7.0.0 (correct-by-design substrate alignment) (#229) * build(deps): bump Lance 6.0.1 → 7.0.0 (object_store 0.13.2, roaring 0.11.4) Arrow stays 58 and DataFusion stays 53 (no change). The only transitive bump is object_store 0.12.5 → 0.13.2. 141 upstream commits reviewed; no fixes lost (the 6.0.x release-branch backports are all forward-ported into 7.0.0). - object_store 0.13 moved get/put/head/rename/delete behind a new ObjectStoreExt trait (list/list_with_delimiter/put_opts stay on the core trait). Add `use object_store::ObjectStoreExt` in storage.rs and db/manifest/namespace.rs; no call-site changes. Mirrors Lance's own migration in PR #6672. - roaring pinned to 0.11.4 (cargo update -p roaring --precise 0.11.4). Lance 7.0.0's UpdatedFragmentOffsets newtype (lance#6650) derives Eq over HashMap<u64, RoaringBitmap>, which needs RoaringBitmap: Eq, added in roaring 0.11.4; the loose `roaring = "0.11"` constraint otherwise resolves 0.11.3 and lance itself fails to compile. - lance#6774: merge-insert INSERT rows now stamp _row_created_at_version with the commit version (was a fallback of 1). Flip the lance_version_columns assertion to `== v2` and correct the changes/mod.rs rationale comment. Production change-detection keys on _row_last_updated_at_version + ID membership, so its logic is unaffected. Refs lance#6650, lance#6774, lance#6672. * fix(storage): pin WriteParams::auto_cleanup = None (lance#6755 default flip) lance#6755 flipped the WriteParams::auto_cleanup default from on (a full cleanup pass every 20th commit) to None. On 6.0.1 the on-by-default hook could silently GC versions that __manifest pins for snapshots/time-travel. OmniGraph owns cleanup explicitly (optimize.rs::cleanup_all_tables) and never set auto_cleanup, so it was relying on a default that is both wrong for our snapshot model and now changed upstream. Pin auto_cleanup: None explicitly at all 11 production WriteParams sites (table_store ×6, commit_graph ×2, recovery_audit ×1, manifest/graph ×2 — the __manifest + sub-table Create paths). Removes the dependency on a default-flag value and locks in the snapshot-safe behavior regardless of future upstream re-flips. Refs lance#6755. * test(lance): pin BTREE range-boundary correctness (lance#6796) lance#6796 (issue #6792) fixed a BTREE scalar-index range-query bound inclusiveness bug: `x <= hi AND x > lo` returned the wrong boundary row. Add lance_surface_guards.rs::btree_range_query_boundary_is_correct, which reproduces the exact #6792 shape (5 rows + an explicit BTREE drives the index path even on tiny data) and pins the corrected inclusive-<= / exclusive-> semantics. It turns red if a future Lance regression reintroduces the bug. OmniGraph today builds BTREE only on string @key columns and queries them by equality/IN, so its current patterns do not hit this; the guard protects any future BTREE-range path (BTREE-on-properties, range-on-key). Refs lance#6796. * docs(dev): align Lance docs + invariants to 7.0.0 - docs/dev/lance.md: new 2026-06-14 alignment stanza for the 6.0.1 → 7.0.0 bump (object_store ObjectStoreExt move, roaring 0.11.4, #6774/#6796/#6755 behavior, #6658 shipped → MR-A unblocked but separate, #6666 + blob compaction still open); prior 6.0.1 stanza demoted to historical. - AGENTS.md: storage substrate 6.x → 7.x (line + architecture diagram). - docs/dev/invariants.md: deletes/vector known gap updated — the staged two-phase delete API (lance#6658) now exists and MR-A is unblocked, but delete_where stays inline and D2 stays in place until the migration lands; create_vector_index still gated on lance#6666. * fix(storage): skip Lance auto-cleanup on commit paths for legacy datasets Addresses PR #229 review (Codex P1). `WriteParams::auto_cleanup` is create-time config with no effect on existing datasets (Lance write.rs docs), so the previous `auto_cleanup: None` change alone did NOT protect graphs created before the v7 bump: 6.0.1 defaulted auto_cleanup ON, leaving `lance.auto_cleanup.*` config on those datasets, and Lance's per-commit hook (io/commit.rs: `if !commit_config.skip_auto_cleanup`) fires off that stored config — so omnigraph's own writes would GC versions the __manifest pins for snapshots/time-travel. Skip the hook on every commit path, covering new and legacy datasets alike: - commit_staged: CommitBuilder::with_skip_auto_cleanup(true) — the staged data path. - __manifest publisher: MergeInsertBuilder::skip_auto_cleanup(true). - all 11 WriteParams: skip_auto_cleanup: true (direct Dataset::write/append paths; auto_cleanup: None retained so new datasets store no cleanup config at all). Tests: - lance_surface_guards::skip_auto_cleanup_suppresses_version_gc — substrate: negative control (config GCs v1 without skip) + with-skip survival. - staged_writes::commit_staged_skips_auto_cleanup_so_pinned_versions_survive — omnigraph usage: commit_staged on a legacy-config dataset preserves the pinned create version. Refs lance#6755. * test(lance): assert created_at-preserved + updated_at-bumped on merge_insert UPDATE Addresses PR #229 review follow-up. `lance_merge_insert_update_preserves_created_at_version` documented (in a comment) that a merge_insert UPDATE preserves created_at and bumps updated_at, but only asserted the value change — leaving the change-feed invariant unguarded. Add the two missing assertions: - bob created_at == v1 (preserved across UPDATE; what the test name promises; lance#6774 only changed INSERT-row stamping). - bob updated_at == v2 (bumped to the commit version) — the invariant OmniGraph's insert/update classification relies on (changes/mod.rs keys on _row_last_updated_at_version). A regression here would silently drop updates from the diff/change feed.
2026-06-14 20:42:24 +02:00
**Storage substrate:** Lance 7.x (columnar, versioned, branchable)
**License:** MIT
**Toolchain:** Rust stable, edition 2024
---
## Start here — what is this?
OmniGraph is a typed property-graph engine built as a coordination layer over many Lance datasets. Highlights:
- **Storage**: per node/edge type a separate Lance dataset; multi-dataset commits coordinated atomically through one `__manifest` table.
- **Languages**: a `.pg` schema language and a `.gq` query language, both Pest-based, with a typed IR.
- **Multi-modal querying**: vector ANN (`nearest`), full-text (`search`/`fuzzy`/`match_text`/`bm25`), Reciprocal Rank Fusion (`rrf`), and graph traversal (`Expand`, anti-join `not { … }`) in one runtime.
- **Branches and commits across the whole graph**: Git-style — every successful publish appends to a commit DAG; merges are three-way at the row level.
feat(engine): Stage the delete path; retire the inline-delete residual (#308) * test(engine): pin zero-row cascade delete must not drift an edge table (red) A delete <Node> cascades a delete_where into every incident edge type. The inline delete_where (Dataset::delete) advances Lance HEAD even when zero edges match, but the cascade records the new version only if deleted_rows > 0 — so a node with no incident edges leaves edge:Knows HEAD>manifest drift, which trips the next strict write's ExpectedVersionMismatch and repair refuses it. Red today: edge:Knows manifest=v5, Lance HEAD=v6. Goes green when delete moves to the staged two-phase path (iss-950, Lance 7.0 DeleteBuilder::execute_uncommitted), where a 0-row delete commits no Lance version and the deleted_rows>0 gate becomes correct by construction. * fix(engine): a zero-row delete must not advance Lance HEAD Lance's Dataset::delete commits a new version even when the predicate matches nothing (build_transaction always emits Operation::Delete), so a node delete that cascades a delete_where into an incident edge type with no matching edges advanced that edge table's Lance HEAD while the cascade skipped record_inline (gated on deleted_rows > 0) — leaving HEAD>manifest drift that wedged the next strict write and that repair refused as suspicious/unverifiable. Use Lance 7.0's two-phase DeleteBuilder::execute_uncommitted to read num_deleted_rows before committing: a no-match delete now advances nothing (no version, no drift) and the existing deleted_rows>0 gate is correct by construction. Non-zero deletes commit the staged transaction with skip_auto_cleanup + affected_rows (parity with the prior inline path). First step of the staged-delete migration (iss-950); turns the node_delete_with_no_incident_edges_leaves_no_edge_table_drift regression green. * feat(engine): stage_delete two-phase primitive (MR-A step 0) Add TableStore::stage_delete (Lance 7.0 DeleteBuilder::execute_uncommitted), the two-phase analogue of stage_merge_insert: writes deletion files without advancing Lance HEAD, returns Option<StagedWrite> (None on 0 rows = true no-op), carrying the deletion-vector updated_fragments as new_fragments and the superseded originals as removed_fragment_ids so combine_committed_with_staged makes the deletion visible to in-query reads. No affected_rows is threaded: like stage_merge_insert's Operation::Update commit, the staged delete relies on OmniGraph's per-table write queue + manifest CAS, not Lance's per-dataset conflict resolver (commit_staged is a single attempt). Flip the two residual guards to the staged path: staged_writes.rs now asserts stage_delete does NOT advance HEAD and that a staged delete is read-your-writes visible (the deletion-vector RYW proof D2 retirement depends on); the lance_surface_guards delete guard pins execute_uncommitted's UncommittedDelete. No behavior change yet (callers still use delete_where); Step 1 wires them. * feat(engine): TableStorage::stage_delete + migrate merge delete path (MR-A step 1a) Add stage_delete/Option<StagedHandle> to the TableStorage trait (delegates to TableStore::stage_delete). Migrate the two branch_merge delete sites (three-way RewriteMerged + adopt delta) from the inline delete_where residual to stage_delete + commit_staged — identical in shape to the stage_merge_insert + commit_staged pair above each. HEAD still advances within the merge sequence (via commit_staged), under the unchanged SidecarKind::BranchMerge Phase-B confirmation; the _pre_delete/_pre_index failpoints fire by position, unchanged. merge_truth_table, branching, composite_flow green. * feat(engine): migrate all delete sites to staged path, retire inline delete (MR-A step 1b/1c) Routes every delete through the staged write path so delete never advances Lance HEAD inline — the last inline-commit residual on the mutation path is gone. `MutationStaging` now accumulates delete predicates (`record_delete`) alongside pending write batches; at end-of-query `stage_all` combines a table's predicates into one `(p1) OR (p2) …` `stage_delete` (a deletion-vector transaction, no HEAD advance) and `commit_all` commits it through the same `commit_staged` path as inserts/updates. Deletes are now ordinary staged entries: one sidecar pin at `expected + 1`, no inline special-casing. Migrated callers (all 5): the 3 mutation.rs sites (delete-node, cascade, delete-edge) and the 2 merge.rs sites (already on stage_delete in step 1a). `affected_edges`/`affected` move from post-inline-commit `deleted_rows` to a committed `count_rows` at record time — exact under D₂, bounded by the cascade working set. A predicate matching zero rows stages nothing (the staged equivalent of the old "skip record_inline on 0 deleted rows"), so the zero-row edge-table drift class stays closed by construction. Retired scaffolding now that no caller remains: - `MutationStaging.inline_committed` + `record_inline` → `delete_predicates` + `record_delete`; `StagedMutation.inline_committed`/`paths` fields and all the `commit_all` inline handling (queue keys, sidecar pins with the `record_inline` table_version special-case, the inline recheck loop). - `open_table_for_mutation`'s post-inline-commit reopen branch (deletes no longer advance HEAD mid-query, so a second touch reopens at the pinned version like any write). - `InlineCommitResidual::delete_where` + its `TableStore` impl, the orphaned `TableStore::delete_where`, and `DeleteState`. `InlineCommitResidual` now carries only `create_vector_index` (Lance #6666 still open). D₂ stays for now: staged-delete read-your-writes doesn't yet compose into the pending accumulator (insert-then-delete on one table), so mixed insert/update/delete in one query is still rejected at parse time. Retiring D₂ is step 2. Doc comments updated to match across exec/, storage_layer, db/. Tests (all green): writes, consistency, validators, end_to_end, composite_flow, merge_truth_table, maintenance, recovery, staged_writes, forbidden_apis, lance_surface_guards, changes, point_in_time (286), plus failpoints (63). * docs: delete is a staged write, not an inline-commit residual (MR-A step 1) Update the docs that described `delete` as the inline-commit residual now that MR-A routes it through `stage_delete`. Always-loaded surfaces (AGENTS.md rule 4 / capability matrix, invariants.md Invariant 4 / truth matrix / known gaps) plus the dev write-path docs (writes.md, execution.md incl. its mutation sequence diagram, architecture.md) now state: deletes accumulate as predicates and stage like inserts/updates, no inline HEAD advance; `InlineCommitResidual` carries only `create_vector_index` (Lance #6666). The parse-time D₂ rule is documented as retained — not because delete inline-commits, but because staged-delete read-your-writes is not yet wired into the pending accumulator (MR-A step 2). lance.md's 7.0 audit note marked MR-A as landed. * docs: D₂ is a deliberate boundary, not temporary scaffolding (MR-A close-out) After MR-A staged the delete path, D₂ (a mutation query is insert/update-only OR delete-only) was left framed as temporary — "until Lance ships two-phase delete" / "retire in step 2". Lance shipped that and we used it for the inline-commit fix; D₂'s original justification is gone. It now stands for a different, permanent reason: keeping a query to one kind keeps its read-your-writes unambiguous and each table to one version per query. Retiring it would buy single-commit mixed atomicity (cheap workaround: split, or a branch) at the cost of an in-query delete view, pending pruning, edge id-resolution, and two-commit-per-table ordering in the hot mutation path — complexity not worth earning. Decision: keep D₂ as a deliberate boundary. Reframes the now-stale wording everywhere, no logic change: - The D₂ parse-time error message no longer promises "this restriction lifts when Lance exposes a two-phase delete API"; it states the boundary and points to a branch+merge for one atomic commit. - `enforce_no_mixed_destructive_constructive` doc, AGENTS.md, invariants.md (Invariant 4 / truth matrix / removed from the known-gaps), writes.md, architecture.md, lance.md, and the user mutations doc (which wrongly said deletes "commit through a different path" — both stage now). - Swept remaining stale `delete_where` mentions left from the Step-1 migration: the merge.rs "swap when upstream ships" comments (already swapped), the forbidden_apis / table_ops residual notes, the staged_writes vector-index guard doc (was "same as stage_delete's absence" — stage_delete now exists), and test comments/assert messages in recovery/maintenance/writes/failpoints. Genuinely-historical records (dated Lance audit, rfc-013, bug-case-fix) left. Verified: engine builds warning-free; check-agents-md OK; writes/maintenance/ recovery/staged_writes/forbidden_apis all green. Closes MR-A. * test(engine): overlapping delete predicates must not double-count affected_* (red) Reproduces a reporting regression from the staged-delete migration flagged in PR #308 review. Because deletes now stage (instead of inline-committing), two delete statements in one query both scan the same unchanged committed snapshot; counting each predicate independently over-reports `affected_*` when they overlap. The old inline path committed each delete before the next ran, so it counted distinct. `delete Person where name = "Alice"` then `delete Person where age > 29` over the standard fixture (Alice 30, Charlie 35) removes 2 distinct nodes and 3 distinct edges, but the buggy per-statement counting returns 3 nodes / 6 edges. RED at this commit (asserts left=3, right=2). * fix(engine): dedup overlapping delete predicates when counting affected_* Count each delete statement against the committed snapshot MINUS the predicates a prior delete statement on the same table already recorded: `(pred) AND NOT ((prior1) OR (prior2) …)`. Summed over statements this is inclusion-exclusion — `Σ |pₙ \ (p₁ ∪ …)| = |p₁ ∪ p₂ ∪ …|` — exactly the distinct count the combined `(p1) OR (p2)` staged delete removes. Works for nodes and edges alike with no edge identity needed; the node ID scan uses the same exclusion so a later statement also doesn't re-cascade already-deleted nodes. The ORIGINAL predicate is still what gets recorded (the staged delete removes the union); only the count uses the exclusion. The common single-delete path is unchanged (`prior` empty → filter is just the base predicate). New helper `dedup_delete_filter` + `MutationStaging::recorded_delete_predicates`. Turns the red regression test green (2 nodes / 3 edges); writes (33), end_to_end, validators, maintenance, recovery, composite_flow, merge_truth_table, consistency, changes, and failpoints (63) all stay green. * test(engine): delete dedup must not drop NULL-column rows (red) Follow-up to the overlapping-delete fix flagged in PR #308 review (Greptile P1): the `(base) AND NOT (prior)` exclusion breaks under SQL three-valued logic. If a prior delete predicate references a NULLable column, a later statement's matching row whose column is NULL makes `prior` evaluate to UNKNOWN, `NOT UNKNOWN` is UNKNOWN, and the row is filtered out of the scan — even though the prior delete never matched it. That drops it from `deleted_ids`, skipping its cascade (orphaned edges) or, if it is the only match, leaving the node undeleted. A data bug, not just a miscount. Data: Charlie(age 35), Zoe(age NULL); Knows Zoe→Charlie. `delete Person where age > 30` then `delete Person where name = "Zoe"`. Under the buggy `NOT`, Zoe's scan `(name='Zoe') AND NOT (age>30)` is UNKNOWN → Zoe survives. RED at this commit (Person count left=1, right=0). * fix(engine): NULL-safe delete dedup — exclude only definitely-matched prior rows Change `dedup_delete_filter` from `(base) AND NOT (prior)` to `(base) AND ((prior) IS NOT TRUE)`. `IS NOT TRUE` keeps both FALSE and UNKNOWN rows, so a prior predicate that evaluates to SQL UNKNOWN (a NULL in a referenced column) no longer drops a row this statement legitimately matches — only rows a prior predicate matched as definitely TRUE are excluded from the count/scan. The distinct-count semantics are unchanged for non-NULL data. Turns the red NULL-dedup test green (Zoe deleted, her edge cascaded), and the overlapping-dedup + writes/end_to_end/validators/maintenance/recovery/ composite_flow/consistency suites stay green. * docs(engine): note dedup_delete_filter's load-bearing dependency on D₂ Self-review follow-up: the overlapping-delete dedup assumes the committed snapshot is invariant across a query's statements, which holds only because D₂ forbids mixing writes with deletes (so a delete-touched table has no pending writes). Make that dependency explicit at the function so a future D₂ relaxation is forced to revisit the dedup. Comment-only. * Preserve staged write commit metadata
2026-06-27 16:48:41 +02:00
- **Atomic per-query writes**: `mutate_as` and `load` accumulate insert/update batches into an in-memory `MutationStaging.pending` per touched table; one `stage_*` + `commit_staged` per table runs at end-of-query, then `ManifestBatchPublisher::publish` commits the manifest atomically with per-table `expected_table_versions` CAS. A mid-query failure leaves Lance HEAD untouched on staged tables — no drift, no run state machine, no staging branches. Deletes stage through the same path (MR-A: `stage_delete` via Lance 7.0 `DeleteBuilder::execute_uncommitted`), so they no longer advance Lance HEAD inline. D₂ at parse time is a deliberate boundary — one mutation query is constructive (insert/update) XOR destructive (delete) — so read-your-writes within a query stays unambiguous and each table commits at most one version; compose mixed operations via separate mutations, or a branch for single-commit atomicity.
feat!: delete the legacy OmnigraphConfig + config migrate; finish the omnigraph.yaml docs sweep (#252) * refactor(cli): own ReadOutputFormat/TableCellLayout in the CLI The two output-presentation enums lived in `omnigraph-server::config` and were re-exported for the CLI, even though the server never used them. Move both definitions into `omnigraph-cli/src/read_format.rs` (where the renderer already lives) and drop them from the server's public re-export. This is a step toward deleting the legacy `omnigraph-server::config` module entirely — a CLI presentation concern has no business in the server crate. No behavior change. The server keeps private copies in `config.rs` only for the soon-to-be-deleted legacy `CliDefaults`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli)!: remove the `config migrate` command and migrate.rs `config migrate` was the last CLI consumer of the legacy `omnigraph.yaml` (`OmnigraphConfig` + `load_config`). With the excision complete there is no legacy file to split, so the whole `omnigraph config` command group is removed along with `migrate.rs`. The `OmnigraphConfig` type, `load_config`, and the deprecation machinery are deleted next. - Remove `Command::Config` / `ConfigCommand` from the clap surface and the dispatch arm; drop `mod migrate;` and the now-unused `load_config` import. - Drop the `Command::Config` arms in `planes.rs`. - Delete the `config_migrate_splits_legacy_config` integration test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(server)!: delete the legacy OmnigraphConfig type and load_config With `config migrate` gone, nothing loads `omnigraph.yaml` anymore. Delete the entire `omnigraph-server::config` module: the `OmnigraphConfig` type and its sub-structs (`ProjectConfig`, `TargetConfig`, `CliDefaults`, `ServerDefaults`, `AuthDefaults`, `QueryDefaults`, `AliasConfig`, `AliasCommand`, `PolicySettings`, `QueryEntry`, `McpSettings`), `load_config`, and the RFC-008 deprecation machinery (`OMNIGRAPH_CONFIG`, `OMNIGRAPH_NO_LEGACY_CONFIG`, `OMNIGRAPH_SUPPRESS_YAML_DEPRECATION`, the deprecation map + warner). - `QueryRegistry::load` (the only `OmnigraphConfig`/`QueryEntry` consumer; its only caller was its own test) is removed — server boot and the CLI both build registries via `QueryRegistry::from_specs`. - `graph_resource_id_for_selection` (CLI-only) moves into the CLI (`helpers.rs`), with its unit test; the server no longer exports it. - Drop the already-dead `format_registry_load_errors` helper (config-adjacent). No behavior change — every deleted item was unreachable after the excision. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs: purge the legacy omnigraph.yaml surface from the docs Finish the RFC-011 excision in the docs: the CLI no longer reads omnigraph.yaml and the server boots cluster-only, so every doc that described the legacy file as a live config is now wrong. - AGENTS.md: rewrite the HTTP-server line to cluster-only boot (drop the single-graph/flat-route and omnigraph.yaml-boot framing); rewrite the CLI two-surface-config passage (drop `config migrate`, the deprecation env vars, and "Never extend omnigraph.yaml"); fix the topic table + capability rows. - cli/reference.md: delete the entire "omnigraph.yaml schema (legacy combined file)" section and the `config migrate` row; re-home the `policy` row, the bearer-token chain, the actor/format/param-precedence references, and the `--config` mentions to the operator config + `--cluster`. - cli/index.md: rewrite the multi-graph-server + add-graph paragraphs to cluster (`--cluster` + `cluster apply`); fix the policy examples to `--cluster`; replace the `## Config` omnigraph.yaml example with the operator/cluster two-surface model. - operations/policy.md: rewrite per-graph-vs-server-level policy to the cluster `policies:`/`applies_to` model; re-home the actor + CLI tooling sections. - clusters/config.md, clusters/index.md, deployment.md: server boots from the cluster only; per-operator facts come from ~/.omnigraph/config.yaml. - architecture.md, testing.md: drop the stale omnigraph.yaml / deleted-test references. RFCs, design specs, and prior release notes are left as historical records. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:31:29 +03:00
- **HTTP server**: Axum + utoipa OpenAPI, bearer auth (SHA-256 hashed, optional AWS Secrets Manager). Cedar policy enforcement is engine-wide — every `_as` writer calls `Omnigraph::enforce(action, scope, actor)`, so HTTP, CLI, and embedded SDK consumers all hit the same gate. **Cluster-only boot** (RFC-011): the server always boots from a cluster directory (`--cluster <dir | s3://…>`, RFC-005) and serves N graphs (N ≥ 1) under multi-graph routes (`/graphs/{graph_id}/...` + read-only `GET /graphs` enumeration); there are no single-graph flat routes and no positional-URI boot. Per-graph + server-level Cedar policies. Runtime add/remove (`POST /graphs`, `DELETE /graphs/{id}`) is not exposed — operators run `cluster apply` and restart.
- **CLI** with two-surface config (RFC-007/008): the team-owned cluster directory (`cluster.yaml`) plus the per-operator `~/.omnigraph/config.yaml` (servers, clusters, credentials, actor, profiles, aliases, defaults). Graphs are addressed via `--store`/`--server`/`--cluster`/`--profile`/operator defaults (RFC-011). Multi-format output (json/jsonl/csv/kv/table).
Throughout the docs, capabilities are split into **L1 — Inherited from Lance** vs **L2 — Added by OmniGraph**.
---
## Architecture at a glance
```
CLI (omnigraph) HTTP Server (omnigraph-server, Axum)
│ │
└─────────────┬──────────────┘
omnigraph-compiler ── Pest grammars, catalog, IR, lowering, lint, migration plan
omnigraph (engine) ── ManifestCoordinator, CommitGraph, RunRegistry, GraphIndex (CSR/CSC), exec
build(deps): bump Lance 6.0.1 → 7.0.0 (correct-by-design substrate alignment) (#229) * build(deps): bump Lance 6.0.1 → 7.0.0 (object_store 0.13.2, roaring 0.11.4) Arrow stays 58 and DataFusion stays 53 (no change). The only transitive bump is object_store 0.12.5 → 0.13.2. 141 upstream commits reviewed; no fixes lost (the 6.0.x release-branch backports are all forward-ported into 7.0.0). - object_store 0.13 moved get/put/head/rename/delete behind a new ObjectStoreExt trait (list/list_with_delimiter/put_opts stay on the core trait). Add `use object_store::ObjectStoreExt` in storage.rs and db/manifest/namespace.rs; no call-site changes. Mirrors Lance's own migration in PR #6672. - roaring pinned to 0.11.4 (cargo update -p roaring --precise 0.11.4). Lance 7.0.0's UpdatedFragmentOffsets newtype (lance#6650) derives Eq over HashMap<u64, RoaringBitmap>, which needs RoaringBitmap: Eq, added in roaring 0.11.4; the loose `roaring = "0.11"` constraint otherwise resolves 0.11.3 and lance itself fails to compile. - lance#6774: merge-insert INSERT rows now stamp _row_created_at_version with the commit version (was a fallback of 1). Flip the lance_version_columns assertion to `== v2` and correct the changes/mod.rs rationale comment. Production change-detection keys on _row_last_updated_at_version + ID membership, so its logic is unaffected. Refs lance#6650, lance#6774, lance#6672. * fix(storage): pin WriteParams::auto_cleanup = None (lance#6755 default flip) lance#6755 flipped the WriteParams::auto_cleanup default from on (a full cleanup pass every 20th commit) to None. On 6.0.1 the on-by-default hook could silently GC versions that __manifest pins for snapshots/time-travel. OmniGraph owns cleanup explicitly (optimize.rs::cleanup_all_tables) and never set auto_cleanup, so it was relying on a default that is both wrong for our snapshot model and now changed upstream. Pin auto_cleanup: None explicitly at all 11 production WriteParams sites (table_store ×6, commit_graph ×2, recovery_audit ×1, manifest/graph ×2 — the __manifest + sub-table Create paths). Removes the dependency on a default-flag value and locks in the snapshot-safe behavior regardless of future upstream re-flips. Refs lance#6755. * test(lance): pin BTREE range-boundary correctness (lance#6796) lance#6796 (issue #6792) fixed a BTREE scalar-index range-query bound inclusiveness bug: `x <= hi AND x > lo` returned the wrong boundary row. Add lance_surface_guards.rs::btree_range_query_boundary_is_correct, which reproduces the exact #6792 shape (5 rows + an explicit BTREE drives the index path even on tiny data) and pins the corrected inclusive-<= / exclusive-> semantics. It turns red if a future Lance regression reintroduces the bug. OmniGraph today builds BTREE only on string @key columns and queries them by equality/IN, so its current patterns do not hit this; the guard protects any future BTREE-range path (BTREE-on-properties, range-on-key). Refs lance#6796. * docs(dev): align Lance docs + invariants to 7.0.0 - docs/dev/lance.md: new 2026-06-14 alignment stanza for the 6.0.1 → 7.0.0 bump (object_store ObjectStoreExt move, roaring 0.11.4, #6774/#6796/#6755 behavior, #6658 shipped → MR-A unblocked but separate, #6666 + blob compaction still open); prior 6.0.1 stanza demoted to historical. - AGENTS.md: storage substrate 6.x → 7.x (line + architecture diagram). - docs/dev/invariants.md: deletes/vector known gap updated — the staged two-phase delete API (lance#6658) now exists and MR-A is unblocked, but delete_where stays inline and D2 stays in place until the migration lands; create_vector_index still gated on lance#6666. * fix(storage): skip Lance auto-cleanup on commit paths for legacy datasets Addresses PR #229 review (Codex P1). `WriteParams::auto_cleanup` is create-time config with no effect on existing datasets (Lance write.rs docs), so the previous `auto_cleanup: None` change alone did NOT protect graphs created before the v7 bump: 6.0.1 defaulted auto_cleanup ON, leaving `lance.auto_cleanup.*` config on those datasets, and Lance's per-commit hook (io/commit.rs: `if !commit_config.skip_auto_cleanup`) fires off that stored config — so omnigraph's own writes would GC versions the __manifest pins for snapshots/time-travel. Skip the hook on every commit path, covering new and legacy datasets alike: - commit_staged: CommitBuilder::with_skip_auto_cleanup(true) — the staged data path. - __manifest publisher: MergeInsertBuilder::skip_auto_cleanup(true). - all 11 WriteParams: skip_auto_cleanup: true (direct Dataset::write/append paths; auto_cleanup: None retained so new datasets store no cleanup config at all). Tests: - lance_surface_guards::skip_auto_cleanup_suppresses_version_gc — substrate: negative control (config GCs v1 without skip) + with-skip survival. - staged_writes::commit_staged_skips_auto_cleanup_so_pinned_versions_survive — omnigraph usage: commit_staged on a legacy-config dataset preserves the pinned create version. Refs lance#6755. * test(lance): assert created_at-preserved + updated_at-bumped on merge_insert UPDATE Addresses PR #229 review follow-up. `lance_merge_insert_update_preserves_created_at_version` documented (in a comment) that a merge_insert UPDATE preserves created_at and bumps updated_at, but only asserted the value change — leaving the change-feed invariant unguarded. Add the two missing assertions: - bob created_at == v1 (preserved across UPDATE; what the test name promises; lance#6774 only changed INSERT-row stamping). - bob updated_at == v2 (bumped to the commit version) — the invariant OmniGraph's insert/update classification relies on (changes/mod.rs keys on _row_last_updated_at_version). A regression here would silently drop updates from the diff/change feed.
2026-06-14 20:42:24 +02:00
Lance 7.x ── columnar Arrow, fragments, per-dataset versions/branches, indexes
Object store (file / s3 / RustFS / MinIO / S3-compat)
```
Full diagram and concurrency model: [docs/dev/architecture.md](docs/dev/architecture.md).
---
## Where to find each topic
| Area | Read |
|---|---|
| **User docs entry point (public CLI/API/operator docs)** | **[docs/user/index.md](docs/user/index.md)** |
| **Developer docs entry point (architecture, invariants, testing, internals)** | **[docs/dev/index.md](docs/dev/index.md)** |
| **Architectural invariants & deny-list (read before any non-trivial proposal or review)** | **[docs/dev/invariants.md](docs/dev/invariants.md)** |
| **Lance docs index — fetch upstream Lance docs by problem domain** | **[docs/dev/lance.md](docs/dev/lance.md)** |
| **Test coverage map — what's covered, what helpers to reuse, before-every-task checklist** | **[docs/dev/testing.md](docs/dev/testing.md)** |
| Architecture, L1/L2 framing, concurrency model | [docs/dev/architecture.md](docs/dev/architecture.md) |
docs(user): split language/branching pages + add front-door pages (Phase 2) (#225) Content build-out on top of the Phase 1 topic move. No behavior changes. Splits (existing content relocated, cross-linked): - queries/index.md → mutations/index.md (insert/update/delete + the inserts-vs-deletes rule) and search/index.md (the multi-modal search functions + a hybrid-ranking overview tying nearest/bm25/rrf together). queries/index.md now covers the read shape and points at both. - branching/index.md → branching/time-travel.md (snapshots/time travel) and branching/merge.md (three-way merge + the 7 conflict kinds, verified against error.rs MergeConflictKind). New pages (written from the code, user-facing): - quickstart.md — init → load → query → branch, with verified CLI flags. - concepts/index.md — what OmniGraph is + the L1/L2 (Lance/OmniGraph) framing. Expanded operations/audit.md from a 7-line struct dump into a real actor-tracking page (server token-resolved vs CLI --as chain; reading the trail; the omnigraph:recovery reserved actor). Index wiring: docs/user/index.md and AGENTS.md's topic table link every new page; also normalized AGENTS.md's docs/user link display text to match the Phase 1 retargeted paths. Verified: zero broken .md links; check-agents-md.sh green (57 links, 54 docs). Deferred to Phase 3: de-dev polish (grammar paths, IR internals still in queries/branching), guides/, and a possible reference/config.md split (the config schema is already coherent in cli/reference.md). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:53:46 +03:00
| Storage layout, `__manifest` schema, URI schemes, S3 env vars | [docs/user/concepts/storage.md](docs/user/concepts/storage.md) |
| `.pg` schema language, types, constraints, annotations, migration planning | [docs/user/schema/index.md](docs/user/schema/index.md) |
| Schema-lint codes (`OG-XXX-NNN`), families, severity, suppression | [docs/user/schema/lint.md](docs/user/schema/lint.md) |
| `.gq` query language, MATCH/RETURN/ORDER, IR ops, lint codes | [docs/user/queries/index.md](docs/user/queries/index.md) |
| Mutations — insert/update/delete, D2, atomicity | [docs/user/mutations/index.md](docs/user/mutations/index.md) |
| Search funcs (`nearest`/`bm25`/`rrf`), hybrid ranking | [docs/user/search/index.md](docs/user/search/index.md) |
| Indexes (BTREE / inverted / vector / graph topology) | [docs/user/search/indexes.md](docs/user/search/indexes.md) |
| Embeddings (engine client, env vars, `@embed`) | [docs/user/search/embeddings.md](docs/user/search/embeddings.md) |
docs(user): split language/branching pages + add front-door pages (Phase 2) (#225) Content build-out on top of the Phase 1 topic move. No behavior changes. Splits (existing content relocated, cross-linked): - queries/index.md → mutations/index.md (insert/update/delete + the inserts-vs-deletes rule) and search/index.md (the multi-modal search functions + a hybrid-ranking overview tying nearest/bm25/rrf together). queries/index.md now covers the read shape and points at both. - branching/index.md → branching/time-travel.md (snapshots/time travel) and branching/merge.md (three-way merge + the 7 conflict kinds, verified against error.rs MergeConflictKind). New pages (written from the code, user-facing): - quickstart.md — init → load → query → branch, with verified CLI flags. - concepts/index.md — what OmniGraph is + the L1/L2 (Lance/OmniGraph) framing. Expanded operations/audit.md from a 7-line struct dump into a real actor-tracking page (server token-resolved vs CLI --as chain; reading the trail; the omnigraph:recovery reserved actor). Index wiring: docs/user/index.md and AGENTS.md's topic table link every new page; also normalized AGENTS.md's docs/user link display text to match the Phase 1 retargeted paths. Verified: zero broken .md links; check-agents-md.sh green (57 links, 54 docs). Deferred to Phase 3: de-dev polish (grammar paths, IR internals still in queries/branching), guides/, and a possible reference/config.md split (the config schema is already coherent in cli/reference.md). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:53:46 +03:00
| Concepts — what OmniGraph is, L1/L2 framing | [docs/user/concepts/index.md](docs/user/concepts/index.md) |
| Quickstart — init → load → query → branch | [docs/user/quickstart.md](docs/user/quickstart.md) |
| Branches, commit graph, system branches | [docs/user/branching/index.md](docs/user/branching/index.md) |
| Snapshots & time travel | [docs/user/branching/time-travel.md](docs/user/branching/time-travel.md) |
| Three-way merge and conflict kinds (user-facing) | [docs/user/branching/merge.md](docs/user/branching/merge.md) |
| Transactions and atomicity (per-query atomic; branches as multi-query transactions) | [docs/user/branching/transactions.md](docs/user/branching/transactions.md) |
| Direct-publish write path (staging, D2, recovery sidecars; the former Run state machine) | [docs/dev/writes.md](docs/dev/writes.md) |
| Three-way merge and conflict kinds | [docs/dev/merge.md](docs/dev/merge.md) |
docs(user): split language/branching pages + add front-door pages (Phase 2) (#225) Content build-out on top of the Phase 1 topic move. No behavior changes. Splits (existing content relocated, cross-linked): - queries/index.md → mutations/index.md (insert/update/delete + the inserts-vs-deletes rule) and search/index.md (the multi-modal search functions + a hybrid-ranking overview tying nearest/bm25/rrf together). queries/index.md now covers the read shape and points at both. - branching/index.md → branching/time-travel.md (snapshots/time travel) and branching/merge.md (three-way merge + the 7 conflict kinds, verified against error.rs MergeConflictKind). New pages (written from the code, user-facing): - quickstart.md — init → load → query → branch, with verified CLI flags. - concepts/index.md — what OmniGraph is + the L1/L2 (Lance/OmniGraph) framing. Expanded operations/audit.md from a 7-line struct dump into a real actor-tracking page (server token-resolved vs CLI --as chain; reading the trail; the omnigraph:recovery reserved actor). Index wiring: docs/user/index.md and AGENTS.md's topic table link every new page; also normalized AGENTS.md's docs/user link display text to match the Phase 1 retargeted paths. Verified: zero broken .md links; check-agents-md.sh green (57 links, 54 docs). Deferred to Phase 3: de-dev polish (grammar paths, IR internals still in queries/branching), guides/, and a possible reference/config.md split (the config schema is already coherent in cli/reference.md). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:53:46 +03:00
| Diff / change feed (`diff_between`, `diff_commits`) | [docs/user/branching/changes.md](docs/user/branching/changes.md) |
| Query execution, mutation execution, bulk loader, `load` vs `ingest` | [docs/dev/execution.md](docs/dev/execution.md) |
docs(user): split language/branching pages + add front-door pages (Phase 2) (#225) Content build-out on top of the Phase 1 topic move. No behavior changes. Splits (existing content relocated, cross-linked): - queries/index.md → mutations/index.md (insert/update/delete + the inserts-vs-deletes rule) and search/index.md (the multi-modal search functions + a hybrid-ranking overview tying nearest/bm25/rrf together). queries/index.md now covers the read shape and points at both. - branching/index.md → branching/time-travel.md (snapshots/time travel) and branching/merge.md (three-way merge + the 7 conflict kinds, verified against error.rs MergeConflictKind). New pages (written from the code, user-facing): - quickstart.md — init → load → query → branch, with verified CLI flags. - concepts/index.md — what OmniGraph is + the L1/L2 (Lance/OmniGraph) framing. Expanded operations/audit.md from a 7-line struct dump into a real actor-tracking page (server token-resolved vs CLI --as chain; reading the trail; the omnigraph:recovery reserved actor). Index wiring: docs/user/index.md and AGENTS.md's topic table link every new page; also normalized AGENTS.md's docs/user link display text to match the Phase 1 retargeted paths. Verified: zero broken .md links; check-agents-md.sh green (57 links, 54 docs). Deferred to Phase 3: de-dev polish (grammar paths, IR internals still in queries/branching), guides/, and a possible reference/config.md split (the config schema is already coherent in cli/reference.md). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:53:46 +03:00
| `optimize` (compaction) and `cleanup` (version GC) | [docs/user/operations/maintenance.md](docs/user/operations/maintenance.md) |
feat(engine): retire commit-graph tables (#311) * docs(dev): write-latency roadmap (validated cost model + layered fix) Records the validated 6-LIST warm-write cost model, the two root causes (un-GC'd _versions/; re-resolving latest by listing), and the layered fix (GC + capture-once reuse), plus how commit-graph-table retirement feeds in. Linked from docs/dev/index.md next to the RFC-013 docs. * feat(engine)!: strand storage versioning — one internal-schema version, no in-place migration Set MIN_SUPPORTED == CURRENT == 4: this binary reads exactly one `__manifest` internal-schema version and refuses any older graph on open with a rebuild-via-export/import message, instead of migrating it in place. Storage format changes become a deliberate cutover, not a permanently-carried in-place migration — the pre-release "complexity must be earned" contract. Delete the entire in-place migration apparatus and everything that existed only to support it: the `migrate_vN` arms + dispatcher + stamp-bump helpers + the schema-version-floor tripwire; `migrate_on_open` (both open modes now refuse); the legacy `_graph_commits.lance` readers + the v3 test fixtures + migration tests + `migration.v3_to_v4.*` failpoints + the two surface guards that pinned Lance variants only the migration matched on; and `state::merge_lineage_rows`. Keep `read_stamp` / `stamp_current_version` / `set_stamp` / `refuse_if_stamp_unsupported` — the seam a future one-shot converter plugs into. `load_commit_cache_for_branch` now reads the `__manifest` projection unconditionally (sub-v4 graphs are refused at open). Adds `sub_current_graph_is_refused_on_open_with_rebuild_hint`. The commit-graph TABLES are still created/used as branch-ref ledgers — their retirement (CommitGraph -> pure `__manifest` projection) is the next commit. BREAKING CHANGE: a graph created by omnigraph <= 0.7.2 (internal schema v3) is refused on open. Rebuild it: `omnigraph export` with the old release, then `omnigraph init` + `omnigraph load` with this one. Data, vectors, and blobs are preserved; commit history and branches are not. * feat(engine)!: retire `_graph_commits.lance` / `_graph_commit_actors.lance` — CommitGraph is a pure `__manifest` projection Since RFC-013 Phase 7, graph lineage lives in `__manifest` (`graph_commit` / `graph_head` rows) and branch authority is `__manifest` (branch create forks it first). The two commit-graph datasets were vestigial: `_graph_commit_actors.lance` was never written or read; `_graph_commits.lance` carried zero commit rows and only mirrored the manifest's branch refs (a deny-list "parallel copy"). Retire both. - `CommitGraph` collapses to a pure projection: drops its Lance dataset handles (`dataset`/`actor_dataset`) and all branch methods; `open`/`open_at_branch`/ `refresh`/`init` open NO dataset, building the cache from `ManifestCoordinator::read_graph_lineage_at`. Removes ~1.4s of cold-open dataset opens. - `graph_coordinator`: `commit_graph` is now non-`Option` (always a valid projection). `branch_create`/`branch_delete` go through `ManifestCoordinator` only — a single atomic op, replacing the two-step manifest-fork + commit-graph-fork + rollback. Deleted `create_commit_graph_branch`, `reclaim_commit_graph_branch`, `ensure_commit_graph_initialized`, and every `storage.exists(_graph_commits.lance)` gate. - `optimize`: dropped `reconcile_commit_graph_orphans` and the two tables from the internal-table compaction set (now `__manifest` only). - `instrumentation`: `INTERNAL_TABLE_DIRS` no longer lists the two tables. - Fresh graphs create neither table; `lineage_projection.rs` now asserts both `.lance` dirs are absent. Deleted the obsolete commit-graph-branch-race failpoint tests + their failpoint names, and updated the `maintenance` optimize tests (one internal table, not three). Review-pass fixes folded in: - Removed two stale `omnigraph.rs` in-source tests the prior run missed (a disk-full link failure masked them): one asserting `open` probes `_graph_commits.lance` (the exists-gate this commit removes) — it was masked earlier by a disk-full link failure. - Corrected src comments referencing deleted code (`migrate_v3_to_v4`, `append_commit`/`append_merge_commit`, the three-internal-table list, the `_graph_commits` reconcile owner) in publisher/recovery/optimize/recovery_audit. - Narrowed `set_stamp_for_test` to `cfg(test)` (its only caller is the refusal test) — removes a dead-code warning in the failpoints build. Branch create/delete atomicity improves (single atomic `__manifest` op). No behavior change for reads or branches. Follow-up (separate commit): the now-always-0 `IoCounts::commit_graph_reads` test counter + its `IOTracker`, threaded through ~11 cost-test files. * feat: surface the internal-schema (storage-format) version to operators After stranding storage versioning (a sub-v4 graph is refused on open), operators could only discover the storage-format version by hitting a refusal. Surface it: - `omnigraph version` prints an `internal-schema <N>` line (the binary's CURRENT storage-format version). - `omnigraph snapshot` includes `internal_schema_version` — the GRAPH's per-branch on-disk stamp, read via the new `Omnigraph::internal_schema_version_of`. - `GET /healthz` includes `internal_schema_version` (server-scoped: the binary's CURRENT, alongside `version`/`source_version`). Wire: re-expose `INTERNAL_MANIFEST_SCHEMA_VERSION` as `pub` on `db::manifest`; add `internal_schema_version: u32` to `SnapshotOutput` + `HealthOutput`; `snapshot_payload` takes the per-graph version (the `Snapshot` does not carry it), threaded through the embedded CLI + server snapshot callers. `openapi.json` regenerated (two added int32 properties). Extends the existing healthz / snapshot / version tests. * docs(engine): gate internal-schema version at the graph level; record the per-branch read gap PR reviewers flagged that the open path validates only main's internal-schema stamp, so a branch read could decode a branch stamped outside this binary's range. The stamp is a graph-wide storage-format property (the upgrade path is a whole-graph export/import), so with one binary version every branch is always CURRENT; divergence needs concurrent multi-version writers, an unsupported topology already in one-winner-CAS territory. Gating per-branch would add a second __manifest open per non-main branch read to defend a state we do not support, unearned complexity that regresses the warm-read budget. Keep the graph-level gate, document it at the code site (refuse_if_internal_schema_unsupported), and record the read-only residual hole as a known gap in invariants.md to close only when multi-version write topologies become supported. Also clarify the sub-floor rebuild message to say "export with the older omnigraph binary that created it." No behavior change: HEAD already gated at the graph level. * test(cost): remove the dead commit_graph_reads IO counter Phase B retired _graph_commits.lance / _graph_commit_actors.lance, so no commit-graph dataset is opened and the commit_graph IOTracker term is structurally always 0. Remove IoCounts::commit_graph_reads, its total_reads() term, the commit_graph IOTracker in OpProbes, and the now-dead commit_graph_wrapper field on QueryIoProbes (it had no accessor — nothing ever attached it). Drop the 7 trivially-true assert_eq!(commit_graph_reads, 0) checks in warm_read_cost.rs and the debug-print refs in write_cost{,_s3}.rs. Lineage and actor rows now live in __manifest (RFC-013 Phase 7), so the internal_table_scans_are_flat_in_history gate folds into the single manifest_reads flat-assertion — the manifest scan already covers them. Harness-only; no production runtime impact. * docs: align with the commit-graph retirement + strand storage versioning Update the always-loaded and user-facing docs to match the landed state: graph lineage lives in __manifest, the _graph_commits.lance / _graph_commit_actors.lance tables are retired, and storage is strict-single-version (no in-place migration — a sub-CURRENT graph is refused with an export/import rebuild). Fixed stale claims in invariants.md (the migration/atomicity known-gap entry, the Truth Matrix branch-delete row, the read-path/optimize internal-table scope), lance.md (the migrate_v1_to_v2 PK bullet now reflects init-time set; removed the two deleted v3->v4 migration surface guards), testing.md (dropped the deleted migration failpoint tests; manifest-only internal-table term), writes.md (rewrote the Migration-code section to the strand model), storage.md / maintenance.md / constants.md (retired tables out of the layout, internal-table compaction scope, and the constants cheat-sheet), and AGENTS.md. Marked the retirement DONE in the RFC-013 handoff/roadmap and banner-noted the historical RFC analysis. Added docs/user/operations/upgrade.md (the export/import rebuild recipe) and docs/dev/versioning.md (the four-axis compatibility policy: release lockstep / wire additive / storage strict-single-version / Lance pinned), cross-linked from the audience indexes and the AGENTS.md topic map, and rewrote the in-progress v0.8.0 release note for the strand model + version surfacing. check-agents-md.sh passes (65 links, 62 docs). * test(manifest): cover the v3-refusal→export/import rebuild cycle and branch stamp inheritance Two coverage additions from PR review (P1): (a) sub_current_graph_is_refused_then_rebuilt_via_export_import — the full operator narrative in one flow: load → export → a sub-CURRENT graph (stamp rewound below CURRENT) is refused with the export nudge → fresh init + load(export) → data present and the rebuilt graph opens. The refusal is stamp-only (read before any data), so a stamp-rewound graph is a faithful stand-in for a real older-release graph without a second binary; vector/blob fidelity stays covered by tests/export.rs. (b) branch_inherits_main_internal_schema_stamp — proves a branch cannot diverge from main's stamp under single-binary operation (create_branch forks main's __manifest, the publisher does not re-stamp), which is why the graph-level (main-only) stamp gate is sufficient for supported inputs. A divergent branch stamp needs concurrent multi-version writers, the unsupported topology recorded as a known gap.
2026-06-28 16:49:49 +02:00
| Upgrade across a storage-format change (export/import rebuild) | [docs/user/operations/upgrade.md](docs/user/operations/upgrade.md) |
| Versioning & compatibility policy (release / wire / storage strict-single-version / Lance) | [docs/dev/versioning.md](docs/dev/versioning.md) |
docs(user): split language/branching pages + add front-door pages (Phase 2) (#225) Content build-out on top of the Phase 1 topic move. No behavior changes. Splits (existing content relocated, cross-linked): - queries/index.md → mutations/index.md (insert/update/delete + the inserts-vs-deletes rule) and search/index.md (the multi-modal search functions + a hybrid-ranking overview tying nearest/bm25/rrf together). queries/index.md now covers the read shape and points at both. - branching/index.md → branching/time-travel.md (snapshots/time travel) and branching/merge.md (three-way merge + the 7 conflict kinds, verified against error.rs MergeConflictKind). New pages (written from the code, user-facing): - quickstart.md — init → load → query → branch, with verified CLI flags. - concepts/index.md — what OmniGraph is + the L1/L2 (Lance/OmniGraph) framing. Expanded operations/audit.md from a 7-line struct dump into a real actor-tracking page (server token-resolved vs CLI --as chain; reading the trail; the omnigraph:recovery reserved actor). Index wiring: docs/user/index.md and AGENTS.md's topic table link every new page; also normalized AGENTS.md's docs/user link display text to match the Phase 1 retargeted paths. Verified: zero broken .md links; check-agents-md.sh green (57 links, 54 docs). Deferred to Phase 3: de-dev polish (grammar paths, IR internals still in queries/branching), guides/, and a possible reference/config.md split (the config schema is already coherent in cli/reference.md). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:53:46 +03:00
| Cluster operator guide (deploy/manage clusters, approvals, recovery, serving) | [docs/user/clusters/index.md](docs/user/clusters/index.md) |
| Cedar policy actions, scopes, CLI | [docs/user/operations/policy.md](docs/user/operations/policy.md) |
| HTTP server endpoints, auth, error model, body limits | [docs/user/operations/server.md](docs/user/operations/server.md) |
| CLI quick-start | [docs/user/cli/index.md](docs/user/cli/index.md) |
feat!: delete the legacy OmnigraphConfig + config migrate; finish the omnigraph.yaml docs sweep (#252) * refactor(cli): own ReadOutputFormat/TableCellLayout in the CLI The two output-presentation enums lived in `omnigraph-server::config` and were re-exported for the CLI, even though the server never used them. Move both definitions into `omnigraph-cli/src/read_format.rs` (where the renderer already lives) and drop them from the server's public re-export. This is a step toward deleting the legacy `omnigraph-server::config` module entirely — a CLI presentation concern has no business in the server crate. No behavior change. The server keeps private copies in `config.rs` only for the soon-to-be-deleted legacy `CliDefaults`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli)!: remove the `config migrate` command and migrate.rs `config migrate` was the last CLI consumer of the legacy `omnigraph.yaml` (`OmnigraphConfig` + `load_config`). With the excision complete there is no legacy file to split, so the whole `omnigraph config` command group is removed along with `migrate.rs`. The `OmnigraphConfig` type, `load_config`, and the deprecation machinery are deleted next. - Remove `Command::Config` / `ConfigCommand` from the clap surface and the dispatch arm; drop `mod migrate;` and the now-unused `load_config` import. - Drop the `Command::Config` arms in `planes.rs`. - Delete the `config_migrate_splits_legacy_config` integration test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(server)!: delete the legacy OmnigraphConfig type and load_config With `config migrate` gone, nothing loads `omnigraph.yaml` anymore. Delete the entire `omnigraph-server::config` module: the `OmnigraphConfig` type and its sub-structs (`ProjectConfig`, `TargetConfig`, `CliDefaults`, `ServerDefaults`, `AuthDefaults`, `QueryDefaults`, `AliasConfig`, `AliasCommand`, `PolicySettings`, `QueryEntry`, `McpSettings`), `load_config`, and the RFC-008 deprecation machinery (`OMNIGRAPH_CONFIG`, `OMNIGRAPH_NO_LEGACY_CONFIG`, `OMNIGRAPH_SUPPRESS_YAML_DEPRECATION`, the deprecation map + warner). - `QueryRegistry::load` (the only `OmnigraphConfig`/`QueryEntry` consumer; its only caller was its own test) is removed — server boot and the CLI both build registries via `QueryRegistry::from_specs`. - `graph_resource_id_for_selection` (CLI-only) moves into the CLI (`helpers.rs`), with its unit test; the server no longer exports it. - Drop the already-dead `format_registry_load_errors` helper (config-adjacent). No behavior change — every deleted item was unreachable after the excision. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs: purge the legacy omnigraph.yaml surface from the docs Finish the RFC-011 excision in the docs: the CLI no longer reads omnigraph.yaml and the server boots cluster-only, so every doc that described the legacy file as a live config is now wrong. - AGENTS.md: rewrite the HTTP-server line to cluster-only boot (drop the single-graph/flat-route and omnigraph.yaml-boot framing); rewrite the CLI two-surface-config passage (drop `config migrate`, the deprecation env vars, and "Never extend omnigraph.yaml"); fix the topic table + capability rows. - cli/reference.md: delete the entire "omnigraph.yaml schema (legacy combined file)" section and the `config migrate` row; re-home the `policy` row, the bearer-token chain, the actor/format/param-precedence references, and the `--config` mentions to the operator config + `--cluster`. - cli/index.md: rewrite the multi-graph-server + add-graph paragraphs to cluster (`--cluster` + `cluster apply`); fix the policy examples to `--cluster`; replace the `## Config` omnigraph.yaml example with the operator/cluster two-surface model. - operations/policy.md: rewrite per-graph-vs-server-level policy to the cluster `policies:`/`applies_to` model; re-home the actor + CLI tooling sections. - clusters/config.md, clusters/index.md, deployment.md: server boots from the cluster only; per-operator facts come from ~/.omnigraph/config.yaml. - architecture.md, testing.md: drop the stale omnigraph.yaml / deleted-test references. RFCs, design specs, and prior release notes are left as historical records. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:31:29 +03:00
| CLI command surface and config schema (`~/.omnigraph/config.yaml`) | [docs/user/cli/reference.md](docs/user/cli/reference.md) |
docs(user): split language/branching pages + add front-door pages (Phase 2) (#225) Content build-out on top of the Phase 1 topic move. No behavior changes. Splits (existing content relocated, cross-linked): - queries/index.md → mutations/index.md (insert/update/delete + the inserts-vs-deletes rule) and search/index.md (the multi-modal search functions + a hybrid-ranking overview tying nearest/bm25/rrf together). queries/index.md now covers the read shape and points at both. - branching/index.md → branching/time-travel.md (snapshots/time travel) and branching/merge.md (three-way merge + the 7 conflict kinds, verified against error.rs MergeConflictKind). New pages (written from the code, user-facing): - quickstart.md — init → load → query → branch, with verified CLI flags. - concepts/index.md — what OmniGraph is + the L1/L2 (Lance/OmniGraph) framing. Expanded operations/audit.md from a 7-line struct dump into a real actor-tracking page (server token-resolved vs CLI --as chain; reading the trail; the omnigraph:recovery reserved actor). Index wiring: docs/user/index.md and AGENTS.md's topic table link every new page; also normalized AGENTS.md's docs/user link display text to match the Phase 1 retargeted paths. Verified: zero broken .md links; check-agents-md.sh green (57 links, 54 docs). Deferred to Phase 3: de-dev polish (grammar paths, IR internals still in queries/branching), guides/, and a possible reference/config.md split (the config schema is already coherent in cli/reference.md). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:53:46 +03:00
| Audit / actor tracking | [docs/user/operations/audit.md](docs/user/operations/audit.md) |
| Error taxonomy and result serialization | [docs/user/operations/errors.md](docs/user/operations/errors.md) |
| Install (binary / Homebrew / source / channels) | [docs/user/install.md](docs/user/install.md) |
docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257) * docs: optimize README for dev onboarding; fix 0.7.0 staleness The README's setup half drifted from the shipped 0.7.0 CLI and led with the heaviest path (Docker + RustFS). This reworks it for fast, correct onboarding: README.md - New zero-dependency "Your first graph in 60 seconds" hero: a fully copy-pasteable local file-backed loop (schema → init → load → query → branch). - Add a correct "Serve it" section (cluster apply + omnigraph-server --cluster); the server is cluster-only on main, so the old positional-URI boot is gone. - Demote the RustFS bootstrap to "rehearse the S3 path locally"; reframe the storage bullet as "filesystem or any S3-compatible store (AWS S3, R2, MinIO, RustFS)" — RustFS is a provider, not a storage class. - Fix crate/MCP descriptions (query/mutate/load, not read/change/ingest). docs/user/quickstart.md - Fix the query example: `read --name <q> … <uri>` is removed — the query name is positional and the graph is addressed with `--store` (`omnigraph query find_people --query queries.gq --store graph.omni`). scripts/local-rustfs-bootstrap.sh - Convert to cluster mode: write a cluster.yaml (storage: s3://…), then validate → import → apply, load the fixture into the derived root with the now-required --mode, and serve with `omnigraph-server --cluster`. The old flow (`load` without --mode, `omnigraph-server <URI>` positional boot) no longer works on a cluster-only server. * docs: move agent skill into the repo, add agent-setup snippet, drop rustfs script skills/omnigraph - The operational skill (formerly `omnigraph-best-practices` in the cookbooks repo) now lives with the engine it documents, co-versioned. Renamed to `omnigraph`; repository metadata repointed here. - Broadened the description to trigger on intent — storing/retrieving/querying knowledge, agent memory, building a knowledge graph, operating Omnigraph — as well as on CLI/artifact sightings (stays ≤1024 chars). - Install: `npx skills add ModernRelay/omnigraph@omnigraph`. README - New "Set it up with an AI agent" paste snippet: installs the skill, reads the docs (URL), browses the cookbooks, and asks the user about a use case before standing up a first graph. - "Agent skill & starter graphs" section points at skills/omnigraph + cookbooks. Drop scripts/local-rustfs-bootstrap.sh - Not CI-tested (so it rotted: it broke on the cluster-only migration — positional server boot, load without --mode), demoed the now-optional S3 path, and was the most fragile artifact in the repo. Replaced with a "Testing against S3 locally" guide in deployment.md (docker run RustFS/MinIO + AWS_* env + cluster-on-S3). README/AGENTS references updated.
2026-06-16 11:48:13 +02:00
| Deployment (binary / container / S3-local testing / auth / build variants) | [docs/user/deployment.md](docs/user/deployment.md) |
| CI / release workflows | [docs/dev/ci.md](docs/dev/ci.md) |
| Branch protection policy (declarative, applied via `scripts/apply-branch-protection.sh`) | [docs/dev/branch-protection.md](docs/dev/branch-protection.md) |
docs(user): split language/branching pages + add front-door pages (Phase 2) (#225) Content build-out on top of the Phase 1 topic move. No behavior changes. Splits (existing content relocated, cross-linked): - queries/index.md → mutations/index.md (insert/update/delete + the inserts-vs-deletes rule) and search/index.md (the multi-modal search functions + a hybrid-ranking overview tying nearest/bm25/rrf together). queries/index.md now covers the read shape and points at both. - branching/index.md → branching/time-travel.md (snapshots/time travel) and branching/merge.md (three-way merge + the 7 conflict kinds, verified against error.rs MergeConflictKind). New pages (written from the code, user-facing): - quickstart.md — init → load → query → branch, with verified CLI flags. - concepts/index.md — what OmniGraph is + the L1/L2 (Lance/OmniGraph) framing. Expanded operations/audit.md from a 7-line struct dump into a real actor-tracking page (server token-resolved vs CLI --as chain; reading the trail; the omnigraph:recovery reserved actor). Index wiring: docs/user/index.md and AGENTS.md's topic table link every new page; also normalized AGENTS.md's docs/user link display text to match the Phase 1 retargeted paths. Verified: zero broken .md links; check-agents-md.sh green (57 links, 54 docs). Deferred to Phase 3: de-dev polish (grammar paths, IR internals still in queries/branching), guides/, and a possible reference/config.md split (the config schema is already coherent in cli/reference.md). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:53:46 +03:00
| Constants & tunables cheat sheet | [docs/user/reference/constants.md](docs/user/reference/constants.md) |
| Per-version release notes | [docs/releases/](docs/releases/) |
---
## First principle: engineering is programming integrated over time
Software engineering is **programming integrated over time** (Winters, *Software Engineering at Google*). A line of code costs you at every future read, refactor, migration, and dependent change — not just at write-time. So the operative question for any change is: **which option has the lower ongoing liability?** Not "shorter now," not "fastest to ship," but which leaves the codebase narrower in the long run. **Complexity should be earned** — by demonstrated correctness, performance, or future-shape cost; never by speculation.
This is a decision lens, not a code-size rule. It cuts both ways. Sometimes the lower-liability option is:
- **More code.** A centralized dispatcher costs more lines than an ad-hoc heal hook, but each future change adds a match arm instead of a new hook scattered through the engine.
- **Less code.** Three similar lines that may diverge later cost less to maintain than a premature abstraction that has to be retrofitted every time a caller deviates.
- **DRYing.** Two copies of business logic that must stay in sync are a perpetual drift risk.
- **Duplication.** Two callers that look similar today but have independent evolution pressure shouldn't be wedged through a shared helper just because the lines match.
- **Removal.** A "just in case" code path with no caller is pure surface area: tests for it, docs that mention it, future changes that have to consider it.
- **Addition.** A migration framework, a typed error variant, a feature flag — each adds code now and lowers the cost of every future change in its surface.
- **A new abstraction**, when the absence forces every consumer to re-derive the same logic. Or **flattening one**, when the abstraction has accumulated more special-cases than the code it replaced.
When evaluating a design, ask: *"what does this look like after 5 more changes like it?"* If the answer is "this converges to one shape", cost is bounded. If it's "this forks every time", the option is mortgaging the future for present convenience — pick differently.
perf(engine): remove the per-query metadata re-derivation tax on warm reads (#268) * test(engine): add read-path IO instrumentation seam for warm-read cost tests Prerequisite seam for the query-latency fixes. Adds crates/omnigraph/src/instrumentation.rs: - CountingStorageAdapter: a StorageAdapter decorator counting per-method reads (read_text/exists/read_text_versioned/list_dir), for the schema-contract reads on the query path. - A per-query task-local (QueryIoProbes) carrying Lance WrappingObjectStore wrappers per open category plus a probe counter, delivered via with_query_io_probes. open_dataset_tracked attaches the wrapper so the open itself is counted (ObjectStoreParams.object_store_wrapper). Wires the wrappers into the manifest open (open_manifest_dataset) and the commit-graph opens (CommitGraph::open/open_at_branch). Production leaves the task-local unset, so nothing attaches. Makes Omnigraph::open_with_storage public so tests can inject the counting adapter. lance-io is a dev-dependency (IOTracker named only in tests). No runtime behavior change. * test(engine): warm same-branch read should reuse the coordinator (red) Cost-budget test using Lance IOTracker at the object-store boundary (the LanceDB IO-counted-test pattern). On a 20-commit-deep graph, a warm same-branch query re-opens a fresh coordinator, which opens both the commit graph and __manifest. Asserts the read opens the commit graph zero times and performs exactly one cheap version probe; today it does neither (it scans the commit graph on re-open and never probes). The freshness guard already passes. Adds the commit_many helper for history-depth fixtures. Red half of the Fix 1 red->green pair; turns green with the next commit. * perf(engine): same-branch reads reuse the warm coordinator (Fix 1) query()/resolved_target re-opened a fresh GraphCoordinator from storage on every read (full __manifest scan + two commit-graph scans), so a warm read's cost grew with commit history (invariant 15) though the data was unchanged. resolved_target now serves same-branch reads from the warm in-memory coordinator, gated by a cheap version probe (latest_version_id, one object-store op) instead of a full re-open: - fresh (probe == cached version): return the in-memory snapshot under the read lock, with a synthetic (branch, version) id and no commit-graph access (reads pin the snapshot by manifest version, not the commit DAG; invariant 2). - stale: take the write lock, re-probe (double-checked; tokio RwLock has no read->write upgrade), then refresh_manifest_only (no commit-graph scan), preserving strong consistency for external writers (invariant 6). Cross-branch and snapshot targets keep the existing cold-resolve path. Adds ManifestCoordinator/GraphCoordinator::probe_latest_version and GraphCoordinator::refresh_manifest_only. Nothing on the read path needs a real commit ULID (only RuntimeCache keys on the id, where synthetic is consistent), per a caller audit. A warm same-branch read on a 20-commit graph now does zero commit-graph opens and exactly one probe (down from a deep commit-graph scan) and still observes external commits. The residual per-table __manifest scans are removed later by Fix 2. * test(engine): warm query should validate the schema contract once (red) ensure_schema_state_valid runs twice per query (query()/run_query_at AND resolved_target/snapshot_at_version), each reading 3 contract files + 2 existence probes. A warm query thus does 6 read_text + 4 exists where one validation (3 + 2) suffices, measured via CountingStorageAdapter. Adds a drift guard (schema_source_drift_is_caught_on_read) that already passes. Red half of the finding-A red->green pair. * perf(engine): validate the schema contract once per query (finding A) ensure_schema_state_valid ran on every query AND again inside resolved_target / snapshot_at_version, so each query validated the schema contract twice (~10 storage ops). Removes the redundant query()/ run_query_at() calls; the validation inside resolved_target / snapshot_at_version still runs, so drift is detected exactly as before. A source-only fast path was rejected: a long-lived handle must detect external drift of the schema source, IR, OR state on its next operation (lifecycle::long_lived_handle_rejects_schema_*), which a source-only compare would miss. So the only safe latency win is not validating twice. A warm query now does one validation (3 read_text + 2 exists) instead of two (6 + 4). * test(engine): warm + multi-table reads should do zero manifest scans (red) After Fix 1 a warm same-branch read still scans __manifest ~44 times at 20-commit depth: not from resolution (Fix 1 removed that) but from the per-table open path, which routes through the Lance namespace and full-scans __manifest twice per touched table (describe_table + describe_table_version). Tightens the warm test to assert manifest read_iops == 0 and adds a multi-table (traversal) test asserting the same, pinning the "2 tables = 2x" tax. Red half of the Fix 2 red->green pair. * perf(engine): open touched tables by location+version, not via the namespace (Fix 2) SubTableEntry::open routed every read-path table open through DatasetBuilder::from_namespace(BranchManifestNamespace), whose describe_table full-scans __manifest and, with managed_versioning, makes Lance scan again (describe_table_version) -- two full __manifest scans per touched table. That was the residual that made warm-read manifest IO grow with history and the '2 tables = 2x' multi-table tax. The resolved Snapshot already holds each table's path/version/branch, so open directly: from_uri(table_uri_for_path(root, path, branch)).with_version(v). The branch-qualified location is the dataset that physically holds the version (main: {path}; branch: {path}/tree/{branch}, Lance native-branch storage), and with_version resolves it within THAT dataset's _versions. 0 namespace calls + 1 HEAD via the native ConditionalPutCommitHandler. The read namespace (BranchManifestNamespace) is now unused in production (writes use StagedTableNamespace), so it, its constructor, and the helpers only it used (to_namespace_version, publish_requests, their imports) are gated #[cfg(test)] -- retained to validate the namespace contract in unit tests. Removes the dead open_table_at_version_from_manifest. Warm same-branch + multi-table reads now scan __manifest zero times; branch + time-travel reads stay correct (branching.rs, point_in_time.rs, 2 lib regression tests); production-lib warnings unchanged (baseline). * test(engine): cost-budget coverage for branch-warm and stale-refresh reads (matrix) Extends the read-path cost-budget tests across more of the morphological matrix: - warm_branch_read_does_no_manifest_scans: a warm read on a non-main branch (handle synced to it) scans __manifest zero times, exercising Fix 2's branch-owned-table open (tree/{branch} + with_version) on Fix 1's warm path -- the cell that regressed when the open used with_branch against the base. - stale_read_refreshes_manifest_only: an external commit makes the next read take the stale path, which re-reads the manifest (read_iops > 0) but never scans the commit graph (refresh_manifest_only), pinning Fix 1's manifest-only refresh. Cold paths (cross-branch, time-travel) stay behavior-covered (branching.rs, point_in_time.rs) and are cold by design (Fix 1 warm-paths only same-branch), so there is no manifest==0 contract to assert there. * test(engine): same-branch write after external commit must not fork the commit DAG (red) * fix(engine): refresh commit-graph head before append to prevent same-branch DAG fork A same-branch write that follows an external commit committed a fresh manifest version (commit_all rebases the pin from a fresh coordinator) but appended off the coordinator's stale in-memory commit-graph head, forking the commit DAG (the new commit and the external commit shared a parent). Pre-existing for non-strict inserts; widened to strict ops by Fix 1's refresh_manifest_only freshening the read-time pin. record_graph_commit now refreshes the commit-graph head from storage before append_commit, so the parent is the true current head. record_merge_commit is unaffected (it passes explicit parents). * perf(engine): hold open Dataset handles + share one Session per graph (Fix 3) A warm same-branch read still re-opened every touched table per query (the "never warms up" residual after Fix 1+2). A per-graph held-handle cache keyed by (table_path, branch, version) now serves repeat reads with zero table opens, and one shared lance::Session per graph warms metadata/index caches across opens. Validated against LanceDB upstream (rust/lancedb/src/table/dataset.rs DatasetConsistencyWrapper): hold an Arc<Dataset> and reuse it for 0-IO warm reads; one Session per connection threaded into opens; writers never serve from the read cache; time-travel bypasses. One adaptation: omnigraph keys by version (snapshot-pins-version model) where LanceDB keys per-table+HEAD, reusing the in-repo GraphIndexCache LRU template. - ReadCaches (session + TableHandleCache) injected onto live-Branch-read snapshots in resolved_target; Snapshot::open serves from the cache or opens once with the session on a miss (via the instrumented open_table_dataset). - Writes (resolved_branch_target -> open HEAD) and time-travel / Snapshot-id reads bypass the cache. Version-in-key makes a write a new key (old handle ages out via LRU); invalidate_all at branch-switch/refresh is hygiene only. - Cost tests: a 2nd identical warm read does 0 table opens; a write re-opens only the changed table at its new version. Full engine suite green. * test(engine): forbid raw data opens in the read/exec layer (P2 guard) Extend the forbidden-API guard with Dataset::open / DatasetBuilder::from_uri / from_namespace so the read/exec layer (exec/, loader/, changes/, db/omnigraph/) cannot bypass Snapshot::open and the held-handle cache (Fix 3). The instrumented opener (instrumentation.rs) is allow-listed; two legitimate non-read opens (a test editing __manifest, Hard-drop version GC) carry sentinels. The storage/manifest layers stay allow-listed. Lean P2 scope, per LanceDB-upstream + minimize-liability: the data-read boundary already exists (SubTableEntry::open); this guard pins it so a future read cannot open around the cache. Centralizing all internal opens behind one opener is deferred. * docs(dev): invariant 15 (one source of truth, cheaply derived) + cost-budget testing Records the principle behind the query-latency work: Lance and the manifest are the source of truth, everything else a derived view held warm and refreshed by a cheap probe; the two failure modes (a drifting parallel copy, and cold re-derivation whose cost grows with history) are deny-listed. Adds the cost-budget testing discipline (assert a warm read's open/IO count is flat at commit-history depth, the LanceDB IO-counted pattern) and the warm_read_cost.rs row. Updates the read-path-re-derivation known gap to reflect what Fix 1/2/3 + finding A close, and adds the commit-graph-parent-under-concurrency gap. * fix(engine): branch-incarnation identity + unified invalidation + shared LruMap (PR #268 review) Phase 6 A-D, correct-by-design responses to the Codex/Greptile P2 review comments. A: warm-read freshness and the table-handle cache key use the manifest incarnation (e_tag, manifest-timestamp fallback, then version), so a deleted+recreated non-main branch reusing a version number cannot be served stale; main stays version-cheap, non-main loads latest_manifest; a detected stale refresh also invalidates read caches; two regression tests force the version collision. B: unify the two cache invalidations into Omnigraph::invalidate_read_caches() at the four sites. C: assert the stale path's probe count. D: shared LruMap behind both caches with unconditional eviction, plus a unit test. Full engine suite green; multi-process lineage fork and O(history) write refresh remain known gaps for Phase 6E/7.
2026-06-17 13:25:20 +02:00
The same lens has a structural corollary: **one source of truth, cheaply derived.** Lance and the manifest are the source of truth; everything else is a derived view. Maintaining a parallel copy invites drift that compounds over time, and re-deriving a view from the full source on every call makes its cost grow with history. Both are liabilities integrated over time, so both are ruled out the same way: hold a warm derived view and refresh it with a cheap probe, never shadow the source or rebuild from it cold. Invariant 15 in [docs/dev/invariants.md](docs/dev/invariants.md) states this; invariants 1 (respect the substrate) and 7 (indexes are derived state) are instances.
### Tiebreakers when liability alone is silent
- **Correctness > simplicity > performance.** Lexicographic — give up performance for simpler code; give up simplicity for correct code; never give up correctness. The deny-list ("no silent failures," "no acks before durable persistence," "no reads of partial commits") is this rule's hard floor.
- **Reversibility shapes evidence demand.** Reversible changes wait for evidence: prefer prod metrics over napkin math over RFCs. Irreversible changes (substrate choice, on-disk format, database guarantees) earn an RFC, because by the time prod tells you they were wrong, you've shipped years of dependent code. Reviewers should spot both failure modes — RFC-ing a one-line config, and measuring-your-way into a substrate decision.
The always-on rules below and the deny-list in [docs/dev/invariants.md](docs/dev/invariants.md) are specific applications of this principle; when the rules are silent, fall back to it.
---
## Always-on rules (load these into your working memory)
These are architectural rules that need to be in scope on every change. They're framed at the level that survives renames and refactors — the deeper implementation specifics (function names, lock names, branch-prefix conventions, enforcement points) live in the per-area docs and may evolve. The full architectural invariants and deny-list are in [docs/dev/invariants.md](docs/dev/invariants.md); the deny-list is the fastest first-pass when reviewing any change.
1. **Multi-dataset publish is atomic across the whole graph.** A graph commit flips every relevant sub-table version visible together, in one manifest write. Don't introduce code paths that publish per sub-table outside the unified publish path — that loses cross-table snapshot isolation.
2. **Snapshot isolation per query.** A query holds one snapshot for its lifetime. Don't re-read the current head mid-query.
3. **Mutations are atomic at the commit boundary.** Multi-statement change queries publish one commit. Don't commit per-statement.
4. **Bearer-token plaintext never persists in process memory.** Tokens are hashed at startup; auth uses constant-time comparison; the actor id is server-resolved from the hash match and must not be settable by the client.
5. **Reads always see the current index state for the branch they're reading.** Indexes track the branch head, not historical snapshots. If you change index lifecycle, preserve this guarantee.
6. **Stable type IDs survive renames.** Schema migration relies on identity that's stable across rename — don't mint new IDs on rename.
Index materialization is derived state: defer off the write path, reconcile via optimize (iss-848) (#246) * test(engine): reproduce empty-table Vector @index aborting schema apply A Vector (IVF) index trains k-means centroids over the column, so Lance cannot build it on 0 vectors ("Creating empty vector indices with train=False is not yet implemented"). schema apply reconciles a table's whole index set whenever any @index on it changes, so adding an unrelated scalar @index materializes the dormant empty vector index and aborts the entire migration (all-or-nothing). This regression test inits a 0-row Doc with a Vector @index, adds a scalar @index, and asserts the apply succeeds (then loads one embedded row and asserts the deferred index materializes). It fails today at the apply step with the vector-index abort; the fix lands in the next commit. Refs dev-graph iss-empty-vector-index-schema-apply, iss-848. * fix(engine): defer Vector @index on an empty table instead of aborting schema apply build_indices_on_dataset_for_catalog materialized a declared Vector @index unconditionally. On a 0-row table Lance cannot train the IVF index ("Creating empty vector indices with train=False is not yet implemented"), so any later migration that touches the table (e.g. adding an unrelated scalar @index, which reconciles the table's whole index set) aborted the entire migration on the dormant vector index — all-or-nothing. Guard the vector arm with a row-count check, matching the guard ensure_indices_for_branch and the branch-merge rebuild already use: an untrainable column becomes a pending index that a later ensure_indices / optimize materializes once the table has rows. Reads stay correct meanwhile (vector search degrades to a brute-force scan). Stop-gap: the residual rows-present-but-vectors-null window and the full decoupling (intent recorded at apply, an idempotent coverage reconciler) are dev-graph iss-848. Turns the green half of the regression test added in the previous commit. Refs dev-graph iss-empty-vector-index-schema-apply, iss-848, iss-687. * docs(invariants): record the logical-contract-over-physical-state principle The bug class behind the empty-table vector-index abort (and the schema-apply vs optimize version drift) is one shape: a physical operation allowed to fail a logical one. Several hard invariants (2, 5, 7, 13) and deny-list items are already instances of this, but the unifying rule was never written down. Add it to docs/dev/invariants.md as a "Governing principle" section above the hard invariants, naming which invariants and deny-list items instantiate it and the smell to watch for (a logical operation gated on a physical fact). Add a one-line always-on rule (7) in AGENTS.md so it stays in working memory, with the qualifier that genuine logical conflicts still fail loudly — the licence to lag covers physical convergence, not correctness. Audience-neutral: no private ticket refs. check-agents-md.sh passes. * test(engine): index build must tolerate rows with null vectors (load-before-embed) Loading rows whose vector column is null into a `Vector @index` table fails today: build_indices (reached via the loader's prepare_updates_for_commit) calls create_vector_index, and Lance's IVF KMeans errors "cannot train 1 centroids with 0 vectors". The same abort hits ensure_indices/optimize/schema apply/merge, since they all funnel through build_indices_on_dataset_for_catalog. This test loads two null-embedding rows and calls ensure_indices; it must not abort (the untrainable vector column is deferred, sibling indexes still build). Fails today at the load step; fixed in the next commit. Refs dev-graph iss-848, iss-empty-vector-index-schema-apply. * fix(engine): defer unbuildable index columns instead of aborting the write path build_indices_on_dataset_for_catalog is the chokepoint every write path funnels through (load/mutate via prepare_updates_for_commit, schema apply, ensure_indices, optimize, branch merge). Its vector arm called create_vector_index unconditionally, so a column with no trainable vectors yet — an empty table, or rows loaded before `embed` populates them — aborted the whole operation with Lance's IVF KMeans error. Fault-isolate the vector build: on failure, record the column as a PendingIndex (table, column, reason), log it, and continue building the sibling indexes; a later ensure_indices/optimize materializes it once the column is trainable, and reads use brute-force meanwhile. Manifest/CAS/IO errors at the publish boundary still propagate. Isolating at the single chokepoint realizes the governing principle (physical index state never fails a logical operation) for every write path, and supersedes the earlier symptomatic count_rows==0 stop-gap (removed) — closing the residual rows-present-but-vectors-null window it left open. Surfacing pending index status rather than failing is the database norm (Postgres indisvalid, LanceDB list_indices). ensure_indices and the build_indices wrappers now return Vec<PendingIndex>; optimize surfaces it in a later commit. Refs dev-graph iss-848, iss-951 (vector index stays inline-commit until lance#6666). * test(engine): index-only schema apply must not touch table data Adding an @index to an existing column should be a pure metadata change once index materialization moves to the reconciler (iss-848): the apply records the intent in the catalog/IR but builds nothing inline, so the table's manifest version is unchanged. Today the indexed_tables block builds the index inline and bumps the version (4 -> 5). Fixed in the next commit. Refs dev-graph iss-848. * fix(engine): schema apply records index intent only; index-only apply is metadata Schema apply no longer builds indexes inline. The four build_indices calls (added/renamed/rewritten/index-only tables) are removed; the @index/@key intent is already persisted in the catalog/IR the apply writes, and the physical index is materialized off the critical path by ensure_indices/optimize (iss-848). Concretely: - AddConstraint (an @index addition — every other added constraint plans as UnsupportedChange) becomes a pure metadata step alongside the metadata-only steps: it touches no table data, so the table version is unchanged. - added/renamed/rewritten tables still write their data; only the trailing index build is gone. The rewritten table's coverage is restored later by optimize_indices. - recovery_pins drops index-only tables (they no longer advance Lance HEAD) and keeps rewritten tables; their post_commit_pin = expected+1 is now exact (one rewrite commit), strengthening recovery classification. - the now-orphaned Omnigraph::build_indices_on_dataset_for_catalog wrapper is removed. A migration can no longer abort on an index build, for any index type at any cardinality. Turns the green half of index_only_constraint_apply_touches_no_table_data. Refs dev-graph iss-848. * test(engine): optimize must converge a declared-but-unbuilt index After iss-848, adding an @index post-data is a metadata-only apply that defers the physical build, so the column is declared-indexed but unbuilt (reads scan). `optimize` — the operator's cron reconciler — must materialize it. Today optimize only maintains coverage of EXISTING indexes (optimize_indices) and never creates missing ones, so the rank BTREE stays Degraded after optimize. Fixed next commit. Refs dev-graph iss-848. * fix(engine): optimize materializes declared-but-unbuilt indexes (the reconciler) `omnigraph optimize` is the operator's cron reconciler. It already compacts and folds new fragments into EXISTING indexes (optimize_indices); now it also builds declared-but-missing indexes, so the indexes schema apply / load defer (iss-848) converge on the next optimize. Done inside optimize_one_table (not by composing the all-tables ensure_indices, which is drift-blind and would re-publish the uncovered HEAD>manifest drift that optimize deliberately skips): after the per-table drift/blob skips and under the queue + Optimize sidecar already held, a needs_index_create gate (reusing needs_index_work_node/edge — "declared index missing AND row_count > 0", so empty tables stay no-ops) admits index-only work, and Phase B builds the missing index over the just-compacted layout via the build chokepoint. An untrainable vector column fault-isolates into the new TableOptimizeStats.pending_indexes (the list_indices/indisvalid analog operators read), not a failure. committed now reflects index commits, so the existing post-publish cache invalidation covers them. LanceDB's optimize only maintains existing indexes; creating declared-but-missing ones is the L2 behavior omnigraph's declarative @index needs. Turns the green half of optimize_materializes_index_declared_but_unbuilt. Refs dev-graph iss-848. * docs: index materialization is deferred to the reconciler (iss-848) Update the index-lifecycle docs to reflect the new contract: @index/@key declares intent and the physical index is derived state that never fails a logical operation. Schema apply builds nothing (records intent only); load/mutate build inline through one chokepoint that defers an untrainable Vector column as pending; optimize/ensure_indices is the reconciler that creates declared-but-missing indexes and maintains coverage, reporting still-pending columns. Touches: dev/invariants.md (truth-matrix Index-lifecycle row), AGENTS.md (capability matrix), user/search/indexes.md (L2 orchestration), user/operations/ maintenance.md (optimize reconciler bullet), dev/testing.md (new tests). * test(server): schema_apply_route_can_add_index reflects deferred index build iss-848 made schema apply record @index intent without building the physical index inline. The route test asserted the index count increased after apply; on an empty graph it now stays unchanged (the build is deferred to ensure_indices/optimize). Assert the new contract: apply succeeds and the physical index count is unchanged. * fix(engine): precheck vector trainability — don't pin or swallow (PR review) Two issues Cursor Bugbot caught in the chokepoint fault-isolation: 1. (HIGH) Pending vector pins roll back siblings. needs_index_work_node counted a missing vector index as work whenever the table had rows, so a column with no trainable vectors got pinned in the EnsureIndices recovery sidecar — but the build deferred it (zero commit). On a crash before manifest publish the classifier sees NoMovement and the all-or-nothing decision (recovery.rs decide()) rolls back the WHOLE sidecar, undoing a sibling table's committed index work. 2. (MED) Vector build swallowed fatal errors. The match arm converted every create_vector_index error into a deferred PendingIndex, hiding genuine I/O/manifest/Lance failures as "pending". Fix both with one trainability precheck (vector_column_trainable: >=1 non-null vector, the ivf_flat(1) minimum) used identically by needs_index_work_node and the build arm: an untrainable column is never counted as work (so never pinned — no zero-commit pin) and never attempted (so it can't fail); only a trainable column is built, and then any error PROPAGATES (stays fatal). The deferred column is still recorded as a PendingIndex with a clear reason. Refs dev-graph iss-848. * feat(cli): surface pending index column + reason in optimize output (PR review) Codex (P2): pending_indexes was documented as visible in `optimize --json` but the CLI projection never emitted it — operators would lose the only signal that optimize has deferred index work. Greptile (P2): the stat dropped the reason, so operators saw which column was stuck, not why. Carry the reason: TableOptimizeStats.pending_indexes is now Vec<PendingIndex> (column + reason), and `omnigraph optimize --json` emits {column, reason} per pending index; human output prints a "↳ index pending on '<col>': <reason>" line. Refs dev-graph iss-848. * test: align CLI index-add test with deferred build; cover post-rename reconcile - schema_apply_json_adds_index_for_existing_property (cli_schema_config.rs): the CLI analog of the server test — asserted the index count grew after apply; under iss-848 the apply defers the build, so the count is unchanged on an empty graph. Assert the deferred contract. (The only full-suite failure.) - optimize_materializes_index_after_type_rename (maintenance.rs, new): covers the gap Greptile flagged — a RenameType writes the renamed table with rows but no indexes (inline build removed in Commit B); assert the rank index is Degraded post-rename and Indexed after optimize reconciles it. Refs dev-graph iss-848. * test(engine): in-source apply tests reflect deferred index materialization The two db::omnigraph in-source unit tests asserted the old "schema apply builds / preserves indexes inline" behavior (the only remaining full-suite failures): - test_apply_schema_defers_index_then_reconciler_builds_it (was test_apply_schema_adds_index_for_existing_property): apply records the @index intent but builds nothing; assert the BTREE on `age` is absent after apply and present after ensure_indices. (Uses `age`, unindexed in TEST_SCHEMA — `name @key` is already FTS-indexed at seed.) - test_apply_schema_rewrite_defers_index_then_reconciler_restores (was test_apply_schema_rewrite_preserves_existing_indices): an AddProperty rewrite no longer rebuilds indexes inline; assert ensure_indices restores id BTREE + name FTS after the rewrite. Verified by grep that these + the server/CLI tests are the complete set of "apply builds an index" assertions; all other index-presence tests run after load/ensure_indices/primitives, which still build. Refs dev-graph iss-848. * fix(engine): optimize always reports pending indexes, not only on create-work (PR review) Cursor Bugbot (MED): pending_indexes was filled only when needs_index_create was true, but the vector trainability precheck makes needs_index_work_node exclude an untrainable Vector column. So a table whose sole missing index is untrainable, but which optimize still compacts or reindexes, returned an empty pending_indexes — contradicting the documented operator contract for deferred columns. Run the (idempotent) build chokepoint unconditionally once past the no-op gate, rather than gating it on needs_index_create. It skips existing indexes, builds any buildable missing one, and reports an untrainable column as pending whether the table entered for compaction, reindex, or index creation. needs_index_create still gates the no-op decision (so an index-only table still enters the path). Refs dev-graph iss-848. * test(engine): reframe staged-BTREE-failure failpoint onto the reconciler path ensure_indices_stage_btree_failure_leaves_existing_tables_writable fired `ensure_indices.post_stage_pre_commit_btree` and expected `apply_schema` (adding a type) to fail mid-BTREE-build. iss-848 removed apply's inline index build, so that apply now succeeds and the test's unwrap_err panicked — it exercised a removed code path. Reframe onto where BTREE builds happen now: seed Person, add an `@index` on `age` (apply records intent, defers the build), then `ensure_indices` builds the deferred BTREE and the failpoint fires between stage and commit. Person's HEAD is unchanged (no drift) and its EnsureIndices sidecar pins NoMovement; a write to a different, unpinned table (Company) is unaffected (mutations/loads heal roll-forward and proceed, unlike optimize/repair which refuse on a pending sidecar). Preserves the original coverage (staged-index stage failure leaves other tables writable, no drift) in the new architecture. Refs dev-graph iss-848. * feat(server): converge deferred indexes promptly after schema apply (iss-848) Schema apply records @index intent but defers the physical build. On a long-lived server, spawn a detached best-effort ensure_indices after a successful apply so the indexes converge promptly instead of waiting for the operator's next optimize. Fire-and-forget: it never blocks or fails the apply response, and a failure is logged (the index still converges on the next optimize). Guarded on result.applied. The CLI is one-shot, so it has no equivalent; its convergence path is the optimize cadence. handle.engine is already an Arc, so the spawn takes an owned clone. Convergence itself is covered by the engine ensure_indices/optimize tests; the existing empty-graph schema-apply route tests confirm the response is unaffected (the spawn is a read-only no-op on an empty table). Refs dev-graph iss-848. * docs(maintenance): list pending_indexes in optimize per-table stats (consistency)
2026-06-15 18:48:43 +02:00
7. **Logical contract over physical state.** Physical state (index coverage, fragment layout, compaction versions, staged writes) is derived and rebuildable; it must never fail a logical operation. Check preconditions against logical state and let reconciliation converge the physical state idempotently — genuine logical conflicts still fail loudly. This is the rule rules 16 instantiate; full statement and applications in [docs/dev/invariants.md](docs/dev/invariants.md).
perf(engine): remove the per-query metadata re-derivation tax on warm reads (#268) * test(engine): add read-path IO instrumentation seam for warm-read cost tests Prerequisite seam for the query-latency fixes. Adds crates/omnigraph/src/instrumentation.rs: - CountingStorageAdapter: a StorageAdapter decorator counting per-method reads (read_text/exists/read_text_versioned/list_dir), for the schema-contract reads on the query path. - A per-query task-local (QueryIoProbes) carrying Lance WrappingObjectStore wrappers per open category plus a probe counter, delivered via with_query_io_probes. open_dataset_tracked attaches the wrapper so the open itself is counted (ObjectStoreParams.object_store_wrapper). Wires the wrappers into the manifest open (open_manifest_dataset) and the commit-graph opens (CommitGraph::open/open_at_branch). Production leaves the task-local unset, so nothing attaches. Makes Omnigraph::open_with_storage public so tests can inject the counting adapter. lance-io is a dev-dependency (IOTracker named only in tests). No runtime behavior change. * test(engine): warm same-branch read should reuse the coordinator (red) Cost-budget test using Lance IOTracker at the object-store boundary (the LanceDB IO-counted-test pattern). On a 20-commit-deep graph, a warm same-branch query re-opens a fresh coordinator, which opens both the commit graph and __manifest. Asserts the read opens the commit graph zero times and performs exactly one cheap version probe; today it does neither (it scans the commit graph on re-open and never probes). The freshness guard already passes. Adds the commit_many helper for history-depth fixtures. Red half of the Fix 1 red->green pair; turns green with the next commit. * perf(engine): same-branch reads reuse the warm coordinator (Fix 1) query()/resolved_target re-opened a fresh GraphCoordinator from storage on every read (full __manifest scan + two commit-graph scans), so a warm read's cost grew with commit history (invariant 15) though the data was unchanged. resolved_target now serves same-branch reads from the warm in-memory coordinator, gated by a cheap version probe (latest_version_id, one object-store op) instead of a full re-open: - fresh (probe == cached version): return the in-memory snapshot under the read lock, with a synthetic (branch, version) id and no commit-graph access (reads pin the snapshot by manifest version, not the commit DAG; invariant 2). - stale: take the write lock, re-probe (double-checked; tokio RwLock has no read->write upgrade), then refresh_manifest_only (no commit-graph scan), preserving strong consistency for external writers (invariant 6). Cross-branch and snapshot targets keep the existing cold-resolve path. Adds ManifestCoordinator/GraphCoordinator::probe_latest_version and GraphCoordinator::refresh_manifest_only. Nothing on the read path needs a real commit ULID (only RuntimeCache keys on the id, where synthetic is consistent), per a caller audit. A warm same-branch read on a 20-commit graph now does zero commit-graph opens and exactly one probe (down from a deep commit-graph scan) and still observes external commits. The residual per-table __manifest scans are removed later by Fix 2. * test(engine): warm query should validate the schema contract once (red) ensure_schema_state_valid runs twice per query (query()/run_query_at AND resolved_target/snapshot_at_version), each reading 3 contract files + 2 existence probes. A warm query thus does 6 read_text + 4 exists where one validation (3 + 2) suffices, measured via CountingStorageAdapter. Adds a drift guard (schema_source_drift_is_caught_on_read) that already passes. Red half of the finding-A red->green pair. * perf(engine): validate the schema contract once per query (finding A) ensure_schema_state_valid ran on every query AND again inside resolved_target / snapshot_at_version, so each query validated the schema contract twice (~10 storage ops). Removes the redundant query()/ run_query_at() calls; the validation inside resolved_target / snapshot_at_version still runs, so drift is detected exactly as before. A source-only fast path was rejected: a long-lived handle must detect external drift of the schema source, IR, OR state on its next operation (lifecycle::long_lived_handle_rejects_schema_*), which a source-only compare would miss. So the only safe latency win is not validating twice. A warm query now does one validation (3 read_text + 2 exists) instead of two (6 + 4). * test(engine): warm + multi-table reads should do zero manifest scans (red) After Fix 1 a warm same-branch read still scans __manifest ~44 times at 20-commit depth: not from resolution (Fix 1 removed that) but from the per-table open path, which routes through the Lance namespace and full-scans __manifest twice per touched table (describe_table + describe_table_version). Tightens the warm test to assert manifest read_iops == 0 and adds a multi-table (traversal) test asserting the same, pinning the "2 tables = 2x" tax. Red half of the Fix 2 red->green pair. * perf(engine): open touched tables by location+version, not via the namespace (Fix 2) SubTableEntry::open routed every read-path table open through DatasetBuilder::from_namespace(BranchManifestNamespace), whose describe_table full-scans __manifest and, with managed_versioning, makes Lance scan again (describe_table_version) -- two full __manifest scans per touched table. That was the residual that made warm-read manifest IO grow with history and the '2 tables = 2x' multi-table tax. The resolved Snapshot already holds each table's path/version/branch, so open directly: from_uri(table_uri_for_path(root, path, branch)).with_version(v). The branch-qualified location is the dataset that physically holds the version (main: {path}; branch: {path}/tree/{branch}, Lance native-branch storage), and with_version resolves it within THAT dataset's _versions. 0 namespace calls + 1 HEAD via the native ConditionalPutCommitHandler. The read namespace (BranchManifestNamespace) is now unused in production (writes use StagedTableNamespace), so it, its constructor, and the helpers only it used (to_namespace_version, publish_requests, their imports) are gated #[cfg(test)] -- retained to validate the namespace contract in unit tests. Removes the dead open_table_at_version_from_manifest. Warm same-branch + multi-table reads now scan __manifest zero times; branch + time-travel reads stay correct (branching.rs, point_in_time.rs, 2 lib regression tests); production-lib warnings unchanged (baseline). * test(engine): cost-budget coverage for branch-warm and stale-refresh reads (matrix) Extends the read-path cost-budget tests across more of the morphological matrix: - warm_branch_read_does_no_manifest_scans: a warm read on a non-main branch (handle synced to it) scans __manifest zero times, exercising Fix 2's branch-owned-table open (tree/{branch} + with_version) on Fix 1's warm path -- the cell that regressed when the open used with_branch against the base. - stale_read_refreshes_manifest_only: an external commit makes the next read take the stale path, which re-reads the manifest (read_iops > 0) but never scans the commit graph (refresh_manifest_only), pinning Fix 1's manifest-only refresh. Cold paths (cross-branch, time-travel) stay behavior-covered (branching.rs, point_in_time.rs) and are cold by design (Fix 1 warm-paths only same-branch), so there is no manifest==0 contract to assert there. * test(engine): same-branch write after external commit must not fork the commit DAG (red) * fix(engine): refresh commit-graph head before append to prevent same-branch DAG fork A same-branch write that follows an external commit committed a fresh manifest version (commit_all rebases the pin from a fresh coordinator) but appended off the coordinator's stale in-memory commit-graph head, forking the commit DAG (the new commit and the external commit shared a parent). Pre-existing for non-strict inserts; widened to strict ops by Fix 1's refresh_manifest_only freshening the read-time pin. record_graph_commit now refreshes the commit-graph head from storage before append_commit, so the parent is the true current head. record_merge_commit is unaffected (it passes explicit parents). * perf(engine): hold open Dataset handles + share one Session per graph (Fix 3) A warm same-branch read still re-opened every touched table per query (the "never warms up" residual after Fix 1+2). A per-graph held-handle cache keyed by (table_path, branch, version) now serves repeat reads with zero table opens, and one shared lance::Session per graph warms metadata/index caches across opens. Validated against LanceDB upstream (rust/lancedb/src/table/dataset.rs DatasetConsistencyWrapper): hold an Arc<Dataset> and reuse it for 0-IO warm reads; one Session per connection threaded into opens; writers never serve from the read cache; time-travel bypasses. One adaptation: omnigraph keys by version (snapshot-pins-version model) where LanceDB keys per-table+HEAD, reusing the in-repo GraphIndexCache LRU template. - ReadCaches (session + TableHandleCache) injected onto live-Branch-read snapshots in resolved_target; Snapshot::open serves from the cache or opens once with the session on a miss (via the instrumented open_table_dataset). - Writes (resolved_branch_target -> open HEAD) and time-travel / Snapshot-id reads bypass the cache. Version-in-key makes a write a new key (old handle ages out via LRU); invalidate_all at branch-switch/refresh is hygiene only. - Cost tests: a 2nd identical warm read does 0 table opens; a write re-opens only the changed table at its new version. Full engine suite green. * test(engine): forbid raw data opens in the read/exec layer (P2 guard) Extend the forbidden-API guard with Dataset::open / DatasetBuilder::from_uri / from_namespace so the read/exec layer (exec/, loader/, changes/, db/omnigraph/) cannot bypass Snapshot::open and the held-handle cache (Fix 3). The instrumented opener (instrumentation.rs) is allow-listed; two legitimate non-read opens (a test editing __manifest, Hard-drop version GC) carry sentinels. The storage/manifest layers stay allow-listed. Lean P2 scope, per LanceDB-upstream + minimize-liability: the data-read boundary already exists (SubTableEntry::open); this guard pins it so a future read cannot open around the cache. Centralizing all internal opens behind one opener is deferred. * docs(dev): invariant 15 (one source of truth, cheaply derived) + cost-budget testing Records the principle behind the query-latency work: Lance and the manifest are the source of truth, everything else a derived view held warm and refreshed by a cheap probe; the two failure modes (a drifting parallel copy, and cold re-derivation whose cost grows with history) are deny-listed. Adds the cost-budget testing discipline (assert a warm read's open/IO count is flat at commit-history depth, the LanceDB IO-counted pattern) and the warm_read_cost.rs row. Updates the read-path-re-derivation known gap to reflect what Fix 1/2/3 + finding A close, and adds the commit-graph-parent-under-concurrency gap. * fix(engine): branch-incarnation identity + unified invalidation + shared LruMap (PR #268 review) Phase 6 A-D, correct-by-design responses to the Codex/Greptile P2 review comments. A: warm-read freshness and the table-handle cache key use the manifest incarnation (e_tag, manifest-timestamp fallback, then version), so a deleted+recreated non-main branch reusing a version number cannot be served stale; main stays version-cheap, non-main loads latest_manifest; a detected stale refresh also invalidates read caches; two regression tests force the version collision. B: unify the two cache invalidations into Omnigraph::invalidate_read_caches() at the four sites. C: assert the stale path's probe count. D: shared LruMap behind both caches with unconditional eviction, plus a unit test. Full engine suite green; multi-process lineage fork and O(history) write refresh remain known gaps for Phase 6E/7.
2026-06-17 13:25:20 +02:00
8. **One source of truth, cheaply derived.** Lance and the manifest are the source of truth; runtime state is a derived view of them. Don't maintain a parallel copy that can drift, and don't re-derive a view from cold storage on every call (that makes cost grow with history). Hold it warm, refresh with a cheap probe.
### Deny-list (fast-pass review filter — full reasoning in [docs/dev/invariants.md](docs/dev/invariants.md))
If a proposal fits one of these, the burden is on the proposer to justify why this case is the exception:
- Synchronous-inline index updates for indexes expensive to build (vector ANN, FTS) — use the reconciler pattern.
- Custom WAL / transaction manager / buffer pool — Lance owns these.
- Job queue for state derivable from manifest — reconciler pattern instead.
- Per-feature lowering for shapes that share a structure (interfaces, wildcards, alternation) — use one mechanism.
- Eager materialization of cross-products in multi-hop — factorize; flatten only when needed.
- Ad-hoc IN-list filtering when SIP fits.
- String-flattened SQL filter generation when structured pushdown is available.
- In-process-only `Dataset` impls — `Send + Sync`, remote descriptors.
- Cost-blind plan choice — lowering-order execution is not a planner.
- Hidden statistics — if a metric matters for plan choice, it must be exposed through the trait surface.
- Side-channels for query semantics — search modes, mutations, polymorphism are first-class IR concepts.
- Discarding rank in retrieval — score and rank propagate as columns.
- State that drifts from the manifest — derive from observable state.
- Cloud-only correctness fixes — correctness is always OSS.
- Forking the codebase for Cloud — trait-extension only.
- Hand-rolling something Lance already does — check the spec first.
perf(engine): remove the per-query metadata re-derivation tax on warm reads (#268) * test(engine): add read-path IO instrumentation seam for warm-read cost tests Prerequisite seam for the query-latency fixes. Adds crates/omnigraph/src/instrumentation.rs: - CountingStorageAdapter: a StorageAdapter decorator counting per-method reads (read_text/exists/read_text_versioned/list_dir), for the schema-contract reads on the query path. - A per-query task-local (QueryIoProbes) carrying Lance WrappingObjectStore wrappers per open category plus a probe counter, delivered via with_query_io_probes. open_dataset_tracked attaches the wrapper so the open itself is counted (ObjectStoreParams.object_store_wrapper). Wires the wrappers into the manifest open (open_manifest_dataset) and the commit-graph opens (CommitGraph::open/open_at_branch). Production leaves the task-local unset, so nothing attaches. Makes Omnigraph::open_with_storage public so tests can inject the counting adapter. lance-io is a dev-dependency (IOTracker named only in tests). No runtime behavior change. * test(engine): warm same-branch read should reuse the coordinator (red) Cost-budget test using Lance IOTracker at the object-store boundary (the LanceDB IO-counted-test pattern). On a 20-commit-deep graph, a warm same-branch query re-opens a fresh coordinator, which opens both the commit graph and __manifest. Asserts the read opens the commit graph zero times and performs exactly one cheap version probe; today it does neither (it scans the commit graph on re-open and never probes). The freshness guard already passes. Adds the commit_many helper for history-depth fixtures. Red half of the Fix 1 red->green pair; turns green with the next commit. * perf(engine): same-branch reads reuse the warm coordinator (Fix 1) query()/resolved_target re-opened a fresh GraphCoordinator from storage on every read (full __manifest scan + two commit-graph scans), so a warm read's cost grew with commit history (invariant 15) though the data was unchanged. resolved_target now serves same-branch reads from the warm in-memory coordinator, gated by a cheap version probe (latest_version_id, one object-store op) instead of a full re-open: - fresh (probe == cached version): return the in-memory snapshot under the read lock, with a synthetic (branch, version) id and no commit-graph access (reads pin the snapshot by manifest version, not the commit DAG; invariant 2). - stale: take the write lock, re-probe (double-checked; tokio RwLock has no read->write upgrade), then refresh_manifest_only (no commit-graph scan), preserving strong consistency for external writers (invariant 6). Cross-branch and snapshot targets keep the existing cold-resolve path. Adds ManifestCoordinator/GraphCoordinator::probe_latest_version and GraphCoordinator::refresh_manifest_only. Nothing on the read path needs a real commit ULID (only RuntimeCache keys on the id, where synthetic is consistent), per a caller audit. A warm same-branch read on a 20-commit graph now does zero commit-graph opens and exactly one probe (down from a deep commit-graph scan) and still observes external commits. The residual per-table __manifest scans are removed later by Fix 2. * test(engine): warm query should validate the schema contract once (red) ensure_schema_state_valid runs twice per query (query()/run_query_at AND resolved_target/snapshot_at_version), each reading 3 contract files + 2 existence probes. A warm query thus does 6 read_text + 4 exists where one validation (3 + 2) suffices, measured via CountingStorageAdapter. Adds a drift guard (schema_source_drift_is_caught_on_read) that already passes. Red half of the finding-A red->green pair. * perf(engine): validate the schema contract once per query (finding A) ensure_schema_state_valid ran on every query AND again inside resolved_target / snapshot_at_version, so each query validated the schema contract twice (~10 storage ops). Removes the redundant query()/ run_query_at() calls; the validation inside resolved_target / snapshot_at_version still runs, so drift is detected exactly as before. A source-only fast path was rejected: a long-lived handle must detect external drift of the schema source, IR, OR state on its next operation (lifecycle::long_lived_handle_rejects_schema_*), which a source-only compare would miss. So the only safe latency win is not validating twice. A warm query now does one validation (3 read_text + 2 exists) instead of two (6 + 4). * test(engine): warm + multi-table reads should do zero manifest scans (red) After Fix 1 a warm same-branch read still scans __manifest ~44 times at 20-commit depth: not from resolution (Fix 1 removed that) but from the per-table open path, which routes through the Lance namespace and full-scans __manifest twice per touched table (describe_table + describe_table_version). Tightens the warm test to assert manifest read_iops == 0 and adds a multi-table (traversal) test asserting the same, pinning the "2 tables = 2x" tax. Red half of the Fix 2 red->green pair. * perf(engine): open touched tables by location+version, not via the namespace (Fix 2) SubTableEntry::open routed every read-path table open through DatasetBuilder::from_namespace(BranchManifestNamespace), whose describe_table full-scans __manifest and, with managed_versioning, makes Lance scan again (describe_table_version) -- two full __manifest scans per touched table. That was the residual that made warm-read manifest IO grow with history and the '2 tables = 2x' multi-table tax. The resolved Snapshot already holds each table's path/version/branch, so open directly: from_uri(table_uri_for_path(root, path, branch)).with_version(v). The branch-qualified location is the dataset that physically holds the version (main: {path}; branch: {path}/tree/{branch}, Lance native-branch storage), and with_version resolves it within THAT dataset's _versions. 0 namespace calls + 1 HEAD via the native ConditionalPutCommitHandler. The read namespace (BranchManifestNamespace) is now unused in production (writes use StagedTableNamespace), so it, its constructor, and the helpers only it used (to_namespace_version, publish_requests, their imports) are gated #[cfg(test)] -- retained to validate the namespace contract in unit tests. Removes the dead open_table_at_version_from_manifest. Warm same-branch + multi-table reads now scan __manifest zero times; branch + time-travel reads stay correct (branching.rs, point_in_time.rs, 2 lib regression tests); production-lib warnings unchanged (baseline). * test(engine): cost-budget coverage for branch-warm and stale-refresh reads (matrix) Extends the read-path cost-budget tests across more of the morphological matrix: - warm_branch_read_does_no_manifest_scans: a warm read on a non-main branch (handle synced to it) scans __manifest zero times, exercising Fix 2's branch-owned-table open (tree/{branch} + with_version) on Fix 1's warm path -- the cell that regressed when the open used with_branch against the base. - stale_read_refreshes_manifest_only: an external commit makes the next read take the stale path, which re-reads the manifest (read_iops > 0) but never scans the commit graph (refresh_manifest_only), pinning Fix 1's manifest-only refresh. Cold paths (cross-branch, time-travel) stay behavior-covered (branching.rs, point_in_time.rs) and are cold by design (Fix 1 warm-paths only same-branch), so there is no manifest==0 contract to assert there. * test(engine): same-branch write after external commit must not fork the commit DAG (red) * fix(engine): refresh commit-graph head before append to prevent same-branch DAG fork A same-branch write that follows an external commit committed a fresh manifest version (commit_all rebases the pin from a fresh coordinator) but appended off the coordinator's stale in-memory commit-graph head, forking the commit DAG (the new commit and the external commit shared a parent). Pre-existing for non-strict inserts; widened to strict ops by Fix 1's refresh_manifest_only freshening the read-time pin. record_graph_commit now refreshes the commit-graph head from storage before append_commit, so the parent is the true current head. record_merge_commit is unaffected (it passes explicit parents). * perf(engine): hold open Dataset handles + share one Session per graph (Fix 3) A warm same-branch read still re-opened every touched table per query (the "never warms up" residual after Fix 1+2). A per-graph held-handle cache keyed by (table_path, branch, version) now serves repeat reads with zero table opens, and one shared lance::Session per graph warms metadata/index caches across opens. Validated against LanceDB upstream (rust/lancedb/src/table/dataset.rs DatasetConsistencyWrapper): hold an Arc<Dataset> and reuse it for 0-IO warm reads; one Session per connection threaded into opens; writers never serve from the read cache; time-travel bypasses. One adaptation: omnigraph keys by version (snapshot-pins-version model) where LanceDB keys per-table+HEAD, reusing the in-repo GraphIndexCache LRU template. - ReadCaches (session + TableHandleCache) injected onto live-Branch-read snapshots in resolved_target; Snapshot::open serves from the cache or opens once with the session on a miss (via the instrumented open_table_dataset). - Writes (resolved_branch_target -> open HEAD) and time-travel / Snapshot-id reads bypass the cache. Version-in-key makes a write a new key (old handle ages out via LRU); invalidate_all at branch-switch/refresh is hygiene only. - Cost tests: a 2nd identical warm read does 0 table opens; a write re-opens only the changed table at its new version. Full engine suite green. * test(engine): forbid raw data opens in the read/exec layer (P2 guard) Extend the forbidden-API guard with Dataset::open / DatasetBuilder::from_uri / from_namespace so the read/exec layer (exec/, loader/, changes/, db/omnigraph/) cannot bypass Snapshot::open and the held-handle cache (Fix 3). The instrumented opener (instrumentation.rs) is allow-listed; two legitimate non-read opens (a test editing __manifest, Hard-drop version GC) carry sentinels. The storage/manifest layers stay allow-listed. Lean P2 scope, per LanceDB-upstream + minimize-liability: the data-read boundary already exists (SubTableEntry::open); this guard pins it so a future read cannot open around the cache. Centralizing all internal opens behind one opener is deferred. * docs(dev): invariant 15 (one source of truth, cheaply derived) + cost-budget testing Records the principle behind the query-latency work: Lance and the manifest are the source of truth, everything else a derived view held warm and refreshed by a cheap probe; the two failure modes (a drifting parallel copy, and cold re-derivation whose cost grows with history) are deny-listed. Adds the cost-budget testing discipline (assert a warm read's open/IO count is flat at commit-history depth, the LanceDB IO-counted pattern) and the warm_read_cost.rs row. Updates the read-path-re-derivation known gap to reflect what Fix 1/2/3 + finding A close, and adds the commit-graph-parent-under-concurrency gap. * fix(engine): branch-incarnation identity + unified invalidation + shared LruMap (PR #268 review) Phase 6 A-D, correct-by-design responses to the Codex/Greptile P2 review comments. A: warm-read freshness and the table-handle cache key use the manifest incarnation (e_tag, manifest-timestamp fallback, then version), so a deleted+recreated non-main branch reusing a version number cannot be served stale; main stays version-cheap, non-main loads latest_manifest; a detected stale refresh also invalidates read caches; two regression tests force the version collision. B: unify the two cache invalidations into Omnigraph::invalidate_read_caches() at the four sites. C: assert the stale path's probe count. D: shared LruMap behind both caches with unconditional eviction, plus a unit test. Full engine suite green; multi-process lineage fork and O(history) write refresh remain known gaps for Phase 6E/7.
2026-06-17 13:25:20 +02:00
- Shadowing the source of truth with a maintained parallel copy, or re-deriving a derived view from cold storage per call (cost then scales with history). Hold it warm and refresh cheaply.
- Mutating in place state that should be immutable (Lance fragments, index segments) — new segments instead.
- Silent failures — OOM, timeout, partial result must all be surfaced and bounded.
- Shipping observable behavior as if it weren't part of the contract — output ordering, error-message text, timestamp precision, default-flag values, latency profile. Per Hyrum's Law, every observable behavior gets depended on once shipped; don't expose what you don't want to commit to.
---
docs: rewrite README opening + add AGENTS.md dev commands (#122) * docs(agents): add build/test/lint dev-command section AGENTS.md (CLAUDE.md) covered architecture and invariants but had no developer command surface — only runtime `omnigraph` CLI usage. Add a concise "Build, test, lint" section with the non-obvious gotchas: - crate dir `crates/omnigraph` is package `omnigraph-engine` (the `-p` name) - canonical CI gate is `cargo test --workspace --locked` - how to run one file / one fn - feature-gated suites (`failpoints`, server `aws`) - S3 tests skip without `OMNIGRAPH_S3_TEST_BUCKET` - the two non-test CI checks (check-agents-md, OpenAPI drift) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(readme): rewrite opening, dedupe, fix stale references - New manifesto-style opening (tagline, X-as-code, features, core use cases, coordination-layer line); drop the old prose intro, Use Cases, and Capabilities sections. - Remove Capabilities, which restated the new opening line-for-line. - Harmonize heading case: "## Core Use Cases". - Dedupe the verbatim Slack invite (kept the Community section) and the double-linked cli.md (kept the contextual pointer). - Fix stale references that no longer match the code: - drop "transactional runs" / "and runs" — no run concept remains; writes are atomic per-query, multi-query workflows use branches. - update the CLI crate command list to canonical query/mutate plus commit/lint/optimize/cleanup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:42 +01:00
## Build, test, lint
Rust stable workspace (edition 2024). `protoc` is a build dependency (`brew install protobuf` / `apt-get install protobuf-compiler libprotobuf-dev`). **Crate dir ≠ package name** for the engine: the directory is `crates/omnigraph` but its Cargo package is `omnigraph-engine` (use that in `-p`). The CLI binary built from `omnigraph-cli` is named `omnigraph`.
```bash
cargo build --workspace --locked # build everything
cargo test --workspace --locked # the canonical CI gate (matches CI exactly)
cargo run -p omnigraph-cli -- <args> # run the `omnigraph` CLI from source
[codex] fix RFC-011 follow-up regressions (#258) * fix rfc-011 follow-up regressions * test(cli): remove served schema-apply tests obsoleted by the cluster 409 This PR disables server-side schema apply for cluster-backed serving (409 → `omnigraph cluster apply`). Two system_local tests still drove *served* schema apply against a spawned `--cluster` server and asserted the pre-409 behavior, so they failed under `cargo test --workspace`: - `local_cli_schema_apply_enforces_engine_layer_policy` — expected a per-actor policy `denied`/allow on the served route; the route now 409s for everyone before policy runs. - `local_cli_schema_apply_rejects_stored_query_breakage_before_publish` — expected a served apply to reject a stored-query breakage; the route now 409s before any apply. Both exercise a path the PR intentionally removed. Their surviving coverage: the 409 itself is pinned by `schema_routes::schema_apply_route_refuses_cluster_backed_server_mode` (asserts 409 + no mutation); stored-query-breakage-before-publish stays covered by `schema_routes::schema_apply_route_rejects_stored_query_breakage_before_publish` (single-mode); engine-layer schema_apply Cedar enforcement stays covered by `policy_engine_chassis`. Remove the obsolete served versions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(server): report the cluster-backed schema-apply 409 after the Cedar gate The 409 ("schema apply is disabled for cluster-backed serving") fired at the top of `server_schema_apply`, before `authorize_request`. An authenticated-but- unauthorized actor therefore learned the server is cluster-backed (409) instead of getting a normal 403 — leaking topology before authorization, against the same posture that keeps `GET /graphs` default-deny. Move the 409 below the Cedar gate so the route reports 401 → 403 → 409: an unauthorized actor gets 403, and only an actor authorized for `schema_apply` sees the actionable "use `omnigraph cluster apply`" 409. (An open/unauthenticated server still 409s, as it has no topology to protect.) Regression: `schema_apply_route_cluster_backed_denies_unauthorized_actor_before_409` (POLICY_YAML grants no schema_apply → act-ragnor gets 403, not 409). Addresses the bot-review finding on #258. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 03:11:43 +03:00
cargo run -p omnigraph-server -- --cluster <dir|s3://...> --bind 0.0.0.0:8080 # run the server from source
docs: rewrite README opening + add AGENTS.md dev commands (#122) * docs(agents): add build/test/lint dev-command section AGENTS.md (CLAUDE.md) covered architecture and invariants but had no developer command surface — only runtime `omnigraph` CLI usage. Add a concise "Build, test, lint" section with the non-obvious gotchas: - crate dir `crates/omnigraph` is package `omnigraph-engine` (the `-p` name) - canonical CI gate is `cargo test --workspace --locked` - how to run one file / one fn - feature-gated suites (`failpoints`, server `aws`) - S3 tests skip without `OMNIGRAPH_S3_TEST_BUCKET` - the two non-test CI checks (check-agents-md, OpenAPI drift) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(readme): rewrite opening, dedupe, fix stale references - New manifesto-style opening (tagline, X-as-code, features, core use cases, coordination-layer line); drop the old prose intro, Use Cases, and Capabilities sections. - Remove Capabilities, which restated the new opening line-for-line. - Harmonize heading case: "## Core Use Cases". - Dedupe the verbatim Slack invite (kept the Community section) and the double-linked cli.md (kept the contextual pointer). - Fix stale references that no longer match the code: - drop "transactional runs" / "and runs" — no run concept remains; writes are atomic per-query, multi-query workflows use branches. - update the CLI crate command list to canonical query/mutate plus commit/lint/optimize/cleanup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:42 +01:00
# Run one crate / one test file / one test fn
cargo test -p omnigraph-engine --test traversal # one integration-test file (see docs/dev/testing.md)
cargo test -p omnigraph-engine --test writes concurrent # one test fn by name substring
docs: rewrite README opening + add AGENTS.md dev commands (#122) * docs(agents): add build/test/lint dev-command section AGENTS.md (CLAUDE.md) covered architecture and invariants but had no developer command surface — only runtime `omnigraph` CLI usage. Add a concise "Build, test, lint" section with the non-obvious gotchas: - crate dir `crates/omnigraph` is package `omnigraph-engine` (the `-p` name) - canonical CI gate is `cargo test --workspace --locked` - how to run one file / one fn - feature-gated suites (`failpoints`, server `aws`) - S3 tests skip without `OMNIGRAPH_S3_TEST_BUCKET` - the two non-test CI checks (check-agents-md, OpenAPI drift) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(readme): rewrite opening, dedupe, fix stale references - New manifesto-style opening (tagline, X-as-code, features, core use cases, coordination-layer line); drop the old prose intro, Use Cases, and Capabilities sections. - Remove Capabilities, which restated the new opening line-for-line. - Harmonize heading case: "## Core Use Cases". - Dedupe the verbatim Slack invite (kept the Community section) and the double-linked cli.md (kept the contextual pointer). - Fix stale references that no longer match the code: - drop "transactional runs" / "and runs" — no run concept remains; writes are atomic per-query, multi-query workflows use branches. - update the CLI crate command list to canonical query/mutate plus commit/lint/optimize/cleanup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:42 +01:00
cargo test -p omnigraph-engine some_inline_test -- --nocapture # show stdout
# Feature-gated suites (each is its own job in CI, not part of the default run)
cargo test -p omnigraph-engine --features failpoints --test failpoints # fault injection
cargo build -p omnigraph-server --features aws # AWS Secrets Manager bearer-token source
```
docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257) * docs: optimize README for dev onboarding; fix 0.7.0 staleness The README's setup half drifted from the shipped 0.7.0 CLI and led with the heaviest path (Docker + RustFS). This reworks it for fast, correct onboarding: README.md - New zero-dependency "Your first graph in 60 seconds" hero: a fully copy-pasteable local file-backed loop (schema → init → load → query → branch). - Add a correct "Serve it" section (cluster apply + omnigraph-server --cluster); the server is cluster-only on main, so the old positional-URI boot is gone. - Demote the RustFS bootstrap to "rehearse the S3 path locally"; reframe the storage bullet as "filesystem or any S3-compatible store (AWS S3, R2, MinIO, RustFS)" — RustFS is a provider, not a storage class. - Fix crate/MCP descriptions (query/mutate/load, not read/change/ingest). docs/user/quickstart.md - Fix the query example: `read --name <q> … <uri>` is removed — the query name is positional and the graph is addressed with `--store` (`omnigraph query find_people --query queries.gq --store graph.omni`). scripts/local-rustfs-bootstrap.sh - Convert to cluster mode: write a cluster.yaml (storage: s3://…), then validate → import → apply, load the fixture into the derived root with the now-required --mode, and serve with `omnigraph-server --cluster`. The old flow (`load` without --mode, `omnigraph-server <URI>` positional boot) no longer works on a cluster-only server. * docs: move agent skill into the repo, add agent-setup snippet, drop rustfs script skills/omnigraph - The operational skill (formerly `omnigraph-best-practices` in the cookbooks repo) now lives with the engine it documents, co-versioned. Renamed to `omnigraph`; repository metadata repointed here. - Broadened the description to trigger on intent — storing/retrieving/querying knowledge, agent memory, building a knowledge graph, operating Omnigraph — as well as on CLI/artifact sightings (stays ≤1024 chars). - Install: `npx skills add ModernRelay/omnigraph@omnigraph`. README - New "Set it up with an AI agent" paste snippet: installs the skill, reads the docs (URL), browses the cookbooks, and asks the user about a use case before standing up a first graph. - "Agent skill & starter graphs" section points at skills/omnigraph + cookbooks. Drop scripts/local-rustfs-bootstrap.sh - Not CI-tested (so it rotted: it broke on the cluster-only migration — positional server boot, load without --mode), demoed the now-optional S3 path, and was the most fragile artifact in the repo. Replaced with a "Testing against S3 locally" guide in deployment.md (docker run RustFS/MinIO + AWS_* env + cluster-on-S3). README/AGENTS references updated.
2026-06-16 11:48:13 +02:00
S3-backed tests (`s3_storage`, and the S3 paths in server/CLI system tests) **skip** unless `OMNIGRAPH_S3_TEST_BUCKET` + `AWS_*` (incl. `AWS_ENDPOINT_URL_S3` for non-AWS) are set; CI runs them against containerized RustFS. To run RustFS/MinIO yourself, see [docs/user/deployment.md](docs/user/deployment.md) → *Testing against S3 locally*.
docs: rewrite README opening + add AGENTS.md dev commands (#122) * docs(agents): add build/test/lint dev-command section AGENTS.md (CLAUDE.md) covered architecture and invariants but had no developer command surface — only runtime `omnigraph` CLI usage. Add a concise "Build, test, lint" section with the non-obvious gotchas: - crate dir `crates/omnigraph` is package `omnigraph-engine` (the `-p` name) - canonical CI gate is `cargo test --workspace --locked` - how to run one file / one fn - feature-gated suites (`failpoints`, server `aws`) - S3 tests skip without `OMNIGRAPH_S3_TEST_BUCKET` - the two non-test CI checks (check-agents-md, OpenAPI drift) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(readme): rewrite opening, dedupe, fix stale references - New manifesto-style opening (tagline, X-as-code, features, core use cases, coordination-layer line); drop the old prose intro, Use Cases, and Capabilities sections. - Remove Capabilities, which restated the new opening line-for-line. - Harmonize heading case: "## Core Use Cases". - Dedupe the verbatim Slack invite (kept the Community section) and the double-linked cli.md (kept the contextual pointer). - Fix stale references that no longer match the code: - drop "transactional runs" / "and runs" — no run concept remains; writes are atomic per-query, multi-query workflows use branches. - update the CLI crate command list to canonical query/mutate plus commit/lint/optimize/cleanup. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-05-30 00:53:42 +01:00
CI does **not** run `clippy` or `rustfmt` as gates — but `cargo test --workspace --locked` is the exact gate, so run it before pushing. Two non-test CI checks: `scripts/check-agents-md.sh` (doc cross-link integrity — run it after moving/renaming docs) and OpenAPI drift (`crates/omnigraph-server/tests/openapi.rs` regenerates `openapi.json`; set `OMNIGRAPH_UPDATE_OPENAPI=1` to update the checked-in copy when a server/API change is intentional).
---
## Quick-reference flows
```bash
# Initialize an S3-backed graph
omnigraph init --schema ./schema.pg s3://my-bucket/graph.omni
# Bulk load
omnigraph load --data ./seed.jsonl --mode overwrite s3://my-bucket/graph.omni
# Load a review batch onto its own branch (--from forks it if missing)
omnigraph load --branch review/2026-04-25 --from main --mode merge --data ./batch.jsonl s3://my-bucket/graph.omni
docs(readme): align with cluster-first paradigm + RFC-011 CLI ergonomics (#262) * docs(readme): align with the cluster-first paradigm + RFC-011 CLI ergonomics The README's command examples and crate list predated 0.7.0. Update them: - Common Commands: capability/addressing model — positional/`--store` for direct storage, `--server` for served graphs (no positional `http(s)://`), `query`/ `mutate` invoke a stored query by name (positional is the query name), `load` requires `--mode`. Drop the false "same URI works for local/s3/http" claim and the deprecated `read`/`change` + `--name` forms; mention `~/.omnigraph/config.yaml`. - New "Serving (cluster-first)" section: a deployment is a cluster.yaml converged with `cluster apply` and served by `omnigraph-server --cluster <dir|s3://…>` (no single-graph mode; config-free boot from a bucket). - Fix the stale docs link (`docs/user/cli.md` → `cli/index.md` + the CLI reference) after the docs were restructured into topic sections. - Workspace Crates: list all seven (add omnigraph-policy, omnigraph-api-types, omnigraph-cluster) with cluster-first framing. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(agents): fix the stale `read --name` quick-reference example AGENTS.md (always-loaded; symlinked to CLAUDE.md) still showed the deprecated `omnigraph read --query … --name … <s3-uri>` form. Update it to the RFC-011 shape: `omnigraph query --query … <name> … --store <uri>` (read→query, --name→ positional query name, positional URI→--store). Addresses the bot-review finding on #262. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 11:18:16 +03:00
# Run a hybrid (vector + BM25) query — ad-hoc .gq against a store (positional = query name)
omnigraph query --query ./queries.gq find_similar \
--params '{"q":"trends in AI safety"}' --format table --store s3://my-bucket/graph.omni
# Plan + apply schema migration
omnigraph schema plan --schema ./next.pg s3://my-bucket/graph.omni
omnigraph schema apply --schema ./next.pg s3://my-bucket/graph.omni --json
# Merge review branch back
omnigraph branch merge review/2026-04-25 --into main s3://my-bucket/graph.omni
# Compact, preview any uncovered drift, then repair/GC after review
omnigraph optimize s3://my-bucket/graph.omni
omnigraph repair s3://my-bucket/graph.omni
omnigraph repair --confirm s3://my-bucket/graph.omni
# For suspicious/unverifiable drift only after deliberate review:
# omnigraph repair --force --confirm s3://my-bucket/graph.omni
omnigraph cleanup --keep 10 --older-than 7d s3://my-bucket/graph.omni
omnigraph cleanup --keep 10 --older-than 7d --confirm s3://my-bucket/graph.omni
# Stand up the HTTP server (token from env)
OMNIGRAPH_SERVER_BEARER_TOKEN=xxxx \
[codex] fix RFC-011 follow-up regressions (#258) * fix rfc-011 follow-up regressions * test(cli): remove served schema-apply tests obsoleted by the cluster 409 This PR disables server-side schema apply for cluster-backed serving (409 → `omnigraph cluster apply`). Two system_local tests still drove *served* schema apply against a spawned `--cluster` server and asserted the pre-409 behavior, so they failed under `cargo test --workspace`: - `local_cli_schema_apply_enforces_engine_layer_policy` — expected a per-actor policy `denied`/allow on the served route; the route now 409s for everyone before policy runs. - `local_cli_schema_apply_rejects_stored_query_breakage_before_publish` — expected a served apply to reject a stored-query breakage; the route now 409s before any apply. Both exercise a path the PR intentionally removed. Their surviving coverage: the 409 itself is pinned by `schema_routes::schema_apply_route_refuses_cluster_backed_server_mode` (asserts 409 + no mutation); stored-query-breakage-before-publish stays covered by `schema_routes::schema_apply_route_rejects_stored_query_breakage_before_publish` (single-mode); engine-layer schema_apply Cedar enforcement stays covered by `policy_engine_chassis`. Remove the obsolete served versions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(server): report the cluster-backed schema-apply 409 after the Cedar gate The 409 ("schema apply is disabled for cluster-backed serving") fired at the top of `server_schema_apply`, before `authorize_request`. An authenticated-but- unauthorized actor therefore learned the server is cluster-backed (409) instead of getting a normal 403 — leaking topology before authorization, against the same posture that keeps `GET /graphs` default-deny. Move the 409 below the Cedar gate so the route reports 401 → 403 → 409: an unauthorized actor gets 403, and only an actor authorized for `schema_apply` sees the actionable "use `omnigraph cluster apply`" 409. (An open/unauthenticated server still 409s, as it has no topology to protect.) Regression: `schema_apply_route_cluster_backed_denies_unauthorized_actor_before_409` (POLICY_YAML grants no schema_apply → act-ragnor gets 403, not 409). Addresses the bot-review finding on #258. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 03:11:43 +03:00
omnigraph-server --cluster s3://my-bucket/cluster --bind 0.0.0.0:8080
# Cedar policy explain
[codex] fix RFC-011 follow-up regressions (#258) * fix rfc-011 follow-up regressions * test(cli): remove served schema-apply tests obsoleted by the cluster 409 This PR disables server-side schema apply for cluster-backed serving (409 → `omnigraph cluster apply`). Two system_local tests still drove *served* schema apply against a spawned `--cluster` server and asserted the pre-409 behavior, so they failed under `cargo test --workspace`: - `local_cli_schema_apply_enforces_engine_layer_policy` — expected a per-actor policy `denied`/allow on the served route; the route now 409s for everyone before policy runs. - `local_cli_schema_apply_rejects_stored_query_breakage_before_publish` — expected a served apply to reject a stored-query breakage; the route now 409s before any apply. Both exercise a path the PR intentionally removed. Their surviving coverage: the 409 itself is pinned by `schema_routes::schema_apply_route_refuses_cluster_backed_server_mode` (asserts 409 + no mutation); stored-query-breakage-before-publish stays covered by `schema_routes::schema_apply_route_rejects_stored_query_breakage_before_publish` (single-mode); engine-layer schema_apply Cedar enforcement stays covered by `policy_engine_chassis`. Remove the obsolete served versions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(server): report the cluster-backed schema-apply 409 after the Cedar gate The 409 ("schema apply is disabled for cluster-backed serving") fired at the top of `server_schema_apply`, before `authorize_request`. An authenticated-but- unauthorized actor therefore learned the server is cluster-backed (409) instead of getting a normal 403 — leaking topology before authorization, against the same posture that keeps `GET /graphs` default-deny. Move the 409 below the Cedar gate so the route reports 401 → 403 → 409: an unauthorized actor gets 403, and only an actor authorized for `schema_apply` sees the actionable "use `omnigraph cluster apply`" 409. (An open/unauthenticated server still 409s, as it has no topology to protect.) Regression: `schema_apply_route_cluster_backed_denies_unauthorized_actor_before_409` (POLICY_YAML grants no schema_apply → act-ragnor gets 403, not 409). Addresses the bot-review finding on #258. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-16 03:11:43 +03:00
omnigraph policy explain --cluster ./company-brain --graph knowledge --actor act-alice --action change --branch main
```
---
## Capability matrix — "Lens by default vs. added by OmniGraph"
| Capability | L1 (Lance default) | L2 (OmniGraph adds) |
|---|---|---|
| Columnar storage on object store | ✅ Arrow/Lance | URI normalization, S3 env-var plumbing |
| Per-dataset versioning + time travel | ✅ | `snapshot_at_version`, `entity_at`, snapshot-pinned reads across many tables |
| Per-dataset branches | ✅ | **Graph-level** branches (atomic across all sub-tables), lazy fork, system branch filtering |
feat(engine): Stage the delete path; retire the inline-delete residual (#308) * test(engine): pin zero-row cascade delete must not drift an edge table (red) A delete <Node> cascades a delete_where into every incident edge type. The inline delete_where (Dataset::delete) advances Lance HEAD even when zero edges match, but the cascade records the new version only if deleted_rows > 0 — so a node with no incident edges leaves edge:Knows HEAD>manifest drift, which trips the next strict write's ExpectedVersionMismatch and repair refuses it. Red today: edge:Knows manifest=v5, Lance HEAD=v6. Goes green when delete moves to the staged two-phase path (iss-950, Lance 7.0 DeleteBuilder::execute_uncommitted), where a 0-row delete commits no Lance version and the deleted_rows>0 gate becomes correct by construction. * fix(engine): a zero-row delete must not advance Lance HEAD Lance's Dataset::delete commits a new version even when the predicate matches nothing (build_transaction always emits Operation::Delete), so a node delete that cascades a delete_where into an incident edge type with no matching edges advanced that edge table's Lance HEAD while the cascade skipped record_inline (gated on deleted_rows > 0) — leaving HEAD>manifest drift that wedged the next strict write and that repair refused as suspicious/unverifiable. Use Lance 7.0's two-phase DeleteBuilder::execute_uncommitted to read num_deleted_rows before committing: a no-match delete now advances nothing (no version, no drift) and the existing deleted_rows>0 gate is correct by construction. Non-zero deletes commit the staged transaction with skip_auto_cleanup + affected_rows (parity with the prior inline path). First step of the staged-delete migration (iss-950); turns the node_delete_with_no_incident_edges_leaves_no_edge_table_drift regression green. * feat(engine): stage_delete two-phase primitive (MR-A step 0) Add TableStore::stage_delete (Lance 7.0 DeleteBuilder::execute_uncommitted), the two-phase analogue of stage_merge_insert: writes deletion files without advancing Lance HEAD, returns Option<StagedWrite> (None on 0 rows = true no-op), carrying the deletion-vector updated_fragments as new_fragments and the superseded originals as removed_fragment_ids so combine_committed_with_staged makes the deletion visible to in-query reads. No affected_rows is threaded: like stage_merge_insert's Operation::Update commit, the staged delete relies on OmniGraph's per-table write queue + manifest CAS, not Lance's per-dataset conflict resolver (commit_staged is a single attempt). Flip the two residual guards to the staged path: staged_writes.rs now asserts stage_delete does NOT advance HEAD and that a staged delete is read-your-writes visible (the deletion-vector RYW proof D2 retirement depends on); the lance_surface_guards delete guard pins execute_uncommitted's UncommittedDelete. No behavior change yet (callers still use delete_where); Step 1 wires them. * feat(engine): TableStorage::stage_delete + migrate merge delete path (MR-A step 1a) Add stage_delete/Option<StagedHandle> to the TableStorage trait (delegates to TableStore::stage_delete). Migrate the two branch_merge delete sites (three-way RewriteMerged + adopt delta) from the inline delete_where residual to stage_delete + commit_staged — identical in shape to the stage_merge_insert + commit_staged pair above each. HEAD still advances within the merge sequence (via commit_staged), under the unchanged SidecarKind::BranchMerge Phase-B confirmation; the _pre_delete/_pre_index failpoints fire by position, unchanged. merge_truth_table, branching, composite_flow green. * feat(engine): migrate all delete sites to staged path, retire inline delete (MR-A step 1b/1c) Routes every delete through the staged write path so delete never advances Lance HEAD inline — the last inline-commit residual on the mutation path is gone. `MutationStaging` now accumulates delete predicates (`record_delete`) alongside pending write batches; at end-of-query `stage_all` combines a table's predicates into one `(p1) OR (p2) …` `stage_delete` (a deletion-vector transaction, no HEAD advance) and `commit_all` commits it through the same `commit_staged` path as inserts/updates. Deletes are now ordinary staged entries: one sidecar pin at `expected + 1`, no inline special-casing. Migrated callers (all 5): the 3 mutation.rs sites (delete-node, cascade, delete-edge) and the 2 merge.rs sites (already on stage_delete in step 1a). `affected_edges`/`affected` move from post-inline-commit `deleted_rows` to a committed `count_rows` at record time — exact under D₂, bounded by the cascade working set. A predicate matching zero rows stages nothing (the staged equivalent of the old "skip record_inline on 0 deleted rows"), so the zero-row edge-table drift class stays closed by construction. Retired scaffolding now that no caller remains: - `MutationStaging.inline_committed` + `record_inline` → `delete_predicates` + `record_delete`; `StagedMutation.inline_committed`/`paths` fields and all the `commit_all` inline handling (queue keys, sidecar pins with the `record_inline` table_version special-case, the inline recheck loop). - `open_table_for_mutation`'s post-inline-commit reopen branch (deletes no longer advance HEAD mid-query, so a second touch reopens at the pinned version like any write). - `InlineCommitResidual::delete_where` + its `TableStore` impl, the orphaned `TableStore::delete_where`, and `DeleteState`. `InlineCommitResidual` now carries only `create_vector_index` (Lance #6666 still open). D₂ stays for now: staged-delete read-your-writes doesn't yet compose into the pending accumulator (insert-then-delete on one table), so mixed insert/update/delete in one query is still rejected at parse time. Retiring D₂ is step 2. Doc comments updated to match across exec/, storage_layer, db/. Tests (all green): writes, consistency, validators, end_to_end, composite_flow, merge_truth_table, maintenance, recovery, staged_writes, forbidden_apis, lance_surface_guards, changes, point_in_time (286), plus failpoints (63). * docs: delete is a staged write, not an inline-commit residual (MR-A step 1) Update the docs that described `delete` as the inline-commit residual now that MR-A routes it through `stage_delete`. Always-loaded surfaces (AGENTS.md rule 4 / capability matrix, invariants.md Invariant 4 / truth matrix / known gaps) plus the dev write-path docs (writes.md, execution.md incl. its mutation sequence diagram, architecture.md) now state: deletes accumulate as predicates and stage like inserts/updates, no inline HEAD advance; `InlineCommitResidual` carries only `create_vector_index` (Lance #6666). The parse-time D₂ rule is documented as retained — not because delete inline-commits, but because staged-delete read-your-writes is not yet wired into the pending accumulator (MR-A step 2). lance.md's 7.0 audit note marked MR-A as landed. * docs: D₂ is a deliberate boundary, not temporary scaffolding (MR-A close-out) After MR-A staged the delete path, D₂ (a mutation query is insert/update-only OR delete-only) was left framed as temporary — "until Lance ships two-phase delete" / "retire in step 2". Lance shipped that and we used it for the inline-commit fix; D₂'s original justification is gone. It now stands for a different, permanent reason: keeping a query to one kind keeps its read-your-writes unambiguous and each table to one version per query. Retiring it would buy single-commit mixed atomicity (cheap workaround: split, or a branch) at the cost of an in-query delete view, pending pruning, edge id-resolution, and two-commit-per-table ordering in the hot mutation path — complexity not worth earning. Decision: keep D₂ as a deliberate boundary. Reframes the now-stale wording everywhere, no logic change: - The D₂ parse-time error message no longer promises "this restriction lifts when Lance exposes a two-phase delete API"; it states the boundary and points to a branch+merge for one atomic commit. - `enforce_no_mixed_destructive_constructive` doc, AGENTS.md, invariants.md (Invariant 4 / truth matrix / removed from the known-gaps), writes.md, architecture.md, lance.md, and the user mutations doc (which wrongly said deletes "commit through a different path" — both stage now). - Swept remaining stale `delete_where` mentions left from the Step-1 migration: the merge.rs "swap when upstream ships" comments (already swapped), the forbidden_apis / table_ops residual notes, the staged_writes vector-index guard doc (was "same as stage_delete's absence" — stage_delete now exists), and test comments/assert messages in recovery/maintenance/writes/failpoints. Genuinely-historical records (dated Lance audit, rfc-013, bug-case-fix) left. Verified: engine builds warning-free; check-agents-md OK; writes/maintenance/ recovery/staged_writes/forbidden_apis all green. Closes MR-A. * test(engine): overlapping delete predicates must not double-count affected_* (red) Reproduces a reporting regression from the staged-delete migration flagged in PR #308 review. Because deletes now stage (instead of inline-committing), two delete statements in one query both scan the same unchanged committed snapshot; counting each predicate independently over-reports `affected_*` when they overlap. The old inline path committed each delete before the next ran, so it counted distinct. `delete Person where name = "Alice"` then `delete Person where age > 29` over the standard fixture (Alice 30, Charlie 35) removes 2 distinct nodes and 3 distinct edges, but the buggy per-statement counting returns 3 nodes / 6 edges. RED at this commit (asserts left=3, right=2). * fix(engine): dedup overlapping delete predicates when counting affected_* Count each delete statement against the committed snapshot MINUS the predicates a prior delete statement on the same table already recorded: `(pred) AND NOT ((prior1) OR (prior2) …)`. Summed over statements this is inclusion-exclusion — `Σ |pₙ \ (p₁ ∪ …)| = |p₁ ∪ p₂ ∪ …|` — exactly the distinct count the combined `(p1) OR (p2)` staged delete removes. Works for nodes and edges alike with no edge identity needed; the node ID scan uses the same exclusion so a later statement also doesn't re-cascade already-deleted nodes. The ORIGINAL predicate is still what gets recorded (the staged delete removes the union); only the count uses the exclusion. The common single-delete path is unchanged (`prior` empty → filter is just the base predicate). New helper `dedup_delete_filter` + `MutationStaging::recorded_delete_predicates`. Turns the red regression test green (2 nodes / 3 edges); writes (33), end_to_end, validators, maintenance, recovery, composite_flow, merge_truth_table, consistency, changes, and failpoints (63) all stay green. * test(engine): delete dedup must not drop NULL-column rows (red) Follow-up to the overlapping-delete fix flagged in PR #308 review (Greptile P1): the `(base) AND NOT (prior)` exclusion breaks under SQL three-valued logic. If a prior delete predicate references a NULLable column, a later statement's matching row whose column is NULL makes `prior` evaluate to UNKNOWN, `NOT UNKNOWN` is UNKNOWN, and the row is filtered out of the scan — even though the prior delete never matched it. That drops it from `deleted_ids`, skipping its cascade (orphaned edges) or, if it is the only match, leaving the node undeleted. A data bug, not just a miscount. Data: Charlie(age 35), Zoe(age NULL); Knows Zoe→Charlie. `delete Person where age > 30` then `delete Person where name = "Zoe"`. Under the buggy `NOT`, Zoe's scan `(name='Zoe') AND NOT (age>30)` is UNKNOWN → Zoe survives. RED at this commit (Person count left=1, right=0). * fix(engine): NULL-safe delete dedup — exclude only definitely-matched prior rows Change `dedup_delete_filter` from `(base) AND NOT (prior)` to `(base) AND ((prior) IS NOT TRUE)`. `IS NOT TRUE` keeps both FALSE and UNKNOWN rows, so a prior predicate that evaluates to SQL UNKNOWN (a NULL in a referenced column) no longer drops a row this statement legitimately matches — only rows a prior predicate matched as definitely TRUE are excluded from the count/scan. The distinct-count semantics are unchanged for non-NULL data. Turns the red NULL-dedup test green (Zoe deleted, her edge cascaded), and the overlapping-dedup + writes/end_to_end/validators/maintenance/recovery/ composite_flow/consistency suites stay green. * docs(engine): note dedup_delete_filter's load-bearing dependency on D₂ Self-review follow-up: the overlapping-delete dedup assumes the committed snapshot is invariant across a query's statements, which holds only because D₂ forbids mixing writes with deletes (so a delete-touched table has no pending writes). Make that dependency explicit at the function so a future D₂ relaxation is forced to revisit the dedup. Comment-only. * Preserve staged write commit metadata
2026-06-27 16:48:41 +02:00
| Atomic single-dataset commits | ✅ | **Multi-table publish via three layers**, NOT a single Lance primitive: (1) per-table Lance `commit_staged` for the data write, (2) `__manifest` row-level CAS via `ManifestBatchPublisher` for cross-table ordering, (3) the open-time recovery sweep for the residual gap between (1) and (2). All three layers ship; the five migrated writers (`MutationStaging::finalize`, `schema_apply`, `branch_merge`, `ensure_indices`, `optimize_all_tables`) write a `__recovery/{ulid}.json` sidecar before Phase B and delete it after Phase C. The next `Omnigraph::open` (gated on `OpenMode::ReadWrite`) runs the sweep in `db/manifest/recovery.rs`: classify, decide all-or-nothing per sidecar, roll forward via single `ManifestBatchPublisher::publish` or roll back via `Dataset::restore` followed by a manifest publish of the restored version (so both directions converge to `manifest == HEAD` — no residual drift), and record an audit row in `_graph_commit_recoveries.lance` (queryable via `omnigraph commit list --filter actor=omnigraph:recovery`). The write entry points (`load_as`, `mutate_as`, `apply_schema_as`, `branch_merge_as`) and `refresh` additionally run an in-process roll-forward-only heal (serialized against live writers via the per-table write queues), so a long-lived server converges on its next write without restart; only rollback-eligible sidecars still defer to the next read-write open (a future background reconciler's goal). Engine writes route through a sealed `TableStorage` trait (`db.storage()`) exposing only `stage_*` + `commit_staged` + reads; the sole inline-commit residual (`create_vector_index`) is split onto a separate sealed `InlineCommitResidual` trait reached via `db.storage_inline_residual()` (MR-854), so the default surface cannot couple a write with a HEAD advance — §1 holds by construction. `delete` migrated to the staged path in MR-A (`stage_delete` via Lance 7.0 `DeleteBuilder::execute_uncommitted`, [#6658](https://github.com/lance-format/lance/issues/6658)); `create_vector_index` stays inline until upstream Lance ships a public two-phase API ([#6666](https://github.com/lance-format/lance/issues/6666)); `LoadMode::Overwrite` uses Lance `Overwrite` staged transactions. |
fix(engine): scalar index coverage + filter literal coercion (query latency) (#216) * fix(engine): lower date/datetime filter literals as typed Arrow scalars `literal_to_expr` lowered `Date`/`DateTime` query literals as Utf8 strings, relying on DataFusion implicit casts. Against a physical `Date32`/`Date64` column that can coerce the column side (`CAST(col AS Utf8)`), which defeats a scalar BTREE and degrades the scan to a full filtered read. Lower to typed `Date32`/`Date64` scalars instead (reusing the loader's `parse_date32_literal`/`parse_date64_literal`, already used by the in-memory comparison arm), so the predicate stays a direct column comparison and the index is used. Malformed literals fall back to the Utf8 string so pushdown behavior never regresses. Tests: unit goldens asserting the lowered literal is typed (red before, green after) + inline-binding pushdown equality in literal_filters confirming the epoch conversion selects the right rows. * fix(engine): build scalar BTREE for enum and orderable-scalar @index columns `build_indices_on_dataset_for_catalog` only handled `String` (-> FTS) and `Vector` (-> vector). Enums are physically `String`, so an enum `@index` column (e.g. `status`) got an FTS inverted index, which Lance never consults for `=`; and `DateTime`/`Date`/numeric/`Bool` `@index` columns fell through and built nothing. Both meant equality/range filters degraded to full scans with `indices_loaded=0`. Dispatch index kind by property type via a shared `node_prop_index_kind`: enum + orderable scalar -> BTREE, free-text String -> FTS, Vector -> vector, list/Blob -> none. The helper is shared by the builder and `needs_index_work_node` so they cannot drift — the latter decides recovery- sidecar pinning, and under-reporting would leave a HEAD-advancing index build uncovered (invariant 5). Tests: scalar_indexes.rs asserts enum/DateTime/numeric @index columns report `IndexCoverage::Indexed` while free-text String/un-annotated columns stay `Degraded` (negative control). Docs: docs/user/indexes.md. * feat(engine): reindex in optimize to keep index coverage current A scalar/FTS/vector index only covers the fragments it was built over. Rows appended after the build (e.g. `ingest --mode merge`, whose commit does not rebuild an existing index) are scanned unindexed, and `compact_files` rewrites fragments out of coverage. Nothing folded them back in, so coverage decayed as the graph grew — even the id/src/dst BTREEs that power traversal. `optimize_one_table` now runs Lance `optimize_indices` after `compact_files` (incremental merge, not retrain — the same compact->optimize_indices sequence LanceDB's `optimize()` uses) and enters the publish path on compaction work OR stale index coverage (new `TableStore::has_unindexed_fragments`, reusing the fragment_bitmap logic). `optimize_indices` is a committing call with no uncommitted variant in lance-6.0.1, so it is an inline-commit residual covered by the existing `SidecarKind::Optimize` recovery sidecar spanning both ops. Blob-bearing tables are still skipped (the Lance blob-compaction bug is compaction-specific; reindex-for-blob deferred as a noted follow-up). Tests: maintenance.rs asserts an appended fragment is uncovered before and covered after optimize, and idempotency holds (second pass is a no-op). lance_surface_guards pins the `optimize_indices` signature and its incremental- coverage behavior. The existing optimize Phase-B recovery failpoint now also exercises a crash after reindex. Docs: maintenance.md, writes.md, invariants.md, lance.md, AGENTS.md. * fix(engine): coerce pushdown filter literals to the column type Filter literals were pushed to Lance in their natural Arrow type (every integer Int64, every float Float64). Against a narrower indexed column DataFusion widens to the literal's type and casts the COLUMN (`CAST(n32 AS Int64)`), which defeats the scalar BTREE and degrades to a full filtered read. A physical-plan probe confirms it: an Int32 column filtered by an i32 literal uses `ScalarIndexQuery`; by an i64 literal it does not. Thread the scan's `arrow_schema` through `build_lance_filter_expr` -> `ir_filter_to_expr` and coerce each literal operand to the opposite column's exact Arrow type, reusing `projection::literal_to_array` + `arrow_cast` (the same path the in-memory arm uses, so the two arms agree). Coercion never demotes a filter to None: on failure it falls back to the natural literal, because a node scan has no in-memory fallback for inline filters. Supersedes the date-specific change in e4ef67b (PR1): the probe shows dates were never index-defeated — temporal coercion casts the LITERAL, not the column — so PR1's index-use rationale was wrong though harmless. The generic coercion subsumes it; `literal_to_expr`'s date arms revert to the natural Utf8 fallback, and its unit tests now assert the live coerced path. Tests: surface guard `scalar_index_use_requires_matched_literal_type` pins the substrate behavior (matched -> index, widened -> column-cast full scan); unit tests cover Int32/UInt32/Float32 coercion, range op, reversed operand order, and the natural fallback; `literal_filters` adds an I32 column with equality + range and an F32 pushdown case. * fix(engine): only coerce filter literals when the cast is lossless The literal coercion in f064121 narrowed unconditionally. typecheck permits numeric cross-type comparisons (`types_compatible`), so an out-of-domain literal reaches `literal_to_typed_expr` and casts lossily: a fractional float vs an integer column truncates (`{ count: 2.7 }` -> `count = 2`, wrongly matching the count=2 row) and an out-of-range integer overflows to null (`count < 3e9` on I32 -> `count < NULL` -> empty). Both silently change results, and a node scan has no in-memory fallback for inline filters. Add a lossless guard for integer targets: round-trip the cast back to the natural type and, on mismatch, return None so the caller keeps the natural literal (correct via DataFusion coercion; the index is just unused for that out-of-domain predicate). Float targets stay coerced -- narrowing F64 -> F32 is the column's own precision domain, not a value error. Resolves the two valid review findings on PR #216 (Codex float truncation, Greptile out-of-range). Tests: unit cases for fractional/out-of-range fallback vs whole-float/in-range coerce vs F32 exemption; e2e `{ count: 2.7 }` returns no rows.
2026-06-14 16:31:19 +02:00
| Compaction (`compact_files`) + reindex (`optimize_indices`) | ✅ | `omnigraph optimize` orchestrates over all node/edge tables, bounded concurrency; per table runs `compact_files` **then Lance `optimize_indices`** (folds appended/rewritten fragments back into existing indexes — incremental merge, not retrain) and **publishes the resulting version to `__manifest`** (so the manifest tracks the Lance HEAD — required for reads to observe the work and for schema apply / strict writes to pass their HEAD-vs-manifest precondition), under the per-`(table, main)` write queue with `SidecarKind::Optimize` recovery coverage spanning both ops; **commits even with no compaction work if index coverage is stale**; **refuses on an unrecovered graph**; **skips uncovered HEAD > manifest drift** with `DriftNeedsRepair`; **skips blob-bearing tables** (reported via `TableOptimizeStats.skipped`, not silent; reindex is skipped for them too today), gated on `LANCE_SUPPORTS_BLOB_COMPACTION` until the upstream blob-v2 compaction-decode bug is fixed (see [docs/dev/invariants.md](docs/dev/invariants.md) Known Gaps) |
| Repair uncovered drift | — | `omnigraph repair` explicitly classifies uncovered table `HEAD > manifest` drift: verified maintenance drift (`ReserveFragments`/`Rewrite`) can be published with `--confirm`; suspicious or unverifiable drift requires `--force --confirm`. Sidecar-covered crash residuals still recover automatically on open. |
| Cleanup (`cleanup_old_versions`) | ✅ | `omnigraph cleanup` with `--keep` / `--older-than` policy |
Index materialization is derived state: defer off the write path, reconcile via optimize (iss-848) (#246) * test(engine): reproduce empty-table Vector @index aborting schema apply A Vector (IVF) index trains k-means centroids over the column, so Lance cannot build it on 0 vectors ("Creating empty vector indices with train=False is not yet implemented"). schema apply reconciles a table's whole index set whenever any @index on it changes, so adding an unrelated scalar @index materializes the dormant empty vector index and aborts the entire migration (all-or-nothing). This regression test inits a 0-row Doc with a Vector @index, adds a scalar @index, and asserts the apply succeeds (then loads one embedded row and asserts the deferred index materializes). It fails today at the apply step with the vector-index abort; the fix lands in the next commit. Refs dev-graph iss-empty-vector-index-schema-apply, iss-848. * fix(engine): defer Vector @index on an empty table instead of aborting schema apply build_indices_on_dataset_for_catalog materialized a declared Vector @index unconditionally. On a 0-row table Lance cannot train the IVF index ("Creating empty vector indices with train=False is not yet implemented"), so any later migration that touches the table (e.g. adding an unrelated scalar @index, which reconciles the table's whole index set) aborted the entire migration on the dormant vector index — all-or-nothing. Guard the vector arm with a row-count check, matching the guard ensure_indices_for_branch and the branch-merge rebuild already use: an untrainable column becomes a pending index that a later ensure_indices / optimize materializes once the table has rows. Reads stay correct meanwhile (vector search degrades to a brute-force scan). Stop-gap: the residual rows-present-but-vectors-null window and the full decoupling (intent recorded at apply, an idempotent coverage reconciler) are dev-graph iss-848. Turns the green half of the regression test added in the previous commit. Refs dev-graph iss-empty-vector-index-schema-apply, iss-848, iss-687. * docs(invariants): record the logical-contract-over-physical-state principle The bug class behind the empty-table vector-index abort (and the schema-apply vs optimize version drift) is one shape: a physical operation allowed to fail a logical one. Several hard invariants (2, 5, 7, 13) and deny-list items are already instances of this, but the unifying rule was never written down. Add it to docs/dev/invariants.md as a "Governing principle" section above the hard invariants, naming which invariants and deny-list items instantiate it and the smell to watch for (a logical operation gated on a physical fact). Add a one-line always-on rule (7) in AGENTS.md so it stays in working memory, with the qualifier that genuine logical conflicts still fail loudly — the licence to lag covers physical convergence, not correctness. Audience-neutral: no private ticket refs. check-agents-md.sh passes. * test(engine): index build must tolerate rows with null vectors (load-before-embed) Loading rows whose vector column is null into a `Vector @index` table fails today: build_indices (reached via the loader's prepare_updates_for_commit) calls create_vector_index, and Lance's IVF KMeans errors "cannot train 1 centroids with 0 vectors". The same abort hits ensure_indices/optimize/schema apply/merge, since they all funnel through build_indices_on_dataset_for_catalog. This test loads two null-embedding rows and calls ensure_indices; it must not abort (the untrainable vector column is deferred, sibling indexes still build). Fails today at the load step; fixed in the next commit. Refs dev-graph iss-848, iss-empty-vector-index-schema-apply. * fix(engine): defer unbuildable index columns instead of aborting the write path build_indices_on_dataset_for_catalog is the chokepoint every write path funnels through (load/mutate via prepare_updates_for_commit, schema apply, ensure_indices, optimize, branch merge). Its vector arm called create_vector_index unconditionally, so a column with no trainable vectors yet — an empty table, or rows loaded before `embed` populates them — aborted the whole operation with Lance's IVF KMeans error. Fault-isolate the vector build: on failure, record the column as a PendingIndex (table, column, reason), log it, and continue building the sibling indexes; a later ensure_indices/optimize materializes it once the column is trainable, and reads use brute-force meanwhile. Manifest/CAS/IO errors at the publish boundary still propagate. Isolating at the single chokepoint realizes the governing principle (physical index state never fails a logical operation) for every write path, and supersedes the earlier symptomatic count_rows==0 stop-gap (removed) — closing the residual rows-present-but-vectors-null window it left open. Surfacing pending index status rather than failing is the database norm (Postgres indisvalid, LanceDB list_indices). ensure_indices and the build_indices wrappers now return Vec<PendingIndex>; optimize surfaces it in a later commit. Refs dev-graph iss-848, iss-951 (vector index stays inline-commit until lance#6666). * test(engine): index-only schema apply must not touch table data Adding an @index to an existing column should be a pure metadata change once index materialization moves to the reconciler (iss-848): the apply records the intent in the catalog/IR but builds nothing inline, so the table's manifest version is unchanged. Today the indexed_tables block builds the index inline and bumps the version (4 -> 5). Fixed in the next commit. Refs dev-graph iss-848. * fix(engine): schema apply records index intent only; index-only apply is metadata Schema apply no longer builds indexes inline. The four build_indices calls (added/renamed/rewritten/index-only tables) are removed; the @index/@key intent is already persisted in the catalog/IR the apply writes, and the physical index is materialized off the critical path by ensure_indices/optimize (iss-848). Concretely: - AddConstraint (an @index addition — every other added constraint plans as UnsupportedChange) becomes a pure metadata step alongside the metadata-only steps: it touches no table data, so the table version is unchanged. - added/renamed/rewritten tables still write their data; only the trailing index build is gone. The rewritten table's coverage is restored later by optimize_indices. - recovery_pins drops index-only tables (they no longer advance Lance HEAD) and keeps rewritten tables; their post_commit_pin = expected+1 is now exact (one rewrite commit), strengthening recovery classification. - the now-orphaned Omnigraph::build_indices_on_dataset_for_catalog wrapper is removed. A migration can no longer abort on an index build, for any index type at any cardinality. Turns the green half of index_only_constraint_apply_touches_no_table_data. Refs dev-graph iss-848. * test(engine): optimize must converge a declared-but-unbuilt index After iss-848, adding an @index post-data is a metadata-only apply that defers the physical build, so the column is declared-indexed but unbuilt (reads scan). `optimize` — the operator's cron reconciler — must materialize it. Today optimize only maintains coverage of EXISTING indexes (optimize_indices) and never creates missing ones, so the rank BTREE stays Degraded after optimize. Fixed next commit. Refs dev-graph iss-848. * fix(engine): optimize materializes declared-but-unbuilt indexes (the reconciler) `omnigraph optimize` is the operator's cron reconciler. It already compacts and folds new fragments into EXISTING indexes (optimize_indices); now it also builds declared-but-missing indexes, so the indexes schema apply / load defer (iss-848) converge on the next optimize. Done inside optimize_one_table (not by composing the all-tables ensure_indices, which is drift-blind and would re-publish the uncovered HEAD>manifest drift that optimize deliberately skips): after the per-table drift/blob skips and under the queue + Optimize sidecar already held, a needs_index_create gate (reusing needs_index_work_node/edge — "declared index missing AND row_count > 0", so empty tables stay no-ops) admits index-only work, and Phase B builds the missing index over the just-compacted layout via the build chokepoint. An untrainable vector column fault-isolates into the new TableOptimizeStats.pending_indexes (the list_indices/indisvalid analog operators read), not a failure. committed now reflects index commits, so the existing post-publish cache invalidation covers them. LanceDB's optimize only maintains existing indexes; creating declared-but-missing ones is the L2 behavior omnigraph's declarative @index needs. Turns the green half of optimize_materializes_index_declared_but_unbuilt. Refs dev-graph iss-848. * docs: index materialization is deferred to the reconciler (iss-848) Update the index-lifecycle docs to reflect the new contract: @index/@key declares intent and the physical index is derived state that never fails a logical operation. Schema apply builds nothing (records intent only); load/mutate build inline through one chokepoint that defers an untrainable Vector column as pending; optimize/ensure_indices is the reconciler that creates declared-but-missing indexes and maintains coverage, reporting still-pending columns. Touches: dev/invariants.md (truth-matrix Index-lifecycle row), AGENTS.md (capability matrix), user/search/indexes.md (L2 orchestration), user/operations/ maintenance.md (optimize reconciler bullet), dev/testing.md (new tests). * test(server): schema_apply_route_can_add_index reflects deferred index build iss-848 made schema apply record @index intent without building the physical index inline. The route test asserted the index count increased after apply; on an empty graph it now stays unchanged (the build is deferred to ensure_indices/optimize). Assert the new contract: apply succeeds and the physical index count is unchanged. * fix(engine): precheck vector trainability — don't pin or swallow (PR review) Two issues Cursor Bugbot caught in the chokepoint fault-isolation: 1. (HIGH) Pending vector pins roll back siblings. needs_index_work_node counted a missing vector index as work whenever the table had rows, so a column with no trainable vectors got pinned in the EnsureIndices recovery sidecar — but the build deferred it (zero commit). On a crash before manifest publish the classifier sees NoMovement and the all-or-nothing decision (recovery.rs decide()) rolls back the WHOLE sidecar, undoing a sibling table's committed index work. 2. (MED) Vector build swallowed fatal errors. The match arm converted every create_vector_index error into a deferred PendingIndex, hiding genuine I/O/manifest/Lance failures as "pending". Fix both with one trainability precheck (vector_column_trainable: >=1 non-null vector, the ivf_flat(1) minimum) used identically by needs_index_work_node and the build arm: an untrainable column is never counted as work (so never pinned — no zero-commit pin) and never attempted (so it can't fail); only a trainable column is built, and then any error PROPAGATES (stays fatal). The deferred column is still recorded as a PendingIndex with a clear reason. Refs dev-graph iss-848. * feat(cli): surface pending index column + reason in optimize output (PR review) Codex (P2): pending_indexes was documented as visible in `optimize --json` but the CLI projection never emitted it — operators would lose the only signal that optimize has deferred index work. Greptile (P2): the stat dropped the reason, so operators saw which column was stuck, not why. Carry the reason: TableOptimizeStats.pending_indexes is now Vec<PendingIndex> (column + reason), and `omnigraph optimize --json` emits {column, reason} per pending index; human output prints a "↳ index pending on '<col>': <reason>" line. Refs dev-graph iss-848. * test: align CLI index-add test with deferred build; cover post-rename reconcile - schema_apply_json_adds_index_for_existing_property (cli_schema_config.rs): the CLI analog of the server test — asserted the index count grew after apply; under iss-848 the apply defers the build, so the count is unchanged on an empty graph. Assert the deferred contract. (The only full-suite failure.) - optimize_materializes_index_after_type_rename (maintenance.rs, new): covers the gap Greptile flagged — a RenameType writes the renamed table with rows but no indexes (inline build removed in Commit B); assert the rank index is Degraded post-rename and Indexed after optimize reconciles it. Refs dev-graph iss-848. * test(engine): in-source apply tests reflect deferred index materialization The two db::omnigraph in-source unit tests asserted the old "schema apply builds / preserves indexes inline" behavior (the only remaining full-suite failures): - test_apply_schema_defers_index_then_reconciler_builds_it (was test_apply_schema_adds_index_for_existing_property): apply records the @index intent but builds nothing; assert the BTREE on `age` is absent after apply and present after ensure_indices. (Uses `age`, unindexed in TEST_SCHEMA — `name @key` is already FTS-indexed at seed.) - test_apply_schema_rewrite_defers_index_then_reconciler_restores (was test_apply_schema_rewrite_preserves_existing_indices): an AddProperty rewrite no longer rebuilds indexes inline; assert ensure_indices restores id BTREE + name FTS after the rewrite. Verified by grep that these + the server/CLI tests are the complete set of "apply builds an index" assertions; all other index-presence tests run after load/ensure_indices/primitives, which still build. Refs dev-graph iss-848. * fix(engine): optimize always reports pending indexes, not only on create-work (PR review) Cursor Bugbot (MED): pending_indexes was filled only when needs_index_create was true, but the vector trainability precheck makes needs_index_work_node exclude an untrainable Vector column. So a table whose sole missing index is untrainable, but which optimize still compacts or reindexes, returned an empty pending_indexes — contradicting the documented operator contract for deferred columns. Run the (idempotent) build chokepoint unconditionally once past the no-op gate, rather than gating it on needs_index_create. It skips existing indexes, builds any buildable missing one, and reports an untrainable column as pending whether the table entered for compaction, reindex, or index creation. needs_index_create still gates the no-op decision (so an index-only table still enters the path). Refs dev-graph iss-848. * test(engine): reframe staged-BTREE-failure failpoint onto the reconciler path ensure_indices_stage_btree_failure_leaves_existing_tables_writable fired `ensure_indices.post_stage_pre_commit_btree` and expected `apply_schema` (adding a type) to fail mid-BTREE-build. iss-848 removed apply's inline index build, so that apply now succeeds and the test's unwrap_err panicked — it exercised a removed code path. Reframe onto where BTREE builds happen now: seed Person, add an `@index` on `age` (apply records intent, defers the build), then `ensure_indices` builds the deferred BTREE and the failpoint fires between stage and commit. Person's HEAD is unchanged (no drift) and its EnsureIndices sidecar pins NoMovement; a write to a different, unpinned table (Company) is unaffected (mutations/loads heal roll-forward and proceed, unlike optimize/repair which refuse on a pending sidecar). Preserves the original coverage (staged-index stage failure leaves other tables writable, no drift) in the new architecture. Refs dev-graph iss-848. * feat(server): converge deferred indexes promptly after schema apply (iss-848) Schema apply records @index intent but defers the physical build. On a long-lived server, spawn a detached best-effort ensure_indices after a successful apply so the indexes converge promptly instead of waiting for the operator's next optimize. Fire-and-forget: it never blocks or fails the apply response, and a failure is logged (the index still converges on the next optimize). Guarded on result.applied. The CLI is one-shot, so it has no equivalent; its convergence path is the optimize cadence. handle.engine is already an Arc, so the spawn takes an owned clone. Convergence itself is covered by the engine ensure_indices/optimize tests; the existing empty-graph schema-apply route tests confirm the response is unaffected (the spawn is a read-only no-op on an empty table). Refs dev-graph iss-848. * docs(maintenance): list pending_indexes in optimize per-table stats (consistency)
2026-06-15 18:48:43 +02:00
| BTREE / inverted (FTS) / vector indexes | ✅ | `@index`/`@key` declares intent; the physical index is derived state that never fails a logical op. Built per column through one chokepoint (`build_indices_on_dataset_for_catalog`, type-dispatched by `node_prop_index_kind`: enum + orderable scalar → BTREE, free-text String → FTS, Vector → vector); idempotent; lazy across branches. **Schema apply builds nothing** (records intent only); `load`/`mutate` build inline but **defer an untrainable Vector column** (no trainable vectors yet) as *pending* rather than aborting. `ensure_indices`/`optimize` is the reconciler that materializes declared-but-missing indexes and restores coverage of appended/rewritten fragments (`optimize_indices`), reporting still-pending columns (see Compaction row). |
| `merge_insert` upsert | ✅ | `LoadMode::Merge`, mutation `update`/`insert`/`delete` lowering |
| Vector search | ✅ | `nearest()` query op; embedding pipeline (Gemini / OpenAI clients); `@embed` in schema |
| Full-text search | ✅ | `search/fuzzy/match_text/bm25` query ops |
| Hybrid ranking | — | `rrf(...)` Reciprocal Rank Fusion in one runtime |
| Graph traversal | — | CSR/CSC topology index, `Expand` IR op, variable-length hops, `not { }` anti-join |
| Schema language | — | `.pg` + Pest grammar + catalog + interfaces + constraints + annotations |
| Query language | — | `.gq` + Pest grammar + IR + lowering + linter |
| Schema migration planning | — | `plan_schema_migration` + `apply_schema` step types + `__schema_apply_lock__` |
feat(engine): retire commit-graph tables (#311) * docs(dev): write-latency roadmap (validated cost model + layered fix) Records the validated 6-LIST warm-write cost model, the two root causes (un-GC'd _versions/; re-resolving latest by listing), and the layered fix (GC + capture-once reuse), plus how commit-graph-table retirement feeds in. Linked from docs/dev/index.md next to the RFC-013 docs. * feat(engine)!: strand storage versioning — one internal-schema version, no in-place migration Set MIN_SUPPORTED == CURRENT == 4: this binary reads exactly one `__manifest` internal-schema version and refuses any older graph on open with a rebuild-via-export/import message, instead of migrating it in place. Storage format changes become a deliberate cutover, not a permanently-carried in-place migration — the pre-release "complexity must be earned" contract. Delete the entire in-place migration apparatus and everything that existed only to support it: the `migrate_vN` arms + dispatcher + stamp-bump helpers + the schema-version-floor tripwire; `migrate_on_open` (both open modes now refuse); the legacy `_graph_commits.lance` readers + the v3 test fixtures + migration tests + `migration.v3_to_v4.*` failpoints + the two surface guards that pinned Lance variants only the migration matched on; and `state::merge_lineage_rows`. Keep `read_stamp` / `stamp_current_version` / `set_stamp` / `refuse_if_stamp_unsupported` — the seam a future one-shot converter plugs into. `load_commit_cache_for_branch` now reads the `__manifest` projection unconditionally (sub-v4 graphs are refused at open). Adds `sub_current_graph_is_refused_on_open_with_rebuild_hint`. The commit-graph TABLES are still created/used as branch-ref ledgers — their retirement (CommitGraph -> pure `__manifest` projection) is the next commit. BREAKING CHANGE: a graph created by omnigraph <= 0.7.2 (internal schema v3) is refused on open. Rebuild it: `omnigraph export` with the old release, then `omnigraph init` + `omnigraph load` with this one. Data, vectors, and blobs are preserved; commit history and branches are not. * feat(engine)!: retire `_graph_commits.lance` / `_graph_commit_actors.lance` — CommitGraph is a pure `__manifest` projection Since RFC-013 Phase 7, graph lineage lives in `__manifest` (`graph_commit` / `graph_head` rows) and branch authority is `__manifest` (branch create forks it first). The two commit-graph datasets were vestigial: `_graph_commit_actors.lance` was never written or read; `_graph_commits.lance` carried zero commit rows and only mirrored the manifest's branch refs (a deny-list "parallel copy"). Retire both. - `CommitGraph` collapses to a pure projection: drops its Lance dataset handles (`dataset`/`actor_dataset`) and all branch methods; `open`/`open_at_branch`/ `refresh`/`init` open NO dataset, building the cache from `ManifestCoordinator::read_graph_lineage_at`. Removes ~1.4s of cold-open dataset opens. - `graph_coordinator`: `commit_graph` is now non-`Option` (always a valid projection). `branch_create`/`branch_delete` go through `ManifestCoordinator` only — a single atomic op, replacing the two-step manifest-fork + commit-graph-fork + rollback. Deleted `create_commit_graph_branch`, `reclaim_commit_graph_branch`, `ensure_commit_graph_initialized`, and every `storage.exists(_graph_commits.lance)` gate. - `optimize`: dropped `reconcile_commit_graph_orphans` and the two tables from the internal-table compaction set (now `__manifest` only). - `instrumentation`: `INTERNAL_TABLE_DIRS` no longer lists the two tables. - Fresh graphs create neither table; `lineage_projection.rs` now asserts both `.lance` dirs are absent. Deleted the obsolete commit-graph-branch-race failpoint tests + their failpoint names, and updated the `maintenance` optimize tests (one internal table, not three). Review-pass fixes folded in: - Removed two stale `omnigraph.rs` in-source tests the prior run missed (a disk-full link failure masked them): one asserting `open` probes `_graph_commits.lance` (the exists-gate this commit removes) — it was masked earlier by a disk-full link failure. - Corrected src comments referencing deleted code (`migrate_v3_to_v4`, `append_commit`/`append_merge_commit`, the three-internal-table list, the `_graph_commits` reconcile owner) in publisher/recovery/optimize/recovery_audit. - Narrowed `set_stamp_for_test` to `cfg(test)` (its only caller is the refusal test) — removes a dead-code warning in the failpoints build. Branch create/delete atomicity improves (single atomic `__manifest` op). No behavior change for reads or branches. Follow-up (separate commit): the now-always-0 `IoCounts::commit_graph_reads` test counter + its `IOTracker`, threaded through ~11 cost-test files. * feat: surface the internal-schema (storage-format) version to operators After stranding storage versioning (a sub-v4 graph is refused on open), operators could only discover the storage-format version by hitting a refusal. Surface it: - `omnigraph version` prints an `internal-schema <N>` line (the binary's CURRENT storage-format version). - `omnigraph snapshot` includes `internal_schema_version` — the GRAPH's per-branch on-disk stamp, read via the new `Omnigraph::internal_schema_version_of`. - `GET /healthz` includes `internal_schema_version` (server-scoped: the binary's CURRENT, alongside `version`/`source_version`). Wire: re-expose `INTERNAL_MANIFEST_SCHEMA_VERSION` as `pub` on `db::manifest`; add `internal_schema_version: u32` to `SnapshotOutput` + `HealthOutput`; `snapshot_payload` takes the per-graph version (the `Snapshot` does not carry it), threaded through the embedded CLI + server snapshot callers. `openapi.json` regenerated (two added int32 properties). Extends the existing healthz / snapshot / version tests. * docs(engine): gate internal-schema version at the graph level; record the per-branch read gap PR reviewers flagged that the open path validates only main's internal-schema stamp, so a branch read could decode a branch stamped outside this binary's range. The stamp is a graph-wide storage-format property (the upgrade path is a whole-graph export/import), so with one binary version every branch is always CURRENT; divergence needs concurrent multi-version writers, an unsupported topology already in one-winner-CAS territory. Gating per-branch would add a second __manifest open per non-main branch read to defend a state we do not support, unearned complexity that regresses the warm-read budget. Keep the graph-level gate, document it at the code site (refuse_if_internal_schema_unsupported), and record the read-only residual hole as a known gap in invariants.md to close only when multi-version write topologies become supported. Also clarify the sub-floor rebuild message to say "export with the older omnigraph binary that created it." No behavior change: HEAD already gated at the graph level. * test(cost): remove the dead commit_graph_reads IO counter Phase B retired _graph_commits.lance / _graph_commit_actors.lance, so no commit-graph dataset is opened and the commit_graph IOTracker term is structurally always 0. Remove IoCounts::commit_graph_reads, its total_reads() term, the commit_graph IOTracker in OpProbes, and the now-dead commit_graph_wrapper field on QueryIoProbes (it had no accessor — nothing ever attached it). Drop the 7 trivially-true assert_eq!(commit_graph_reads, 0) checks in warm_read_cost.rs and the debug-print refs in write_cost{,_s3}.rs. Lineage and actor rows now live in __manifest (RFC-013 Phase 7), so the internal_table_scans_are_flat_in_history gate folds into the single manifest_reads flat-assertion — the manifest scan already covers them. Harness-only; no production runtime impact. * docs: align with the commit-graph retirement + strand storage versioning Update the always-loaded and user-facing docs to match the landed state: graph lineage lives in __manifest, the _graph_commits.lance / _graph_commit_actors.lance tables are retired, and storage is strict-single-version (no in-place migration — a sub-CURRENT graph is refused with an export/import rebuild). Fixed stale claims in invariants.md (the migration/atomicity known-gap entry, the Truth Matrix branch-delete row, the read-path/optimize internal-table scope), lance.md (the migrate_v1_to_v2 PK bullet now reflects init-time set; removed the two deleted v3->v4 migration surface guards), testing.md (dropped the deleted migration failpoint tests; manifest-only internal-table term), writes.md (rewrote the Migration-code section to the strand model), storage.md / maintenance.md / constants.md (retired tables out of the layout, internal-table compaction scope, and the constants cheat-sheet), and AGENTS.md. Marked the retirement DONE in the RFC-013 handoff/roadmap and banner-noted the historical RFC analysis. Added docs/user/operations/upgrade.md (the export/import rebuild recipe) and docs/dev/versioning.md (the four-axis compatibility policy: release lockstep / wire additive / storage strict-single-version / Lance pinned), cross-linked from the audience indexes and the AGENTS.md topic map, and rewrote the in-progress v0.8.0 release note for the strand model + version surfacing. check-agents-md.sh passes (65 links, 62 docs). * test(manifest): cover the v3-refusal→export/import rebuild cycle and branch stamp inheritance Two coverage additions from PR review (P1): (a) sub_current_graph_is_refused_then_rebuilt_via_export_import — the full operator narrative in one flow: load → export → a sub-CURRENT graph (stamp rewound below CURRENT) is refused with the export nudge → fresh init + load(export) → data present and the rebuilt graph opens. The refusal is stamp-only (read before any data), so a stamp-rewound graph is a faithful stand-in for a real older-release graph without a second binary; vector/blob fidelity stays covered by tests/export.rs. (b) branch_inherits_main_internal_schema_stamp — proves a branch cannot diverge from main's stamp under single-binary operation (create_branch forks main's __manifest, the publisher does not re-stamp), which is why the graph-level (main-only) stamp gate is sufficient for supported inputs. A divergent branch stamp needs concurrent multi-version writers, the unsupported topology recorded as a known gap.
2026-06-28 16:49:49 +02:00
| Commit graph (DAG) across whole graph | — | Lineage (linear + merge parents, ULID ids, actor) stored as `graph_commit`/`graph_head` rows in `__manifest`, written in the same publish CAS as the table-version rows (RFC-013 Phase 7 — atomic with the graph commit). The in-memory commit graph is a pure projection of those rows; the legacy `_graph_commits.lance` / `_graph_commit_actors.lance` tables are **retired** (a fresh graph creates neither) |
MR-794 step 2: docs — runs/invariants/architecture/execution + cleanup Refresh user-facing and agent-facing docs for the staged-write rewire and clean up stale Run-state-machine references that survived MR-771. MR-794-specific updates: * docs/runs.md — remove "Known limitation: mid-query partial failure" section; document the in-memory accumulator + D₂ rule + the LoadMode::Overwrite residual. * docs/invariants.md §VI.25 — flip from aspirational/open to upheld for inserts/updates. Within-query read-your-writes is now load-bearing for the publisher CAS contract. * docs/architecture.md — add "Mutation atomicity — in-memory accumulator (MR-794)" subsection with per-op flow; refresh the engine + state diagrams to drop RunRegistry and add MutationStaging. * docs/execution.md — rewrite the mutation flow sequence diagram for the staged-write path; updated the LoadMode table to call out per-mode commit semantics; rewrote load vs ingest. * docs/query-language.md — document the D₂ parse-time rule. * docs/errors.md — add the D₂ BadRequest rejection path. * docs/testing.md — extend the runs.rs row to cover the new MR-794 contract tests; add the staged_writes.rs row. * docs/releases/v0.4.1.md (new) — release note covering the rewire, test additions, residuals, and files changed. * AGENTS.md (CLAUDE.md symlink) — update the atomic-per-query description and the L2 capability matrix row. Stale-reference cleanup (MR-771 leftovers): * docs/storage.md — drop live _graph_runs.lance / _graph_run_actors.lance from the layout diagram and prose; mark legacy. * docs/branches-commits.md — move __run__<id> to a legacy note; remove publish_run from the publish-trigger list. * docs/audit.md — refresh _as API list (drop begin_run_as / publish_run_as); legacy RunRecord.actor_id moved to a historical note. * docs/constants.md — mark run registry / branch-prefix rows as legacy. * docs/cli.md — replace the legacy omnigraph run * quickstart block with omnigraph commit list/show. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:43:19 +02:00
| Per-query atomic writes | — | In-memory `MutationStaging.pending` accumulator + `stage_*` / `commit_staged` per touched table at end-of-query + publisher CAS via `commit_with_expected` (single manifest commit per `mutate_as` / `load`); D₂ parse-time rule keeps inserts/updates and deletes from mixing |
| Three-way row-level merge | — | `OrderedTableCursor` + `StagedTableWriter`, structured `MergeConflictKind` |
| Change feeds | — | `diff_between` / `diff_commits` with manifest fast path + ID streaming |
docs(user): split language/branching pages + add front-door pages (Phase 2) (#225) Content build-out on top of the Phase 1 topic move. No behavior changes. Splits (existing content relocated, cross-linked): - queries/index.md → mutations/index.md (insert/update/delete + the inserts-vs-deletes rule) and search/index.md (the multi-modal search functions + a hybrid-ranking overview tying nearest/bm25/rrf together). queries/index.md now covers the read shape and points at both. - branching/index.md → branching/time-travel.md (snapshots/time travel) and branching/merge.md (three-way merge + the 7 conflict kinds, verified against error.rs MergeConflictKind). New pages (written from the code, user-facing): - quickstart.md — init → load → query → branch, with verified CLI flags. - concepts/index.md — what OmniGraph is + the L1/L2 (Lance/OmniGraph) framing. Expanded operations/audit.md from a 7-line struct dump into a real actor-tracking page (server token-resolved vs CLI --as chain; reading the trail; the omnigraph:recovery reserved actor). Index wiring: docs/user/index.md and AGENTS.md's topic table link every new page; also normalized AGENTS.md's docs/user link display text to match the Phase 1 retargeted paths. Verified: zero broken .md links; check-agents-md.sh green (57 links, 54 docs). Deferred to Phase 3: de-dev polish (grammar paths, IR internals still in queries/branching), guides/, and a possible reference/config.md split (the config schema is already coherent in cli/reference.md). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:53:46 +03:00
| Cedar policy | — | Per-graph actions plus server-scoped actions (see [docs/user/operations/policy.md](docs/user/operations/policy.md) for the current list), branch / target_branch / protected scopes, validate/test/explain CLI. **Engine-wide enforcement** (MR-722): every `_as` writer (`apply_schema_as`, `mutate_as`, `load_as` — the deprecated `ingest_as` shims route through it — `branch_create_as` / `branch_create_from_as`, `branch_delete_as`, `branch_merge_as`) calls `Omnigraph::enforce(action, scope, actor)` — HTTP, CLI, embedded SDK all hit the same gate. |
feat!: delete the legacy OmnigraphConfig + config migrate; finish the omnigraph.yaml docs sweep (#252) * refactor(cli): own ReadOutputFormat/TableCellLayout in the CLI The two output-presentation enums lived in `omnigraph-server::config` and were re-exported for the CLI, even though the server never used them. Move both definitions into `omnigraph-cli/src/read_format.rs` (where the renderer already lives) and drop them from the server's public re-export. This is a step toward deleting the legacy `omnigraph-server::config` module entirely — a CLI presentation concern has no business in the server crate. No behavior change. The server keeps private copies in `config.rs` only for the soon-to-be-deleted legacy `CliDefaults`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(cli)!: remove the `config migrate` command and migrate.rs `config migrate` was the last CLI consumer of the legacy `omnigraph.yaml` (`OmnigraphConfig` + `load_config`). With the excision complete there is no legacy file to split, so the whole `omnigraph config` command group is removed along with `migrate.rs`. The `OmnigraphConfig` type, `load_config`, and the deprecation machinery are deleted next. - Remove `Command::Config` / `ConfigCommand` from the clap surface and the dispatch arm; drop `mod migrate;` and the now-unused `load_config` import. - Drop the `Command::Config` arms in `planes.rs`. - Delete the `config_migrate_splits_legacy_config` integration test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(server)!: delete the legacy OmnigraphConfig type and load_config With `config migrate` gone, nothing loads `omnigraph.yaml` anymore. Delete the entire `omnigraph-server::config` module: the `OmnigraphConfig` type and its sub-structs (`ProjectConfig`, `TargetConfig`, `CliDefaults`, `ServerDefaults`, `AuthDefaults`, `QueryDefaults`, `AliasConfig`, `AliasCommand`, `PolicySettings`, `QueryEntry`, `McpSettings`), `load_config`, and the RFC-008 deprecation machinery (`OMNIGRAPH_CONFIG`, `OMNIGRAPH_NO_LEGACY_CONFIG`, `OMNIGRAPH_SUPPRESS_YAML_DEPRECATION`, the deprecation map + warner). - `QueryRegistry::load` (the only `OmnigraphConfig`/`QueryEntry` consumer; its only caller was its own test) is removed — server boot and the CLI both build registries via `QueryRegistry::from_specs`. - `graph_resource_id_for_selection` (CLI-only) moves into the CLI (`helpers.rs`), with its unit test; the server no longer exports it. - Drop the already-dead `format_registry_load_errors` helper (config-adjacent). No behavior change — every deleted item was unreachable after the excision. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs: purge the legacy omnigraph.yaml surface from the docs Finish the RFC-011 excision in the docs: the CLI no longer reads omnigraph.yaml and the server boots cluster-only, so every doc that described the legacy file as a live config is now wrong. - AGENTS.md: rewrite the HTTP-server line to cluster-only boot (drop the single-graph/flat-route and omnigraph.yaml-boot framing); rewrite the CLI two-surface-config passage (drop `config migrate`, the deprecation env vars, and "Never extend omnigraph.yaml"); fix the topic table + capability rows. - cli/reference.md: delete the entire "omnigraph.yaml schema (legacy combined file)" section and the `config migrate` row; re-home the `policy` row, the bearer-token chain, the actor/format/param-precedence references, and the `--config` mentions to the operator config + `--cluster`. - cli/index.md: rewrite the multi-graph-server + add-graph paragraphs to cluster (`--cluster` + `cluster apply`); fix the policy examples to `--cluster`; replace the `## Config` omnigraph.yaml example with the operator/cluster two-surface model. - operations/policy.md: rewrite per-graph-vs-server-level policy to the cluster `policies:`/`applies_to` model; re-home the actor + CLI tooling sections. - clusters/config.md, clusters/index.md, deployment.md: server boots from the cluster only; per-operator facts come from ~/.omnigraph/config.yaml. - architecture.md, testing.md: drop the stale omnigraph.yaml / deleted-test references. RFCs, design specs, and prior release notes are left as historical records. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 22:31:29 +03:00
| HTTP server | — | Axum, OpenAPI via utoipa, bearer auth (SHA-256, AWS Secrets Manager option), `authorize_request` at the HTTP boundary (resolves bearer→actor, applies admission control), NDJSON streaming export, **cluster-only boot (RFC-011): always `--cluster <dir | s3://…>`, serving N graphs (N ≥ 1) under multi-graph routes + read-only `GET /graphs` enumeration + per-graph + server-level Cedar policies. Add/remove graphs via `cluster apply` and restart.** |
| CLI with config | — | two-surface config (team `cluster.yaml` dir + per-operator `~/.omnigraph/config.yaml`), scope addressing (`--store`/`--server`/`--cluster`/`--profile`/defaults, RFC-011), aliases, multi-format output (json/jsonl/csv/kv/table) |
| Audit / actor tracking | — | `_as` write APIs + actor map in commit graph |
docs: onboarding-first README + in-repo agent skill + drop RustFS script (#257) * docs: optimize README for dev onboarding; fix 0.7.0 staleness The README's setup half drifted from the shipped 0.7.0 CLI and led with the heaviest path (Docker + RustFS). This reworks it for fast, correct onboarding: README.md - New zero-dependency "Your first graph in 60 seconds" hero: a fully copy-pasteable local file-backed loop (schema → init → load → query → branch). - Add a correct "Serve it" section (cluster apply + omnigraph-server --cluster); the server is cluster-only on main, so the old positional-URI boot is gone. - Demote the RustFS bootstrap to "rehearse the S3 path locally"; reframe the storage bullet as "filesystem or any S3-compatible store (AWS S3, R2, MinIO, RustFS)" — RustFS is a provider, not a storage class. - Fix crate/MCP descriptions (query/mutate/load, not read/change/ingest). docs/user/quickstart.md - Fix the query example: `read --name <q> … <uri>` is removed — the query name is positional and the graph is addressed with `--store` (`omnigraph query find_people --query queries.gq --store graph.omni`). scripts/local-rustfs-bootstrap.sh - Convert to cluster mode: write a cluster.yaml (storage: s3://…), then validate → import → apply, load the fixture into the derived root with the now-required --mode, and serve with `omnigraph-server --cluster`. The old flow (`load` without --mode, `omnigraph-server <URI>` positional boot) no longer works on a cluster-only server. * docs: move agent skill into the repo, add agent-setup snippet, drop rustfs script skills/omnigraph - The operational skill (formerly `omnigraph-best-practices` in the cookbooks repo) now lives with the engine it documents, co-versioned. Renamed to `omnigraph`; repository metadata repointed here. - Broadened the description to trigger on intent — storing/retrieving/querying knowledge, agent memory, building a knowledge graph, operating Omnigraph — as well as on CLI/artifact sightings (stays ≤1024 chars). - Install: `npx skills add ModernRelay/omnigraph@omnigraph`. README - New "Set it up with an AI agent" paste snippet: installs the skill, reads the docs (URL), browses the cookbooks, and asks the user about a use case before standing up a first graph. - "Agent skill & starter graphs" section points at skills/omnigraph + cookbooks. Drop scripts/local-rustfs-bootstrap.sh - Not CI-tested (so it rotted: it broke on the cluster-only migration — positional server boot, load without --mode), demoed the now-optional S3 path, and was the most fragile artifact in the repo. Replaced with a "Testing against S3 locally" guide in deployment.md (docker run RustFS/MinIO + AWS_* env + cluster-on-S3). README/AGENTS references updated.
2026-06-16 11:48:13 +02:00
| Local S3 testing | — | run RustFS/MinIO + the `AWS_*` env; see [docs/user/deployment.md](docs/user/deployment.md) → *Testing against S3 locally* |
| Agent skill | — | `skills/omnigraph` — operational playbook for driving Omnigraph; install with `npx skills add ModernRelay/omnigraph@omnigraph` |
---
## Maintenance contract for agents
When you change something user-visible, **update the relevant `docs/user/<area>.md` in the same change**. Use [docs/user/index.md](docs/user/index.md) for public behavior and [docs/dev/index.md](docs/dev/index.md) for contributor/internal mechanics. Pointers from this file to those docs must keep working — CI enforces cross-link integrity via `scripts/check-agents-md.sh`.
When proposing or reviewing a non-trivial change, walk [docs/dev/invariants.md](docs/dev/invariants.md) — at minimum the deny-list and review checklist. Add to the deny-list when a new anti-pattern surfaces; relaxing an invariant requires the same review process as code.
Rules:
1. **Update in the same PR.** New endpoint, query function, CLI flag, env var, constant, schema construct, or invariant: update both the source code and the doc in the same change. Never split documentation drift into a follow-up.
2. **Bump version on release.** When a release boundary crosses (e.g. v0.3.1 → v0.3.2), update the version line at the top of this file and add a `docs/releases/<version>.md` describing the user-visible delta. Update [docs/dev/architecture.md](docs/dev/architecture.md) only if the architecture itself changed.
2026-05-10 14:02:28 +00:00
3. **Write OSS-facing release notes.** Release docs are public project history. Describe capabilities, behavior changes, breaking changes, upgrade notes, and user impact; do not reference private ticket systems, internal codenames, or planning shorthand that an outside contributor cannot inspect.
4. **Keep versioning coherent.** A release bump must update every published crate manifest, local path dependency constraint, `Cargo.lock`, generated API metadata such as `openapi.json`, and this file's surveyed version. Do not leave mixed package versions unless the release plan explicitly calls for them.
5. **Keep docs audience-neutral.** Prefer stable public identifiers (versions, PR numbers, public issue links, crate names, endpoint names) over organization-specific labels. If internal context is useful for maintainers, translate it into a durable public rationale before committing it.
6. **Don't lie.** If a section becomes wrong but you can't rewrite it fully right now, replace the wrong line with `*(stale — needs update after <change>)*` rather than leaving silently incorrect text. Then fix it ASAP.
7. **Re-verify before recommending.** If you cite a flag, env var, endpoint, or constant to the user or in code, grep for it in source first. Memory and docs go stale; the code is authoritative.
2026-05-10 14:41:02 +00:00
8. **Keep AGENTS.md short.** This file is always loaded into agent context, so every added line has a recurring context-window cost. Prefer pointers and terse invariants here; put detail in `docs/`.
9. **Keep AGENTS.md a map, not an encyclopedia.** New deep content goes into `docs/`. Add an entry to "Where to find each topic" instead of pasting prose into this file. The "Always-on rules" section is the exception — it's for invariants that should always be in scope.
docs(user): split language/branching pages + add front-door pages (Phase 2) (#225) Content build-out on top of the Phase 1 topic move. No behavior changes. Splits (existing content relocated, cross-linked): - queries/index.md → mutations/index.md (insert/update/delete + the inserts-vs-deletes rule) and search/index.md (the multi-modal search functions + a hybrid-ranking overview tying nearest/bm25/rrf together). queries/index.md now covers the read shape and points at both. - branching/index.md → branching/time-travel.md (snapshots/time travel) and branching/merge.md (three-way merge + the 7 conflict kinds, verified against error.rs MergeConflictKind). New pages (written from the code, user-facing): - quickstart.md — init → load → query → branch, with verified CLI flags. - concepts/index.md — what OmniGraph is + the L1/L2 (Lance/OmniGraph) framing. Expanded operations/audit.md from a 7-line struct dump into a real actor-tracking page (server token-resolved vs CLI --as chain; reading the trail; the omnigraph:recovery reserved actor). Index wiring: docs/user/index.md and AGENTS.md's topic table link every new page; also normalized AGENTS.md's docs/user link display text to match the Phase 1 retargeted paths. Verified: zero broken .md links; check-agents-md.sh green (57 links, 54 docs). Deferred to Phase 3: de-dev polish (grammar paths, IR internals still in queries/branching), guides/, and a possible reference/config.md split (the config schema is already coherent in cli/reference.md). Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-14 13:53:46 +03:00
10. **Re-read on schema/query/IR changes.** Edits to `schema.pest`, `query.pest`, `ir/lower.rs`, `query/typecheck.rs`, or `query/lint.rs` should trigger a re-read of [docs/user/schema/index.md](docs/user/schema/index.md), [docs/user/queries/index.md](docs/user/queries/index.md), and [docs/dev/execution.md](docs/dev/execution.md) to confirm they still describe reality.
2026-05-10 14:41:02 +00:00
11. **Always make smaller commits.** Each commit does one thing, compiles, and passes tests; mechanical refactors land separately from the behavior changes they enable.
12. **Test-first for bug fixes.** When fixing an identified bug, write a regression test that reproduces the failure first. Confirm it fails against the current code with the predicted symptom (not an unrelated error). Then land the fix in a separate commit and confirm the test turns green. The test commit lands just before the fix commit so the red → green pair is visible in `git log` and a reviewer can check out the test commit alone and reproduce the failure.
13. **Correct by design over symptomatic patches.** When a bug surfaces, identify the root cause and make the fix correct by construction. Don't patch the symptom. If the design admits the bug class, the fix is to close the class, not to add a guard around the latest instance. A symptomatic patch is acceptable only as a stop-gap, with an explicit note in the commit message and a follow-up issue tracking the design fix.
CI check: `scripts/check-agents-md.sh` verifies that docs links in this file and the audience indexes resolve, and that every canonical doc is linked from either [docs/user/index.md](docs/user/index.md) or [docs/dev/index.md](docs/dev/index.md). Run it locally before opening a PR if you've moved or renamed docs.